WO2000046694A1 - System and process for creating a structured tag representation of a document - Google Patents
System and process for creating a structured tag representation of a document Download PDFInfo
- Publication number
- WO2000046694A1 WO2000046694A1 PCT/US2000/002747 US0002747W WO0046694A1 WO 2000046694 A1 WO2000046694 A1 WO 2000046694A1 US 0002747 W US0002747 W US 0002747W WO 0046694 A1 WO0046694 A1 WO 0046694A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- content
- style
- attributes
- dtd
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 239000000284 extract Substances 0.000 description 9
- 238000012545 processing Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 206010039101 Rhinorrhoea Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 229910052729 chemical element Inorganic materials 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007373 indentation Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
- G06F16/94—Hypermedia
Definitions
- This invention relates to the field of extracting content from an existing document into a structural representation of the document.
- SGML Markup Language
- XML extensible Markup Language
- DTD Document Type Definition
- XML and SGML use the DTD or other structured document models to associate the content with the appropriate mark up commands to enable the content to be displayed with a desired presentation and style.
- the mark up language adds identifiers for each of the "elements" or parts of the document for identification purposes.
- a DTD may define a document model as having a title, a main paragraph and several secondary paragraphs.
- the mark up language then adds identifiers, called a "tag", to designate the beginning and the end of a particular element.
- the presentational attributes and/or the style attributes can also be associated by additional tags or by association with separate style sheets.
- the use of such structural document models also can be used in converting existing documents into content which can be presented in other applications.
- the present invention accomplishes these needs and others by providing a system and process for systematically extracting content from a document into a tagged structural representation of the document.
- the system "interro- gates" the document in order to systematically extract the content from document based on the defined rules or "hints".
- the system is able to structure the content in accordance with a defined structural document model (such as a DTD) to create a structural representation of the docu- ment.
- This structural representation can then be used to create a document to enable a meaningful presentation of the extracted content, such as on a browser or other presentation applications.
- the system is able to quickly extract content from an existing docu- ment into a structured representation of the existing document. This is particularly useful when it is desired to create a plurality of documents of similar type from existing documents, or when there is a need to frequently update documents.
- the present invention utilizes a struc- tural document model, such as a Document Type Definition ("DTD") to define the structure of the content based on the elements of the document to be represented.
- the DTD is graphically represented by a logical structure tree, which allows the document elements to be easily inserted and/or moved about.
- Elements can be defined (or nested) within other elements. These elements can be, for example, a Title element, a Headline element, an Author element, a Body element, a Photo element, and so forth.
- the structure or sequence of particular elements are grouped together to form a particular DTD.
- most documents, particularly documents of a similar type are represented by an existing DTD. For instance, a maga- zine will typically use a standard DTD for feature articles, another DTD for columns, and the like.
- the system also includes a Hint Set to associate particular presentational or style attributes to each of the designated elements of a particular DTD.
- These presentational and/or style attributes are associated by selecting from a menu or by other means.
- the Title element for a particular DTD may be associated with the presentational attribute of being the first text box in the document or by a style attribute of having a particular font size.
- the Headline element may be associated with the presentational attribute of being the second text box.
- a Keyword element may be associated with the style attribute of having a particular character style, such as italics.
- Hint Sets can be applied to different DTD's as well. The user can select a particular Hint Set from a menu of Hint Sets and associate that Hint Set to a particular DTD.
- the system is able to quickly "interrogate” the document, extract the content and create a structural representation of the extracted content in accordance with the DTD based on the selected Hint Set.
- the system parses the document file to search for the attributes assigned to each of the elements in the Hint Set for that DTD. Once it finds a Hint or defined attributes associated with a particular element, it extracts the content associated with those attributes and associates that content with the element in the DTD to which that Hint have been associated. For example, a DTD may have a Title element, a Headline element, a Body element with a Picture element a subset of the Body element. The Hint Set for that DTD would have certain attributes associated with each of those elements. The system analyzes the document to which the DTD was applied.
- the Title element As it finds the presentational and style attributes associated with the Title element, it extracts the content to which those attributes were associated, associates that extracted content to the Title element and represents it in the structure for the Title element for that DTD. The system continues to search for the attributes for the remaining elements. As it finds each of the attributes or style sheets for each of the elements, it extracts that content, associates the extracted content with the element and represents that content with the element in the structure defined in the DTD.
- the system of the preferred embodiment also employs heuristic techniques to improve the efficiency of the process. The system may encounter multiple options for various attributes in analyzing the document. The system is capable of intelligently resolving an appropriate path among these multiple options by the use of previous history, by looking ahead to the following sequence of Hints and elements, and by other intelligence.
- the system of the preferred embodiment will query the user if unrecognized style sheets or attributes are encountered or if there are irreconcilable unresolved options. The decision provided by either the system in resolving the best option or by input from the user will then be used in other documents when those problems are encountered.
- Figure 1 illustrates a screen shot of a document from which content is to be extracted under a preferred embodiment of the present invention.
- Figure 2 illustrates a screen shot of a DTD and Hint Set of a preferred embodiment of the present invention.
- Figure 3 is a screen shot of a structural representation of a document from which content has been extracted.
- Figure 4 is another screen shot of Figure 3.
- Figure 5 is an illustration an XML encoded version of the document of
- the present invention provides a process and system for extracting content into a structural representation of a defined structural document model from an existing document.
- the system "interrogates" the document to find the elements of the document based on a set of hints or rules associated with a selected structural document model, extracts the content for each of the document elements and structurally represents the extracted content in accordance with the selected structural document model.
- This program as well as other word-processing and/or desktop publishing systems, allow the user to input text and graphics into a user-defined layout in electronic digital form.
- the user is able to utilize presentational attributes such as design objects, including text boxes, picture boxes, lines, color fills, as well as locations, dimensions, spacing and the like.
- style attributes to the content of the document, such as fonts, indentation, spacing, color, image types, and many other attributes. Selective groupings of certain attributes may be assigned designations as "style sheets". These style sheets can then be saved to allow reuse.
- the style sheets can be applied to a single element (such as a title, headline, paragraph, etc.) or to a group of elements (such as an article, book, etc.).
- a title is normally the first text box and is often characterized by a center-justified sentence, in bold letters with a large font. This could be identified as a title style sheet.
- a headline style sheet may be the second text box while a keyword style sheet may be the third text box and/or having characters in a different style than the other elements, such as in italics.
- a paragraph is often characterized by a text box, with an indented sentence, followed by one or more other sentences and ending with a "hard return". This could be identified as a paragraph style sheet.
- a plurality of presentational and/or style attributes can be grouped together to form a document.
- a technical note document may include a title style sheet, a headline style sheet, a keyword style sheet, a body text style sheet in which a series of paragraph style sheets could be included, and so forth. It is to be expressly understood that this description is intended for explanatory purposes only, and is not meant to limit the claimed inventions to this embodiment. The use of other embodiments of document types, and programs for creating them are considered to be within the scope of the claimed inventions.
- An example of a document from which the content may be extracted is illustrated in Figure 1.
- This document (also referred to as an "Article"), prepared under QuarkXPress uses a Style Book which includes a Headline style ("When the Bough Breaks"), a SubHead style ("Mothers tell 20 secrets of keeping children from catching those nasty winter colds"), a Body style, a Photo style, and a PullQuote style ("those of us with new- borns know the terror one experiences when children come down with their first case of the sniffles").
- the Body style includes a BodySubHead style and several paragraphs.
- the Photo style includes a Source style, Dimension style and a Caption style. These styles are all standard for this particular style of article and was creating in accordance with Style Books historically used by the industry.
- the present system utilizes these elements to provide an intelligent heuristic and user-definable process for extracting the content of a document into structural representation of the original document.
- the user selects or defines a Hint Set for the extraction and structural representation of the content from a document.
- the user first creates or selects a Document Type Definition ("DTD") for the extraction process.
- DTD Document Type Definition
- a window such as illustrated in Figure 2, allows a user to define a DTD, or select one already created from a library.
- an Article DTD is select- ed, which is the same DTD used in creating the Article illustrated in Figure
- the DTD is graphically represented by a logical structural tree, as shown in Figure 2. It is to be expressly understood that representations other than the logical structure tree embodiment can be utilized under the present invention.
- a "Hint Set" is associated with the selected DTD.
- Hint Set associates certain presentational and/or style attributes or style sheets to each of the elements of the DTD.
- the system will "search" for these attributes in the original document based on the associated Hint Set.
- An example of a Hint Set is illustrated in Figure 2.
- the Hint Sets may be selected from a menu of defined Hint Sets, or defined by the user.
- the user is able to associate the sets of presentational or style attributes to the elements of the DTD as necessary or desired.
- the elements and attributes can be associated in the DTD and Hint Set by selection from a menu or by other known techniques.
- existing style sheets may be used for the Hint Sets.
- a style sheet may have already been defined for assigning presentational and style attributes for associating with con- tent to create a Headline for the existing document.
- This style sheet can thus also be used in the Hint Set for association with Headline element.
- each of the elements in the DTD is associated with certain presentational (such as being the first text box, first paragraph, location, etc. ) and/or style attributes (font types, character styles, color, etc.).
- Headline element for the Article DTD is associated with a Headline style sheet
- the BodySubHead element is associated with a Sub sub head style sheet
- the p1 element is associated with a Body style sheet
- the p element is associated with a Pull Out Quote style sheet.
- the Byline element and the Subhead element are designated as optional (not shown).
- decision as to whether there may be multiple occurrences of an element for instance multiple first paragraphs or secondary paragraphs in the Body element or multiple Photos is also defined in the DTD.
- the selected DTD and associated Hint Set is then applied to the desired document.
- the system of the pres- ent invention parses the document by checking the attributes or style sheets of the document. It analyzes those attributes in the document based on the Hint Set for the selected DTD. In this example, the system recognizes the original document as an Article. It then moves to the next Hint, a Headline style sheet, that it expects would contain the attributes for the Headline style sheet. If the system does find the attributes for the
- Headline style sheet it extracts the content associated with the Headline style sheet and associates that extracted content with Headline element in the DTD.
- the system parses to the next Hint, a Sub-head style sheet.
- the system continues in this fashion until it has analyzed each of the style sheets or sets of attributes set forth in the Hint Set. If the system is unable to find a Hint, or if it encounters attributes or style sheets which are not listed in the Hint Set, then it employs heuristic techniques to resolve these issues.
- the system may attempt to resolve the missing Hint by determining whether the Hint is mandatory or optional, whether there is another style sheet that may be used as the style sheet defined in the missing Hint, whether a previous decision based on previous history when this Hint is missing provides instruction on how to proceed, obtain guidance by "looking" ahead to the next sequence of Hints to determine whether to use another style sheet, or by other "intelligent" decisions.
- the system is also able to employ multiple paths to attempt to resolve this dilemma, such as skipping the Hint to see if the continuing sequence of Hints can be resolved. If the system is able to successfully resolve this issue, then this resolution goes into future decision making. If the system is unable to successfully resolve this issue, then the system may query the user for assistance. If the user provides assistance, or later corrects the structural representation, this assistance or correction can be later used by the system to resolve future dilemmas.
- the Hint Set can be saved and applied to other document types. This is particularly useful when a number of similar documents are processed, or if a particular document is frequently updated.
- the document from which the content is to be extracted is applied to an existing DTD using the desired Hint Set to create the tagged structural representation of the document.
- the document illustrated in Figure 1 is applied to the DTD and Hint Set illustrated in Figure 2.
- the system analyzed the document for the occurrence of the Hints for the applied DTD, as shown in Figure 2.
- the system recognized the attributes for the Headline style sheet, and extracted the content associated with that style sheet. This extracted content was associated with the DTD element Headline.
- the system then proceed to analyze the document for the style sheet Subhead.
- the system was unable to find this style sheet and since the SubHead element was designated as optional, ignored this element and proceeded on.
- the system was unable to find the stylesheet Byline, and thus ignored the Byline element.
- the system was able to find multiple style sheets for BodySubHead, p1 and p.
- the system extracted the content associated with those style sheets and associated each of the extracted content to the appropriate element.
- the system extracted the content associated with each of these Hints and associated the extracted content to the appropriate structural elements of the DTD.
- the system analyzed the document for the style sheets associated with Photo element, Source element, Width element, Height element and Caption element and associated the extracted content ° with the appropriate elements.
- a graphical structural representation of the extracted content is illustrated in Figures 4 and 5.
- the nested elements may be hidden for conciseness purposes in Figure 4.
- the nested elements may be viewed in a tree structure by opening the parent element, as shown in Figure 5.
- FIG. 5 illustrates the content extracted from the document shown in Figure 1 by a preferred embodiment of the present invention with XML tags applied.
- the XML tags provide the identifiers for each of the elements represented in the structural representation shown in Figures 4 and 5.
- the entire process once a DTD and Hint Set has been selected, can extract the content from an existing document prepared for print into a structured representation of that document from which a presentation of that content can be created, such as for use on a Web site.
- One feature of the preferred embodiment of the present invention is the use of the complexities of the document itself to create a more efficient process for extracting the content into a structural relationship.
- the greater the density of these attributes to create a stylistic document increased the difficulty in extracting the content in a meaningful manner.
- the present system is able to efficiently utilize these attributes to extract the content into a structural relationship, and provide greater structural detail with higher density of attributes in the document. While the descriptive embodiment is particularly useful in processing documents in QuarkXPress, other embodiments may also be used in conjunction with other publishing and/or word-processing systems.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP00905944A EP1240599A1 (en) | 1999-02-03 | 2000-02-02 | System and process for creating a structured tag representation of a document |
AU27532/00A AU2753200A (en) | 1999-02-03 | 2000-02-02 | System and process for creating a structured tag representation of a document |
JP2000597706A JP2002536745A (en) | 1999-02-03 | 2000-02-02 | Systems and processes for creating structured tag representations of documents |
CA002361398A CA2361398A1 (en) | 1999-02-03 | 2000-02-02 | System and process for creating a structured tag representation of a document |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US24374499A | 1999-02-03 | 1999-02-03 | |
US09/243,744 | 1999-02-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2000046694A1 true WO2000046694A1 (en) | 2000-08-10 |
Family
ID=22919948
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/002747 WO2000046694A1 (en) | 1999-02-03 | 2000-02-02 | System and process for creating a structured tag representation of a document |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1240599A1 (en) |
JP (1) | JP2002536745A (en) |
AU (1) | AU2753200A (en) |
CA (1) | CA2361398A1 (en) |
WO (1) | WO2000046694A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2860618A1 (en) * | 2003-10-02 | 2005-04-08 | Stelae Technologies Sa | Digital information unit e.g. electronic mail, processing method for enterprise, involves numbering data blocks in ascending order, allocating XML markup to each block, and obtaining processed information unit in XML format |
US7251697B2 (en) | 2002-06-20 | 2007-07-31 | Koninklijke Philips Electronics N.V. | Method and apparatus for structured streaming of an XML document |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5752020A (en) * | 1993-08-25 | 1998-05-12 | Fuji Xerox Co., Ltd. | Structured document retrieval system |
US5812999A (en) * | 1995-03-16 | 1998-09-22 | Fuji Xerox Co., Ltd. | Apparatus and method for searching through compressed, structured documents |
US5875441A (en) * | 1996-05-07 | 1999-02-23 | Fuji Xerox Co., Ltd. | Document database management system and document database retrieving method |
US5915259A (en) * | 1996-03-20 | 1999-06-22 | Xerox Corporation | Document schema transformation by patterns and contextual conditions |
-
2000
- 2000-02-02 AU AU27532/00A patent/AU2753200A/en not_active Abandoned
- 2000-02-02 WO PCT/US2000/002747 patent/WO2000046694A1/en not_active Application Discontinuation
- 2000-02-02 JP JP2000597706A patent/JP2002536745A/en active Pending
- 2000-02-02 CA CA002361398A patent/CA2361398A1/en not_active Abandoned
- 2000-02-02 EP EP00905944A patent/EP1240599A1/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5752020A (en) * | 1993-08-25 | 1998-05-12 | Fuji Xerox Co., Ltd. | Structured document retrieval system |
US5812999A (en) * | 1995-03-16 | 1998-09-22 | Fuji Xerox Co., Ltd. | Apparatus and method for searching through compressed, structured documents |
US5915259A (en) * | 1996-03-20 | 1999-06-22 | Xerox Corporation | Document schema transformation by patterns and contextual conditions |
US5875441A (en) * | 1996-05-07 | 1999-02-23 | Fuji Xerox Co., Ltd. | Document database management system and document database retrieving method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7251697B2 (en) | 2002-06-20 | 2007-07-31 | Koninklijke Philips Electronics N.V. | Method and apparatus for structured streaming of an XML document |
FR2860618A1 (en) * | 2003-10-02 | 2005-04-08 | Stelae Technologies Sa | Digital information unit e.g. electronic mail, processing method for enterprise, involves numbering data blocks in ascending order, allocating XML markup to each block, and obtaining processed information unit in XML format |
Also Published As
Publication number | Publication date |
---|---|
EP1240599A1 (en) | 2002-09-18 |
JP2002536745A (en) | 2002-10-29 |
AU2753200A (en) | 2000-08-25 |
CA2361398A1 (en) | 2000-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7013309B2 (en) | Method and apparatus for extracting anchorable information units from complex PDF documents | |
US8515939B2 (en) | Method and system for facilitating rule-based document content mining | |
US5956726A (en) | Method and apparatus for structured document difference string extraction | |
US6533822B2 (en) | Creating summaries along with indicators, and automatically positioned tabs | |
US7305612B2 (en) | Systems and methods for automatic form segmentation for raster-based passive electronic documents | |
US7984076B2 (en) | Document processing apparatus, document processing method, document processing program and recording medium | |
US7823061B2 (en) | System and method for text segmentation and display | |
US7805671B1 (en) | Style sheet generation | |
US20140006913A1 (en) | Visual template extraction | |
US20120304051A1 (en) | Automation Tool for XML Based Pagination Process | |
JPH077408B2 (en) | Method and system for changing emphasis characteristics | |
Hardy et al. | Mapping and displaying structural transformations between xml and pdf | |
US9286272B2 (en) | Method for transformation of an extensible markup language vocabulary to a generic document structure format | |
US20070180359A1 (en) | Method of and apparatus for preparing a document for display or printing | |
JP4666996B2 (en) | Electronic filing system and electronic filing method | |
US20040044691A1 (en) | Method and browser for linking electronic documents | |
US8584007B2 (en) | Information processing method, information processing apparatus, and program | |
US20060112327A1 (en) | Structured document processing apparatus and structured document processing method, and program | |
WO2000046694A1 (en) | System and process for creating a structured tag representation of a document | |
CN111274761A (en) | Font editing method and system using SVG format, and computer-readable recording medium | |
JP2004178011A (en) | Document conversion device and documents conversion method | |
Chakraborty et al. | Extracting anchorable information units from PDF files | |
Hufflen | {mlBibTeX} Meets {ConTeXt} | |
CN113935282A (en) | Document editing method, device, storage medium and equipment | |
JPH08202714A (en) | Multimedia document preparing/editing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2000 597706 Kind code of ref document: A Format of ref document f/p: F |
|
ENP | Entry into the national phase |
Ref document number: 2361398 Country of ref document: CA Ref country code: CA Ref document number: 2361398 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 27532/00 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000905944 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 2000905944 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000905944 Country of ref document: EP |