WO2002097667A2 - Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml - Google Patents

Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml Download PDF

Info

Publication number
WO2002097667A2
WO2002097667A2 PCT/IB2002/003036 IB0203036W WO02097667A2 WO 2002097667 A2 WO2002097667 A2 WO 2002097667A2 IB 0203036 W IB0203036 W IB 0203036W WO 02097667 A2 WO02097667 A2 WO 02097667A2
Authority
WO
WIPO (PCT)
Prior art keywords
wrapper
document
pattern
xml
documents
Prior art date
Application number
PCT/IB2002/003036
Other languages
French (fr)
Other versions
WO2002097667A8 (en
Inventor
Robert Baumgartner
Sergio Flesca
Georg Gottlob
Marcus Herzog
Original Assignee
Lixto Software Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lixto Software Gmbh filed Critical Lixto Software Gmbh
Priority to US10/479,039 priority Critical patent/US7581170B2/en
Priority to EP02755419A priority patent/EP1430420A2/en
Publication of WO2002097667A2 publication Critical patent/WO2002097667A2/en
Publication of WO2002097667A8 publication Critical patent/WO2002097667A8/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures

Definitions

  • This disc osure teaches techniques related in general to the field of information processing. Mors particularly, the teachings relate to methods, systems and computer- prograni product's for information extraction' from Web pages, to the construction of wrappers (i.e. extraction programs), based on example Web pages, and to the transformation of "relevant" parts of HTML documents into XML.
  • the World Wide Web (abbreviated as Web) is the world's largest data repository and information retrieval system.
  • client machines effect transactions to Web servers ising thej Hypertext Transfer Protocol (HTTP), which is an application protocol usually providi ⁇ g user access to files formatted in a standard page description language known as Hypen:ext Markup Language (HTML).
  • HTML provides basic document formatting (and some logic markup) and allows the developer to specify "links" to other servers and documents.
  • HTTP Uniform Resource Locator
  • URL Uniform Resource Locator
  • HTML-compatible e.g. Netscape Navigator, Microsoft Internet Explorer, Amaya or Opera
  • client machine involves the specification of a link by the means of an URL.
  • the client then makes a request to the server (also referred to as "Web site") i entified by the link and receives in return an HTML document or some other object of a known file ty je.
  • server also referred to as "Web site”
  • Simple and less sophisticated browsers can easily be written in a short time and wit little effort in (object-oriented) programming languages such as Java, where powerful program libraries are available that already contain modules or classes providing the [07]
  • the system lacks sufficiently powerful semi-automatic generalization mechanisms that allow a j' user to specify several similar patterns at once while marking only a single pattern of he desired type.
  • the system lacks sufficient visual facilities for imposing inherent (internal) or contextual (external) conditions to an extraction pattern, e.g.
  • I I . . ' i structure extractor severely limit the ways to define extraction patterns. For example,
  • computer-program products including computer-readable media with instructions that implement the systems and methods disclosed, completely or partially, are also con ;emplated as being within the overall scope of the disclosed teachings.
  • the media could be anything including but not limited to RAMs, ROMs, hard disks CD s, tapes , floppy disks, Internet downloads, etc.
  • any medium that can fix all or a sublet of the 1 instructions, even for a transient period of time, is considered as a computer- readable media for the purposes of this disclosed teaching.
  • the type of computer that can imp ement is also not restricted to any particular type, but includes personal computers, workstations, mainframes and the like. Also, it can be implemented on a stand-alone compute: r or in a distributed .fashion across al network, including but not limited to, the Internet.
  • Pattern- Filter-Diagram Logical structure of a sample pattern illustrating the general ttern structure
  • Packa ge Architecture of preferred embodiment Package structure of actual implemesntation.
  • Lixto Screenshots Vectorized screenshots showing Lixto at work.
  • Lixto Screenshots Vectorized screenshots showing Lixto at work (part 2).
  • Table of em ipiricatl results Evaluation of L ' i •xto w.r.t. different sample web sites.
  • Example of an HTML tree Illustiation of a possible way to parse an HTML document.
  • Pattern EER The entity relationship diagram of pattern and filters.
  • Filtej EER The entity relationship diagram of filters and their constituents.
  • Rule Evaluation Algorithm of evaluating an Elog rule.
  • subsequence evaluation Algorithm of evaluating the "subsequence" predicate.
  • subsequence evaluation Algorithm of evaluating the "subsequence" predicate.
  • subsequence evaluation Algorithm of evaluating the "subsequence" predicate.
  • before evaluation Algorithm of evaluating the "before” predicate for tree filters.
  • string evaluation Algorithm for evaluating the "before" predicate for string filters. target pi .ttern instance is allowed to appear.
  • a "before" condition might [114]
  • the XML Translation Builder another interactive module of the visual builder (Fig.l), is responsible for supporting a wrapper designer during the generation of the so-called XML translation program.
  • Fig.l another interactive module of the visual builder
  • One key issue is that, by default, pattern names that are chosen by the designer during the pattern design process are taken as output XML tags and that the
  • I I hierarc y of ex xacted pattern instances determines the structure of the output XML document.
  • a standard translation of the 'extracted pattern instances into XML will be perf srmed without any need of further interaction.
  • Lixto also offers a wrapper designer the po ssibility to modify the standard XML translation in the three following ways:
  • the designer can rename some patte 3rrns with the effect that a new name instead of the actual pattern name appears as tag name in the XML translation.
  • the designer can suppress some patterns from the translation.
  • a designer wants to construct a wrapper that extracts all records from the third table of some Web page. She may first construct a pattern table that precisely identifies the third table. Then she may define a pattern record using a filter whose parent-p attern is the table pattern (with the effect that only records from the document's third table are identified as instances of the record pattern).
  • the pattern hierarchy is of the form document ⁇ - table ⁇ - record, where each arrow symbolizes a reference to a parent pattern j While the pattern table has an essential
  • the designer may decide to suppress it in the XML output. Then the XML output document will display a hierarchy of the form document ⁇ record , where each record pattern instance is a child of the document instance.
  • font, color, or positional attributes of HTML items can either be carried over to the XML output or can be suppressed in case thely are not of interest.
  • the desijred modalities of the XML translation are determined during the wra per design process by a very simple and very user-friendly graphical interface and are stored, in form of a so called "XML translation scheme" that encodes the mappir g between extraction patterns and tine XML] schema in a suitable form. pattern, building and program construct on phases, and as a stand-alone program (Fig.2) and las r childre are tables, The sequence is jtoo general to match the desired target only, [162]
  • each filter refers to a parent pattern from which it has to extract the desired information and specifies how to retrieve this information from a parent pattern inst nce.
  • a filter specifies the type of the desired information and how to distinguish the desired information from other similar information, i.e. it specifies some additional conditions that the desired information must fulfill. All the conditions in a filter are interpreted conjunctively in the sense that a portion of a Web page satisfies a filter if and only if it satisfies all the conditions expressed in these filter. For a sample pattern see Fig. 3.
  • Fig. 9 shows an extended entity-relationship diagram (EER) describing a data structure for representing and storing the definitiori of information patterns.
  • EER extended entity-relationship diagram
  • a pattern [901] can be a document pattern [903] or a source pattern [904], and is characterized by a unique name [902].
  • a document pattern represents a set of similarly structured Web pages, and is defined
  • Document filters define a way to retrieve documents from the text (in particular, from given links) of other documents. Essentially, evaluating a document filter (which takes as parent pattern a string pattern instance or a constant string representing the URL) the corresponding documents are fetched and their doc ment tree constructed. With document filters, patterns can be reused on further pages, and L recursive behavior can be exhibited.
  • a source pattern describes parts or e ements of (HTML) documents. It can be a tree pattern [907] oi a string pattern [906]. Tree patterns define the information of interest using the tree representation of (HTML) documents, whereas string patterns represent (HTML) doc ments or parts of those as plain text disregarding formatting tags.
  • a free p ttern is defined by one or more tree filters [912] (relation [909]), whereas a string p attern is] defined by one or more string filters [911] (relation [908]).
  • Filters [910] can be free, string or document filters; string filters can be distinguished in text filters [913] and attribu e filters [914] - both are described below in more detail.
  • Tree and string filters extract information (relation [905]) from instances of a pattern (either a source or a document pattern I.
  • An attribute filter can define a string pattern only and works on tree and document pattern s only.
  • a text filter can define a string pattem only and can refer to all kind of patterns [175] xamplte of yahoo 'auction pages: ⁇ jahoo Auction pages are (at least were at the time these li ⁇ es were written) structured as follows - each item is described in one table row, however, , there A e two different tables and headings, namely "Featured Auctions" and "All auctions '.
  • Tree filters [1003] extract free regions from other tree regions. They are defined by 1020] a 'tree region extraction definition [1015]. Pattern instances identified by a free region e: ⁇ traction definition can be further filtered out by imposing additional tree onditiors. Tree ⁇ region extraction definitions specify the characteristics of the simple nr gener tl tree regions being extracted, hi particular they specify how to identify the
  • Attribute filters [1004] extract information from attribute values. They must id Ientify ' an a 1ttribute designator whose information has to be extracted and impose some further conditions (as explained later). For this reason, attribute filters are defined by [1021] attribute extraction definitions [1016]. An attribute filter extracts the values of one kind of attribute designator (or optionally, of more than one kind, e.g. defined via regular expressions).
  • Text filters [1005] always extract a substring of the parent string pattern instance, but they are also defined by [1017] a string extraction definitions [1014] that can further restrict the characteristics of the substring being extracted.
  • Al string extraction 'definition is essentially the specification of a language that defines all the substrings that can be extracted.
  • all parent patterns ofthe filters contained in this pattern need to be tr 3e patterns or document patterns.
  • all parent patterns ofthe filters contained in this pattern can either be string patterns or tree patterns.
  • all parent patterns ofthe filters contain d in the document pattern have to be string patterns.
  • a pattern acts like a disjunction of rale bodies: to be extracted, i.e. to be an instance of this pattern, a target needs to be in the solution set of at least one rale. Adding rales usually matches more targets, while adding constraints in the rule bodies removes unwanted matches.
  • Fig.15 precisely the use o I f free path mini 1m 1ization I .
  • the free] path ". * table” due to minimization, matches only those tables which occur at the outermbst level of any hierarchy of nested tables.
  • To get all tables one can either disallow tree patri m: nimizatipn (which is an option in the] GUI; then instead ofthe star another sign is used for this general descendant] navigation), or much better, use recursion to distinguish the various hierarchy levels, hn the first solution, all extracted tables at any nesting level are direct be usef ⁇ l.
  • the evaluator removes the instances whose position does not appear m any ⁇ f the specified intervals [1104], and returns] [1105] the resulting list of pattern instances.
  • ' '
  • mimization step is performed by default, but can be optionally omitted if the designer wishes so. d) Evaluating tree extraction predicates and tree conditions
  • the evaluation ofthe tree extraction predicates is mainly concerned with the computation of he list of elements matched by an element path definition. Essentially, this computation ban be split into two parts: finding the elements reachable from a given start elemem t (the root ofthe pa Trent pattern tree-region) following a certain tree path, and the validation of ( these elements w.r.t. the specified attribute conditions
  • Finding elements The following functions compute the list of .elements matched by an incompletely specified tree path. They use the function children(x) that returns the list of the children ofthe element x ordered with respect of their position in the document tree and the function findDescendants(e, t) defined below that returns the list ofthe descendants of an element x of name t, again ordered with respect to their position in the document free.
  • the function children(x) that returns the list of the children ofthe element x ordered with respect of their position in the document tree
  • findDescendants(e, t) defined below that returns the list ofthe descendants of an element x of name t, again ordered with respect to their position in the document free.
  • [] identify lists and a notation like [x
  • the function matchelements(e,p) seeks for the paths in the document tree rooted in the element e that fulfill the element path definition. It starts by finding the elements reachable from e whose type is the san e as the type contained in the first part ofthe free path/?. This process is iterated with the rem aiming part ofthe path/?.
  • Ata. element e i . e . , a node of 'a document parse-tree
  • the test 'substrg(s,x,y) in L(spd)" checks whether a string is in the language defined by a regular expression (see e.g., J. E. Hopcroft and J. Ulhman: Introduction to Automata Theory, Languages, and Computation, Addison- Wesley, ISBN: 0201441241) or to verify that such a string is in the relation defined by a concept predicate (or a mixture of both).
  • the set ofthe valid (ground) substitutions for an atom subtext (s,spd,X) is trivially computed by calling the funct ion match ubtext(s, spd). In pur preferred embodiment, only those matches are considered which are not already part o: 'another one, using left-to-right evaluation.
  • the evaluator starts [1401] by computing all the substrings contained in the parent pattern instance 5 matching the string path definition spd [1402]. Each of these substrings is
  • the atoms notbefore(s,x,spd,d) and notafter(s,x,spd,d) are evaluated using the corresponding atoms before(s,x,spd,0,d,Y l D) and after(s,x,spd,0,d,Y,D).
  • the evaluation of a ontains(s,spd,X) atom derives straightforwardly from the matchsubtext j functio n.
  • Pattern name EloglDB [318]
  • the standard mode is to refer case, srijice we are merely interested in writing a wrapper for a single Web page. However, time to maintain! compatibility with a servlet frontend. (3) UI Package I
  • one way to create consistency conditions is to mark that some patterns need to have at l reast one injstance to make th ibb wrapper consistent (in this case, labeling a checkbox for such a pattern is sufficient).
  • Such patterns are constructed in the same way as ordinary ones. Tlrey do not necessarily need to exfract information.
  • Another 'scenario is to provide support in programming a video recorder. Instead of typing in a VPS number one can choose to type in the name of an interesting broadcast. The VR returns mor information about this from a wrapped database and the option to program one of these broadcasts.
  • the extraction job server is embedded into a user- personalizable info ⁇ mation process flow which accesses the XML output in order to query it and deliver parts of it to the user in case of changes, and a merger to map several XML outputs of different wrappers into a common scheme.
  • Lixto wrappers can be embedded into a persona lizable riser information pipe. There information is processed from various sources, wrapped, merged, transformed and finally delivered.
  • Lixto is a tool for Web info ⁇ mation extraction, information labelling and franslation to XML. Its ideal cpmpanion is the InfoPipes system, which provides a tool for Information integration and ransfornration, and multi-platform information delivery.
  • Lixto an HTML page to wrap. InfoPipes is capable [of integrating various XML companions, transforming them, and querying them by providing easy-to-use graphical interfaces for

Abstract

A method and a system for information extraction from Web pages formatted with markup languages such as HTML [8]. A method and system for interactively and visually describing information patterns of interest based on visualized sample Web pages [5,6,16-29]. A method and data structure for representing and storing these patterns [1]. A method and system for extracting information corresponding to a set of previously defined patterns from Web pages [2], and a method for transforming the extracted data into XML is described. Each pattern is defined via the (interactive) specification of one or more filters. Two or more filters for the same pattern contribute disjunctively to the pattern definition [3], that is, an actual pattern describes the set of all targets specified by any of its filters. A method and for extracting relevant elements from Web pages by interpreting and executing a previously defined wrapper program of the above form on an input Web page [9-14] and producing as output the extracted elements represented in a suitable data structure. A method and system for automatically translating said output into XML format by exploiting the hierarchical structure of the patterns and by using pattern names as XML tags is described.

Description

VISUAL A I ND INTERACTIVE WRAPPER GENERATION, ' AUTOMATED
INFORMATION EXTRACTION FROM WEB PAGES, AND TRANSLATION INTO
\ XML
I. INTRODUCTION
A. RELATED APPLICATIONS
[01] The present application claims priority from the copending U.S. Provisional
Applicaion Serial No. 60/294,213, having the same title filed May 31, 2001
13. BACKGROUND
FIELD OF INVENTION
[02] This disc osure teaches techniques related in general to the field of information processing. Mors particularly, the teachings relate to methods, systems and computer- prograni product's for information extraction' from Web pages, to the construction of wrappers (i.e. extraction programs), based on example Web pages, and to the transformation of "relevant" parts of HTML documents into XML.
BASIC CONCEPTS, TERMINOLOGY, AND INTRODUCTION
[03] The World Wide Web (abbreviated as Web) is the world's largest data repository and information retrieval system. In this environment, client machines effect transactions to Web servers ising thej Hypertext Transfer Protocol (HTTP), which is an application protocol usually providiηg user access to files formatted in a standard page description language known as Hypen:ext Markup Language (HTML). HTML provides basic document formatting (and some logic markup) and allows the developer to specify "links" to other servers and documents. In the Internet paradigm, a network location reference to a server or to a specific Web resource at a server (for example a Web page), is identified by a so-called Uniform Resource Locator (URL) having a well-defined syntax for describing such a network location. The use of an (HTML-compatible) browser (e.g. Netscape Navigator, Microsoft Internet Explorer, Amaya or Opera) at a client machine involves the specification of a link by the means of an URL. The client then makes a request to the server (also referred to as "Web site") i entified by the link and receives in return an HTML document or some other object of a known file ty je. Simple and less sophisticated browsers can easily be written in a short time and wit little effort in (object-oriented) programming languages such as Java, where powerful program libraries are available that already contain modules or classes providing the
Figure imgf000004_0001
[07]
[08]
Figure imgf000005_0001
Figure imgf000006_0001
[14]
Figure imgf000007_0001
Figure imgf000008_0001
Figure imgf000009_0001
Figure imgf000010_0001
Figure imgf000011_0001
Figure imgf000012_0001
[18] I These approaches are discussed in detail. a) Wrapper progra Im ' ming ' languages and environments
Figure imgf000013_0001
Figure imgf000014_0001
Figure imgf000015_0001
Figure imgf000016_0001
[31] Note that there are Web change monitoring and notification tools that have much better information extraction capabilities because they rely on a separate and independent supervised wrapper generator. An example is the continual query system OpenCQ (L.Liu, C.Pu, and Wei Tang, "Continual Queries for Internet Scale Event-Driven Information
Figure imgf000017_0001
! istructurqd and Semistructured Data from Text Documents", Proceedings of the [33]
Figure imgf000018_0001
Figure imgf000019_0001
learning 'structured data and translation into XML are given. This works very nicely on plain
Figure imgf000020_0001
that some explicit description (or tagging) of the relevant data is already contained in the sample document (in form of names of section headers) or at least unambiguously determined i I by simp e attribi.tes of headers (such as their font size), which is very often not the case. There is no possibility of locating desired extraction items by more complex contextual conditions. Moreover, no user-defined complex patterns can be created that are not already determined by basic formatting issues and paragraph headings. Accordingly, the constructed wrappers are not robust, except for sets of input documents having exactly the same formatting structure. No translation into XML is offered.
A I nother e !'arly approach of supervised wrapper generation was developed in "Wrapper
Generation for Web Accessible Data Sources" by J.R. Gruser, L. Raschid, M.E. Nidal, and L.Brigh: at the Proceedings of CoopIS 1998. A translation to a semistructured data format can be generated.
The folk wing systems are more advanced and can be considered the prior art most related to our owp invention. J (
The XWRAP system by Ling Liu, Carlton Pu and Wei Han (L. Liu, C Pu, and W. Han
"XWRi5 -P: An XJML-enabled Wrapper Construction System for Web Information Sources",
Proceedings of th i e 16th Inte Irnational Confer i elnce onl. Data Engineering San Diego CA,
February 28 - March 3,2000, LEEE Computer Society Press, pp. 611-621, 2000). See also the
Figure imgf000021_0001
will facilitate the wrapper code generation. After we get all the extraction rules, XWrap can compile these rules into a runable program."
[44] The main! drawbacks of this system arc:
I
• The limited expressive power of its pattern definition mechanism. First, the system lacks sufficiently powerful semi-automatic generalization mechanisms that allow a j' user to specify several similar patterns at once while marking only a single pattern of he desired type. Secondly, the system lacks sufficient visual facilities for imposing inherent (internal) or contextual (external) conditions to an extraction pattern, e.g.
' extraction pattern X should appear after a recognized instance of the extraction ttern Y but not before an instance of pattern Z", and so on. Thirdly, the division into he two levels of description "region''! and "token" and the automatic hierarchical
I I . . ' i structure extractor severely limit the ways to define extraction patterns. For example,
Figure imgf000022_0001
d h ll b h i i paths generated by the wizard and adding further HEL language constructs. The expressive power of the set( of queries that can be visually generated (without hand coding) is extremely limited. This implies that a user of the W4F system is required to have both expertise of the HEL language and expertise of HTML. This, in turn, means that, notwithstanding the visual
Figure imgf000023_0001
Figure imgf000024_0001
Figure imgf000025_0001
Figure imgf000026_0001
that whi e Lixto 3 discussed in detail, it is only an example implementation and should not be construed, to restrict the scope of the claims in any way,
[55] Further, computer-program products including computer-readable media with instructions that implement the systems and methods disclosed, completely or partially, are also con ;emplated as being within the overall scope of the disclosed teachings. It should be noted that the media could be anything including but not limited to RAMs, ROMs, hard disks CD s, tapes , floppy disks, Internet downloads, etc. In short, any medium that can fix all or a sublet of the1 instructions, even for a transient period of time, is considered as a computer- readable media for the purposes of this disclosed teaching. Further the type of computer that can imp ement is also not restricted to any particular type, but includes personal computers, workstations, mainframes and the like. Also, it can be implemented on a stand-alone compute: r or in a distributed .fashion across al network, including but not limited to, the Internet.
II RIIEF DESCRIPTION OF THE DRAWINGS
[56] Architecture Overview: A diagr riam depicting the overview of the Lixto architecture.
[57] ' Architecture Overview: Using the extractor as stand-alone program. [58] Pattern- Filter-Diagram: Logical structure of a sample pattern illustrating the general ttern structure,
[59] Packa ge Architecture of preferred embodiment: Package structure of actual implemesntation.
[60] 5. Lixto Screenshots: Vectorized screenshots showing Lixto at work. [61] . Lixto Screenshots: Vectorized screenshots showing Lixto at work (part 2). [62] . Table of em ipiricatl results: Evaluation of L ' i •xto w.r.t. different sample web sites. [63] 8. Example of an HTML tree: Illustiation of a possible way to parse an HTML document.
[64] 9. Pattern EER: The entity relationship diagram of pattern and filters. [65] 0. Filtej EER: The entity relationship diagram of filters and their constituents. [66] 1. Rule Evaluation: Algorithm of evaluating an Elog rule. [67] 2. subsequence evaluation: Algorithm of evaluating the "subsequence" predicate. [68] 3. before evaluation: Algorithm of evaluating the "before" predicate for tree filters. [69] 4. before string evaluation: Algorithm for evaluating the "before" predicate for string filters.
Figure imgf000028_0001
Figure imgf000029_0001
Figure imgf000030_0001
target pi .ttern instance is allowed to appear. For example, a "before" condition might
Figure imgf000031_0001
Figure imgf000032_0001
Figure imgf000033_0001
Figure imgf000034_0001
Figure imgf000035_0001
[114] The XML Translation Builder, another interactive module of the visual builder (Fig.l), is responsible for supporting a wrapper designer during the generation of the so-called XML translation program. One key issue is that, by default, pattern names that are chosen by the designer during the pattern design process are taken as output XML tags and that the
I I hierarc y of ex xacted pattern instances (which is always a proper tree-like hierarchy) determines the structure of the output XML document. Thus, in case no specific action is taken I y the de signer, a standard translation of the 'extracted pattern instances into XML will be perf srmed without any need of further interaction. However, Lixto also offers a wrapper designer the po ssibility to modify the standard XML translation in the three following ways:
The designer can rename some patte 3rrns with the effect that a new name instead of the actual pattern name appears as tag name in the XML translation.
The designer can suppress some patterns from the translation. In this case, instances of non-suppressed patterns that are children of a suppressed pattern instance /will appear in the XML translation as children of the closest non-suppressed ancestor of I. For example, assume a designer wants to construct a wrapper that extracts all records from the third table of some Web page. She may first construct a pattern table that precisely identifies the third table. Then she may define a pattern record using a filter whose parent-p attern is the table pattern (with the effect that only records from the document's third table are identified as instances of the record pattern). In this case the pattern hierarchy is of the form document <- table <- record, where each arrow symbolizes a reference to a parent pattern jWhile the pattern table has an essential
(but in sense auxiliary) role in the i definition of extraction items, the designer may decide to suppress it in the XML output. Then the XML output document will display a hierarchy of the form document < record , where each record pattern instance is a child of the document instance.
For each pattern the designer can choose the set of attributes that should be carried over to the output XML document. For example, font, color, or positional attributes of HTML items can either be carried over to the XML output or can be suppressed in case thely are not of interest.
[115] The desijred modalities of the XML translation (as described above) are determined during the wra per design process by a very simple and very user-friendly graphical interface and are stored, in form of a so called "XML translation scheme" that encodes the mappir g between extraction patterns and tine XML] schema in a suitable form.
Figure imgf000037_0001
Figure imgf000038_0001
Figure imgf000039_0001
Figure imgf000040_0001
Figure imgf000041_0001
Figure imgf000042_0001
Figure imgf000043_0001
pattern, building and program construct on phases, and as a stand-alone program (Fig.2)
Figure imgf000044_0001
Figure imgf000045_0001
Figure imgf000046_0001
Figure imgf000047_0001
Figure imgf000048_0001
Figure imgf000049_0001
Figure imgf000050_0001
Figure imgf000051_0001
Figure imgf000052_0001
and las r childre are tables, The sequence is jtoo general to match the desired target only, [162]
[163]
[164]
Figure imgf000053_0001
[165] More specifically, each filter refers to a parent pattern from which it has to extract the desired information and specifies how to retrieve this information from a parent pattern inst nce. Essen ially a filter specifies the type of the desired information and how to distinguish the desired information from other similar information, i.e. it specifies some additional conditions that the desired information must fulfill. All the conditions in a filter are interpreted conjunctively in the sense that a portion of a Web page satisfies a filter if and only if it satisfies all the conditions expressed in these filter. For a sample pattern see Fig. 3.
[166] Fig. 9 shows an extended entity-relationship diagram (EER) describing a data structure for representing and storing the definitiori of information patterns. For an explanation of various notations used in entity relationship diagrams refer to Bernhard
ThaJUnem, " Ent.ty Relationship Modeling foundations of Database Technology", Springer, ISBN 2540654' '04. We recall that there are alternative ways to describe this data structure, such as class diagrams or logical represent, ion.
[167] A pattern [901] can be a document pattern [903] or a source pattern [904], and is characterized by a unique name [902].
[168] A document pattern represents a set of similarly structured Web pages, and is defined
[916] by one or more document filters [915]. Document filters define a way to retrieve documents from the text (in particular, from given links) of other documents. Essentially, evaluating a document filter (which takes as parent pattern a string pattern instance or a constant string representing the URL) the corresponding documents are fetched and their doc ment tree constructed. With document filters, patterns can be reused on further pages, and L recursive behavior can be exhibited.
[169] A source pattern describes parts or e ements of (HTML) documents. It can be a tree pattern [907] oi a string pattern [906]. Tree patterns define the information of interest using the tree representation of (HTML) documents, whereas string patterns represent (HTML) doc ments or parts of those as plain text disregarding formatting tags.
[170] A free p ttern is defined by one or more tree filters [912] (relation [909]), whereas a string p attern is] defined by one or more string filters [911] (relation [908]). Filters [910] can be free, string or document filters; string filters can be distinguished in text filters [913] and attribu e filters [914] - both are described below in more detail. Tree and string filters extract information (relation [905]) from instances of a pattern (either a source or a document pattern I. An attribute filter can define a string pattern only and works on tree and document pattern s only. A text filter can define a string pattem only and can refer to all kind of patterns
Figure imgf000055_0001
[175] xamplte of yahoo 'auction pages: Ϋjahoo Auction pages are (at least were at the time these li ιes were written) structured as follows - each item is described in one table row, however, , there A e two different tables and headings, namely "Featured Auctions" and "All auctions '. One can define "record" by referring to two different parent patterns, "record" contains two filters, where one refers to "tablefeatured" (1515) and the other refers to
I
"tableal I" (1514) as parent pattern.
[176] liters [ 001] are mainly characterized by the way they identify information being exfracte]d. There are three different kind of filters:
Tree filters [1003] extract free regions from other tree regions. They are defined by 1020] a 'tree region extraction definition [1015]. Pattern instances identified by a free region e:ι traction definition can be further filtered out by imposing additional tree onditiors. Tree\region extraction definitions specify the characteristics of the simple nr gener tl tree regions being extracted, hi particular they specify how to identify the
;he roo i t elements off these trees and, if the desired free regi •ons are not subtrees (in the spirit o I f our ab !ove d 1efinition), the cha !racteri Istics of the first and last child that belongs ro the region.
String filters [1002], which are further subdivided into attribute filters and text filters:
Attribute filters [1004] extract information from attribute values. They must id Ientify ' an a 1ttribute designator whose information has to be extracted and impose some further conditions (as explained later). For this reason, attribute filters are defined by [1021] attribute extraction definitions [1016]. An attribute filter extracts the values of one kind of attribute designator (or optionally, of more than one kind, e.g. defined via regular expressions).
Text filters [1005] always extract a substring of the parent string pattern instance, but they are also defined by [1017] a string extraction definitions [1014] that can further restrict the characteristics of the substring being extracted. Al string extraction 'definition is essentially the specification of a language that defines all the substrings that can be extracted.
[177] e give three easy examples of when to use which kind of filter based upon the exampl page used above whose HTML tree is depicted in Fig.8:
Figure imgf000057_0001
Figure imgf000058_0001
Figure imgf000059_0001
Figure imgf000060_0001
Figure imgf000061_0001
Figure imgf000062_0001
Figure imgf000063_0001
Figure imgf000064_0001
Figure imgf000065_0001
Figure imgf000066_0001
Figure imgf000067_0001
Figure imgf000068_0001
Figure imgf000069_0001
Figure imgf000070_0001
Figure imgf000071_0001
Figure imgf000072_0001
Figure imgf000073_0001
Figure imgf000074_0001
Figure imgf000075_0001
Figure imgf000076_0001
defihitijon of chjid patterns is not clearly defrned). [267] Patterns (and their filters) are restricted in their use of parent patterns in the following
I manner In case of a free pattern (all filters are tree exfraction rales), all parent patterns ofthe filters contained in this pattern need to be tr 3e patterns or document patterns. In case of a string pattern, all parent patterns ofthe filters contained in this pattern can either be string patterns or tree patterns. In case of a document pattern, all parent patterns ofthe filters contain d in the document pattern have to be string patterns.
[268] ' Hn case o f a homogeneous pattern, i.e. all filters refer to the same parent pattern, the notion ώf "pareiit pattern" can be associated with a pattern rather than with its filters. Im fact,
I I i in a more restrictive embodiment ofthe disclosed invention, where only homogeneous patterns: are allowed, the parent pattern is always specified together with a pattern and not for its filters
[269] As for standard datalog rales, a pattern acts like a disjunction of rale bodies: to be extracted, i.e. to be an instance of this pattern, a target needs to be in the solution set of at least one rale. Adding rales usually matches more targets, while adding constraints in the rule bodies removes unwanted matches.
[270] The extracted targets of a pattern can be minimized. Urn our preferred embodiment we chose t< consider only thosje targets not contained m any other target of tine same pattern instance. Minimization can be carried out even with recursive patterns. However, this type of minimisation is restricted tp instances of the same parent-pattern instance. Hence, in example [1521,11522] of Fig.15, where a nested tables are to be extracted, minimization does not cut off the interior tables, because levery table has a different table as parent pattern.
[271] As those skilled in the art can easily recognize, instead of minimization various other alternative ways' of simplifying or restricting the legal output pattern instances can be used. It is conceivable that in certain contexts some other methods may be more appropriate than the describ d minimization. Lixto is open for incorporating such other methods in alternative embodiments.
[272] Note that the reason why a recursive ('approach was taken in the second example of
Fig.15 s precisely the use o I f free path mini 1m 1ization I . In fact, starting from the document root, the free] path ". * table", due to minimization, matches only those tables which occur at the outermbst level of any hierarchy of nested tables. To get all tables one can either disallow tree patri m: nimizatipn (which is an option in the] GUI; then instead ofthe star another sign is used for this general descendant] navigation), or much better, use recursion to distinguish the various hierarchy levels, hn the first solution, all extracted tables at any nesting level are direct
Figure imgf000078_0001
Figure imgf000079_0001
Figure imgf000080_0001
Figure imgf000081_0001
Figure imgf000082_0001
Figure imgf000083_0001
Figure imgf000084_0001
be usefμl. In a final step the evaluator removes the instances whose position does not appear m any φf the specified intervals [1104], and returns] [1105] the resulting list of pattern instances. ' '
[300] When evaluating a pattern, the union of all exfracted instances of all its rules is considered and all non-minimal targets, i.e. all pattern instances which derive from the same pare int fristance a !s am instance which is entirely contained inside this instance are dropped.
This m: mimization step is performed by default, but can be optionally omitted if the designer wishes so. d) Evaluating tree extraction predicates and tree conditions
[301] The evaluation ofthe tree extraction predicates is mainly concerned with the computation of he list of elements matched by an element path definition. Essentially, this computation ban be split into two parts: finding the elements reachable from a given start elemem t (the root ofthe pa Trent pattern tree-region) following a certain tree path, and the validation of (these elements w.r.t. the specified attribute conditions
[302] Finding elements: The following functions compute the list of .elements matched by an incompletely specified tree path. They use the function children(x) that returns the list of the children ofthe element x ordered with respect of their position in the document tree and the function findDescendants(e, t) defined below that returns the list ofthe descendants of an element x of name t, again ordered with respect to their position in the document free. The
[] identify lists and a notation like [x|C] identifies a list having head x and tail C, whereas the function concat returns the concatenation of two lists. Essentially, the function matchelements(e,p) seeks for the paths in the document tree rooted in the element e that fulfill the element path definition. It starts by finding the elements reachable from e whose type is the san e as the type contained in the first part ofthe free path/?. This process is iterated with the rem aiming part ofthe path/?.
function mat helements (e,p) : List of matched elements INPUT
Ata. element e (i . e . , a node of 'a document parse-tree)
Ab. incompletely specified tree path p
OUTPUT]
List of matched elements
BEGIN
[e list of children of x */
Figure imgf000086_0001
List of matched elements
Figure imgf000086_0002
Figure imgf000087_0001
[308]
[309]
[310]
Figure imgf000088_0001
R := concat (R, stringpattern (s,x,y) ) ; return R
[311] The test 'substrg(s,x,y) in L(spd)" checks whether a string is in the language defined by a regular expression (see e.g., J. E. Hopcroft and J. Ulhman: Introduction to Automata Theory, Languages, and Computation, Addison- Wesley, ISBN: 0201441241) or to verify that such a string is in the relation defined by a concept predicate (or a mixture of both). Thus the set ofthe valid (ground) substitutions for an atom subtext (s,spd,X) is trivially computed by calling the funct ion match ubtext(s, spd). In pur preferred embodiment, only those matches are considered which are not already part o: 'another one, using left-to-right evaluation.
[312] The computation o the above function can be speeded up with the same methods as used for finite s ate automata (see again the book of Hopcroft and Ullman).
[313] As for the predicate subtext, the evaluation of before and after predicates for string sources uses the! function matchsubtext. A flow diagram for the evaluation ofthe before predicate is reported in Figl 14. The evaluation ofthe after predicate is completely analogous, as will be understood by those skilled in the art, and is not reported here.
[314] The evaluator starts [1401] by computing all the substrings contained in the parent pattern instance 5 matching the string path definition spd [1402]. Each of these substrings is
I in an iteration ofthe loop [1404,1403,1406,1408,1407], where it is verified that it is in the desired di stance interval [1403,1406], and if this is the case, the string source (i.e. the string t gether with its position) is inserted into the] result list [1408].
1
[315] As for elements, the atoms notbefore(s,x,spd,d) and notafter(s,x,spd,d) are evaluated using the corresponding atoms before(s,x,spd,0,d,YlD) and after(s,x,spd,0,d,Y,D). The evaluation of a ontains(s,spd,X) atom derives straightforwardly from the matchsubtext j functio n. In our preferred embodiment we cpnsidei "minimal" regular expressions only, i.e. regular expressions not contained in each other, starting evaluation from the left to the right with a 'greedy" operator interpretation (like Perl treats regular expressions).
[316] The evaluation of concept predicate straightforwardly follows from this definition.
Indeed syntactic, concept predicate can be evaluated by simply testing the membership of a string to the language defined by a regular expressions, whereas semantic predicates are directly represented as sets of ground atoms. f) Translation into XML
[317] The XM translator offers the possibility to map exfracted instances ofthe pattern instance base imp XML elements. A simple correspondence holds: Pattern name = EloglDB [318]
[319]
Figure imgf000090_0001
Figure imgf000091_0001
Figure imgf000092_0001
//next line
Figure imgf000093_0001
Figure imgf000094_0001
Figure imgf000095_0001
Figure imgf000096_0001
appear eore t e esre target pattern .
Figure imgf000097_0001
Figure imgf000098_0001
Figure imgf000099_0001
Figure imgf000100_0001
Java1 Sv ing browser with navigation capabi ities is used) [1602]. This sample page is
Figure imgf000101_0001
shou 'ldL be addedI' [1607]. Ot1herwise, the pattern already exactly identifies the desired
Figure imgf000102_0001
Figure imgf000103_0001
(1) Constructing a Tree Filter (Fig. 17)
Figure imgf000104_0001
express that the content of an element is a date and is before 10th of March 2001 (keep
Figure imgf000105_0001
Figure imgf000106_0001
di i ih b
Figure imgf000107_0001
Figure imgf000108_0001
Figure imgf000109_0001
Figure imgf000110_0001
Figure imgf000111_0001
Figure imgf000112_0001
Figure imgf000113_0001
must oc ur after the desired pattern instance. Note that such pattern references do not
Figure imgf000114_0001
designer! the relevant groups and the designer selects which groups to consider and
Figure imgf000115_0001
Figure imgf000116_0001
for free conditions) to opt for one ofthe following two modes: The standard mode is to refer
Figure imgf000117_0001
Figure imgf000118_0001
Figure imgf000119_0001
Figure imgf000120_0001
Figure imgf000121_0001
Figure imgf000122_0001
case, srijice we are merely interested in writing a wrapper for a single Web page. However,
Figure imgf000123_0001
Figure imgf000124_0001
Figure imgf000125_0001
Figure imgf000126_0001
time to maintain! compatibility with a servlet frontend. (3) UI Package I
Figure imgf000127_0001
Figure imgf000128_0001
Figure imgf000129_0001
[497] If the consistency condition is not fulfilled, a warning signal is given or a warning email is sent to an administrator and/or wrapper execution is stopped. Consider a wrapper designer, who once created a wrapper and uses it since then for repeated exfraction; after three months, the stri cture ofthe page changes significantly and he could lose some information without even knowing about it. The wrapper program hence can give some warning message that sopne structural conditions are no longer fulfilled, and this page does not classify as a suited piage for this wrapper. It is useful that, such consistency conditions can be specified by the wrapper designer herself, because she knows best, which requirements to pose as conditions for s able extraction.
[498] As just sijaid, one way to create consistency conditions is to mark that some patterns need to have at l reast one injstance to make th ibb wrapper consistent (in this case, labeling a checkbox for such a pattern is sufficient). Such patterns are constructed in the same way as ordinary ones. Tlrey do not necessarily need to exfract information.
(10) Possible Example Scenarios
[499] One possible scenario uses Lixto wrappers to wrap sites of TV broadcast companies and provide an iiser interface to query the combined XML companions of TV program HTML pages of various( channels.
[500] ] Another 'scenario is to provide support in programming a video recorder. Instead of typing in a VPS number one can choose to type in the name of an interesting broadcast. The VR returns mor information about this from a wrapped database and the option to program one of these broadcasts.
(11) Informationi Pipes System
[501] In one off our embodiments, the extraction job server is embedded into a user- personalizable infoπmation process flow which accesses the XML output in order to query it and deliver parts of it to the user in case of changes, and a merger to map several XML outputs of different wrappers into a common scheme. Lixto wrappers can be embedded into a persona lizable riser information pipe. There information is processed from various sources, wrapped, merged, transformed and finally delivered.
[502] Lixto is a tool for Web infoπmation extraction, information labelling and franslation to XML. Its ideal cpmpanion is the InfoPipes system, which provides a tool for Information integration and ransfornration, and multi-platform information delivery.
[503] InfoPipes takes care of navigating through passwords, HTML forms, etc. to provide
Lixto an HTML page to wrap. InfoPipes is capable [of integrating various XML companions, transforming them, and querying them by providing easy-to-use graphical interfaces for
Figure imgf000131_0001

Claims

WHAT IS CLAIMED IS
1. A wrapper generation system comprising: a networlc includinglat least one example document and at least one production document; and j
I a visual builder that is adapted to interactively generate a wrapper program by letting a user visually and interactively declare at least one desired property of example-elements to be extracted from trie example document thereby creating user declarations; a program evaluator adapted to execute a wrapper program over the production document and toiexfract desired production elements from the production document and to translate: the production elements into XML yielding an XML companion ofthe production document.
2. The vjrapper generation system of Claim 1> wherein the visual builder further includes: on pattern builder adapted to provide a visual interface for a user to specify at least cjme deshied pattern whose instances are to be extracted from the document.
3. The wrapper generation system of Claim 2, wherein the visual builder further includes: ail XML translation builder adapted to interactively generate an XML translation scheme based onluser specifications that specifies how to franslate at least one pattern into
XML using XM franslation rules.
4. The wrapper g ;eneration system of Claim 1, wherein the program evaluator further includes:
I an extractor that exfracts data from the production document and provides an XML document
5. The wrapper g neration system of Claim 2; further including: i j XML translator that performs actual mapping from a set of pattern instances to an
XML document,
6. The wrapper generation system of Claim 1, wherein the example document is received in a
I browser window ι
7. The wrapper generation system of Claim 1, wherein a user controls wrapper generation, selects examples from the example document and adds further user specifications.
Figure imgf000133_0001
Figure imgf000134_0001
colgroub, caption, td, p, br, 'div, blockquote, body, head, hi, h2, h3, h4, h5, h6, dl, dd, dt, ol, ul, li, dir, menu, form, input, select, option, address, center, pre, xmp, nobr, wbr, hr, img, b, i, font-size, font-color, underline, blink, a (anchors), href, tt, big, sup, sub, cite, code, strong, em, sarrip, area, map, script, and CSS styles applied to the content and structure.
25. The (wrapper generation system of Claim 13, wherein said generalized location descriptor is obtai ed from a corresponding location descriptor by syntactic generalization operations.
26. The wrapper 'generation system of Claim 25, wherein said generalized location descriptor is objtaiiied bbyy iinlsertin g zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more el Iem ! ents I from said location descriptor.
27. The wrapper generation system of Claim 13, wherein the example-document is tree- structured and wherein each location descriptor, cal ed plain tree path, created for an example- element occurring in an example-document corresponds to the sequence of element-types on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree patli, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.
28. The wrapper generation system of Claim 13, wherein at least one production-document is tree-structured and wherein each location descriptor], called plain tree path, created for an example ■ -elemen occurring in an example-document corresponds to the sequence of element- types on the pat in the parsing tree of said example-document from the root to said example- element. and whe re the corresponding generalized location descriptor, called incompletely specified tree pat :hh, is obtained from said location descriptor by inserting zero or more wildcards. , by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.
29. The wrapper generation system of Claim 13, wherein at least one production-document is inHTMI format) (and wherein each location descriptor, called plain tree path, created for an example-elemeni occurring in an example-document corresponds to the sequence of HTML tags on the path in the parsing tree of said example-document from the root to said example-
Figure imgf000136_0001
Figure imgf000137_0001
5. The wrapper I
Figure imgf000138_0001
Figure imgf000139_0001
Figure imgf000140_0001
1. The ; [method [of Claim 57 wherein at least one exfraction pattern is organized according
Figure imgf000141_0001
Figure imgf000142_0001
Figure imgf000143_0001
elimimatjng zero or more elements from said location descriptor.
Figure imgf000144_0001
Figure imgf000145_0001
94. The method f Claim 57, wherein the pattern description generated for each of said patterns consists jjof a set of rules formulated in a logic programming language and where said wrapper consists of a logic program containing said rules.
95. The method of Claim 94 wherein the variables and terms of said logic program range over
I elemen occurπiή; g in the documents to which said wrapper is applied,
I
96. The method of Claim 94 wherein the variables and terms of said logic program range over element and strin .gs occurring in the documents to which said wrapper is applied.
97. The method f Claim 94 wherein the variables arid terms of said logic program range over elements, element-lists, and] strings occurring! in the documents to which said wrapper is applied,
98. Th method bf Claim 94 wherein the variables and terms of said logic program range over elements, el bment-lists, strings and attribute values occurring in the documents to which said wrapper is ap plied.
99. The method of Claim 94 wherein said documents are free-structured and where the variables of said logic program range over nodes ofthe parsing frees of documents to which said wrapper is ajpplied,
100. the method of Claim 94 wherein said documents are free-structured and where the variables of said ogic program range over nodes ofthe parsing trees and over strings occurrm g in the documents to which said wrapper is applied.
101. The method1 of Claim 94 wherein said documents are free-structured and where the variables of said logic program range over nodes and sequences of nodes ofthe parsing frees of documents to which said wrapper is applied.
102. The method of Claim 94 wherein said documents are free-structured and where the variables of said ogic program range over nodes ofthe parsing free of documents to which said wrabper is applied, over sequences of such nodes, and over strings occurring in documents to which said wrapper is applied. ]
Figure imgf000147_0001
wherein the contains conditions impose one or more restrictions on some subelement ofthe pattern to be defined and the notcontains conditions require that instances ofthe pattern to be defined do not contain any subelement that satisfies specified restrictions.
Figure imgf000148_0001
I I
Figure imgf000149_0001
Figure imgf000150_0001
125. Th method of Claim 123, wherein range intervals can be specified with each rule of said extended logic program, and wherein only those facts are computed by a rule that match the ranges of said intervals in the ordering induced by document order.
PCT/IB2002/003036 2001-05-31 2002-05-28 Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml WO2002097667A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/479,039 US7581170B2 (en) 2001-05-31 2002-05-28 Visual and interactive wrapper generation, automated information extraction from Web pages, and translation into XML
EP02755419A EP1430420A2 (en) 2001-05-31 2002-05-28 Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29421301P 2001-05-31 2001-05-31
US60/294,213 2001-05-31

Publications (2)

Publication Number Publication Date
WO2002097667A2 true WO2002097667A2 (en) 2002-12-05
WO2002097667A8 WO2002097667A8 (en) 2004-04-15

Family

ID=23132374

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2002/003036 WO2002097667A2 (en) 2001-05-31 2002-05-28 Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml

Country Status (3)

Country Link
US (1) US7581170B2 (en)
EP (1) EP1430420A2 (en)
WO (1) WO2002097667A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010017159A1 (en) * 2008-08-05 2010-02-11 Beliefnetworks, Inc. Systems and methods for concept mapping
WO2010043212A3 (en) * 2008-10-16 2010-08-19 Newbase Gmbh Data organization and evaluation method
US8412646B2 (en) 2008-10-03 2013-04-02 Benefitfocus.Com, Inc. Systems and methods for automatic creation of agent-based systems
US8572760B2 (en) 2010-08-10 2013-10-29 Benefitfocus.Com, Inc. Systems and methods for secure agent information
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment
CN110580174A (en) * 2018-06-11 2019-12-17 中国移动通信集团浙江有限公司 application component generation method, server and terminal

Families Citing this family (287)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6243691B1 (en) * 1996-03-29 2001-06-05 Onsale, Inc. Method and system for processing and transmitting electronic auction information
US7216177B1 (en) * 2000-06-16 2007-05-08 Palm, Inc. Apparatus and method for supplying electronic content to network appliances
US7624356B1 (en) * 2000-06-21 2009-11-24 Microsoft Corporation Task-sensitive methods and systems for displaying command sets
US6948135B1 (en) 2000-06-21 2005-09-20 Microsoft Corporation Method and systems of providing information to computer users
US6883168B1 (en) * 2000-06-21 2005-04-19 Microsoft Corporation Methods, systems, architectures and data structures for delivering software via a network
WO2001098928A2 (en) * 2000-06-21 2001-12-27 Microsoft Corporation System and method for integrating spreadsheets and word processing tables
US7191394B1 (en) * 2000-06-21 2007-03-13 Microsoft Corporation Authoring arbitrary XML documents using DHTML and XSLT
US7000230B1 (en) * 2000-06-21 2006-02-14 Microsoft Corporation Network-based software extensions
US7346848B1 (en) * 2000-06-21 2008-03-18 Microsoft Corporation Single window navigation methods and systems
US7155667B1 (en) * 2000-06-21 2006-12-26 Microsoft Corporation User interface for integrated spreadsheets and word processing tables
US7873649B2 (en) * 2000-09-07 2011-01-18 Oracle International Corporation Method and mechanism for identifying transaction on a row of data
JP4099948B2 (en) * 2001-01-18 2008-06-11 株式会社日立製作所 System, method and program for mapping structured document to structure data in programming language
US7155668B2 (en) * 2001-04-19 2006-12-26 International Business Machines Corporation Method and system for identifying relationships between text documents and structured variables pertaining to the text documents
US20050086584A1 (en) 2001-07-09 2005-04-21 Microsoft Corporation XSL transform
US7146409B1 (en) * 2001-07-24 2006-12-05 Brightplanet Corporation System and method for efficient control and capture of dynamic database content
CN1167027C (en) * 2001-08-03 2004-09-15 富士通株式会社 Format file information extracting device and method
EP1428139B1 (en) 2001-08-14 2015-06-03 Microsoft Technology Licensing, LLC System and method for extracting content for submission to a search engine
US7774388B1 (en) * 2001-08-31 2010-08-10 Margaret Runchey Model of everything with UR-URL combination identity-identifier-addressing-indexing method, means, and apparatus
US8332275B2 (en) * 2001-10-31 2012-12-11 Ebay Inc. Method and apparatus to facilitate a transaction within a network-based facility
WO2003039101A2 (en) * 2001-11-01 2003-05-08 Telecommunications Research Associates, Llc. Computerized interactive learning system and method over a network
US7552135B2 (en) * 2001-11-15 2009-06-23 Siebel Systems, Inc. SQL adapter business service
JP4215425B2 (en) * 2001-11-21 2009-01-28 日本電気株式会社 Text management system, management method thereof, and program thereof
US20030140091A1 (en) * 2002-01-16 2003-07-24 International Business Machines Corporation Telephone number capture from Web page
US20030135647A1 (en) * 2002-01-16 2003-07-17 International Business Machines Corporation Web browser control of telephone directories
JP3809863B2 (en) 2002-02-28 2006-08-16 インターナショナル・ビジネス・マシーンズ・コーポレーション server
US7337391B2 (en) * 2002-03-12 2008-02-26 International Business Machines Corporation Method and system for stylesheet execution interactive feedback
US7287229B2 (en) * 2002-04-03 2007-10-23 Hewlett-Packard Development Company, L.P. Template-driven process system
US7225398B1 (en) 2002-06-26 2007-05-29 Microsoft Corporation Using icons to show the validity of computer language structural elements applicable to a computer-generated document
US7685314B2 (en) * 2002-06-27 2010-03-23 Siebel Systems, Inc. System integration system and method
US7322022B2 (en) * 2002-09-05 2008-01-22 International Business Machines Corporation Method for creating wrapper XML stored procedure
US20040088676A1 (en) * 2002-10-30 2004-05-06 Gazdik Charles J. Document production
US7689709B2 (en) * 2002-12-13 2010-03-30 Sap Ag Native format tunneling
US7904346B2 (en) * 2002-12-31 2011-03-08 Ebay Inc. Method and system to adjust a seller fixed price offer
US7593866B2 (en) * 2002-12-31 2009-09-22 Ebay Inc. Introducing a fixed-price transaction mechanism in conjunction with an auction transaction mechanism
US7325186B2 (en) * 2003-02-28 2008-01-29 Microsoft Corporation Method and system for showing unannotated text nodes in a data formatted document
US20040187090A1 (en) * 2003-03-21 2004-09-23 Meacham Randal P. Method and system for creating interactive software
US7275216B2 (en) * 2003-03-24 2007-09-25 Microsoft Corporation System and method for designing electronic forms and hierarchical schemas
US7370066B1 (en) 2003-03-24 2008-05-06 Microsoft Corporation System and method for offline editing of data files
US7415672B1 (en) 2003-03-24 2008-08-19 Microsoft Corporation System and method for designing electronic forms
US7913159B2 (en) 2003-03-28 2011-03-22 Microsoft Corporation System and method for real-time validation of structured data files
US7296017B2 (en) * 2003-03-28 2007-11-13 Microsoft Corporation Validation of XML data files
US7516145B2 (en) * 2003-03-31 2009-04-07 Microsoft Corporation System and method for incrementally transforming and rendering hierarchical data files
US7600001B1 (en) * 2003-05-01 2009-10-06 Vignette Corporation Method and computer system for unstructured data integration through a graphical interface
JP4676136B2 (en) * 2003-05-19 2011-04-27 株式会社日立製作所 Document structure inspection method and apparatus
US7890852B2 (en) 2003-06-26 2011-02-15 International Business Machines Corporation Rich text handling for a web application
US20040268229A1 (en) * 2003-06-27 2004-12-30 Microsoft Corporation Markup language editing with an electronic form
US7451392B1 (en) * 2003-06-30 2008-11-11 Microsoft Corporation Rendering an HTML electronic form by applying XSLT to XML using a solution
US7406660B1 (en) 2003-08-01 2008-07-29 Microsoft Corporation Mapping between structured data and a visual surface
US7334187B1 (en) * 2003-08-06 2008-02-19 Microsoft Corporation Electronic form aggregation
US7546288B2 (en) * 2003-09-04 2009-06-09 Microsoft Corporation Matching media file metadata to standardized metadata
US7725875B2 (en) * 2003-09-04 2010-05-25 Pervasive Software, Inc. Automated world wide web navigation and content extraction
US7236982B2 (en) * 2003-09-15 2007-06-26 Pic Web Services, Inc. Computer systems and methods for platform independent presentation design
US20050066269A1 (en) * 2003-09-18 2005-03-24 Fujitsu Limited Information block extraction apparatus and method for Web pages
US20050091224A1 (en) * 2003-10-22 2005-04-28 Fisher James A. Collaborative web based development interface
US8392823B1 (en) 2003-12-04 2013-03-05 Google Inc. Systems and methods for detecting hidden text and hidden links
US7783555B2 (en) * 2003-12-11 2010-08-24 Ebay Inc. Auction with interest rate bidding
US7640497B1 (en) 2003-12-22 2009-12-29 Apple Inc. Transforming a hierarchical data structure according to requirements specified in a transformation template
US8819072B1 (en) 2004-02-02 2014-08-26 Microsoft Corporation Promoting data from structured data files
US8423471B1 (en) * 2004-02-04 2013-04-16 Radix Holdings, Llc Protected document elements
US7440954B2 (en) * 2004-04-09 2008-10-21 Oracle International Corporation Index maintenance for operations involving indexed XML data
US7398265B2 (en) * 2004-04-09 2008-07-08 Oracle International Corporation Efficient query processing of XML data using XML index
US7603347B2 (en) * 2004-04-09 2009-10-13 Oracle International Corporation Mechanism for efficiently evaluating operator trees
US7496837B1 (en) * 2004-04-29 2009-02-24 Microsoft Corporation Structural editing with schema awareness
US20050268233A1 (en) * 2004-04-30 2005-12-01 Configurecode, Inc. System and method for mixed language editing
US7774620B1 (en) 2004-05-27 2010-08-10 Microsoft Corporation Executing applications at appropriate trust levels
EP1759315B1 (en) * 2004-06-23 2010-06-30 Oracle International Corporation Efficient evaluation of queries using translation
US7516121B2 (en) * 2004-06-23 2009-04-07 Oracle International Corporation Efficient evaluation of queries using translation
US7529731B2 (en) * 2004-06-29 2009-05-05 Xerox Corporation Automatic discovery of classification related to a category using an indexed document collection
US9098476B2 (en) * 2004-06-29 2015-08-04 Microsoft Technology Licensing, Llc Method and system for mapping between structured subjects and observers
US7302426B2 (en) * 2004-06-29 2007-11-27 Xerox Corporation Expanding a partially-correct list of category elements using an indexed document collection
US7558792B2 (en) * 2004-06-29 2009-07-07 Palo Alto Research Center Incorporated Automatic extraction of human-readable lists from structured documents
US7370273B2 (en) * 2004-06-30 2008-05-06 International Business Machines Corporation System and method for creating dynamic folder hierarchies
US8566300B2 (en) * 2004-07-02 2013-10-22 Oracle International Corporation Mechanism for efficient maintenance of XML index structures in a database system
US7668806B2 (en) 2004-08-05 2010-02-23 Oracle International Corporation Processing queries against one or more markup language sources
US7769773B1 (en) * 2004-08-31 2010-08-03 Adobe Systems Incorporated Relevant rule inspector for hierarchical documents
US9171100B2 (en) 2004-09-22 2015-10-27 Primo M. Pettovello MTree an XPath multi-axis structure threaded index
US20060074933A1 (en) * 2004-09-30 2006-04-06 Microsoft Corporation Workflow interaction
US7692636B2 (en) * 2004-09-30 2010-04-06 Microsoft Corporation Systems and methods for handwriting to a screen
US7801874B2 (en) * 2004-10-22 2010-09-21 Mahle Powertrain Llc Reporting tools
US8487879B2 (en) * 2004-10-29 2013-07-16 Microsoft Corporation Systems and methods for interacting with a computer through handwriting to a screen
US20060107224A1 (en) * 2004-11-15 2006-05-18 Microsoft Corporation Building a dynamic action for an electronic form
US7712022B2 (en) 2004-11-15 2010-05-04 Microsoft Corporation Mutually exclusive options in electronic forms
US7584417B2 (en) * 2004-11-15 2009-09-01 Microsoft Corporation Role-dependent action for an electronic form
US7509353B2 (en) * 2004-11-16 2009-03-24 Microsoft Corporation Methods and systems for exchanging and rendering forms
US7721190B2 (en) * 2004-11-16 2010-05-18 Microsoft Corporation Methods and systems for server side form processing
WO2006065877A2 (en) * 2004-12-14 2006-06-22 Freedom Scientific, Inc. Custom labeler for screen readers
US7904801B2 (en) * 2004-12-15 2011-03-08 Microsoft Corporation Recursive sections in electronic forms
US7437376B2 (en) * 2004-12-20 2008-10-14 Microsoft Corporation Scalable object model
US7937651B2 (en) * 2005-01-14 2011-05-03 Microsoft Corporation Structural editing operations for network forms
US7725834B2 (en) * 2005-03-04 2010-05-25 Microsoft Corporation Designer-created aspect for an electronic form template
US7673228B2 (en) * 2005-03-30 2010-03-02 Microsoft Corporation Data-driven actions for network forms
US8468445B2 (en) * 2005-03-30 2013-06-18 The Trustees Of Columbia University In The City Of New York Systems and methods for content extraction
US8463801B2 (en) * 2005-04-04 2013-06-11 Oracle International Corporation Effectively and efficiently supporting XML sequence type and XQuery sequence natively in a SQL system
US8010515B2 (en) * 2005-04-15 2011-08-30 Microsoft Corporation Query to an electronic form
US20060235839A1 (en) * 2005-04-19 2006-10-19 Muralidhar Krishnaprasad Using XML as a common parser architecture to separate parser from compiler
US7949941B2 (en) * 2005-04-22 2011-05-24 Oracle International Corporation Optimizing XSLT based on input XML document structure description and translating XSLT into equivalent XQuery expressions
US7499931B2 (en) * 2005-05-09 2009-03-03 International Business Machines Corporation Method and apparatus for approximate projection of XML documents
US8015549B2 (en) 2005-05-10 2011-09-06 Novell, Inc. Techniques for monitoring application calls
EP1896969A2 (en) * 2005-05-31 2008-03-12 Ipifini, Inc. Computer program for identifying and automating repetitive user inputs
US20070011665A1 (en) * 2005-06-21 2007-01-11 Microsoft Corporation Content syndication platform
US7543228B2 (en) * 2005-06-27 2009-06-02 Microsoft Corporation Template for rendering an electronic form
US8200975B2 (en) * 2005-06-29 2012-06-12 Microsoft Corporation Digital signatures for network forms
US8166059B2 (en) 2005-07-08 2012-04-24 Oracle International Corporation Optimization of queries on a repository based on constraints on how the data is stored in the repository
US7478092B2 (en) * 2005-07-21 2009-01-13 International Business Machines Corporation Key term extraction
US7613996B2 (en) * 2005-08-15 2009-11-03 Microsoft Corporation Enabling selection of an inferred schema part
US20070036433A1 (en) * 2005-08-15 2007-02-15 Microsoft Corporation Recognizing data conforming to a rule
US7698695B2 (en) * 2005-08-31 2010-04-13 International Business Machines Corporation Search technique for design patterns in Java source code
US20070061706A1 (en) * 2005-09-14 2007-03-15 Microsoft Corporation Mapping property hierarchies to schemas
US20070061467A1 (en) * 2005-09-15 2007-03-15 Microsoft Corporation Sessions and session states
CA2621348C (en) * 2005-09-27 2010-07-06 Teamon Systems, Inc. System for obtaining image using xslt extension and related method
US20070083599A1 (en) * 2005-09-27 2007-04-12 Teamon Systems, Inc. System for transforming application data using xslt extensions to render templates from cache and related methods
US7484173B2 (en) * 2005-10-18 2009-01-27 International Business Machines Corporation Alternative key pad layout for enhanced security
US7664742B2 (en) * 2005-11-14 2010-02-16 Pettovello Primo M Index data structure for a peer-to-peer network
US8001459B2 (en) * 2005-12-05 2011-08-16 Microsoft Corporation Enabling electronic documents for limited-capability computing devices
US8144730B2 (en) * 2005-12-13 2012-03-27 The Boeing Company Automated tactical datalink translator
US7949646B1 (en) 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
CN101346997B (en) * 2005-12-28 2015-01-14 英特尔公司 Novel user sensitive information adaptive video code conversion structure, and method and equipment of the same
US8862551B2 (en) 2005-12-29 2014-10-14 Nextlabs, Inc. Detecting behavioral patterns and anomalies using activity data
US20070162409A1 (en) * 2006-01-06 2007-07-12 Godden Kurt S Creation and maintenance of ontologies
US20070174309A1 (en) * 2006-01-18 2007-07-26 Pettovello Primo M Mtreeini: intermediate nodes and indexes
US9170987B2 (en) * 2006-01-18 2015-10-27 Microsoft Technology Licensing, Llc Style extensibility applied to a group of shapes by editing text files
US7984389B2 (en) 2006-01-28 2011-07-19 Rowan University Information visualization system
US7836399B2 (en) * 2006-02-09 2010-11-16 Microsoft Corporation Detection of lists in vector graphics documents
US20070240032A1 (en) * 2006-04-07 2007-10-11 Wilson Jeff K Method and system for vertical acquisition of data from HTML tables
US20070245327A1 (en) * 2006-04-17 2007-10-18 Honeywell International Inc. Method and System for Producing Process Flow Models from Source Code
US7774746B2 (en) * 2006-04-19 2010-08-10 Apple, Inc. Generating a format translator
US8793584B2 (en) * 2006-05-24 2014-07-29 International Business Machines Corporation Customizable user interface wrappers for web applications
WO2007139039A1 (en) * 2006-05-26 2007-12-06 Nec Corporation Information classification device, information classification method, and information classification program
ITUD20060161A1 (en) * 2006-06-22 2007-12-23 Matteo Macoratti PROCEDURE AND EQUIPMENT FOR THE ELECTRONIC PROCESSING OF ELECTRONIC DOCUMENTS, OR FILES
GB0612673D0 (en) * 2006-06-27 2006-08-09 Gems Tv Ltd Computer system
JP4539613B2 (en) * 2006-06-28 2010-09-08 富士ゼロックス株式会社 Image forming apparatus, image generation method, and program
US7499909B2 (en) * 2006-07-03 2009-03-03 Oracle International Corporation Techniques of using a relational caching framework for efficiently handling XML queries in the mid-tier data caching
US7660804B2 (en) * 2006-08-16 2010-02-09 Microsoft Corporation Joint optimization of wrapper generation and template detection
US20080059486A1 (en) * 2006-08-24 2008-03-06 Derek Edwin Pappas Intelligent data search engine
CN101140578B (en) * 2006-09-06 2010-12-08 鸿富锦精密工业(深圳)有限公司 Method and system for multithread analyzing web page data
US7739219B2 (en) 2006-09-08 2010-06-15 Oracle International Corporation Techniques of optimizing queries using NULL expression analysis
US7647351B2 (en) * 2006-09-14 2010-01-12 Stragent, Llc Web scrape template generation
US8321845B2 (en) * 2006-10-13 2012-11-27 International Business Machines Corporation Extensible markup language (XML) path (XPATH) debugging framework
US7680782B2 (en) * 2006-10-18 2010-03-16 International Business Machines Corporation Method to generate semantically valid queries in the XQuery language
FR2907567B1 (en) * 2006-10-23 2008-12-26 Canon Kk METHOD AND DEVICE FOR GENERATING REFERENCE PATTERNS FROM WRITING LANGUAGE DOCUMENT AND ASSOCIATED ENCODING AND DECODING METHODS AND DEVICES.
JP4771915B2 (en) * 2006-11-15 2011-09-14 京セラミタ株式会社 Apparatus, method, and program for converting HTML text
US7949993B2 (en) * 2006-11-28 2011-05-24 International Business Machines Corporation Method and system for providing a visual context for software development processes
US20080172606A1 (en) * 2006-12-27 2008-07-17 Generate, Inc. System and Method for Related Information Search and Presentation from User Interface Content
US7908260B1 (en) * 2006-12-29 2011-03-15 BrightPlanet Corporation II, Inc. Source editing, internationalization, advanced configuration wizard, and summary page selection for information automation systems
CN101211336B (en) * 2006-12-29 2011-05-04 鸿富锦精密工业(深圳)有限公司 Visualized system and method for generating inquiry file
US8285697B1 (en) 2007-01-23 2012-10-09 Google Inc. Feedback enhanced attribute extraction
WO2008092079A2 (en) 2007-01-25 2008-07-31 Clipmarks Llc System, method and apparatus for selecting content from web sources and posting content to web logs
US7912828B2 (en) * 2007-02-23 2011-03-22 Apple Inc. Pattern searching methods and apparatuses
FR2913274A1 (en) * 2007-03-02 2008-09-05 Canon Kk Structured document i.e. XML document, coding method, involves creating drifted pattern formed by modification of another pattern, and coding data of document for providing code, where code associates third pattern to coded data
US20080222237A1 (en) * 2007-03-06 2008-09-11 Microsoft Corporation Web services mashup component wrappers
US20080222599A1 (en) * 2007-03-07 2008-09-11 Microsoft Corporation Web services mashup designer
US20080235260A1 (en) * 2007-03-23 2008-09-25 International Business Machines Corporation Scalable algorithms for mapping-based xml transformation
US7873640B2 (en) * 2007-03-27 2011-01-18 Adobe Systems Incorporated Semantic analysis documents to rank terms
FR2914759B1 (en) * 2007-04-03 2009-06-05 Canon Kk METHOD AND DEVICE FOR CODING A HIERARCHISED DOCUMENT
US7873902B2 (en) * 2007-04-19 2011-01-18 Microsoft Corporation Transformation of versions of reports
US8719291B2 (en) * 2007-04-24 2014-05-06 Lixto Software Gmbh Information extraction using spatial reasoning on the CSS2 visual box model
WO2008144547A1 (en) * 2007-05-16 2008-11-27 The Generations Network, Inc. User-directed capture of unstructured information from web pages with assignment to data type
US20090288033A1 (en) * 2008-05-15 2009-11-19 The Generations Network, Inc. User-Directed Capture of Unstructured Information from Web Pages with Assignment to Data Type
CA2687484A1 (en) * 2007-05-17 2008-11-27 Fat Free Mobile Inc. Web page transcoding method and system applying queries to plain text
US8762556B2 (en) * 2007-06-13 2014-06-24 Apple Inc. Displaying content on a mobile device
US7895189B2 (en) * 2007-06-28 2011-02-22 International Business Machines Corporation Index exploitation
US8086597B2 (en) * 2007-06-28 2011-12-27 International Business Machines Corporation Between matching
US20090006316A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Rewriting Regular XPath Queries on XML Views
EP2019361A1 (en) * 2007-07-26 2009-01-28 Siemens Aktiengesellschaft A method and apparatus for extraction of textual content from hypertext web documents
US8601361B2 (en) * 2007-08-06 2013-12-03 Apple Inc. Automatically populating and/or generating tables using data extracted from files
US8103495B2 (en) * 2007-08-08 2012-01-24 Microsoft Corporation Feature oriented protocol modeling
US20090043736A1 (en) * 2007-08-08 2009-02-12 Wook-Shin Han Efficient tuple extraction from streaming xml data
US20090063533A1 (en) * 2007-08-27 2009-03-05 International Business Machines Corporation Method of supporting multiple extractions and binding order in xml pivot join
US20090064337A1 (en) * 2007-09-05 2009-03-05 Shih-Wei Chien Method and apparatus for preventing web page attacks
US20090172517A1 (en) * 2007-12-27 2009-07-02 Kalicharan Bhagavathi P Document parsing method and system using web-based GUI software
US7996444B2 (en) * 2008-02-18 2011-08-09 International Business Machines Corporation Creation of pre-filters for more efficient X-path processing
JP4613214B2 (en) * 2008-02-26 2011-01-12 日立オートモティブシステムズ株式会社 Software automatic configuration device
JP2009223485A (en) * 2008-03-14 2009-10-01 Brother Ind Ltd Link tree creation program and creation device
US7912927B2 (en) * 2008-03-26 2011-03-22 Microsoft Corporation Wait for ready state
CN101546309B (en) * 2008-03-26 2012-07-04 国际商业机器公司 Method and equipment for constructing indexes to resource content in computer network
US8196118B2 (en) * 2008-03-27 2012-06-05 Microsoft Corporation Event set recording
JP2009265279A (en) 2008-04-23 2009-11-12 Sony Ericsson Mobilecommunications Japan Inc Voice synthesizer, voice synthetic method, voice synthetic program, personal digital assistant, and voice synthetic system
US20090271388A1 (en) * 2008-04-23 2009-10-29 Yahoo! Inc. Annotations of third party content
FR2930660A1 (en) * 2008-04-25 2009-10-30 Canon Kk METHOD FOR ACCESSING A PART OR MODIFYING A PART OF A BINARY XML DOCUMENT, ASSOCIATED DEVICES
US8180758B1 (en) * 2008-05-09 2012-05-15 Amazon Technologies, Inc. Data management system utilizing predicate logic
US8311806B2 (en) * 2008-06-06 2012-11-13 Apple Inc. Data detection in a sequence of tokens using decision tree reductions
US8738360B2 (en) 2008-06-06 2014-05-27 Apple Inc. Data detection of a character sequence having multiple possible data types
US9582292B2 (en) 2008-10-07 2017-02-28 Microsoft Technology Licensing, Llc. Merged tree-view UI objects
US8117343B2 (en) * 2008-10-28 2012-02-14 Hewlett-Packard Development Company, L.P. Landmark chunking of landmarkless regions
KR101574603B1 (en) 2008-10-31 2015-12-04 삼성전자주식회사 A method for conditional processing and an apparatus thereof
US20100114902A1 (en) * 2008-11-04 2010-05-06 Brigham Young University Hidden-web table interpretation, conceptulization and semantic annotation
US8489388B2 (en) 2008-11-10 2013-07-16 Apple Inc. Data detection
US8805861B2 (en) * 2008-12-09 2014-08-12 Google Inc. Methods and systems to train models to extract and integrate information from data sources
US20100169311A1 (en) * 2008-12-30 2010-07-01 Ashwin Tengli Approaches for the unsupervised creation of structural templates for electronic documents
US8225277B2 (en) * 2009-01-31 2012-07-17 Ted J. Biggerstaff Non-localized constraints for automated program generation
US8327321B2 (en) * 2009-01-31 2012-12-04 Ted J. Biggerstaff Synthetic partitioning for imposing implementation design patterns onto logical architectures of computations
US20100198770A1 (en) * 2009-02-03 2010-08-05 Yahoo!, Inc., a Delaware corporation Identifying previously annotated web page information
US20100199165A1 (en) * 2009-02-03 2010-08-05 Yahoo!, Inc., a Delaware corporation Updating wrapper annotations
US9344401B2 (en) * 2009-02-04 2016-05-17 Citrix Systems, Inc. Methods and systems for providing translations of data retrieved from a storage system in a cloud computing environment
US8719308B2 (en) * 2009-02-16 2014-05-06 Business Objects, S.A. Method and system to process unstructured data
US20100223214A1 (en) * 2009-02-27 2010-09-02 Kirpal Alok S Automatic extraction using machine learning based robust structural extractors
US8001273B2 (en) * 2009-03-16 2011-08-16 Hewlett-Packard Development Company, L.P. Parallel processing of input data to locate landmarks for chunks
US7979491B2 (en) * 2009-03-27 2011-07-12 Hewlett-Packard Development Company, L.P. Producing chunks from input data using a plurality of processing elements
US8311330B2 (en) * 2009-04-06 2012-11-13 Accenture Global Services Limited Method for the logical segmentation of contents
TWI385537B (en) * 2009-05-04 2013-02-11 Univ Nat Taiwan Assisting method and apparatus for accessing markup language document
US8332763B2 (en) * 2009-06-09 2012-12-11 Microsoft Corporation Aggregating dynamic visual content
US8600814B2 (en) * 2009-08-30 2013-12-03 Cezary Dubnicki Structured analysis and organization of documents online and related methods
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
CN102053993B (en) * 2009-11-10 2014-04-09 阿里巴巴集团控股有限公司 Text filtering method and text filtering system
US8666913B2 (en) 2009-11-12 2014-03-04 Connotate, Inc. System and method for using pattern recognition to monitor and maintain status quo
US20120233323A1 (en) * 2009-11-18 2012-09-13 Giuseppe Conte Method and apparatus for use in a communications network
US8271479B2 (en) * 2009-11-23 2012-09-18 International Business Machines Corporation Analyzing XML data
US8468104B1 (en) * 2009-12-02 2013-06-18 Hrl Laboratories, Llc System for anomaly detection
US8683311B2 (en) * 2009-12-11 2014-03-25 Microsoft Corporation Generating structured data objects from unstructured web pages
US20110191381A1 (en) * 2010-01-29 2011-08-04 Microsoft Corporation Interactive System for Extracting Data from a Website
US20110197133A1 (en) * 2010-02-11 2011-08-11 Yahoo! Inc. Methods and apparatuses for identifying and monitoring information in electronic documents over a network
US20120233210A1 (en) * 2011-03-12 2012-09-13 Matthew Thomas Bogosian Storage of Arbitrary Points in N-Space and Retrieval of Subset thereof Based on Criteria Including Maximum Distance to an Arbitrary Reference Point
US20110265058A1 (en) * 2010-04-26 2011-10-27 Microsoft Corporation Embeddable project data
US8479170B2 (en) * 2010-05-12 2013-07-02 Fujitsu Limited Generating software application user-input data through analysis of client-tier source code
US20110307479A1 (en) * 2010-06-10 2011-12-15 Microsoft Corporation Automatic Extraction of Structured Web Content
CA2706743A1 (en) * 2010-06-30 2010-09-08 Ibm Canada Limited - Ibm Canada Limitee Dom based page uniqueness indentification
US9317622B1 (en) * 2010-08-17 2016-04-19 Amazon Technologies, Inc. Methods and systems for fragmenting and recombining content structured language data content to reduce latency of processing and rendering operations
CA2712028C (en) * 2010-08-25 2011-12-20 Ibm Canada Limited - Ibm Canada Limitee Geospatial database integration using business models
WO2012041216A1 (en) * 2010-09-30 2012-04-05 北京联想软件有限公司 Portable electronic device, content publishing method, and prompting method
US9280528B2 (en) * 2010-10-04 2016-03-08 Yahoo! Inc. Method and system for processing and learning rules for extracting information from incoming web pages
US8375031B2 (en) * 2011-02-10 2013-02-12 Tektronix, Inc. Lossless real-time line-rate filtering using PCAP style filters and hardware assisted patricia trees
US9065793B2 (en) 2011-02-24 2015-06-23 Cbs Interactive Inc. Rendering web content using pre-caching
US8788927B2 (en) * 2011-02-24 2014-07-22 Cbs Interactive Inc. System and method for displaying web page content
EP2506157A1 (en) 2011-03-30 2012-10-03 British Telecommunications Public Limited Company Textual analysis system
US8706762B1 (en) 2011-05-16 2014-04-22 Intuit Inc. System and method for automated web site information retrieval scripting using untrained users
US9430583B1 (en) 2011-06-10 2016-08-30 Salesforce.Com, Inc. Extracting a portion of a document, such as a web page
PL395376A1 (en) * 2011-06-22 2013-01-07 Google Inc. Rendering approximate webpage screenshot client-side
CN102262658B (en) * 2011-07-13 2013-10-16 东北大学 Method for extracting web data from bottom to top based on entity
CN103139260B (en) * 2011-11-30 2015-09-30 国际商业机器公司 For reusing the method and system of HTML content
EP2788860A4 (en) * 2011-12-06 2016-07-06 Autograph Inc Consumer self-profiling gui, analysis and rapid information presentation tools
US8762315B2 (en) * 2012-02-07 2014-06-24 Alan A. Yelsey Interactive portal for facilitating the representation and exploration of complexity
US9116947B2 (en) * 2012-03-15 2015-08-25 Hewlett-Packard Development Company, L.P. Data-record pattern searching
US9639575B2 (en) * 2012-03-30 2017-05-02 Khalifa University Of Science, Technology And Research Method and system for processing data queries
US8589837B1 (en) * 2012-04-25 2013-11-19 International Business Machines Corporation Constructing inductive counterexamples in a multi-algorithm verification framework
US9753926B2 (en) 2012-04-30 2017-09-05 Salesforce.Com, Inc. Extracting a portion of a document, such as a web page
US8578311B1 (en) 2012-05-09 2013-11-05 International Business Machines Corporation Method and system for optimal diameter bounding of designs with complex feed-forward components
US9152619B2 (en) * 2012-05-21 2015-10-06 Google Inc. System and method for constructing markup language templates and input data structure specifications
US10803233B2 (en) * 2012-05-31 2020-10-13 Conduent Business Services Llc Method and system of extracting structured data from a document
US9110852B1 (en) * 2012-07-20 2015-08-18 Google Inc. Methods and systems for extracting information from text
CN103885972B (en) * 2012-12-20 2017-02-08 北大方正集团有限公司 Method and device for document content structuring
CN103902535B (en) * 2012-12-24 2019-02-22 腾讯科技(深圳)有限公司 Obtain the method, apparatus and system of associational word
US9075619B2 (en) * 2013-01-15 2015-07-07 Nuance Corporation, Inc. Method and apparatus for supporting multi-modal dialog applications
US10545932B2 (en) * 2013-02-07 2020-01-28 Qatar Foundation Methods and systems for data cleaning
US9215133B2 (en) * 2013-02-20 2015-12-15 Tekelec, Inc. Methods, systems, and computer readable media for detecting orphan Sy or Rx sessions using audit messages with fake parameter values
US9582494B2 (en) 2013-02-22 2017-02-28 Altilia S.R.L. Object extraction from presentation-oriented documents using a semantic and spatial approach
KR102074734B1 (en) * 2013-02-28 2020-03-02 삼성전자주식회사 Method and apparatus for pattern discoverty in sequence data
DE102013204245A1 (en) * 2013-03-12 2014-09-18 Bayerische Motoren Werke Aktiengesellschaft Method and apparatus for providing extracted data
US10445415B1 (en) * 2013-03-14 2019-10-15 Ca, Inc. Graphical system for creating text classifier to match text in a document by combining existing classifiers
US9448979B2 (en) * 2013-04-10 2016-09-20 International Business Machines Corporation Managing a display of results of a keyword search on a web page by modifying attributes of DOM tree structure
US9501569B2 (en) 2013-04-23 2016-11-22 Microsoft Technology Licensing, Llc Automatic taxonomy construction from keywords
US10817662B2 (en) * 2013-05-21 2020-10-27 Kim Technologies Limited Expert system for automation, data collection, validation and managed storage without programming and without deployment
US10803232B2 (en) * 2013-06-06 2020-10-13 International Business Machines Corporation Optimizing loading of web page based on aggregated user preferences for web page elements of web page
US20150058390A1 (en) * 2013-08-20 2015-02-26 Matthew Thomas Bogosian Storage of Arbitrary Points in N-Space and Retrieval of Subset Thereof Based on a Determinate Distance Interval from an Arbitrary Reference Point
US10140627B2 (en) * 2014-03-26 2018-11-27 Excalibur Ip, Llc Xpath related and other techniques for use in native advertisement placement
US9600596B2 (en) * 2014-04-08 2017-03-21 Sap Se Parser wrapper class
US10333979B1 (en) * 2014-06-10 2019-06-25 Amazon Technologies, Inc. Multi-tenant network data validation service
US10262077B2 (en) * 2014-06-27 2019-04-16 Intel Corporation Systems and methods for pattern matching and relationship discovery
US10867273B2 (en) 2014-09-26 2020-12-15 Oracle International Corporation Interface for expanding logical combinations based on relative placement
US10419295B1 (en) * 2014-10-03 2019-09-17 Amdocs Development Limited System, method, and computer program for automatically generating communication device metadata definitions
WO2016089346A1 (en) * 2014-12-01 2016-06-09 Hewlett Packard Enterprise Development Lp Statuses of exit criteria
US10425341B2 (en) 2015-01-23 2019-09-24 Ebay Inc. Processing high volume network data
CN107431664B (en) 2015-01-23 2021-03-12 电子湾有限公司 Message transmission system and method
US9658938B2 (en) * 2015-03-30 2017-05-23 Fujtsu Limited Iterative test generation based on data source analysis
US9952916B2 (en) * 2015-04-10 2018-04-24 Microsoft Technology Licensing, Llc Event processing system paging
CA2988369C (en) * 2015-06-09 2019-07-16 Nissan Motor Co., Ltd. Solid oxide fuel cell
US10180932B2 (en) 2015-06-30 2019-01-15 Datawatch Corporation Systems and methods for automatically creating tables using auto-generated templates
WO2017007775A2 (en) * 2015-07-06 2017-01-12 Abbott Diabetes Care Inc. Systems, devices, and methods for episode detection and evaluation
US10386985B2 (en) * 2015-07-14 2019-08-20 International Business Machines Corporation User interface pattern mapping
US9886250B2 (en) 2016-01-26 2018-02-06 International Business Machines Corporation Translation of a visual representation into an executable information extraction program
US10423675B2 (en) * 2016-01-29 2019-09-24 Intuit Inc. System and method for automated domain-extensible web scraping
US10277637B2 (en) 2016-02-12 2019-04-30 Oracle International Corporation Methods, systems, and computer readable media for clearing diameter session information
US10042846B2 (en) * 2016-04-28 2018-08-07 International Business Machines Corporation Cross-lingual information extraction program
EP3516536A4 (en) 2016-09-19 2020-05-13 Kim Technologies Limited Actively adapted knowledge base, content calibration, and content recognition
US10621271B2 (en) 2017-05-25 2020-04-14 Microsoft Technology Licensing, Llc Reordering a multi-level layout using a hierarchical tree
US10462010B2 (en) * 2017-06-13 2019-10-29 Cisco Technology, Inc. Detecting and managing recurring patterns in device and service configuration data
GB201711315D0 (en) * 2017-07-13 2017-08-30 Univ Oxford Innovation Ltd Method for automatically generating a wrapper for extracting web data, and a computer system
US10922366B2 (en) 2018-03-27 2021-02-16 International Business Machines Corporation Self-adaptive web crawling and text extraction
CN110851678B (en) * 2018-07-24 2024-02-02 京东科技控股股份有限公司 Method and device for crawling data
TWI682287B (en) * 2018-10-25 2020-01-11 財團法人資訊工業策進會 Knowledge graph generating apparatus, method, and computer program product thereof
US11003442B2 (en) 2019-05-14 2021-05-11 Fujitsu Limited Application programming interface documentation annotation
US11645513B2 (en) * 2019-07-03 2023-05-09 International Business Machines Corporation Unary relation extraction using distant supervision
US10614382B1 (en) * 2019-07-12 2020-04-07 Capital One Services, Llc Computer-based systems and methods configured to utilize automating deployment of predictive models for machine learning tasks
US11567939B2 (en) * 2019-12-26 2023-01-31 Snowflake Inc. Lazy reassembling of semi-structured data
US11308090B2 (en) 2019-12-26 2022-04-19 Snowflake Inc. Pruning index to support semi-structured data types
CN112347736A (en) * 2020-12-01 2021-02-09 中国商用飞机有限责任公司 Method for converting structured data files relating to civil aviation regulations into word documents
US11638134B2 (en) 2021-07-02 2023-04-25 Oracle International Corporation Methods, systems, and computer readable media for resource cleanup in communications networks
US11709725B1 (en) 2022-01-19 2023-07-25 Oracle International Corporation Methods, systems, and computer readable media for health checking involving common application programming interface framework

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081804A (en) * 1994-03-09 2000-06-27 Novell, Inc. Method and apparatus for performing rapid and multi-dimensional word searches
US5860071A (en) 1997-02-07 1999-01-12 At&T Corp Querying and navigating changes in web repositories
US5966516A (en) * 1996-05-17 1999-10-12 Lucent Technologies Inc. Apparatus for defining properties in finite-state machines
US5913214A (en) 1996-05-30 1999-06-15 Massachusetts Inst Technology Data extraction from world wide web pages
US6085186A (en) 1996-09-20 2000-07-04 Netbot, Inc. Method and system using information written in a wrapper description language to execute query on a network
US5826258A (en) 1996-10-02 1998-10-20 Junglee Corporation Method and apparatus for structuring the querying and interpretation of semistructured information
US5841895A (en) 1996-10-25 1998-11-24 Pricewaterhousecoopers, Llp Method for learning local syntactic relationships for use in example-based information-extraction-pattern learning
US5983268A (en) 1997-01-14 1999-11-09 Netmind Technologies, Inc. Spreadsheet user-interface for an internet-document change-detection tool
US5898836A (en) 1997-01-14 1999-04-27 Netmind Services, Inc. Change-detection tool indicating degree and location of change of internet documents by comparison of cyclic-redundancy-check(CRC) signatures
US6128655A (en) 1998-07-10 2000-10-03 International Business Machines Corporation Distribution mechanism for filtering, formatting and reuse of web based content
US20020169771A1 (en) * 2001-05-09 2002-11-14 Melmon Kenneth L. System & method for facilitating knowledge management

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
No Search *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010017159A1 (en) * 2008-08-05 2010-02-11 Beliefnetworks, Inc. Systems and methods for concept mapping
US8412646B2 (en) 2008-10-03 2013-04-02 Benefitfocus.Com, Inc. Systems and methods for automatic creation of agent-based systems
WO2010043212A3 (en) * 2008-10-16 2010-08-19 Newbase Gmbh Data organization and evaluation method
US8572760B2 (en) 2010-08-10 2013-10-29 Benefitfocus.Com, Inc. Systems and methods for secure agent information
CN110580174A (en) * 2018-06-11 2019-12-17 中国移动通信集团浙江有限公司 application component generation method, server and terminal
CN110580174B (en) * 2018-06-11 2022-07-01 中国移动通信集团浙江有限公司 Application component generation method, server and terminal
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment

Also Published As

Publication number Publication date
US7581170B2 (en) 2009-08-25
US20050022115A1 (en) 2005-01-27
EP1430420A2 (en) 2004-06-23
WO2002097667A8 (en) 2004-04-15

Similar Documents

Publication Publication Date Title
WO2002097667A2 (en) Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml
Arocena et al. WebOQL: Restructuring documents, databases, and webs
Hogue et al. Thresher: automating the unwrapping of semantic content from the world wide web
Laender et al. DEByE–data extraction by example
JP4264118B2 (en) How to configure information from different sources on the network
Hammer et al. Semistructured data: The TSIMMIS experience
US6889223B2 (en) Apparatus, method, and program for retrieving structured documents
US6604099B1 (en) Majority schema in semi-structured data
US6094649A (en) Keyword searches of structured databases
Ceri et al. XML: Current developments and future challenges for the database community
Hogue Tree pattern inference and matching for wrapper induction on the World Wide Web
Álvarez et al. Finding and extracting data records from web pages
Baumgartner et al. The elog web extraction language
Liu et al. An XJML-based wrapper generator for Web information extraction
Myllymaki et al. Robust web data extraction with xml path expressions
Liu et al. An XML-enabled data extraction toolkit for web sources
Gregg et al. Adaptive web information extraction
JP3914081B2 (en) Access authority setting method and structured document management system
CN1326078C (en) Forming method for package device
JP3842576B2 (en) Structured document editing method and structured document editing system
Liu et al. Towards building logical views of websites
May An integrated architecture for exploring, wrapping, mediating and restructuring information from the web
Zhang et al. Odaies: ontology-driven adaptive Web information extraction system
Škrbić et al. Bibliographic records editor in XML native environment
Bhowmick et al. Representation of web data in a web warehouse

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2002755419

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

D17 Declaration under article 17(2)a
WWE Wipo information: entry into national phase

Ref document number: 10479039

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2002755419

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP