WO2001027712A2

WO2001027712A2 - A method and system for automatically structuring content from universal marked-up documents

Info

Publication number: WO2001027712A2
Application number: PCT/IL2000/000648
Authority: WO
Inventors: Alon Fishman; Ari Enoshi; Udi Ran
Original assignee: The Shopper Inc.
Priority date: 1999-10-12
Filing date: 2000-10-12
Publication date: 2001-04-19
Also published as: WO2001027712A3; AU7941600A

Abstract

A system and method for automatic, knowledge-based processing and structuring of information from marked-up documents such as SGML, XML, HTML based documents. The system includes three parts: an offer processing (1000), an offer presentation (2000) and a database system (3000) that the offer processing (1000) and the offer presentation (2000) use. The offer processing (1000) builds anoffer database in the database system (3000) by accessing data sources such as Internet sites, intranet files, retrieving pages and processing them. The offer presentation (2000) allows users to access the offer database and retrieve information from it. The offer presentation (2000) also includes servers that may be accessed by different web-enabled means such as web browers, cellular phone, PDA, digital TV, Internet appliances, voice acrivated user interfaces.

Description

A method and system for automatically structuring content from universal marked-up documents

FIELD AND BACKGROUND OF THE INVENTION

The present invention relates to a system and method for automatically extracting, processing and structuring dynamic content from universal marked-up documents. Hereinafter, universal marked-up documents refer to any type of document that has marked up properties, such as HTML, XML, Microsoft Word, PDF, WML, VML or any other current or future mark-up languages or document types. Mark-up refers to the sequence of characters or other symbols (called tags) that may be inserted at certain places in a file to delimit and describe document subcomponents as information objects. They provide metadata about the document's content that may be used for further processing of the document, such as displaying it, printing it, etc. It is also used to describe the document's logical structure. The use of the word "automatically" implies the ability to structure information from such documents without having had previous exposure to a particular style, type or example of such a document.

The rise and popularization of computer networks such as the Internet, intranets and corporate extranets etc. have caused an explosion of information being available to anyone with Internet or network access. This information is usually available to users as HTML and XML based documents, available on some network. In general, the content (the information to be conveyed to the users) in these pages has no standard format or organization. Potential content consumers could benefit substantially from the ability to access an aggregated, unified structured repository (e.g., a commercial DB) of such content. However, performing this task automatically, on a large scale, and generally, without prior knowledge of content format of organization in specific information sources (that require source-specific adaptations), is generally impractical or impossible for all known types of automated comparative-type robots, search tools, or other content aggregation tools.

This is true in general for various environments. In particular, the well-known Internet environment has seen many different approaches to handle content, and is therefore a good case study. A closer look at the major approaches will help to elaborate on the difficulties.

General Internet search engines have historically tried to access and organize the great abundance of information available, and to bring it to Internet users. They work on a very large scale and cover a huge number of web pages. They usually index the pages for all words occurring in the page title/ keywords or full-text content. This indexing is usually done on a separate keyword basis, without any effort to understand the context or the grammar of the page. The only connection between different keywords in the same page is done in retrospect, when a user performs a Boolean search operation. There is no content understanding and structuring. .

Examples of such engines are www.alltheweb.com, www.altavista.com, www.excite.com, www.google.com. www.northernlight.com. In addition to searching for keywords and phrases, these engines often utilize statistical methods of filtering results according to previous usage. This formula does improve searching accuracy, by introducing a human-element, over time, but is still far from achieving a real "understanding" of document content.

The disadvantage of the above mentioned search technologies is that these engines have limited search capabilities (based on series of individual words, with Boolean relations between them), that produce results as lists with hyperlinks. The more "intelligent" engines, like www.ask.com can also respond to natural language questions with sorted and statistically relevant suggestions. However these engines cannot incorporate and present details from information sources in a meaningful way. For example, they cannot compare prices or features of products or services from various searched pages.

An alternative solution for automatically aggregating content is content robots (for some content sources) that analyze relevant site information and make it easily accessible. These robots run on sites that generate their pages using automated tools (usually database-driven) that give a uniform structure to their pages (such as www.amazon.com). Due to this uniformity, within a site (or parts of it), content robots can be created and programmed (for some of the sites) to analyze relevant site information and structure the information. However, as different sites have different structures, this is a site-specific adjustment task. In this category of solutions fall the various Internet comparative shopping engines, often based on Scrapers (Scrapers are software tools that are programmed to extract data from specific page formats. When writing a specific scraper for a site, the writer has to manually analyze the way offers' data is displayed on that site. Then, a specific scraper is created that is tailored to extract offers from that site. When the site changes the way it displays its relevant information, the scraper usually needs to be manually adjusted accordingly. Building the scrapers and maintaining them takes a lot of technical work, so these engines usually cover only dozens of merchants (or up to a few thousands, by using scraper building MMI tools, operated by human editors).

Examples of such engines and technologies are: www.mysimon.com and www.dealtime.com

Another common technology for content aggregation relies on an agreed upon interface for information exchange between the content aggregator and the content source. The aggregator may receive the information in a variety of ways. For example, the aggregator might write a specific software interface to the content producer's information systems (e.g., to its database). An alternative might be that the aggregator receives periodic feeds in some agreed upon format from the content provider. The drawback is that the aggregator needs to make specific arrangements (business and technical) with each individual information source. This requires much time and technical effort for every single source of information.

In the e-commerce world, these technologies are usually referred to as "shopping agents" or "shopping 'bots (robots)". Similar technologies are called "virtual database" (VDB), or "virtual store", whereby a central Web shop signs contracts with some dozens of affiliate retail sites. In certain current models, the central shop writes special "software agents" to be able to extract and organize data from the affiliate's computerized catalog. When a consumer enters the central shop site and searches for a specific product, the central sites issues a query to some of the distributed agents. Each agent searches the local catalog for the product. If the affiliate site has an offer for the product, the agent sends a price quote to the central shop. After a definitive time (usually 30 seconds), the central shop collects all the offers it has received and presents them to the consumer. The consumer can choose to buy according to one of these offers either through the central shop or through the affiliate shop.

Examples of this technology are Junglee (bought by Amazon) and C2B (bought by Inktomi).

Other methods for content aggregation that usually result in detailed and accurate information are based on human-edited operations. These methods tend to aggregate information from on-line, as well as off-line sources (e.g., newspapers). Specific examples in the e-commerce world include www.eshop.com (formerly www.compare.net) and www.shoppinglist.com and www.pricescan.com. The drawback of this method is clearly the inefficiency of developing and maintaining a non-automated system, which is unable to keep up with massive quantities of dynamic data that are available.

Most of the currently popular e-commerce product categories (like books or CDs) are simple to be described, sold and shipped, wherever the buyer may be. Many other product categories, however, need local representation. These include products and services that need to be demonstrated, measured, explained, installed, etc. In these product and service categories, e-commerce will not likely replace the presence of brick-and-mortar shops, but will rather complement them. In other product categories, some consumers may research on the Web before buying, but will still want to buy from a local merchant. In most of the existing relevant Internet tools and methods, consumers cannot combine product searches of localized retail offers ("yellow page" sites, for example, enable localized merchant searches but not localized product searches) and Internet merchant searches.

Some major technical requirements for a preferred information based offer aggregation tool & method without the limitations mentioned above, include: i. First and foremost is the ability to automatically process arbitrary pages or files from arbitrary information sources, without page or site-specific software adjustments. This requirement is mandatory in order to keep operational costs low, update offers quickly, etc. Satisfying this requirement is the major enabler for the present invention to satisfy the requirements that follow, ii. A breadth of data sources is needed not just to compare more offers per offer category but also to enable a true offer comparison between two similar items or services; and enable geographic location based searches that return a large number of geographically specific offers in each category, iii. Scalability, which is the ability to cover an immense and constantly growing number of data sources, in a way that is fast and cheap. To achieve such scalability the technology must operate without site-specific business or technical agreements or adjustments, iv. The ability to economically maintain and list updated offers from a multitude of 'data sources. There is thus a widely recognized need for, and it would be highly advantageous to have, a system or method for enabling automatic aggregation and structuring of information on a global and local scale. This need is all the more acute when facing the emerging world of mobile Internet appliances, as experts predict that wireless consumers will be highly sensitive to data relevancy issues (and much less tolerant than wired Internet users). A structured search methodology that fits a wide variety of content and can be used by a variety of devices is therefore a major breakthrough in the consumer content search tools niche.

The present invention provides such a method and system for enabling automatic structuring of information from universal marked-up documents. This invention can monitor, analyze, aggregate, compare and present information from varied, dynamic sources, in an organized, structured form. The present invention is driven by the trends discussed above, and it provides high quality aggregated content, i.e., multi-source, relevant, structured, and updated information, regarding topics that are in the scope of interest of users. The present system and method (unlike existing systems, methods and technologies) is able to satisfy these important requirements.

The present invention further enables various applications for structuring information, based on its page processing technology. These include a local marketplace enabler, comparative shopping engine and site, geographic based searching, services comparisons and automatic aggregation of many types of content from Internet, Intranet, Extranet or any other network-based sources.

SUMMARY OF THE INVENTION

According to the present invention there is provided a system and method for automatically structuring content from marked-up pages. More specifically the present invention is of a data processing system and method for automatic, knowledge-based processing and extracting of structured information from universal marked-up documents (documents that contain structural, presentational, and semantic information (e.g. tags) alongside content, such as SGML, XML, HTML, and Microsoft Word documents. A- The present invention consists of:

1. A back-end system including communications means and a mark-up page processing algorithm;

2. A database system for storing structured data and processing requests from said back-end system.

3. An optional front-end system for enabling user or third party interaction with said database.

B- The back-end system automatically processes content from universal marked-up documents, independent of prior knowledge of content structure or type for particular sites.

C- The back-end system comprises a page processing algorithm for automatically processing content from universal marked-up documents. D- The back-end system processes documents from any network-based source via any computerized communications means. This includes data found in any type of computerized information system, where the system is located on a network.

E- Marked-up information sources include content existing in data formats such as

SGML, HTML, XML, Microsoft WORD, PDF, WML, VML, RTF, XHTML, SMIL,

SGML and HDML.

F- User interaction is executed using interactive devices such as PC's, cellular phones, pagers, handheld PC's, pocket PCs, Mobile computers, interactive TV's,

Internet appliances and mobile communications devices.

G- User interaction is executed with user interfaces such as graphic user interfaces, text based interfaces, voice-based interfaces, keyboards & pointing devices and any combination of these.

H- Information offers processed by the system include product offers, such as consumer goods, auctions, classifieds, bartering, wholesale goods and B2B offers, and service offers, such as professional services, job offers, real estate offers, events, classifieds and job finding tools.

I- Offers may be presented according to geographic preferences.

J- In an exemplary application, the front-end system may be an e-commerce web site for comparison-shopping for products and services.

K- The comparison shopping function is a geographically enabled localized shopping application, such that users can research product offers according to geographic preferences and/or online preferences. L- The present invention further comprises a method for automatically structuring network-based content, according to the following steps: i. Finding information pages for information offers; ii. Retrieving relevant content from said information pages; iii. Processing retrieved pages in order to identify information offers; and iv. Aggregating of said information offers in a central database. M- The method further comprises interfacing the central database in a front-end system for responding to user queries.

N- The front-end system may be one or more Web, application and other server for running an interactive web site. User queries may be geographic location based queries, and may be for the purpose of comparative shopping. Shopping may be researched at online or offline stores.

O- Processing retrieved pages further comprises the execution of a generalized algorithm for web page processing, according to the following steps: i. pre-processing for Web pages, for filtering out all Web pages that are not relevant; ii. Web page processing, for parsing Web pages to build legitimate product offer records; iii. post processing for enriching the knowledge base. P- Web page processing includes the following operations: i. Updating the site's identified offers in the offers database, including saving historical information about offers in the history database; ii. ' Updating the site's information in the merchants database; iii. Adjusting the site^'s next revisit time in the site revisit queue, based on the amount of change in the site^'s processed data; iv. Using the web walker to fetch pages from the Web site; v. Handing the fetched pages to the page processor to identify and update offers, to extract site information; vi. Removing sites from the site revisit queue and from the Leads database when it is discovered that no offers were identified in them. Q- The present invention includes a method for structuring information from universal marked-up documents, comprising the execution of a page processing algorithm, according to the following steps: i. Receiving information documents: ii. Scanning said documents for offers; iii. Parsing said document into a parsing tree; iv. running an attribute identification program to find candidates for offer components; and v. running a structure identification program in order to find structures in documents.

The core technology of the data processing system and method is an offer processing system and method. The offer processing system collects, stores, processes, retrieves, and presents offers. It automatically aggregates offers from a very large number of information sources, such as Web merchants and service providers (potentially, most of such sites on the Web). Those skilled in the art will recognize that the method and system of the present invention has many applications, and that the present invention is not limited to the representative examples disclosed herein. Moreover, the scope of the present invention covers conventionally known variations and modifications to the system components described herein, as would be known by those skilled in the art.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:

Figure 1 is a block diagram of the main blocks of the present invention. Figure 2 is a block diagram decomposition of the offer processing module. Figure 3 is a block diagram decomposition of the lead processor module. Figure 4 is a block diagram depicting the site classification module. Figure 5 is a block diagram decomposition of the offer processor module. Figure 6 is a block diagram decomposition of the page processor module. Figure 7 is a part of an exemplary web page from a site that offers wines. Figure 8 is a graphical representation of the results of processing the exemplary Web page. Figure 9 is a graphical representation of part of the parsing tree of the exemplary web page.

Figure 10 is a block diagram decomposition of the database system/module. Figure 1 1 is an exemplary screen shots of a geographic based comparative pricing search. DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention is of a system and method for automatic, knowledge-based processing and structuring of information from marked-up documents.

Specifically, the present invention can be used for automatically locating, analyzing, categorizing, extracting, aggregating and presenting universal marked-up content in a structured, organized form. Hereinafter, universal marked-up documents refer to any type of documents that have marked up properties, such as SGML, HTML, XML, Microsoft WORD, PDF, WML, VML, RTF, XHTML, SMIL, SGML and HDML. or any other current or future mark-up languages or document types. Mark-up refers to the sequence of characters or other symbols that may be inserted at certain places in a file, such as tags, to indicate how the file should perform when it is printed, displayed or otherwise used or processed. It is also used to describe the document's logical structure. The ability to automatically structure information from such documents enables the analysis, aggregation and structuring of information found in such documents without requiring prior exposure or programming in order to process the particular style, type or example of such a document. In this way documents with new types and structures can be processed independent of prior knowledge of content type or structure. It is possible, however, that different information types may require different processing algorithms or changes in the system's knowledge base, without changing the algorithm. Even in these situations, however, once the algorithm has been prepared, the present invention is able to automatically process all pages from various sites without prior specific programming or other preparation for any particular site or page.

The core technology of the present invention is a back-end system, which processes data sources and stores the results of the processing in a database. The components of this back-end system include:

• A database system for storing data and responding to queries.

• An Information Processing System for processing various types of information elements from marked-up pages;

An additional, optional element of the present invention is a front-end system, which may or may not be utilized in any particular application of the system. This is an Information Presentation system for presenting the information elements to users in a structured form, or alternatively it may be used for data retrieval for further processing, such as by 3rd party's secondary servers.

The present invention provides a system and method for extracting information elements, referred to hereinafter as information offers, from information sources. This Information processing system retrieves, processes, and stores offers. Offers may be defined as any relevant information elements from a document that can be grouped together for the purpose of describing an item or service. For example, the system may define a real estate offer, and may attach information elements such as type, location, price, size, features, owners etc.

The core, back-end technology of the present invention provides for the means to create a database where processed information offers are stored The back-end system is self-standing and can be operated independently as a supplier of information offers to third parties. The front-end technology of the present invention provides various means to present the information elements in response to client queries, or to prepare these elements for further processing by a third party. The front-end system is not a necessary component of the present invention.

The present invention is a system and method for processing and structuring information offers retrieved by the system from Internet, intranet, extranet and other network-based marked-up pages. Every information offer pertains to a certain, pre-defined, offer category (e.g., the VCR category of all VCRs). The formal system definition for an offer is an information element uniquely identified by specific values for a set of attributes. An attribute is a feature of the element, which is being offered. That element may be a product, a service etc. An offer category is the set of all the elements that share the same set of attributes. Every offer category has its own set of offer category attributes (e.g., the VCR category attributes may include the brand that makes the VCR, the model of the VCR, the number of heads it has, etc). Associated with every offer category attribute there is a list of known grammatical templates and values. These grammatical templates and values provide the set of possible values that the attribute may assume.

For a certain offer category, there are subsets (one or more) of its offer category attributes, which collectively, uniquely identify an offer in that category (each subset is used separately for the identification). For example a brand and a model uniquely identify a certain VCR within the VCR offer category. An offer category attribute that belongs to one of a subset is called a (offer category) key attribute. Such a subset is called a key attribute subset. The other offer category attributes may be either identification attributes or search attributes. Identification attributes are use to further discriminate offers which were found using the same key attribute subset but differ in other important attributes. For example a computer may be identified by its brand and model, yet two offers for computers of the same brand and model may differ by the amount of RAM the computers have. A search attribute is an offer attribute, which is not required in order to generate an offer record, but still contains additional information about the offer. For example the color of the above mentioned computer might be a search attribute. Both the Identification attributes and the Search attributes are found by the page processor component of the present invention, but are used by the offer presentation system for the benefit of its users. Both attributes might have a default value in the knowledge base of the present invention, which is used if a value was not found in the offer itself.

The Offer processing system is based on a generic algorithm - the page processor algorithm, which processes documents and identifies offers. The page processor behavior is defined by the system knowledge base.

The system's knowledge base contains the data about offer categories that the system has knowledge of (the known offer categories), the attributes, associated grammatical templates and values, key attributes, key attribute subsets of each known offer category, etc. The knowledge base for a category is prepared using proprietary definitions, based on domain-specific expertise. Before starting system identification of a new offer category, a system editor inserts the known grammatical templates for the new category. Since the system's learning (see description below) capability enlarges the known grammatical templates automatically, during system operation, it is sufficient to prepare an initial (non-comprehensive) set of known grammatical templates. This minimizes necessary work when preparing for a new offer category, and also significantly helps lower maintenance costs. The knowledge base is stored in the database (specifically, the offer categories database).

The offer processing system of the present invention scans selected information sources and retrieves pages from them. It then processes these pages, using the system's knowledge base, in order to identify offers that belong to one of the known offer categories. An offer is identified by searching for known grammatical templates that match key attributes. Each identified offer is stored in an offer database. The offer processing system regularly performs the scan, retrieve and identify procedure in order to find new offers as well as update the already found offers. The offer processing system applies a sophisticated learning algorithm in order to enrich its known template repertoire. This feature increases the number of identified offers. It also allows the system to start identifying offers using a smaller knowledge base, enlarging it over time. The offer processing system assumes a partial fulfillment requirement, i.e., it is a heuristic system that may miss offers and may also exhibit a certain number of singular errors. However, the system still provides substantial value and business benefits by its information structuring. The offer processing system as described has many generic aspects in its operation, which can be adjusted and tuned according to the specific application being applied. The present invention, which enables automatic, knowledge-based data processing and structuring of information from marked-up documents, includes an innovative offer processing system. The principles and operations of the present invention may be better understood with reference to the attached drawings, and the accompanying descriptions, wherein:

Figure 1 is a block diagram containing the main blocks of the present invention, referred to as the offers processing system. This system has 3 parts: a back-end (the offer processing 1000), a front-end (offer presentation 2000) and a database system 3000 that the other 2 parts use. As mentioned above, the offer presentation 2000 system is not essential, and can be executed in various ways or by various third parties. 3^rd Parties are intended to include any business partners who have access to at least a part of the present invention for the purpose of further processing. This processing may be for presenting to consumers, corporations, and any other users, or alternatively to provide the processing means for any other purposes. The database system 3000 stores the offers and other data that the system needs for its operation. The purpose of the back-end is to build the offer database in the database system 3000. It performs this operation by accessing data sources (such as Internet sites, intranet files etc.), retrieving pages, and processing them. The purpose of the front-end is to allow prospective users to access the offer database and retrieve information from it. The front-end system includes one or more servers that may be accessed by different web-enabled means (e.g., web browsers, cellular phone / PDA, digital⁷ TV, Internet appliances, voice activated user interfaces etc.) Secondary servers (servers of 3rd parties) could also access the front end. In this case, the secondary servers are "powered by" the offer processing system. The offer processing system has a scalable architecture, in that it may be operated across an unlimited number of Web, application, or other servers, according to need. The offer processing 1000 and offer presentation 2000 operate independently. Thus, even in case one of them fails, the other keeps functioning. The figures present the logical decomposition of the system into modules. Those skilled in the art will realize that the functionality of any of the modules can be distributed over a plurality of computers, wherein the databases and processors are housed in separate units or locations. Those skilled in the art will appreciate that an almost unlimited number of processors and / or storage units may be supported. This arrangement yields a scalable, high performance, dynamic, highly available, and flexible system that is secure in the face of catastrophic hardware failures affecting the entire system.

Figure 2 is a block diagram decomposition of the back-end offer processing system 1000. The Leads processor 100 traverses selected information sites that may contain offers. Each site is checked to see if it is likely to contain offers from known offer categories. The Leads processor 100 retrieves data about the known offer categories from the database system 3000. Sites that contain such offers are inserted into the site revisit queue 200. The site revisit queue 200 stores all the sites that probably contain offers. It regularly performs time-based scans of these sites. The Site Revisit Queue is a software means for managing continued interaction with information sources so that the offers are up to date. Each scanned site is processed in the offer processor 300. The offer processor 300 extracts offers of known offer categories , as described above. For sites that are scanned for the first time, the identified offers are simply stored in the database system 3000. For other sites, the site's identified offers are updated in the database system 3000. Sites containing offers that have been processed are subsequently stored in the Site revisit queue 200 for future monitoring and re-processing.

Figure 3 is a block diagram illustrating the decomposition of the leads processor 100. The leads processor 100 is the component of the backend system that enlarges the collection of sites that contain relevant information for the database system. The leads processor

100 locates, classifies and extracts information from information sources. This is done by searching for new sites that contain offers, and adding them to the list of sites that are periodically revisited. In this way, the whole back-end system can still operate and keep virtually any number of constant sources up-to-date without activating the leads processor

100.

The leads processor 100 increases the number of sites that it processes by employing automated tools to search the relevant network for potentially valuable sites and pages.

The tools, or components, that execute this function from within the leads processor 100 are the following:

The manual Leads 110 component that allows a system operator to input Leads manually.

External knowledge bases such as the yellow pages, search engines, directories, etc 120, which represent different, focused Web sources for Leads.

The Leads processor 100 queries these focused sources to receive result pages with lists of possible relevant lead addresses (i.e., URLs). The exact manner of the querying depends on the type of the source and the interface it supports for user queries (search engines are used in a different manner than yellow pages). It may also depend on the category of offers that are expected to be found in the site. For example, when searching for leads, search engines will be queried using the known grammatical templates and values contained in the system's knowledge base. The Leads processor 100 follows "next" links between the result pages, to retrieve multiple pages. The Leads processor 100 filters all site addresses from the results, to obtain individual leads for its operations. The e-mail registry 130, which is not integral to the invention, is a registry of merchant sites that were received via e-mail. Site owners that would like to be covered by the system send these e-mails. The DNS (Domain Name System) scan 140 is an automatic scan of domain names.

All the Leads that are obtained via any of these means are stored in the Leads database 3400 (Figure 10), which is a part of the database system 3000, for policy based, periodical check and classification of leads. The Leads database 3400 also contains all the leads that it has processed in the past together with their classification (as described below). It constantly retrieves leads from the Leads database 3400 using a priority policy. For example, manually fed addresses will be of top priority, while "brute force" global DNS searching will be of lower priority. Higher priority is assigned to Leads that have higher probability to contain offers in known offer categories. The site classification manager 150 is responsible for classifying sites into one of two categories: "interesting " (probably contains offers) or "not interesting" (probably doesn't contain offers). This classification is probabilistic. The general steps the Leads manager 160 performs are the following: 1) Retrieving a lead to be processed (according to priority). 2) Checking against the Leads database whether or not the lead has been processed before. 3) If it has, and it is classified as interesting, it skips it, and goes back to step 1 (retrieving a new lead). 4) Otherwise, classifying the lead using the site classification 150. 5) Storing the lead and its classification in the Leads database 3400. If the lead was classified as interesting, it is passed on to the Site Revisit Queue 200. Otherwise the Leads Manager 160 lowers the lead's priority in the Leads database 3400. This enables the system to handle changes in sites' content, changes in the known offer categories, and avoids frequent scanning of sites with low probability of being interesting.

As can be seen in Figure 4. the Site Classification is an integral component for determining the relevancy of the content of information sources, as provided by the Leads Processor 100. It is comprised of:

A site classifier 151 which manages the classification.

A Web Walker 152 which retrieves documents from the information sources. A Reduced page Processor 153 which processes retrieved documents and estimates the probability of existence of offers.

The site classifier 151 manages the classification process. It receives a URL of a site, or an address of an information source, for classification, and uses the web walker 152 to fetch Web pages from the site (the web walker is described below). It instructs the web walker 152 to retrieve a pre-defined, configurable number of pages from the site. It also limits its walking depth within the site (see the description of the web walker for more detail). These parameters can be relatively small. For example, a depth of 2-3 usually should be sufficient for good classification. This is because, most likely, relevant information such as offers are accessible by following 2-3 hyper-links. Retrieving a small number of documents, while maintaining good classification results saves system resources and provides an effective balance between missing relevant information sources and using system resources dedicated to offer extraction. Each individual document is given to the reduced page processor 153 that tries to classify it. The reduced page processor 153 is similar to the page processor 330 (see description below, Figure 5). It has many of the fundamental capabilities and methods of the page processor 330. The difference is the focus of its operation so that it generates a different output (it doesn't extract offers). The page processor 330 identifies partial or complete offers in the document. The results of this process are levels of certainty of the existence of offers in the document and their corresponding identified offer categories. The reduced page processor searches the pages for known keywords and templates, from all offer categories. The reduced page processor counts occurrence frequencies by offer category. High frequency of templates or values from one offer category classifies the page for that category (for example a page containing words or templates like "chateau XXX", "Champaign", "bottle", 1986, etc. are classified into offer category "Wines"). If the frequency exceeds a certain value, the page is classified as interesting.

The site classifier 151 accumulates the results from the processed documents, uses thresholds and decides on the information source classification. The site classifier 151 stores this classification in the Leads Database 3400.

The web walker 152, in greater detail, locates pages by repeatedly processing web or other pages, extracting links from them, and following these links to other pages. It operates in a similar way to a user that uses a graphical tool that was built for site navigation (e.g., a browser). The difference is that compared to a human being, the web walker navigates the site in a much more orderly and exhaustive manner. It retrieves the documents from the site exactly as navigation tools do, by submitting client requests to web servers. Thus, the documents are retrieved without any need for an active participation of the information source, apart of its general accessibility to its users. The web walker 152 starts from a certain page, e.g., the home page of some site. It processes the page and finds all the links on the page that link to other pages in the same site. To access the various tags, their structure, attributes and content, the web walker 152 uses a commercial marked-up document parser. There are various means to specify links between pages, depending on the markup language. The web walker 152 identifies the language and uses appropriate tools and methods to obtain the links. For example, in HTML-based documents the web walker 152 obtains links that are specified using methods (and combinations of such methods) such as:

1. Explicit links (e.g., <A HREF = '7foo/sales.htm">).

2. Forms (usually used to query a database-based site that contains offers): It handles both GET & POST form submitting. When handling forms, the web walker 152 automatically attempts to create a sufficient form data set that it submits to the form-processing agent. The exact method is determined after analysis of the form. For example, for a form that has lists of options to select from, the web walker 152 may iterate over the options of one selection, while keeping all the other selections constant. This method usually covers all the possible results in the database that the form enables access to.

3. Links that are partially or wholly generated with the aid of client side scripts (e.g., java-scripts). The web walker 152 uses commercial tools that can interpret the 'script, execute it on demand, and support integration with the underlying object model of the parsed document. Using these tools, the web walker 152 is able to handle and obtain links of this type. The web walker 152 only follows links to documents that are later processed by the page processor 330. For example, it follows links to HTML or XML documents, but does not necessarily follow links to GIF or JPEG documents. It has a configurable list of document types with an indication of whether or not to follow links to documents of the corresponding type. Documents whose type is unknown or can't be reasonably assumed, are retrieved and processed. Such cases are reported and monitored by a system operator in order to update the system with regard to existing document types.

The web walker 152 navigates a site using the well-known A* algorithm. The basic traversal method uses BFS (breadth first search - a special case of the A* algorithm). Consequently, when following relevant links in a document, the web walker follows the links one by one, in the order of their appearance. It retrieves the documents that are directly linked to the processed page (direct descendants that are "one click" away from the processed page) before retrieving documents that are indirectly linked. As it follows the links in the site, the web walker maintains a tree of the documents in the site, their links, and the relations between them. It uses this information to avoid loops. During traversal, the web walker employs heuristics that may result in a non-BFS traversal. For example, after automatically submitting a form (as described above), the results may span more than one page. These pages are usually linked together using common methods such as a "next" button. In such a case, the web walker tries to locate these special links (the "next" button). If the web walker decides (using certainty thresholds) that it has identified a special link, it gives it precedence over regular links, thus deviating from the simple BFS traversal method.

The web walker 152 uses, for every information source that it handles, several configurable operational parameters:

- The maximal walking depth. This is the smallest number of successive links that need to be followed to retrieve some document, relative to the document where we started the retrieval.

- The number of documents to retrieve from the information source.

The web walker 152 stops retrieving documents from an information source when one of the following occurs:

- It reached the maximal specified depth; or

- It reached the maximal specified number of documents to retrieve; or

- There are no more relevant documents to retrieve from that information source. Communications failures, dead links, and other issues may prevent the web walker from retrieving a certain document or even an entire site. The web walker consistently checks the status of its retrieval operations. In case of failure in following a link, the web walker continues to the next, attempting to retrieve those pages that are accessible. As information sources are periodically revisited, transient failures will not disable the system from extracting the information from information sources. In this way, the web walker module successfully retrieves most pages from most sites. However, it is not supposed to retrieve 100% of all documents of all types from all processed information sources. Figure 5 is a block diagram depicting the site revisit module and a decomposition of the offer processor 300. The site revisit queue 200 manages the whole processing of the offers. It manages an adjustable, time-based, revisit of all the sites. Every site in the queue has a next revisit time attribute attached to it. This attribute indicates when the module needs to revisit the site in order to update its offers. The site revisit queue 200 usually receives sites to the queue from the leads processor 100. However, it also supports direct insertion of sites that haven't been processed and classified by the leads processor 100. The insertion could be manual (by a system operator) or computerized (e.g., by accessing some information system, possibly of a 3^rd party). The site revisit queue 200 uses the site processor 310 to process single sites, and perform operations at the site level. The site processor 310 manages various operations while processing a site, these operations include:

1) Updating the site's identified offers in the Offers database 3200 (Figure 10). This operation includes saving historical information about offers in the history database 3500.

2) Updating the site's information in the offer providers database 3600 (Figure 10) (e.g., the addresses of the offices of some service provider, or the brick & mortar store information of some vendor).

3) Adjusting the site's next revisit time in the site revisit queue 200. The adjustment is mainly based on the amount of change in the site's processed data (e.g., the number of offers that were updated since the last time that site was processed). The site processor 310 uses the web walker 152 to fetch pages from the site. The site processor 310 hands the fetched pages to the page processor 330 to identify and update offers. The site processor 310 is also responsible for removing sites from the site revisit queue 200 when it reaches a decision (based on the processing history of the site) that the site isn't relevant anymore. This could happen, for example, when a site is classified as interesting by the leads processor 100 and inserted into the site revisit queue 200. After full processing of the site by the site processor 310, no offers are identified. In this case the initial classification was wrong and the site is put back into the leads database 3400 and marked as not interesting.

Figure 6 is a block diagram decomposition of the page processor 330. The page processor 330 receives a single marked-up document and scans the document for offers. If the document contains offers, it will have certain structural properties, which enables the users of the document to assemble the offer components into offers. For example, HTML structural properties are usually the result of the fact that a human being is supposed to view the document (in addition to the inherent structural properties that the syntax of the mark-up language provides). A human being should be able to easily understand the offers in the exemplary HTML document. Thus, this HTML document author will arrange the contents of the offers in such a way that they will be visually comprehensible to a human being. The author uses HTML tags for that purpose. Any mark up document, which is intended to be displayed to humans, will have the visual structural properties since the structural properties are derived from the capabilities of the document's user. Other types of documents, which are not used by human still, have some pre-defined structural properties. For example, XML may be used for inter-system communications, and so the structural properties may differ from those found in HTML documents. The knowledge base contains the structural properties per mark up language and possibly per application.

The page processor 330 parses the marked-up document into a parsing tree, using a commercial mark up type designated parser (i.e. there is a different parser for each type of mark up document). Each mark up tag is a node, and the text content is in the leaves.

The page processor 330 then runs the attribute identification 332 in order to find candidates for offer components. The attribute identification 332 scans each text node and looks for a text token that matches one of the templates and values stored in the offer categories database 3300. If some of mark up tags contains attribute information, which are mark-up language-specific, they will be analyzed as well. For example, graphic images and alt attributes of IMG elements in HTML documents. The result of the attribute identification 332 is that each node in the parsing tree has a list of all the recognized templates or values that matches one or more attributes of some offer categories. For example Sony is identified as a brand attribute of some consumer electronic offer categories (TV, VCRs, DVDs etc.).

The page processor 330 uses the Structure Identification 331 to identify the structural properties within the document. The Structure Identification 331 is specific per mark up document type, since different mark up types use different set of tags and may have different structural properties. The Structure Identification 331 identifies pre-defined structures within the document for example, in the exemplary HTML-based documents, it handles the specific tables, paragraphs and lines structures. As time passes, there may be changes in the way documents are structured. A possible cause is a change in the mark-up technology. System engineers and operators will regularly monitor such changes, aided by system reports and performance tracking. They will adjust structural knowledge in the knowledge base accordingly.

Since a value for a certain template may span over more than one node (this is done for example for presentation purposes) the page processor 330 tries also to integrate text tokens spanning over more than one node into a value that matches a known template or value.

The page processor 330 identifies an offer within the identified structures. There must be candidates for all the key attributes in at least one of the key attribute subsets. For example, if the key attribute subset of a job offering is the offering firm, the title, the required experience and the location, an offer might be Vineto ltd., Software engineer, 3 years experience in Tel-Aviv, Israel. An offer is identified only if values were found for all necessary attributes within an identified structure. The offer identification 333 stores the identified offer in the database. If not all the necessary attribute has a value this is a partial offer.

Since the database might not include all the possible grammatical templates for attributes, the attribute identification 332 might miss an offer component, and therefore, may not be able to identify an offer. Therefore, sophisticated learning capabilities are employed. The offer identification 333 uses the structural knowledge in order to assume where an offer component is to be expected. For example, structural symmetries are a good source for learning new grammatical templates. "Ordered" sites (sites that organize the date in their pages in a consistent manner) will usually exhibit such symmetry properties. If the values collected this way pass a confirmation test (automatic or manual) they will be entered into the database, and enlarge the knowledge source regarding the grammatical templates associated with attributes. This process will gradually enlarge the number of identified components and will improve the page processor's 330 accuracy and percentage of identified offers. For example, if we found 5 offers in the first 5 rows of a table and didn't found an offer in the 6^th row, and the structure of the first 5 rows is similar to the 6^th row (it doesn't have to be identical) than there is high probability that the 6^th row contains an offer as well. The page processor 330 deduces the location of the 6^th offer attribute according to the location of the attribute of the preceding 5 offers. The page processor component of the back-end system is in itself is a unique sub-system that can operate independently of all the other system modules. It needs the knowledge base for its operation. However, for example, it can extract offers from documents that were placed on a local disk and process them one by one, without necessarily requiring the leads processor component, web walker component, etc.

Figure 7 and Figure 8 are examples of HTML web pages that contain offers. They demonstrate 2 different exemplary methods to organize offers that conform to the principles stated above. The first method, illustrated in Figure 7, is to present several offers in a table. Each offer is presented within a single table row. The human eye easily identifies the offers' border. Since a row is a known HTML structure the page processor will look into the rows. The attribute identification identifies the known templates and values. If a row contains values for all the necessary attributes an offer is identified. If and offer were not found in one of the rows, the page processor still might find the offer attribute due the structure similarity between the rows. The second method, illustrated in figure 8, is to present each offer in a different table. Someone who is skilled in the art can easily see that what was described for the first method applies to the second as well.

Figure 9 shows part of the parsing tree constructed from the exemplary web page of Figure 8 The parsing tree is the base of the structural properties analysis. Under each table block the offer's attributes can be seen. If all the identified value within the table block is collected, an offer can be deduced.

Figure 10 is a block diagram decomposition of the database system 3000. The entire database system is based on a commercial database management system (DBMS, such as Oracle). The database management 3100 represents that software. It is responsible for the creation and maintenance of the database system 3000. It provides interfaces for all the database operations (insertion, removal, queries, etc). The database system 3000 is the repository of the present invention's entire system data. The other blocks in the diagram (3200 through 3600) represent the various individual logical repositories. The database management 3100 manages the access to these repositories.

The offer database 3200 stores all the data about identified offers. This repository is mainly updated by the offer processor. An offer record includes, at least, the following information: 1) The site that offers the offer (a link to the site's record in the offer providers database 3600); 2) The offer category to which it belongs; 3) Values for the key attributes that identify the offer and other attributes of the offer. The specific fields that are present depend upon the offer's category. For example, if the offer is for a room in a motel, 'there may be information about the price of the room, whether it's a single or a double, etc. In a different example, say for a job offer, the data may be the salary offered (if it was published and identified), the company offering the position, etc.; 4) The link to the page that contains the offer; 5) The date when the offer was last updated.

The offer categories database 3300 contains the data about the known offer categories. System editors update it manually. An offer category record includes, at least, the following fields: 1) The offer category attributes. 2) The key attribute subset or subsets. 3) The grammatical templates associated with the offer category attributes. Grammatical templates may be applicable to one offer category, or may be applicable to more than one. For example, the grammatical templates for a price attribute may be applicable to many offer categories.

The Leads database 3400 contains the data about Leads. A lead record includes, at least, the following fields: 1) The lead's URL 2) The lead's classification (containing / not containing offers).

The history database 3500 contains data gathered by the system over time, which may be used for various database products. For example, in an exemplary classified offers application, the history database 3500 could contain the average price of used cars, according to model, age, etc, at different points in time. This could be used, for example, to derive a price trend graph for used cars. In location based applications (e.g., local commerce aggregation), valuable reports could also be generated based on geographic location.

The offer providers' database 3600 contains the data about offer providers that the system recognized. An offer provider's record includes, at least, the following fields: 1) URL of web sites of the offer provider. 2) Information about brick and mortar stores that the merchant might have (an optional field - some offer providers have only web operations), such as the physical addresses of the stores.

The offer processing system operates automatically, with minimal human intervention. The only mandatory manual operation is the preparation of an initial knowledge base, for every known offer category. However, to be able to monitor the system's operation, and to improve its performance, the system contains auxiliary applications to be used by system operators, such as system administrators and editors. Such applications include:

1. An application to maintain the offer categories database 3300 (e.g., insertion of the initial knowledge base about an offer category).

2. An application to maintain the Leads database 3400 (e.g., manually insert new Leads).

3. An application to maintain the site revisit queue 200 (e.g., to change the scheduled revisit time of a site, to directly insert sites into the revisit queue).

4. An application to aid in the maintenance of learning process of the page processor 300 (e.g., acknowledging new proposed grammatical templates).

5. An application to view various system errors logs (e.g., reports about broken links in the web walker 152, SW warnings and errors).

A preferred embodiment of the present invention is a global and/or local content unification platform for various vertical markets. This platform is a local commerce unification platform in the Application Service Provider model, whereby the back-end and database modules may be outsourced to 3rd parties, such as wireless service providers, portals, local media players, VAR (Value Added Resellers), ISPs, content syndicators, local sites, other content aggregators, etc. This platform aggregates content from various vertical markets, and may optionally be bundled with a front-end application such as a web sites, for the end user of the distributor. Alternatively aggregated content may be supplied (or parts thereof, depending on the distributor - different distributors may be interested in different combinations of content type / market and geographic locations). For example, a newspaper chain, that has sites in New York and Los Angeles, may like to receive unified content for all the vertical markets, whereas the real-estate national portal may wish to receive real-estate content only over all the USA. This platform is beneficial for brands, merchants and aggregated businesses, content distributors - scalable, fast and efficient aggregation of content in multiple markets and localities, and end users (Internet & wireless subscribers) - an efficient, high-quality search tool. This platform enables content unification for markets that include retail, professional services (e.g., dry cleaning, plumbers, doctors, lawyers), job offers, real estate offers, events, accommodations, and job finding related sources (e.g., job offers, candidates' CVs), as well as auctions, classifieds, B2B processes, consumer goods and wholesale items. This platform leverages its automatic, structured, product/service-oriented search engine technology to aggregate information from an almost unlimited number of content sources (e.g., merchants or service providers), online as well as offline, in a cheap, fast and scalable method. This preferred embodiment uses the offer processing system of the present invention in a manner that is focused on finding, identifying and aggregating product or service offerings from these content sources (e.g., merchants or service providers). The present embodiment offers consumers a powerful product/service finding experience (including structured, feature based and location based searches) over the WWW using PC's, PDAs, mobile computers, cellular phones, Digital TV as well as any other Internet enabled devices or services. The Web site application of the present embodiment offers easy, structured location, price and product features based product/service offer search facilities, supportive editorial content and buying decision aids, user registration and personalization services, focused advertisements, bi-directional e-mail services, etc. Similar features are available for users accessing this embodiment with mobile devices, such as WAP phones and PDA's. The user interface on each of these devices is adapted according to the limitations and options of the specific device (which vary considerably). For example, simple, WAP based interfaces (in use by many current cellular phones) have only several lines of textual display and a very inconvenient and limited "keyboard". In that case, the interface will be adapted to display very small amounts of data, and that data will be the most relevant to the user (aided by user personalization through the web site). Also, the input from the user is based mainly on simple selections from short option lists that require a minimal use of keys. Another possible interaction channel is voice. In this case the user uses a voice enabled Internet appliance to interact with the web site's servers. The user "surfs" by talking to an IVR-like system that uses commercial voice recognition technologies. The interface could be based solely on voice interaction, or it could be using combinations of voice, visual display, and keyboard inputs. For example, the system could ask and direct the user using voice, the user provides her inputs by pressing on the device's keys, and the system will read the results of the search to the user. Another possibility is for the system to display the options on the device's display, and let the user select the options using her voice. The results could also be displayed on the device's display. The exact combination depends on many factors including the capabilities of the specific device and the details of the application at hand.

The unique technology that is disclosed in the present embodiment has significant advantages over all known existing technologies and business practices of local or global content unification platforms. Among these advantages are:

• 1 Breadth of merchants/service providers covered: The integration of hundreds of thousands of merchants and / or service providers with their addresses and offerings enables aggregation of local small and medium sized businesses. This enables consumers to search for local offerings and enables the integration of brick-and-mortar commerce into the e-commerce world.

•2 No prior technological or business engagement with serviced business is required in order to begin the basic service.

•3 Faster scalability due to the "understanding" ability of the marked-up page processing algorithm. The page processor can automatically process offers from sites with new content structures and styles.

•4 Geographical segmentation capabilities enables localized service over all of the above mentioned devices and means. The present invention facilitates the structuring of information both from online and offline merchant stores and other businesses, enabling users to compare between an almost unlimited number of merchants and service providers. Furthermore the users can undertake local merchant or services searches, and receive product or service comparison information based on geographical specifications.

Figure 11 illustrates a possible exemplary screen shot of the web site. The web site enables the style and layout discrimination of regular merchant product offerings from "Preferred business partners" (for example, merchants paying commissions) product offerings.

The web site's user performs a hierarchical product or service search. The user drills-down from general product category into specific product category (for example from "Food and beverages" into "Wine" into "Red wines" etc.). When in a specific product category the user can consult various information sources about the purchase (see next item), or search for offers. While searching for offers, the user can search products (or models) by choosing a combination of attributes like price range, specific vendor and product category specific values like size, color, features etc. A user can limit the offers for products that are only sold through the web, and/or to products offered by vendors that operate brick-and-mortar outlets in the geographic vicinity of the user (calculated from user's address/zip codes, see "User registration" section). The results can be seen in Figure 11 , illustrating a typical user interface reflecting the results of a product search.

The user receives an answer to his or her query, ordered in a table (may be divided into some pages). The table is ordered by some default criteria (like ascending price), which the user can simple change according to his or her preferences. Offers which are in "uncertain" status, or offers that are price way too low for what is common in this product category are listed at the end, and are marked as "uncertain" (to prevent Spam or error).

The user gets the main fields of data for each product, while they can ask for additional information or link to the original products web page. A user can easily filter his or her query by using a combination of field constraints.

The system informs the user how different product and vendors offer products in each category are, and how much this number is reduced after each filtering criteria is applied.

If different vendors are offering the same product, all those offers may be grouped together.

The user has the opportunity to mark some products/offers as "interesting". A "compare tool" lets the user to compare the "interesting product offers with each other, attribute by attribute. If the user has undertaken a localized search (asked for brick-and-mortar outlet near it's geographic location) she or he can get a web map with all the local stores offering this product & summarized report listing of those shops.

The present invention supplies local as well as global content automation & infrastructure. The preferred embodiment's market includes Internet & wireless subscribers, wireless providers, Internet portals, Internet portal infrastructure integrators, merchants, and brands. While the invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications and other applications of the invention may be made.

APPENDICES

The following appendices contain additional information about the components and operation of the present invention.

1. Outline of the automatic page processing algorithm.

2. Building the algorithm of the present invention by examples.

3. The generalized Algorithm for automatic page processing.

Claims

WHAT IS CLAIMED IS:

1. A system for automatically structuring content from universal marked-up documents, comprising: i. A back-end module for building a database of the structured content; and ii. A database sub-system for storing structured data and processing requests from said back-end system.

2. The system of claim 1 where said back-end module automatically processes content from universal marked-up documents, independent of prior knowledge of content structure for particular sites.

3. The system of claim 1 where said back-end module automatically processes content from universal marked-up documents, independent of prior knowledge of content for particular sites

4. The system of claim 1, wherein said back-end module further comprises a computer usable medium having computer readable program code embodied therein for automatically processing content from said universal marked-up documents.

5. The system of claim 1, wherein said back-end module processes documents from any network-based source selected from the group consisting of Internet, intranet, extranet and Virtual Private Network, via computerized communications means.

6. The system of claim 1, further comprising a front-end module for enabling user interaction with the system.

7. The system of claim 1, further comprising a front-end module for enabling data retrieval for further processing by a 3rd party.

8. The system of claim 2, where said universal marked-up information sources include content existing in data formats selected from the group consisting of SGML, HTML, XML, Microsoft WORD, PDF, WML, VML, RTF, XHTML, SMIL and HDML.

9. The system of claim 6 wherein said front-end module includes at least one server that enables retrieval of the structured information from said database.

lO.The system of claim 6, wherein said user interaction is executed using an interactive device selected from the group consisting of PC's, cellular phones, pagers, handheld PC's, pocket PCs, Mobile computers, interactive TV's, Internet appliances and mobile communications devices.

11.The system of claim 6, wherein said front-end module includes a user interface selected from the group consisting of graphic user interfaces, text based interfaces, voice-based interfaces, keyboards & pointing devices and any combination of them.

12. A system for automatic processing and aggregation of content from universal marked-up documents, such that information offers are extracted and structured, comprising: i. A back-end system for building an offers database of the information offers; and ii. A database system for storing offers records and processing requests from said back-end system.

13. The system of claim 12, further comprising a front-end module for enabling user interaction with the system.

14.The system of claim 12, further comprising a front-end module for enabling access to the information offers for further processing by third parties.

15.The system of claim 12, wherein said information offers are product offers, selected from the group consisting of consumer goods, auctions, classifieds, bartering, wholesale goods and B2B offers.

lό.The system of claim 15, wherein said product offers are geographically enabled localized products.

17. The system of claim 12, wherein said information offers are service offers, such that service offers are selected from the group consisting of professional services, job offers, real estate offers, events, classifieds and job finding tools.

l δ.The system of claim 17, wherein said service offers are geographically enabled localized services, such that users can research service offers according to geographic preferences.

19.The system of claim 13, wherein said front-end module is a comparison-shopping engine for products and services.

20.The system of claim 19, wherein said comparison shopping engine includes geographically enabled localized shopping and Internet based virtual shopping, such that users can research product offers at online stores, offline stores and a combination thereof.

21. The system of claim 12, wherein said back-end system and database system form a stand-alone local content product and services offerings aggregator.

22.The system of claim 21, wherein said product and services offerings aggregator is used by third party entities selected from the group consisting of WSPs (Wireless Service Providers), on-line news sites, and other media sites, voice portals, local portals, community portals, e-commerce hubs, B2B marketplaces and ISP's (Internet Service Providers).

23. A method for automatically structuring network-based content, comprising the steps of: i. Finding information sources for information offers; ii. Retrieving information pages from said information sources; iii. Processing retrieved pages in order to identify information offers; and iv. Aggregating of said information offers in a central database.

24. The method of claim 23, further comprising the step of interfacing the central database with a front-end module for enabling user interaction with said central database.

25. The method of claim 24, wherein said front-end module includes at least one server for running an interactive web site.

26.The system of claim 24, wherein said front-end module includes at least one server for interacting with Internet enabled devices.

27. The method of claim 24, wherein said user interaction includes geographic location based queries.

28.The method of claim 25, wherein said server enables presentations of product offers for the purpose of comparative shopping.

29.The method of claim 28, wherein said product offers include product prices and features obtained from the group of data sources selected from the group consisting of merchant stores, job data bases, renting real estate, events, classifieds and B2B marketplaces.

30.The method of claim 28, wherein said comparative shopping is executed for offline products.

31. The method of claim 25, wherein said server enables presentations of service offers.

32.The method of claim 31 , wherein said service offers are geographic location based offers.

33. The method of claim 23, further comprising the step of interfacing the central database with a front-end server module for further processing by a 3rd party.

34.The method of claim 33, wherein said front-end server module includes an engine for supplying information services to third parties.

5.The method of claim 23, wherein said finding of information sources is executed by a Leads Processor by steps including: a) Retrieving a lead to be processed; b) Classifying said lead using a site classification program; c) Checking a leads database to determine whether said lead has been processed before; and d) Storing said lead and said classification thereof in a Leads database

36. The method of claim 23, wherein said retrieving of information pages includes extraction of data from sites of local merchant stores.

37.The method of claim 23, wherein said retrieving of information pages includes extraction of data from sites of local merchants and local service providers.

38.The method of claim 23, wherein said retrieving of information pages comprises the steps of: i. Revisiting information sources that are stored in a site revisit queue based on a scheduled next revisit time for said information sources, ii. Handing said revisited information sources to a web walker to retrieve relevant pages from the information source, iii. Handing the retrieved pages to a page processor to identify and update offers in an offers database, and to extract information about said

'information source; iv. Updating said information source's information in a offers providers database; v. Adjusting said information source's next revisit time in a site revisit queue, based on the amount of change in a source's processed data; vi. Removing sources from site revisit queue; and vii. updating states' of said information sources in Leads database.

39. The method of claim 23, wherein said retrieving of relevant information pages utilizes a web walker module, comprising the following steps: e) Receiving a starting point for traversal in an information source; f) Traversing the said information source according to a traversal policy wherein said policy is based on the well known A* algorithm"; g) Repeating said traversal according to said traversal arguments, and said traversal policy; and h) Retrieving said pages.

40.The method of claim 23, wherein said processing of retrieved pages includes the steps of: i) parsing pages to create a structure for said offer records; j) building legitimate offer records using a knowledge base; and k) post processing for enriching a knowledge base of the system.

41.The method of claim 40, wherein said parsing pages further comprises automatic processing of content from new and unknown marked-up based pages, independent of having prior programming for said pages.

42. The method of claim 40, wherein said building legitimate offer records further comprises using grammatical templates to enable offer identification

43.The method of claim 40, wherein said post-processing further comprises executing a computerized means for expanding knowledge base.

44.The method of claim 23, wherein said aggregating of information includes aggregation of product and service offers for comparative price and feature searches.

45. A method for structuring information from universal marked-up documents, comprising the steps of: i. Receiving marked-up document; ii. Parsing said document into a parsing tree; iii. finding candidates for offer components from within said document; iv. finding structures in said document; v. using knowledge stored in an offer categories database to identify said offer components and said structures; vi. identifying offers using pre-defined identification criteria from a knowledge base; vii. analyzing said parsing tree, said offer components, and said structures, and using pre-defined heuristics, and said knowledge; viii. making a knowledge base expansion decision; and ix. storing said identified offers in an offers database.