WO2008129339A1

WO2008129339A1 - Method for location identification in web pages and location-based ranking of internet search results

Info

Publication number: WO2008129339A1
Application number: PCT/IB2007/001006
Authority: WO
Inventors: Joachim Diederich; Hermann Havermann; Carsten Tautz
Original assignee: Mitsco - Seekport Fz-Llc
Priority date: 2007-04-18
Filing date: 2007-04-18
Publication date: 2008-10-30

Abstract

The invention relates to a method for determining a geographic location of a service or facility described in a web page by processing location relevant information in a Web page or by using a support vector machine and for location-based ranking by using heuristic functions and an expert system in a localized Internet search. Web pages are preprocessed in different manners to be capable of being applied to a support vector machine. A trained support vector machine determines a location classification of the Web page's content. This location classification is considered by an expert system while ranking a search result.

Description

Method for location identification in Web pages and location- based ranking of Internet search results

The invention relates to a method for location identification in (multilingual) Web pages and the location-based ranking of Internet search results. In particular, this invention relates to a method for determining the geographic location of a service or facility described in a web page by processing location relevant information of the Web page or by using a support vector machine and for the location-based ranking by use of heuristic functions and an expert system in the context of localized Internet search.

The World Wide Web (in the following: WWW) is a collection of documents in specific formats provided on data processing devices connected via the Internet . These documents are accessed by specifying a particular document in a program, i.e. a Web browser. Since its beginning, the WWW has amassed a huge amount of documents and was subjected to various and frequent changes. Thus, it is not possible for a user to keep an over- view of the available information^'',! Conventional techniques like employing conventional mercantile directories, e.g. yellow pages or the like, have proven to be inapt to cope with the amount and volatility of the available information. In this environment, search engines have been established as helpful utilities for finding the required information. But even with the use of search engines, a query representing a topic for which information is requested produces many search results, so that results relevant to a specific user could be easily overlooked. Search engines established in the market therefore employ different algorithms and methods for ranking the display of search results in a way that results most relevant to a user are displayed on top and first. Known are methods like "relevance ranking" which aim at displaying results with the highest priority first. . In other methods, the priority of a given Web page depends on the number of links from other Web pages referencing the page. A further method, described in patent document DE 102 10 840 Al groups search results by thematical domains and provides ranks for domains, respectively.

These methods of ranking search results have proven insufficient in relation to location relevant Web searches, i.e. searches, where the relevance of a result depends on the location of a service, facility, or an event described or named in a Web page. As an example, /a user in Dubai, using a search engine with the keywords "Italian restaurant" is usually only interested in Italian restaurants located in Dubai and not, for example, in Italian restaurants located in New York. With conventional ranking systems or display orders of search results, the user is faced with the cumbersome task to consider a huge amount of search results to discover relevant items or alternatively needs to have at least basic knowledge of his environment and the usage of sophisticated search terms. In addition, the user is faced with the problem, that Web pages often do not contain explicit location information and thus do not provide a basis for Boolean search. This makes the usage of a conventional search engine difficult. It is the object of the invention to overcome the problems described above. In particular, it is the object of the invention to provide a method for enabling a user to quickly and easily find relevant results by use of a Web search that utilizes localized information.

This object is solved by a method according to claim 1 and claim 4 and by the information analysis device according to claim 15. Further advantageous developments are subject-matter of the dependent claims.

The Invention provides a computer based method for determining geographical location information of contents of documents, in particular Web pages, employing machine learning techniques among other methods. The method is advantageous, since it allows a geographical classification of Web pages, even if no explicit geographic location information is available either from the Web page's content itself or from other sources.

The result of the location information determination can even be enhanced in accuracy and amount by applying the subject- matter of the dependent claims.

The Invention further provides a computer based method for providing Web search results with localized relevance, wherein the search results are ranked according to location relevance, e.g. minimal geographic distance to a user. This method employs an expert system implementing a heuristic function. This method is advantageous, since it allows processing of geographic location information in a multitude of formats like geographic positions or areas, exact building addresses, fragmented addresses, suburbs, regions and the like. It is also advantageous, since it allows the combined usage of exact and vague location information, which can be adequately taken into consideration by the expert system. The ranking of the search results can even be improved by applying the subject-matter of the dependent claims.

Further features and advantages of the invention become obvious from the description of an embodiment and the figure.

The Figure shows a schematic diagram of components of an information analysis device as well as a flow chart of the methods according to the invention.

An information analysis device according to an embodiment of the invention comprises one or more data processing devices which are connected to the Internet. On these devices, a program called Crawler C is installed. Via the Internet, the Crawler C has access to Web pages Pi constituting the WWW. The Crawler C is designed to collect the information of all Web pages Pi to be made available to a user of the information analysis device. The Crawler C thus collects a large set of Web pages Pi and preprocesses the Web pages in that structural page information like HTML tags are removed and the Web pages are grouped according to their content language.

The thus preprocessed contents of the Web pages Pi are then distributed to two components for identifying the location of the respective Web pages Pi. In this case, the location of a Web page Pi does not relate to a place where a Web server hosting the Web page is located but relates to the location the content of the Web page refers to, i.e. to a location where a service or facility described or mentioned on the Web page is placed.

The first component A is a program, running on one or more data processing devices. Component A performs a task of deter- mining a location of the Web page by processing information directly pertaining to locations provided in the Web page.

For this purpose, as a first step, Web page content is searched for location relevant information using Boolean expressions. For example, if a Web page relating to an Italian restaurant in Dubai contains address information like country, city, street and street number, i.e. it contains character strings matching distinct formats like postcodes, street names and the like, this information is determined by the Boolean expression, extracted and kept for further processing.

In a next step, ontologies are used to expand the query in the case that the above step of using the Boolean expression provides insufficient results. For this purpose publicly available databases like Wordnet and OpenCyc are used. For example, if the Web page Pi of the Italian restaurant only contains a reference to the city Dubai and the street and street number, the country United Arabic Emirates can be determined by using ontologies or databases providing semantic information. Likewise, if the Web page Pi only contains a reference to the "capital of the United Arab Emirates", the capital Abu Dhabi could be determined.

In a next step, the extracted address information is complemented by external directories, such as Yellow Page services. For example, using names and/or identified phone numbers, ad- . dress information can be retrieved and attached.

In a next step, the location information gathered in the first three steps is transmitted to a geographic information system (in the following GIS) to retrieve a geographic position.

The information analysis and device further comprises an index database^' generation device I consisting of a computer program running on one or more data processing devices . The index data base generating device I indexes all terms of the Web page Pi in a conventional manner and stores them in the search index database ID. The function of the index database generating device I and the search index database consists in indexing every term of the Web page for a later recovery. In addition, the location information gathered by component A is stored in the search index data base ID related to the Web page Pi.

In contrast to component A, component B is used to gather location information of the Web page Pi in a case, in which the Web content does not have exact location information such as street addresses, postal codes or even phone numbers. Component B utilizes machine learning techniques.

In a first step, the Web page content of a Web page found by the Crawler C is converted to a so-called bag-of-words (in the following BOW) representation. This means that the text of the Web page is represented as an attribute-value vector where each distinct word W_j,i of the Web page Pi corresponds to an attribute whose value is the frequency f (W_j,i) of the word Wj_#i in the Web page Pi .

In a next step, the BOW is further processed by applying stemming, thresholding, and the formation of N-grams. By stemming, words W_ji in the BOW with coincident stems are reduced to their stems and pooled. By thresholding words Wj,i with low frequency are removed, since they are not of relevance to the result. N-grams , are formed, i.e. attribute-value vectors where each tuple of two or more words Tk,i corresponds to an attribute whose value is the frequency f (T_k,i) of the sequence of these words in the Web page Pi.

In a next step, several techniques are applied for normalizing and transforming word frequencies. The process of normalizing transforming and importance weighting of word frequencies f in text analysis is well known in the related art and is e.g. described in Applied Intelligence, March 30, 2003, 109-123, the content of which hereby is incorporated by reference. It is therefore not discussed in detail. In this embodiment, Euclidian norm is used for a normalization, since it is known that the later used support vector machines work best when input vectors are normalized to unit length with respect to the Euclidian norm. For transforming word frequencies, raw frequencies and logarithmic frequencies can be used. Inverse document frequency is used to determine importance weights to the terms .

The thus derived BOW format of the Web page is applied to a support vector machine (in the following SVM) for determining location information of the Web page. The usage of SVMs for a text classification is well known in the related art and e.g. described in Applied Soft Computing, 7 (2007) , 923-928 or Heyer, C, Diederich, J., Tibianna: A Learning-Based Search Engine with Query Refinement, Thorn, J., Kay, J. (Eds.), Proceedings of the Seventh Australian Document Computing Symposium, Sydney, Australia (16 December 2002) 105-108. Sydney, Australia: The University of Sydney (2002) , ISBM 1-86487-525- 9., the content of which hereby is incorporated by reference. A description in detail is therefore omitted.

The Web page Pi is indexed and stored in the search index data base together with the location information gathered by component B in the same way like the Web pages treated by component A are stored.

The SVM for classifying the Web pages is trained, i.e. the hy- perplanes separating input vectors in different classes are calculated by applying input vectors to the SVM. The input vectors are formed by applying the above steps of conversion to a BOW representation, applying stemming, thresholding, forming of N-grams, and normalizing and transforming to Web pages with known location information. These Web pages are labeled with the known location information. Labeling with location information can be accomplished by using Web pages classified by component A, by the use of SVM transduction or even by hand-labeling Web pages with location information. In addition, explicit' location information is excluded from the Web page in BOW representation, i.e., exact location information used for labeling the Web page is excluded from the input for training the SVM. The consequence is that the machine learning methods will utilise contextual information for determining location information of the Web page instead of explicit location information, which is already taken into account by the processing of component A.

A search engine S, consisting of a computer program running on one ore more data processing devices is provided for accomplishing a search in the index database ID and to generate a result list of found Web pages. For this purpose, the search engine S comprises a user interface in a Web page format allowing the user to input a search term and the user's geographic position. The geographic position of the user could also be determined by detecting the user's IP address and determining the region in which said address is assigned. Alternatively, if the user is employing a mobile terminal for accessing the search engine, the geographic position of the user can be determined by requesting position information, e.g. cell information, of the mobile terminal from the mobile network provider. In a case of employment of a mobile device equipped or connected to a position detection device like a GPS receiver, the user's geographic position could be determined by transmitting the positional data determined by the position detection device. Search result list determined by applying the search term to the search engine is ranked by the search engine according to positional relation of the location information of each Web page Pi of the search result to the geographic position of the user. Therefore, an expert system implementing a heuristic function is used to determine the ranking of the search results in the search result list. Results are ranked according to the position of the location information and their distance from the user's geographical position. Those Web pages Pi with exact location information close to the user are ranked higher than those pages with imprecise location information or further distance from the user. If the search input field for inputting the key words is part of a specific Web site, the geographical context may be preset or restricted. For example, a city's Web portal may restrict Internet search results to restaurants or those located within the city boundaries.

As an example for the heuristic, it is assumed that A is the geographic position of the user. It is further assumed that B, C, D and E are geographic locations expressed in or inferred from Web pages' contents. Then, the following are example rules for the expert system that realizes the heuristic function:

If B is a house/building address, representing narrow/exact coordinates, and is geographically close to A, then the Web page with location B is ranked high.

If B and C are house addresses and B is geographically closer to A than C, then the Web page with location B is ranked over the Web page with location C and both of them are ranked high.

If D is a street block and geographically close to A and there are street addresses, e.g. B and C, closer to A then the Web page with the location D is ranked below B and C.

If D is a street block and geographically close to A and there is no street address close to A, then the Web page with location D is ranked high.

If E is a suburb and geographically close to A and there are street blocks, e.g. D, closer to A but not street addresses, then rank the Web page with the location E below the Web page with the location D.

If E is a suburb and geographically close to A and there are no street addresses or blocks close to A₇ then rank the Web page with location E high.

If E is a suburb and geographically close to A and C is a house address being geographically more distant from A then the entire suburb E, the web page of E is ranked below the webpage of C.

In the following, the usage of the search engine is described. A user is driving in a car on Sheikh Zayed Road in Dubai, a large inner-city highway. The user is loading the Web page hosting the interface for the localized search engine by use of his mobile phone. The user is entering the keywords "Italian restaurants" to the search engine. Alternatively, the keywords could be entered by a speech recognition interface. The search engine determines the position of the mobile phone by requesting cell information from the user's mobile net provider and processes the search in the search index database. The search engine ranks the list of search results corresponding to Web pages describing facilities offering Italian food. The Web pages of restaurants with known coordinates close to the geographic position of the user, e.g. "Mall of Emirates", are ranked high while Web pages of the restaurants in the general suburb, e.g. "Jumeirah" are ranked lower. The Web pages for Italian restaurants in Dubai without any further knowledge of their position are ranked low. Web pages of restaurants for which location could not be determined are ranked lowest.

This embodiment is only of exemplary character to explain the invention. It is obvious to the person skilled in the art that variations of the above embodiment are possible to achieve the object of the invention. For example, in this embodiment term frequencies are normalized using the Euclidean Norm. Other norms, in particular other p-norms are applicable. In this embodiment, inverse document frequency was used for importance weighting. Other methods of weighting and even omitting importance weighting are applicable without loosing the ability of achieving the invention's object. Other forms of n-grams then bigrams used by the embodiment are applicable. In this embodiment BOW representation of text is used. Any other representation transforming a text into a vector representation capable of being applied to machine learning methods is applicable.

Claims

1. Computer based method for providing geographic location information of a content of a data set, in particular a document, further in particular a web page, comprising the steps : a) providing a set of training data, comprising training data subsets, b) transforming the content of each training data subset in a format processible by a machine learning method, c) attributing each training data subset with geographical location information, d) removing geographic location information from the content of each training data subset, the geographic location information being conform to the geographic location information attributed to the training data subset. e) training the machine learning method by applying the set of training data to the machine learning method, f) providing a search data set for determining geographic location information of its content, g) transforming the content of the search data set in a format processible by the machine learning method, h) determining the geographic location information of the search data set's content by applying it to the machine learning method.

2. Method according to claim 1 wherein the machine learning method is a support vector machine.

3. Method according to one of claims 1 or 2 wherein steps b) and g) comprises one or more of the steps:

-transforming the content in to (an attribute-value vector, wherein each distinct term of the content corresponds to an attribute whose value is a number of occur- rences of the terra in the data set) a collection of distinct terms associated with a number of occurrences of the term in the data set,

-aggregating terms with coinciding stems and summing their number of occurrences,

-removing terms with a number of occurrences less than a predetermined value .

-forming a collection of tuples of terms, each tuple being associated with a number of directly consecutive occurrences of the tuple's terms in the data set.

-transforming the scale for the number of occurrences, in particular transforming the scale for the number of occurrences into a logarithmic scale,

-other forms of weighting terms.

4. Computer based method for providing information with localized relevance in accordance with a character string, the information being from a set of digital data comprising subsets of data, the method comprising the steps: a) providing each subset of data with geographic location information, b) determining a result subset containing subsets of data in accordance with the character string, c) determining a geographic target position, d) ranking the set of subsets of data according to a relation of distances between the geographic location information of each subset of data and the geographic target position by applying a set of rules implementing a heuristic function, e) providing information to the subsets of data contained in the result subset ordered by the rank of each subset of data.

5. Method according to claim 4 wherein the set of digital data comprises documents, in particular web pages accessible via the internet and the subsets of data are formed by the documents.

6. Method according to one of claims 4 or 5 wherein the geographic target position is the position of a terminal utilized by a user conducting a search process.

7. Method according to one of claims 4 to 6 wherein the terminal is a mobile terminal and step c) comprises a step of receiving cell information of the mobile terminal from a mobile network provider.

8. Method according to one of claims 4 to 7 wherein the terminal is connected to or comprises a position detecting device and step c) comprises a step of receiving the position detected by the position detecting device.

9. Method according to one of claims 4 to 8 wherein the terminal comprises an interface provided with an address according to RFC 791 (IPv4) or RFC 2460 (IPv6) and step c) comprises a step of determining the region in which said address is assigned.

10. Method according to one of claims 4 to 9 wherein step a) comprises the step of determining the geographic location information to which a subset of data relates to.

11. Method according to claim 10 wherein the determination of geographic local information comprises one or more of the steps:

- applying one or more Boolean expressions to the subset of data,

- expand the Boolean expression by ontologies,

- complete fragmentary geographic location information by comparison with mercantile directories, - determine a geographic position by transferring the geographic to a geographic information system.

12. Method according to one of claims 10 or 11 wherein the determination of geographic local information is accomplished by applying the method according to one of claims 1 to 3.

13. Method according to one of claims to 4 to 12 wherein step d) comprises a step of applying an expert system.

14. Method according to one of claims 1 to 3 wherein in step c) the geographical location information attributed to a subset of the set of training data is determined by applying step a) of the method according to claims 4 to 13.

15. Information analysis device performing the methods according to one of claims 1 to 14.

16. Information analysis device according to claim 15 comprising a terminal adapted to input a search request to the information analysis device.

17. Information analysis device according to claim 16 wherein the terminal is a mobile terminal connected to the information analysis device via a wireless network maintained by a mobile network provider.