« PrécédentContinuer »
(12) United States Patent ao) Patent No.: us 6,738,678 Bi
Bharat et al. (45) Date of Patent: May 18,2004
(54) METHOD FOR RANKING HYPERLINKED PAGES USING CONTENT AND CONNECTIVITY ANALYSIS
(76) Inventors: Krishna Asur Bharat, 470 Oak Grove Dr. #205, Santa Clara, CA (US) 95054; Monika R. Henzinger, 80 La Loma Dr., Menlo Park, CA (US) 94025
( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days.
(21) Appl. No.: 09/007,635
(22) Filed: Jan. 15, 1998
(51) Int. CI.7 G05B 13/02
(52) U.S. CI 700/48; 707/3; 707/4;
(58) Field of Search 700/48; 707/3,
707/5, 102, 501, 1, 7, 2, 4, 10, 500; 358/402,
(56) References Cited
U.S. PATENT DOCUMENTS
5,301,317 A * 4/1994 Lohman et al 395/600
5,442,784 A * 8/1995 Powers et al 707/102
5,495,604 A * 2/1996 Harding et al 395/600
5,761,493 A * 6/1998 Blakeley et al 395/604
5,873,081 A * 2/1999 Harel 707/3
5,937,422 A * 8/1999 Nelson et al 707/531
5,953,718 A * 9/1999 Wical 707/5
Syu et al., "A Competition-Based Connectionist Model for
Information Retrieval", IEEE., pp. 3301-3306, dated 1994.*
Kleinberg, "Authoritative Sources in a Hyperlinked Envi-
ronment," Proc. of ACM-Siam Symposium on Discrete
Algorithms, 1998 (to appear). Also appears as IBM
Research Report RJ 10076, May 1997.
Frakes et al., "Information Retrieval, Data Structures and
Algorithms," Prentic Hall, Englewood Cliffs, New Jersey
* cited by examiner
Primary Examiner—-William Grant
Assistant Examiner—McDieunel Marc
A computerized method determines the ranking of documents including information content. The present method uses both content and connectivity analysis. An input set of documents is represented as a neighborhood graph in a memory. In the graph, each node represents one document, and each directed edge connecting a pair of nodes represents a linkage between the pair of documents. The input set of documents represented in the graph is ranked according to the contents of the documents. A subset of documents is selected from the input set of documents if the content ranking of the selected documents is greater than a first predetermined threshold. Nodes representing any documents, other than the selected documents, are deleted from the graph. The selected subset of documents is ranked according the linkage of the documents, and an output set of documents exceeding a second predetermined threshold is selected for presentation to users.
48 Claims, 2 Drawing Sheets
METHOD FOR RANKING HYPERLINKED
PAGES USING CONTENT AND
FIELD OF THE INVENTION
This invention relates generally to computerized information retrieval, and more particularly to ranking documents having related content.
BACKGROUND OF THE INVENTION
It has become common for users of client computers connected to the World Wide Web (the "Web") to employ Web browsers and search engines to locate Web pages is having content of interest. A search engine, such as Digital Equipment Corporation's AltaVista search engine, indexes hundreds of millions of Web pages maintained by server computers all over the world. The users compose queries to specify a search topic, and the search engine identifies pages 20 having content that satisfies the queries, e.g., pages that match on the key words of the queries. These pages are known as the result set.
In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, 25 thousands of pages. For this reason, most search engines rank order the result set, and only a small number, for example twenty, of the highest ranking pages are actually returned at a time. Therefore, the quality of search engines can be evaluated not only on the number of pages that are 30 indexed, but also on the usefulness of the ranking process that determines which pages are returned.
Sampling of search engine operation has shown that most queries tend to be quite short, on the average about 1 to 2 words. Therefore, there is usually not enough information in 35 the query itself to rank the pages of the result set. Furthermore, there may be pages that are very relevant to the search that do not include the specific query words. This makes ranking difficult.
In Information Retrieval, some approaches to ranking have used relevance feedback supplied by users. This requires the user to supply feedback on the relevance of some of the results that were returned by the search in order to iteratively improve ranking. However, studies have ^ shown that users of the Web are reluctant to provide relevance feedback.
In one prior art technique, an algorithm for connectivity analysis of a neighborhood graph (n-graph) is described, J. Kleinberg, "Authoritative Sources in a Hyperlinked 50 Environment," Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, and also in IBM Research Report RJ 10076, May 1997. The algorithm analyzes the link structure, or connectivity of Web pages "in the vicinity" of the result set to suggest useful pages in the context of the search that 55 was performed.
The vicinity of a Web page is defined by the hyperlinks that connect the pages. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages are indirectly linked. These go connections can be expressed as a graph where the nodes represent the pages, and the directed edges represent the links.
Specifically, the algorithm attempts to identify "hub" and "authority" pages. Hubs and authorities exhibit a mutually 65 reinforcing relationship, a good hub page is one that points to many good authorities, and a good authority page is
pointed to by many good hubs. Kleinberg constructs a graph for a specified base set of hyperlinked pages. Using an iterative algorithm, an authority weight x and a hub weight y is assigned to each page when the algorithm converges.
When a page points to many pages with large x values, the page receives a large y value and is designated as a hub. When a page is pointed to by many pages with large y values, the page receives a large x value and is designated as an authority. The iterative weights can be ranked to compute "strong" hubs and authorities.
However, there are some problems with the Kleinberg's algorithm which is strictly based on connectivity. First, there is a problem of topic drift. For example, a user composes a query including the key words "jaguar" and "car." The graph will tend to have more pages that talk about "cars" than specifically about "jaguars". These self-reinforcing pages will tend to overwhelm pages mentioning "jaguar" to cause topic drift.
Second, it is possible to have multiple "parallel" edges connected from a certain host to the same authority or the same hub. This occurs when a single Web site stores multiple copies or versions of pages having essentially the same content. In this case, the single site has undue influence, hence, the authority or hub scores may not be representative.
Therefore, it is desired to provide a method which precisely identifies the content of pages related to a topic specified in a query without having a local concentration of pages influence the outcome.
SUMMARY OF THE INVENTION
Provided is a method for ranking documents including information content. The method can be used to rank documents such as Web pages maintained by server computers connected to the World Wide Web. The method is useful in the context of search engines used on the Web to rank result sets generated by the search engines in response to user queries. The present ranking method uses both content and connectivity analysis.
The method proceeds as follows. An input set of documents is represented as a neighborhood graph in a memory. In the graph, each node represents one document, and each directed edge connecting a pair of nodes represents a linkage between the pair of documents. A particular documents can point to other documents, and other documents can point to the particular document. There are no edges between documents on the same site.
The input set of documents represented in the graph is ranked according to the content of the documents based on their match to a certain topic. Ranking can be done using either a vector space model or a probabilistic model. A subset of documents is selected from the input set of documents if the content ranking of the selected documents is greater than a first predetermined threshold. Nodes representing any documents, other than the selected documents, are deleted from the graph.
The selected subset of documents is ranked according the linkage of the documents, and an output set of documents exceeding a second predetermined threshold is selected for presentation to users.
In one aspect of the invention, the input set of documents includes a result set of Web pages generated by a Web search engine in response to a user query, and pages directly linked to the result set. The rank of a particular document is based on the similarity of the content of the particular document