WO2001011444A2 - System and method for searching and indexing world-wide-web pages - Google Patents

System and method for searching and indexing world-wide-web pages Download PDF

Info

Publication number
WO2001011444A2
WO2001011444A2 PCT/US2000/021770 US0021770W WO0111444A2 WO 2001011444 A2 WO2001011444 A2 WO 2001011444A2 US 0021770 W US0021770 W US 0021770W WO 0111444 A2 WO0111444 A2 WO 0111444A2
Authority
WO
WIPO (PCT)
Prior art keywords
url
word
bookmark
pages
searching
Prior art date
Application number
PCT/US2000/021770
Other languages
French (fr)
Other versions
WO2001011444A3 (en
Inventor
Jonathan K. Kilberg
Christopher E. Seline
Original Assignee
2Wrongs.Com, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 2Wrongs.Com, Inc. filed Critical 2Wrongs.Com, Inc.
Priority to AU66277/00A priority Critical patent/AU6627700A/en
Publication of WO2001011444A2 publication Critical patent/WO2001011444A2/en
Publication of WO2001011444A3 publication Critical patent/WO2001011444A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management

Definitions

  • the present invention relates to a system and method for searching world-wide-web
  • the method also includes indexing, databasing, categorizing, and ranking web
  • the WWW is made up of numerous web pages that include text,
  • Web pages are uniquely identified by an
  • URL uniform resource locator
  • EXPLORER retrieves and displays the contents of a web page using the URL.
  • Another method to search the web is to use an index, which is a collection of URL's that
  • Another method to search the web is to use a search engine, which is a searchable
  • engine system consists of a spider (an automated browser or agent) which traverses the Internet
  • Search engines extract and
  • index information using a number of parameters. In general, they index some or all words found
  • documents may also index a document's size, title, headings and subheadings, and other
  • search engines While search engines are useful, they have several problems that have not been
  • search engines attempt to rank, in some sense, the popularity of URL's, none do so similarly to
  • search engines While search engines are widely used, a search engine
  • page may not reside on a portion of the web that has been indexed by a particular search engine.
  • the present invention is described in the context of the Internet and the WWW.
  • the present invention relates to a method of searching a computer network comprising a
  • the method includes the steps of searching the network for pages that are
  • bookmark files and creating a database based upon the contents the bookmarked web pages.
  • word database is created based upon the words contained within the bookmarked web pages
  • the database may be created based upon the categories contained within the bookmark files.
  • a user may query the database(s) and retrieve web pages of interest that are ranked
  • the present invention is a system and method for searching, indexing and otherwise
  • the invention uses an automated browser to search for files (web pages) that are
  • bookmarks are a feature supported by most web browsers that allow a user to
  • Bookmark files may be recognized as such by the automated browser in several ways.
  • the files may be recognized as being "NETSCAPE" bookmark files. Or, the files may be
  • the files may be recognized as being a bookmark file supported by
  • formatted files may include, for example, lists of hyperlinks
  • bookmark files are, in general, categorized by the
  • Non-PTO wherein URL's of other frequently accessed web pages related to patents are placed
  • bookmark file it may be very time consuming and tedious to inspect a complete lists of URL's to
  • the automated browser downloads and locally stores each bookmark file.
  • the automated browser continues searching the Internet to find additional bookmark
  • downloaded information of the present invention only corresponds to web pages that have been
  • a word database is compiled by cross-referencing each word of each downloaded page to
  • each unique word has a file associated with it.
  • the database can be constructed exclude unique or very rare words, as defined by
  • the file has multiple entries, each entry corresponding to: the URL of a
  • URL may appear in many user's bookmark files, and this would tend to indicate a more popular
  • entry would be one entry stored in a file named, for example, "wrongs.txt.”
  • the 123rd word (having font type f which can be any predefined font type), the 123rd word (having font type f
  • the database that contains at least one instance of the word "wrongs.” Thus, the database has a file
  • constructing a database for use with a search engine is a known procedure, and any procedure
  • the present invention may also include a category
  • the file includes the categories: Reference, Search Engine, and
  • bookmark file might include the categories: Search Engine, Law,
  • the invention would then include the categories: Reference, Search Engine, News &
  • the category Search Engine would have the bookmark files of both Yu Wu (here)
  • the categories can be further combined or otherwise manipulated by human or automatic
  • the categories can be any combination of intervention in order to overcome some potential complications.
  • the categories can be any combination of
  • categories could be translated from one language to another.
  • the significant point is that the resulting category database uses the categorization of the downloaded bookmark files to create a
  • a user can search the database described above to search and rank web pages. It is noted
  • the database is searched based on the keyword(s), and results are returned in
  • the results are generally ranked according to some ranking system.
  • the results are generally ranked according to some ranking system.
  • the database may retrieve a desired number of
  • results such as ten, and display them to the user.
  • the results may be retrieved by the user.
  • Such a display is simple to implement and
  • word ab the word number of the b' occurrence of the a word from the query
  • w,-w 6 variables to adjust for customized ranking (i.e., weighting indexes).
  • bookmark files which corresponds to a relatively more popular web page.
  • a user can also search the database to retrieve URL's (and any associated information) by
  • the search can retrieve all URL's associated
  • results may be displayed arbitrarily, or may be ranked by the number of times a URL is included
  • the category database may be searched independently from
  • a search may be performed in both the word database and the category-
  • the Internet denoted I, includes web pages, denoted Wl to W n (representing all of the web pages on the WWW)
  • the automated browser denoted with reference numeral 10. crawls
  • bookmark files are downloaded and stored locally
  • the URL's contained withm the bookmark files 20 are used to download the web pages
  • bookmark files 20 are parsed to create a catego ⁇ es database, denoted with reference
  • the categories database 40 associates user defined catego ⁇ es
  • the downloaded web pages 30 are parsed to
  • database 50 creates a file for each unique word, each file having an entry for each web page
  • a user U interacts with a search engine, denoted with reference numeral 60, to
  • results from the databases 40 and 50 are returned to the user

Abstract

A system and method for indexing, searching and performing related operations on the Internet uses an automated browser (10) to find Internet web pages that are bookmark files (20). These may be identified by being of a particular type of file, such as a 'NETSCAPE' bookmark file, or as being a list of URL's found outside the server where the bookmark resides, or otherwise. The web pages (30) corresponding to the URL's contained within the bookmark files (20) are parsed for relevant information and a database is constructed based thereon. Another database (40) may be constructed to associated categories contained within found bookmark files with URL's contained within the categories. The databases are searched by a user who inputs a query, and relevant URL's and other information is displayed.

Description

SYSTEM AND METHOD FOR SEARCHING AND INDEXING WORLD-WIDE-WEB
PAGES
FIELD OF THE INVENTION
The present invention relates to a system and method for searching world-wide-web
(WWW) pages. The method also includes indexing, databasing, categorizing, and ranking web
pages. BACKGROUND
Use of the Internet, and more particularly the world-wide-web, has increased dramatically
in recent years. Web pages may be found on any conceivable subject for many purposes, such as
to advertise commercial products and serve as a conduit between companies and their customers;
to provide services such as news reporting; to provide direct entertainment such as Internet radio
stations; to provide educational opportunities; and to provide information of general interest on a
particular subject.
As is well known, the WWW is made up of numerous web pages that include text,
graphics, and sometimes other multimedia information. Web pages are uniquely identified by an
address called the URL (uniform resource locator). A typical URL is: http://www.2wrongs.com .
wherein http:// specifies the type of transfer protocol, and 2wrongs.com is the Internet domain
name. Other typical URL's may have directories, subdirectories, and file names specified. A
web browser, such as "NETSCAPE NAVIGATOR" or "MICROSOFT INTERNET
EXPLORER," retrieves and displays the contents of a web page using the URL.
A problem faced by Internet users is finding the URL's of web pages that are of interest.
Users currently have several options to search for desired web pages. Perhaps the simplest method is to enter a URL that seems to correspond to the subject of interest, and see what web
page is retrieved, if any. For instance, a user who is interested in patent law might enter
http://patents.com or http://www.patents.com, and see what appears. There are a number of
limitations to this technique. For example, it only retrieves one web page, instead of providing a
list of the many pages that may be of interest. Also, it is only of use for the simplest searches,
that can be represented by one word that accurately describes the subject of interest. The fact
that this method is used at all is perhaps a reflection of the weaknesses of some of the other
methods described below.
Another method to search the web is to use an index, which is a collection of URL's that
have been categorized by subject matter, generally by human compilation and in a hierarchical
arrangement. A problem with indexes is that human compilation is time consuming, and is thus
expensive and not likely to include current web pages. This is a particular problem in light of the
fast growth of the Internet.
Another method to search the web is to use a search engine, which is a searchable
database that organizes, in some sense, information on at least a portion of the Internet. A search
engine system consists of a spider (an automated browser or agent) which traverses the Internet
gathering information of webpages (by following links of URL's), a database to store the
gathered information, and a search tool for searching the database. Search engines extract and
index information using a number of parameters. In general, they index some or all words found
in documents, and may also index a document's size, title, headings and subheadings, and other
information. While search engines are useful, they have several problems that have not been
overcome. The Internet contains such an enormous amount of information that it is not feasible to index every web page, thus current search engines only index a relatively small portion of the
Internet. Information that is not indexed is ignored by the search engine. Also, the Internet
includes a large amount of web pages that are probably of very little interest to anyone other than
their creator. Search engines nevertheless will, in general, index such web pages. While some
search engines attempt to rank, in some sense, the popularity of URL's, none do so similarly to
that of the present invention as disclosed below. Yet another search engine problem is that the
creators of some web pages will insert words into a web page simply for the purpose of
"tricking" search engines into retrieving the web pages, even when the subject of the web page
has nothing to do with the subject of the search. Taken together, these and other problems have
limited the usefulness of search engines. While search engines are widely used, a search engine
will typically retrieve the URL's of many web pages that are not of interest along with web pages
that are, and, as discussed above, may not find relevant web pages simply because a given web
page may not reside on a portion of the web that has been indexed by a particular search engine.
All documents, including any technical protocols such as http and HTML, referred to in
this document are hereby incorporated by reference in their entirety, although no documents are
admitted to render any of the claims unpatentable either alone or in combination with any other
references known to the applicant.
The present invention is described in the context of the Internet and the WWW.
However, it will be readily understood that the utility of the invention is not limited to the
Internet as that term is used to describe a particular computer network, but instead may be used
by any other computer network that utilizes hyperlinks and addresses similar to URL's, wherein
users collect lists of URL's. SUMMARY
The present invention relates to a method of searching a computer network comprising a
number of pages. The method includes the steps of searching the network for pages that are
bookmark files; and creating a database based upon the contents the bookmarked web pages. A
word database is created based upon the words contained within the bookmarked web pages, and
relating the words to the URL's and other information contained within the web pages. Another
database may be created based upon the categories contained within the bookmark files.
A user may query the database(s) and retrieve web pages of interest that are ranked
according to a desired ranking scheme. A detailed description is provided below, along with
other aspects of the invention.
BRIEF DESCRIPTION OF THE DRAWING
The FIGURE is a schematic representation of the architecture of an embodiment of the
present invention.
DETAILED DESCRIPTION
The present invention is a system and method for searching, indexing and otherwise
classifying web pages on the Internet, and related computer networks.
Bookmarks
The invention uses an automated browser to search for files (web pages) that are
identified as primarily lists of URL's. Files of this type are referred to as "bookmark files"
herein. In general, bookmarks are a feature supported by most web browsers that allow a user to
save important links (as selected by the user) in a bookmark file so they can be found
immediately without the user having to look up the URL and type it into the browser. The user simply views the bookmark file and selects a link therefrom to go to the desired web page.
Bookmark files may be recognized as such by the automated browser in several ways.
The files may be recognized as being "NETSCAPE" bookmark files. Or, the files may be
recognized as being formatted similarly to a "NETSCAPE" bookmark file, even if they are not
formatted identically. Or, the files may be recognized as being a bookmark file supported by
another browser, such as Microsoft's "INTERNET EXPLORER" (called "favorites"), or as files
being similarly formatted. Similarly formatted files may include, for example, lists of hyperlinks
to URL's found outside the server where the bookmark file resides, which can be determined by
comparing the URL of the bookmark file with the listed URL's.
Another useful feature of bookmark files is that they are, in general, categorized by the
user into folders related to a particular subject matter so that the user places the URL's of related
web pages into particular folders. The user creates as many folders as desired, names the folders,
and places URL's within the folders. For example, a computer user who is a patent attorney and
a rock climbing enthusiast might create: a folder titled "Patent and Trademark Office", wherein
URL's of frequently accessed web pages of the PTO are placed: another folder tilted "Patents -
Non-PTO", wherein URL's of other frequently accessed web pages related to patents are placed;
and another folder titled "Rock Climbing", wherein URL's of web pages related to rock climbing
are placed. While the use of categories is not necessary, it assists the user in locating a specific
URL. Since a user may have a potentially unlimited number of URL's saved in the user's
bookmark file, it may be very time consuming and tedious to inspect a complete lists of URL's to
find the one the user has in mind.
Downloading Bookmark Files and Other Web Pages As discussed above, the automated browser searches the Internet to find bookmark files.
The automated browser downloads and locally stores each bookmark file. The automated
browser also locally downloads the web pages stored at each URL contained in each bookmark.
The automated browser continues searching the Internet to find additional bookmark
files. This can be done in several ways. It can be done be selecting an LP address at random, or
according to any predefined search criteria, determining if the selected address is a bookmark
file, downloading if so, and repeating the process. It can also be done by feeding the automated
browser a certain URL and having the browser follow links attached to the URL, and then follow
links attached to these URL's, and so on. Both of these methods, or potentially other methods
may be used. In practice, automated browsers are well understood and are used by current
generation search engines (although not for the purpose of the present invention), so that one
skilled in the art will understand how an automated browser can search the Internet and
download bookmark files and web pages whose URL's are bookmarked in the bookmark files.
While it is not necessary to download all (or substantially all) of the bookmark files on
the Internet and the corresponding bookmarked web pages, it can be appreciated that an
embodiment of the present invention can download all or substantially all of those webpages
using much fewer resources than is presently employed by search engines, which attempt to
download as much of the Internet as possible (at least the portion of the Internet containing user
input text, as opposed to graphics files, Internet resource files, or the like). Further, the
downloaded information of the present invention only corresponds to web pages that have been
bookmarked by Internet users, and so are deemed to be useful. Thus, "junk" that is of little or no
interest to anyone other than the creator is not likely to be included in the downloaded information.
Database Compilation
A. Word Database A word database is compiled by cross-referencing each word of each downloaded page to
a file specific to each unique word. Thus, each unique word has a file associated with it. (If
desired, the database can be constructed exclude unique or very rare words, as defined by
determining the relative frequency of the words, dictionary entries, or otherwise, in order to
conserve resources.) The file has multiple entries, each entry corresponding to: the URL of a
page that has at least one reference to the word; the HTML title of the page; the length of the
page in both number of words and bytes; the word number of each instance of the word with data
specific to each instance including font, case, and HTML type; the date the file was retrieved;
and the frequency that the URL was found within the downloaded bookmark files (i.e., the same
URL may appear in many user's bookmark files, and this would tend to indicate a more popular
web page). A sample database entry of a word file for the word "wrongs" appears below. The
entry would be one entry stored in a file named, for example, "wrongs.txt."
a I b I c I d | e |f | g | h |
45.23|www.2wτongs.com|2wrongs.com Homepage|45al23fl45al78f232a|1050|4|75|43|
a. Precompiled ranking of word
b. URL
c. Title
d. Occurrences and font of each word (45a= 451 word with font type a) e. Number of words in page
f. Size of page in kilobytes (rounded)
g. Day, from zeroth day, the file was retrieved
h. # occurrences of URL in bookmarks
Subpart d labeled above is interpreted to mean that in the web page found at
www.2wrongs.com, the word "wrongs" appears in the text of that web page as the 45th word
(having font type a which can be any predefined font type), the 123rd word (having font type f
which can be another predefined font type), the 145th word (having the font type a), the 178th
word (having the font type f), and the 232nd word (having the font type a). As another example,
if a downloaded URL included the text: "The dog is eating the sock" the "the.txt" file would
have a subpart d entry of |la5fj (i.e., the word "the" is the first word of the sentence with font
type a (here normal font), and is the fifth word of the sentence with font type f (here italics); the
"dog.txt" file would have a subpart d entry of (2a|; the "is.txt" file would have a subpart d entry of
|3a|; and so on.
The word "wrongs" would have a similar entry for each URL that is included within the
database that contains at least one instance of the word "wrongs." Thus, the database has a file
for each unique word; and the file for each unique word has an entry for each URL that includes
that unique word in the text of the web page (or domain name, title, etc.) identified by the URL.
It should be understood that the classification of information contained in the downloaded
web pages as described is a specific embodiment of the invention, and could be modified either
by including more or less information. For example, it may be desirable limit the information to
the first 100 (for example) words of text in each web page, to conserve space. Or, only headings and subheading could be indexed. More generally, indexing web pages for the purpose of
constructing a database for use with a search engine is a known procedure, and any procedure
currently used or that becomes known may be incorporated as an aspect of the present invention.
B. Categories
In addition to the word database, the present invention may also include a category
database that includes categories of URL's. These categories may be determined in a first step by
using each (or any desired subset) of categories that are included within each of the downloaded
web pages. For example, in the sample bookmark file of a typical Internet user named Yu Wu
attached hereto as the Appendix, the file includes the categories: Reference, Search Engine, and
News & TV. Another sample bookmark file might include the categories: Search Engine, Law,
and MP3. The invention would then include the categories: Reference, Search Engine, News &
TV, and MP3. The category Search Engine would have the bookmark files of both Yu Wu (here
Excite, AOL Netfind, AltaVista Search: Main Page; Infoseek; and Yahoo!.) and the other user,
since each user used that category. Each other category would have bookmark files of either Yu
Wu or the other user. Thus, the concept can easily be extended to include each of the categories
that are included in each downloaded bookmark file.
The categories can be further combined or otherwise manipulated by human or automatic
intervention in order to overcome some potential complications. In particular, the categories can
be manipulated to prevent multiple similar categories which might otherwise result because of
users' use of synonymous terms. For example, three different users might use three different
categories such as Bicycling, Biking, or Cycling to denote web pages having the same activity.
Or, categories could be translated from one language to another. The significant point is that the resulting category database uses the categorization of the downloaded bookmark files to create a
database of categories and associated URL's.
Searching the Database
A. Word Database
A user can search the database described above to search and rank web pages. It is noted
at the outset that search engines are presently known to search databases to find and rank web
pages in response to a user input search criteria, and the present invention can use any search
method known or developed as well as the particular methods disclosed herein. In general, a
user enters one or more keywords that are intended to describe the information that the user
wants to find. The database is searched based on the keyword(s), and results are returned in
HTML pages. The results are generally ranked according to some ranking system. The results
generally include the URL's of the relevant web pages, and often include first several sentences
of the web pages and/or the title of the web pages.
If one keyword is selected by the user, then the database may retrieve a desired number of
results, such as ten, and display them to the user. The results may be retrieved by the
precompiled ranking associated with each entry of the relevant word file. Referring to the entry
for the word file "wrongs" stored in the file wrongs.txt, the entry for the URL 2wrongs.com has a
precompiled ranking that is stored in the field denoted by reference character "a", and is an
arbitrary ranking to order the entries composing the word files. The user may select additional
results after viewing the initially retrieved results. Such a display is simple to implement and
requires relatively little computing power.
Alternatively, a ranking system can be used. A presently preferred ranking system is described below. For single word queries:
RANK; = T + Rf + U + D + S + RP
I RANK.o al = ∑ RANKi* Rfi*A i-l For multiple word queries:
where, for words with common URL's
Figure imgf000012_0001
wordab = the word number of the b' occurrence of the a word from the query
I = number of words in query
X, = number of instances of word i in the URL
X2 = number of instances of word q in the URL
else, if URL's are not common
A = l
The definitions of the independent variables are as follows:
T = , * (#t in title)
ifPageSize>3000
Rf - w2 * (# in Page) * / (PageSize/3000) if2000<PageSιze<3000
Rf = w2 * (# in Page) * / (PageSize/2000)
if l000<PageSize<2000
Rf = w2 * (# in Page) * / (PageSize/1000)
ifPageSize<1000
Rf = w2 * (# in Page) * / (PageSize/500)
U = w3 * (#t in URL)
D = w4 * (#t in Domain name)
S = w5 * (# forward slashes in URL)
ifx<=10
RP = w6 * P / (40 - ( 1.1 * (10 - x) + (1.2 / (10 - x) ) ) )
ifx>10
RP = P * w6
P = relative number of times the URL occurs in all bookmarks
#t = number of times the word occurs
w,-w6 = variables to adjust for customized ranking (i.e., weighting indexes).
Briefly discussing the above, it can be seen that the ranking for a single word query
reviews each entry for the word file for the single word, and increases the ranking of an entry
depending upon: the number of times the word appears in the title of the web page; the relative
frequency that that the word appears in the text of the web page; the number of times the word
appears in the URL; the number of times the word appears in the domain name; the number of
forward slashes in the URL (which gives the front page of a domain priority, rather than a page several directories deep); and the relative number of times the URL appears in the downloaded
bookmark files, which corresponds to a relatively more popular web page. The ranking system
for multiple queries uses generally the same parameters, and also ranks web pages more highly
depending upon the proximity of the queried words to each other in a given web page.
B. Categories
A user can also search the database to retrieve URL's (and any associated information) by
searching the category database described above. The search can retrieve all URL's associated
with a user query that is associated with a category that is stored in the database. The search
results may be displayed arbitrarily, or may be ranked by the number of times a URL is included
in the downloaded bookmark files. The category database may be searched independently from
the word database. Or, a search may be performed in both the word database and the category-
database, with the results being determined by combining the results of each database search in
any desired way. By way of illustration, the results could be similar as for the word database,
however additional weighting could be given to URL's that are also found in the categories
database. It will be apparent that any ranking system including the results of searching both the
word and the category database may be used.
Schematic Representation
With reference to the FIGURE, a schematic representation of an embodiment according
to the present invention is described. It should be understood that the schematic representation is
provided solely for the purpose of explaining an embodiment of the invention, and that the
invention is not limited to any particular architecture.
The Internet, denoted I, includes web pages, denoted Wl to Wn (representing all of the web pages on the WWW) The automated browser, denoted with reference numeral 10. crawls
the Internet I and examines the pages Wl to Wn in order to find which pages are bookmark files,
denoted with reference numeral 20 These bookmark files are downloaded and stored locally
The URL's contained withm the bookmark files 20 are used to download the web pages
associated with those URL's, identified with reference numeral 30.
The bookmark files 20 are parsed to create a categoπes database, denoted with reference
numeral 40, as described above. The categories database 40 associates user defined categoπes
with the URL's stored in the user defined categoπes, subject to human or automatic manipulation
to combine synonyms or to perform other actions. The downloaded web pages 30 are parsed to
create a word database, denoted with reference numeral 50, as descπbed above The word
database 50 creates a file for each unique word, each file having an entry for each web page
containing at least one instance of the word
A user U interacts with a search engine, denoted with reference numeral 60, to
search either one or both of the databases 40 and 50. The user U queπes the search engine 60
using one or more search terms, and results from the databases 40 and 50 are returned to the user
While the Internet I, user U, and components 10 - 60 are shown as being connected in a certain
configuration, it should be understood that that is simply one configuration and that any other
configuration could be used as an alternative embodiment of the invention.
Conclusion
It should be understood that a representative embodiment of the invention is disclosed,
and that the scope of the invention should not be unduly limited to the disclosed embodiment
For example, certain operations are descπbed as being performed locally, after downloading information. It will be understood by those skilled in the art that the method disclosed herein can
be performed by software and hardware at any location. It also will be understood that a number
of useful features are disclosed herein, and that not every feature need be incorporated into a
useful product in order to fall within the scope of the present invention. For example, both the
category and word databases have been described. However, a useful product could only include
one or the other database. Additional modifications will be obvious to one skilled in the art.
Appendix
The following is a typical user bookmark file, shown first as it would be displayed upon a
user's screen and then with the HTML tags shown.
As displayed by Web Browsers
Yu Wu's Bookmarks
Reference
Free On-line Dictionary of Computing from FOLDOC GMU Patron databases Eric's Treasure Trove of Mathematics Hypertext Webster Interface
XLibris on the Web
Search Engine
Excite
AOL NetFind
AltaVista Search: Main Page
Infoseek
Yahoo!
News & TV ...
CNN Interactive
Yahoo! - Reuters Hourly News Summary Welcome to WashingtonPost.com Plain Text
<!DOCTYPE NETSCAPE-Bookmark-πTe-l> <!-- This is an automatically generated file. It will be read and overwritten. Do Not Edit! --> <TITLE>Yu Wu's Bookmarks</TITLE> <Hl>Yu Wu's Bookmarks</Hl>
<DLxp>
<DTxH3 FOLDED ADD_DATE="869594090">Reference< H3> <DLxp>
<DTxA HREF="http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?Free+On-line+Dictionary" ADD_DATE="869603871" LAST_VISIT="870100656"
LAST_MODLFIED="869603866">Free On-line Dictionary of Computing from FOLDOC</A> <DTxA HREF="http://library.gmu.edu/lib/dbase/local.html" ADD_DATE="869595248" LAST_VISIT="870099395" LAST_MODLFIED="869595207">GMU Patron databases</A> <DT><A HREF="http://www.asfro.virgiriia.edu/~eww6n/math/ghindex.html'' ADD_DATE="869436954" LAST_VISIT="869437657" LAST_MODLFIED="869437657">Eric's Treasure Trove of Mathematics</A> <DT><A HR£F="http://work.ucsd.edu:5141/cgi-bin/fιttp_webster" ADD_DATE="869436954" LAST_ VISIT="870101208"
LAST_MODLFIED="869594227">Hypertext Webster Interface</A>
<DTxA HREF="http://bluehaze.wrlc.org/webpac-bin/wgbroker?new+-access+top" ADD_DATE="850331540" LAST_VISIT="855072896" LAST_MODLFIED="850331532">XLibris on the Web</A> </DLxp>
<DTxH3 FOLDED ADD_DATE="869593680">Search Engine< H3> <DLxp>
<DTxA HREF="http://www.excite.com " ADD_DATE="870302095" LAST_VISIT="870302084" LAST_MODIFIED="870302084">Excite</A> <DTxA HREF="http://nearnet.gnn.com/search/" ADD_DATE=" 869611014"
LAST_VISIT="86961 1002" LAST_MODLFIED="869611002">AOL NetFind</A>
<DTxA HREF="http://www.altavista.digital.com/" ADD_DATE="869605211" LAST_VISIT-"869605190" LAST_MODLFIED="869605190">AltaNista Search: Main Page</A> <DTxA HREF="http://www.infoseek.com/" ADD_DATE="869604884"
LAST_VISIT="870720350" LAST_MODLFIED="869604879">Infoseek</A>
<DTxA HREF="http://www.yahoo.com/" ADD_DATE="858107417" LAST_VISIT=" 870720901 " L AST_MODLFIED=" 858107414">Yahoo ! </A> </DLxp> <DTxH3 FOLDED ADD_DATE="869594585">Νews & TV ...</H3> <DLXp>
Figure imgf000018_0001
<DT><A HREF="http://w w.yahoo.com/headlines/cuπ-ent/news/summary.htm ' ADD_DATE="869611 179" LAST_VISIT-"869611146"
LAST_MODLFIED="869611146">Yahoo! - Reuters Hourly News Summary</A>
<DTxA HREF="http://www. washingtonpost.com/" ADD_DATE=" 862020505" LAST_VISIT="866206162" LAST_MODLFIED=" 862020495 ">Welcome to WashingtonPost. com</ A> </DLxp>

Claims

THE INVENTION CLAIMED IS:
1. A method of searching a computer network comprising a number of pages, the
method comprising the steps of:
(a) searching the network for pages that are bookmark files, each of the bookmark files having a
number of bookmarked URL's that identify corresponding bookmarked web pages; and
(b) creating a database based upon the contents the bookmarked web pages.
2. The method of claim 1, further comprising the step of: searching the database in
response to a user query.
3. The method of claim 2, wherein step (a) uses an automated browser.
4. The method of claim 3, wherein step (a) identifies bookmark files by comparing
the searched pages with a known bookmark file format.
5. The method of claim 3, wherein step (a) identifies bookmark files by searching
each page having a server for a list of hyperlinks to URL's found outside the server of the
searched page.
6. The method of claim 3, wherein step (b) downloads the contents of each page that
is a bookmark file.
7. The method of claim 6, wherein the database includes a list of files corresponding
to words that are contained within the downloaded pages of step (b).
8. The method of claim 7, wherein each file corresponding to a word has an entry
corresponding to each downloaded web page that contains the word.
9. The method of claim 8, wherein each entry corresponding to each web page
includes the URL of the web page.
10. The method of claim 9, wherein each entry corresponding to each web page includes at least one of a group of parameters related to that page selected from the group
consisting of an HTML title of the page, a length of the page in number of words and bytes, a
word number of each instance of the word; a date the page was retrieved, and a frequency that
the URL was found within the downloaded pages.
11. The method of claim 10, wherein each entry corresponding to each web page
includes more than one of the parameters.
12. The method of claim 11, wherein the searching step (c) retrieves web page URL's
and ranks the retrieved URL's based upon the parameters.
13. The method of claim 12, wherein the user query is a single word query and the
ranks of the retrieved URL's are based upon: the number of times the word appears in a title of
the web page; the relative frequency that that the word appears in a text of the web page; the
number of times the word appears in a URL; the number of times the word appears in a domain
name; the number of forward slashes in the URL; and the relative number of times the URL
appears in the downloaded bookmark files.
14. The method of claim 12, wherein the user query is a multiple word query
consisting of a group of words, and the ranks of the retrieved URL's are based upon: the number
of times that each of the group of words appears in a title of the web page; the relative frequency
that that each of the group of words appears in a text of the web page; the number of times that
each of the group of words appears in a URL; the number of times that each of the group of
words appears in a domain name; the number of forward slashes in the URL's; and the relative
number of times the URL's appears in the downloaded bookmark files and a proximity of group
of queried words to each other in a given web page.
15. The method of claim 3, wherein at least some of the bookmark files include categories, and further comprising the step of creating a database based upon the categories.
16. The method of claim 14, wherein step (c) includes searching the database based
upon the categories.
17. The method of claim 16, wherein step (c) includes ranking web pages based upon
the searching of both databases.
18. The method of claim 1, wherein step (a) searches the world- wide- web.
19. A method of searching a computer network comprising a number of pages, the
method comprising of:
(a) searching the network for pages that are bookmark files, each of the bookmark
files having a number of bookmarked URL's that identify corresponding bookmarked web pages,
at least some of the bookmark files having categories categorizing at least some of the URL's;
and (b) creating a database based upon the contents the bookmarked web pages including the
categories.
20. The method of claim 19, further comprising the step of: (c) searching the
database in response to a user query.
21. A system for searching a computer network comprising a number of pages, the
system comprising:
(a) an automated program for fetching bookmarks files, the bookmark files
referencing web pages, the program crawling the web and the pages referenced by the bookmark
files;
(b) a database including information based upon the contents the bookmark web
pages; and
(c) means for searching the database in response to a user query.
PCT/US2000/021770 1999-08-10 2000-08-09 System and method for searching and indexing world-wide-web pages WO2001011444A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU66277/00A AU6627700A (en) 1999-08-10 2000-08-09 System and method for searching and indexing world-wide-web pages

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US37159899A 1999-08-10 1999-08-10
US09/371,598 1999-08-10

Publications (2)

Publication Number Publication Date
WO2001011444A2 true WO2001011444A2 (en) 2001-02-15
WO2001011444A3 WO2001011444A3 (en) 2001-09-27

Family

ID=23464612

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/021770 WO2001011444A2 (en) 1999-08-10 2000-08-09 System and method for searching and indexing world-wide-web pages

Country Status (2)

Country Link
AU (1) AU6627700A (en)
WO (1) WO2001011444A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10113902A1 (en) * 2001-03-21 2002-09-26 Matthias Jaekle Processing program of events dates involves downloading pages from Internet, searching downloaded pages for event information, storing event information found in result table or database

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6041360A (en) * 1997-11-21 2000-03-21 International Business Machines Corporation Web browser support for dynamic update of bookmarks
US6061738A (en) * 1997-06-27 2000-05-09 D&I Systems, Inc. Method and system for accessing information on a network using message aliasing functions having shadow callback functions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061738A (en) * 1997-06-27 2000-05-09 D&I Systems, Inc. Method and system for accessing information on a network using message aliasing functions having shadow callback functions
US6041360A (en) * 1997-11-21 2000-03-21 International Business Machines Corporation Web browser support for dynamic update of bookmarks

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10113902A1 (en) * 2001-03-21 2002-09-26 Matthias Jaekle Processing program of events dates involves downloading pages from Internet, searching downloaded pages for event information, storing event information found in result table or database

Also Published As

Publication number Publication date
WO2001011444A3 (en) 2001-09-27
AU6627700A (en) 2001-03-05

Similar Documents

Publication Publication Date Title
US6516312B1 (en) System and method for dynamically associating keywords with domain-specific search engine queries
US7340459B2 (en) Information access
US6931397B1 (en) System and method for automatic generation of dynamic search abstracts contain metadata by crawler
US9342602B2 (en) User interfaces for search systems using in-line contextual queries
US8276065B2 (en) System and method for classifying electronically posted documents
US6321228B1 (en) Internet search system for retrieving selected results from a previous search
US6256623B1 (en) Network search access construct for accessing web-based search services
KR100567005B1 (en) Information retrieval from hierarchical compound documents
US20010047353A1 (en) Methods and systems for enabling efficient search and retrieval of records from a collection of biological data
US20080140657A1 (en) Document Searching Tool and Method
US20020091661A1 (en) Method and apparatus for automatic construction of faceted terminological feedback for document retrieval
US8589391B1 (en) Method and system for generating web site ratings for a user
WO2001016807A1 (en) An internet search system for tracking and ranking selected records from a previous search
KR20040029895A (en) Search system
EP1462952B1 (en) Method for indexing and searching a collection of internet documents
WO2001024046A2 (en) Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup
WO2001011444A2 (en) System and method for searching and indexing world-wide-web pages
US8090736B1 (en) Enhancing search results using conceptual document relationships
KR20030034265A (en) Devices and Method for Total Bulletin Board Services
Lam The Overview of Web Search Engines
Clarke Search engines for the World Wide Web: an evaluation of recent developments
Du A Web Meta-Search Engine
Lin et al. Vipas: virtual link powered authority search in the web
Hu et al. World wide web search engines
Chen VIPAS: Virtual Link Powered Authority Search in the Web

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA GB JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AU CA GB JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP