CA2711793A1 - Compressed document surrogates - Google Patents

Compressed document surrogates Download PDF

Info

Publication number
CA2711793A1
CA2711793A1 CA2711793A CA2711793A CA2711793A1 CA 2711793 A1 CA2711793 A1 CA 2711793A1 CA 2711793 A CA2711793 A CA 2711793A CA 2711793 A CA2711793 A CA 2711793A CA 2711793 A1 CA2711793 A1 CA 2711793A1
Authority
CA
Canada
Prior art keywords
document
term
documents
found
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA2711793A
Other languages
French (fr)
Other versions
CA2711793C (en
Inventor
Jay Michael Ponte
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Verizon Laboratories Inc
Original Assignee
Verizon Laboratories Inc.
Jay Michael Ponte
Gte Laboratories Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Verizon Laboratories Inc., Jay Michael Ponte, Gte Laboratories Incorporated filed Critical Verizon Laboratories Inc.
Publication of CA2711793A1 publication Critical patent/CA2711793A1/en
Application granted granted Critical
Publication of CA2711793C publication Critical patent/CA2711793C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting

Abstract

Disclosed is a method and device for storing information about Web documents such as pages or sites in a manner which may be used in conjunction with inverted term lists to facilitate the retrieval of documents of interest from the Web.
The method involves constructing compressed surrogates for documents, such that various operations may be performed without the need to retrieve a copy of the document from the Web. The method permits the efficient updating of inverted term lists when documents on the Web have been modified or deleted, and also permits the efficient processing of search queries in a variety of circumstances.

Claims (8)

1. A method for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the method comprising:
(a) creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and which compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of:
the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;
(b) choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;
(c) consulting the top inverted term list for said term, calculating the score for each document found in the top inverted term list;
(i) if the document has not previously been found on an inverted term list, assigning the document the calculated score;
(ii) if the document has previously been found on an inverted term list, increasing its previously-calculated score by the calculated score;
(d) calculating a maximum score, S Max, achieved by a document, not already found on a top inverted term list, if it is found on all top inverted term lists, for terms to be found in documents, not yet consulted;

(e) calculating a maximum score, S Sub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;
(f) determining whether there are N or more documents already found, with scores such that if S Sub were subtracted from their scores, the remainder would be greater than S Max;
(g) if there are N or more such documents, determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score;

(h) if there are not N or more such documents, repeating (b) through (f) until either N or more such documents are found, or until no top inverted term list of a term to be found in the document has not been analyzed;
(i) if there are not N or more such documents, and the top inverted term lists of all terms desired to be found in the document have been analyzed, repeating (b) through (h) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed; and, (j) determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.
2. The method of claim 1, wherein the documents are Web pages.
3. The method of claim 1, wherein the documents are Web sites.
4. The method of claim 1, wherein only terms desired to be found are contained in a search query, so that S Sub is zero.
5. A device for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the device comprising:

(a) a processor;

(b) means for creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and where the compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of: the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;
(c) means for choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;
(d) means for consulting the top inverted term list for said term, and calculating the score for each document found in the top inverted term list;
(i) means for assigning the document the calculated score, in response to the document not having previously been found on an inverted term list;
(ii) means for increasing the document's previously-calculated score by the calculated score, in response to the document having previously been found on an inverted term list;
(e) means for calculating a maximum score, S Max, achieved by a document, not already found on a top inverted term list, in response to it being found on all top inverted term lists, for terms to be found in documents, not yet consulted;
(f) means for calculating a maximum score, S Sub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;
(g) means for determining whether there are N or more documents already found, with scores such that if S Sub were subtracted from their scores, the remainder would be greater than S Sub;
(h) means for determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score, in response to there being N or more documents already found with scores such that if S Sub were subtracted from their scores, the remainder would be greater than S
Max;
(i) means for repeating (c) through (g) until either N or more such documents are found, or until no top inverted term list of a term desired to be found in the document has not been analyzed, in response to there not being N or more such documents;

(j) means for repeating (c) through (i) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed, in response to there not being N or more such documents, and the top inverted term lists of all terms desired to be found in the document having been analyzed; and, (k) means for determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.
6. The device of claim 5, wherein the documents are Web pages.
7. The device of claim 5, wherein the documents are Web sites.
8. The device of claim 5, wherein only terms desired to be found are contained in a search query, so that S Sub is zero.
CA2711793A 1999-07-30 2000-06-07 Compressed document surrogates Expired - Lifetime CA2711793C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/365,326 1999-07-30
US09/365,326 US6665665B1 (en) 1999-07-30 1999-07-30 Compressed document surrogates
CA2310931A CA2310931C (en) 1999-07-30 2000-06-07 Compressed document surrogates

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CA2310931A Division CA2310931C (en) 1999-07-30 2000-06-07 Compressed document surrogates

Publications (2)

Publication Number Publication Date
CA2711793A1 true CA2711793A1 (en) 2001-01-30
CA2711793C CA2711793C (en) 2011-12-06

Family

ID=23438405

Family Applications (2)

Application Number Title Priority Date Filing Date
CA2310931A Expired - Lifetime CA2310931C (en) 1999-07-30 2000-06-07 Compressed document surrogates
CA2711793A Expired - Lifetime CA2711793C (en) 1999-07-30 2000-06-07 Compressed document surrogates

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CA2310931A Expired - Lifetime CA2310931C (en) 1999-07-30 2000-06-07 Compressed document surrogates

Country Status (2)

Country Link
US (2) US6665665B1 (en)
CA (2) CA2310931C (en)

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839321B2 (en) 1997-01-06 2020-11-17 Jeffrey Eder Automated data storage system
US9471672B1 (en) * 1999-11-10 2016-10-18 Fastcase, Inc. Relevance sorting for database searches
JP3368883B2 (en) * 2000-02-04 2003-01-20 インターナショナル・ビジネス・マシーンズ・コーポレーション Data compression device, database system, data communication system, data compression method, storage medium, and program transmission device
US8132110B1 (en) 2000-05-04 2012-03-06 Aol Inc. Intelligently enabled menu choices based on online presence state in address book
US7979802B1 (en) 2000-05-04 2011-07-12 Aol Inc. Providing supplemental contact information corresponding to a referenced individual
US9100221B2 (en) 2000-05-04 2015-08-04 Facebook, Inc. Systems for messaging senders and recipients of an electronic message
US9356894B2 (en) 2000-05-04 2016-05-31 Facebook, Inc. Enabled and disabled menu choices based on presence state
US8001190B2 (en) 2001-06-25 2011-08-16 Aol Inc. Email integrated instant messaging
US7392238B1 (en) * 2000-08-23 2008-06-24 Intel Corporation Method and apparatus for concept-based searching across a network
US20020116365A1 (en) * 2001-01-12 2002-08-22 A System And Method For Classifying Tangible Assets System and method for classifying tangible assets
US20020165936A1 (en) * 2001-01-25 2002-11-07 Victor Alston Dynamically branded web sites
DE10118127A1 (en) * 2001-04-11 2002-10-17 Philips Corp Intellectual Pty Process for operating an automatic industry information system
US8260786B2 (en) * 2002-05-24 2012-09-04 Yahoo! Inc. Method and apparatus for categorizing and presenting documents of a distributed database
US7231395B2 (en) * 2002-05-24 2007-06-12 Overture Services, Inc. Method and apparatus for categorizing and presenting documents of a distributed database
US7249312B2 (en) * 2002-09-11 2007-07-24 Intelligent Results Attribute scoring for unstructured content
WO2004090692A2 (en) 2003-04-04 2004-10-21 Icosystem Corporation Methods and systems for interactive evolutionary computing (iec)
JP2004348241A (en) * 2003-05-20 2004-12-09 Hitachi Ltd Information providing method, server, and program
EP1649346A2 (en) 2003-08-01 2006-04-26 Icosystem Corporation Methods and systems for applying genetic operators to determine system conditions
EP1661031A4 (en) * 2003-08-21 2006-12-13 Idilia Inc System and method for processing text utilizing a suite of disambiguation techniques
US7346839B2 (en) * 2003-09-30 2008-03-18 Google Inc. Information retrieval based on historical data
US7281005B2 (en) * 2003-10-20 2007-10-09 Telenor Asa Backward and forward non-normalized link weight analysis method, system, and computer program product
US8577893B1 (en) * 2004-03-15 2013-11-05 Google Inc. Ranking based on reference contexts
US9104689B2 (en) * 2004-03-17 2015-08-11 International Business Machines Corporation Method for synchronizing documents for disconnected operation
US7493320B2 (en) 2004-08-16 2009-02-17 Telenor Asa Method, system, and computer program product for ranking of documents using link analysis, with remedies for sinks
US7548917B2 (en) * 2005-05-06 2009-06-16 Nelson Information Systems, Inc. Database and index organization for enhanced document retrieval
US7363225B2 (en) * 2005-06-23 2008-04-22 Microsoft Corporation Compressing language models with Golomb coding
US20060294199A1 (en) * 2005-06-24 2006-12-28 The Zeppo Network, Inc. Systems and Methods for Providing A Foundational Web Platform
JP4756953B2 (en) * 2005-08-26 2011-08-24 富士通株式会社 Information search apparatus and information search method
US20070050361A1 (en) * 2005-08-30 2007-03-01 Eyhab Al-Masri Method for the discovery, ranking, and classification of computer files
EP1927058A4 (en) 2005-09-21 2011-02-02 Icosystem Corp System and method for aiding product design and quantifying acceptance
US7562074B2 (en) * 2005-09-28 2009-07-14 Epacris Inc. Search engine determining results based on probabilistic scoring of relevance
US7529761B2 (en) * 2005-12-14 2009-05-05 Microsoft Corporation Two-dimensional conditional random fields for web extraction
CA2549536C (en) * 2006-06-06 2012-12-04 University Of Regina Method and apparatus for construction and use of concept knowledge base
US8001130B2 (en) * 2006-07-25 2011-08-16 Microsoft Corporation Web object retrieval based on a language model
US7720830B2 (en) * 2006-07-31 2010-05-18 Microsoft Corporation Hierarchical conditional random fields for web extraction
US7921106B2 (en) * 2006-08-03 2011-04-05 Microsoft Corporation Group-by attribute value in search results
JP2008059099A (en) * 2006-08-29 2008-03-13 Access Co Ltd Information display device, information display program and information display system
US8099429B2 (en) * 2006-12-11 2012-01-17 Microsoft Corporation Relational linking among resoures
US8631005B2 (en) 2006-12-28 2014-01-14 Ebay Inc. Header-token driven automatic text segmentation
US7720837B2 (en) * 2007-03-15 2010-05-18 International Business Machines Corporation System and method for multi-dimensional aggregation over large text corpora
US7844609B2 (en) * 2007-03-16 2010-11-30 Expanse Networks, Inc. Attribute combination discovery
US20090043752A1 (en) 2007-08-08 2009-02-12 Expanse Networks, Inc. Predicting Side Effect Attributes
US20090089293A1 (en) * 2007-09-28 2009-04-02 Bccg Ventures, Llc Selfish data browsing
US20090144266A1 (en) * 2007-12-04 2009-06-04 Eclipsys Corporation Search method for entries in a database
US20100057685A1 (en) * 2008-09-02 2010-03-04 Qimonda Ag Information storage and retrieval system
US7917438B2 (en) * 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US20100169262A1 (en) * 2008-12-30 2010-07-01 Expanse Networks, Inc. Mobile Device for Pangenetic Web
US8386519B2 (en) 2008-12-30 2013-02-26 Expanse Networks, Inc. Pangenetic web item recommendation system
US8108406B2 (en) 2008-12-30 2012-01-31 Expanse Networks, Inc. Pangenetic web user behavior prediction system
US8554696B2 (en) * 2009-02-13 2013-10-08 Fujitsu Limited Efficient computation of ontology affinity matrices
US8713007B1 (en) 2009-03-13 2014-04-29 Google Inc. Classifying documents using multiple classifiers
WO2011052526A1 (en) * 2009-10-30 2011-05-05 楽天株式会社 Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
US10614134B2 (en) 2009-10-30 2020-04-07 Rakuten, Inc. Characteristic content determination device, characteristic content determination method, and recording medium
CN103164435B (en) * 2011-12-13 2016-03-09 北大方正集团有限公司 A kind of acquisition method of network data and system
TWI518616B (en) * 2014-09-24 2016-01-21 國立清華大學 Method and electronic device for rating outfit

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5764906A (en) * 1995-11-07 1998-06-09 Netword Llc Universal electronic resource denotation, request and delivery system
JP3040945B2 (en) * 1995-11-29 2000-05-15 松下電器産業株式会社 Document search device
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US5920859A (en) * 1997-02-05 1999-07-06 Idd Enterprises, L.P. Hypertext document retrieval system and method
US6038561A (en) * 1996-10-15 2000-03-14 Manning & Napier Information Services Management and analysis of document information text
GB9701866D0 (en) * 1997-01-30 1997-03-19 British Telecomm Information retrieval
US5895470A (en) * 1997-04-09 1999-04-20 Xerox Corporation System for categorizing documents in a linked collection of documents
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
US6122647A (en) * 1998-05-19 2000-09-19 Perspecta, Inc. Dynamic generation of contextual links in hypertext documents
US6405188B1 (en) * 1998-07-31 2002-06-11 Genuity Inc. Information retrieval system
US6389412B1 (en) * 1998-12-31 2002-05-14 Intel Corporation Method and system for constructing integrated metadata

Also Published As

Publication number Publication date
CA2711793C (en) 2011-12-06
CA2310931C (en) 2010-10-26
US20060184521A1 (en) 2006-08-17
CA2310931A1 (en) 2001-01-30
US7240056B2 (en) 2007-07-03
US6665665B1 (en) 2003-12-16

Similar Documents

Publication Publication Date Title
CA2711793A1 (en) Compressed document surrogates
KR100944744B1 (en) Determination of a desired repository
US7136851B2 (en) Method and system for indexing and searching databases
EP0813158A3 (en) System and method for accelerated query evaluation of very large full-text databases
EP1251438A3 (en) Information retrieval system
WO2000062264A3 (en) Method and system for retrieving data from multiple data sources using a search routing database
EP0940762A3 (en) Multilingual patent information search system
DE69433165D1 (en) ASSOCIATIVE TEXT SEARCH AND REINFORCEMENT SYSTEM
NZ508695A (en) Method and system of searching a database of records
CA2409642A1 (en) Method and apparatus for identifying related searches in a database search system
DE69815898D1 (en) IDENTIFYING THE MOST RELEVANT ANSWERS TO A CURRENT SEARCH REQUEST BASED ON ANSWERS ALREADY SELECTED FOR SIMILAR INQUIRIES
EP0889419A3 (en) Keyword extracting system and text retrieval system using the same
CA2218270A1 (en) Text index registration and retrieval method
ATE555445T1 (en) EFFICIENT SEARCH FOR MIGRATION AND EMPTYING CANDIDATES
WO2003065248A3 (en) Retrieving matching documents by queries in any national language
EP1008944A3 (en) Document retrieval mediating apparatus, document retrieval system amd recording medium storing document retrieval mediating program
EP0926606A3 (en) Document data linking apparatus
RU2003105262A (en) METHOD OF SEARCH AND SELECTION OF INFORMATION WITH INCREASED RELEVANCE
EP2264619A3 (en) Search engine for video and graphics
JP2007072596A (en) Information sharing system and information sharing method
EP1293915A3 (en) Search system and method
US7680760B2 (en) System and method for labeling a document
Franklin How internet search engines work
US20030078992A1 (en) Web Pages
Larson Cheshire II at GeoCLEF: Fusion and Query Expansion for GIR.

Legal Events

Date Code Title Description
EEER Examination request
MKEX Expiry

Effective date: 20200607