US20150234917A1 - Per-document index for semantic searching - Google Patents

Per-document index for semantic searching Download PDF

Info

Publication number
US20150234917A1
US20150234917A1 US14/700,693 US201514700693A US2015234917A1 US 20150234917 A1 US20150234917 A1 US 20150234917A1 US 201514700693 A US201514700693 A US 201514700693A US 2015234917 A1 US2015234917 A1 US 2015234917A1
Authority
US
United States
Prior art keywords
document
section
term
sections
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/700,693
Inventor
Yue-Sheng Liu
Xiao-Song Yang
Hui Shen
Weihu Wang
John G. Bennett
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing Llc filed Critical Microsoft Technology Licensing Llc
Priority to US14/700,693 priority Critical patent/US20150234917A1/en
Publication of US20150234917A1 publication Critical patent/US20150234917A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • G06F17/30622
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • G06F17/30864

Definitions

  • Inverted index stores a mapping from content, such as keywords, to its location in a database file, or in a document or set of documents. Those documents having keywords that match search query keywords are returned as search results.
  • Search ranking algorithms have been developed, however, that rely on additional information in documents besides keywords in order to return more contextually-meaningful search results that better match user intent.
  • the requirements of these new algorithms along with the ever-increasing size of Web data can present issues regarding the storage of document information.
  • aspects of the present invention relate to systems, methods, and computer-storage media for, among other things, generating a per-document index (PDI) used for semantic searching and search ranking.
  • the PDI is forward encoded which preserves the semantic and contextual information of the original document including keywords and surrounding non-keyword terms; this semantic information provides valuable indicators as to the underlying meaning of the document.
  • the PDI is generated in such a way that rich per-document information, including semantic and/or contextual information, of different kinds can be efficiently stored. As well, the information can be accessed and utilized at the time a search query is received without significant search-time penalties.
  • the PDI is generated by parsing a document into a plurality of sections and, for each section, translating in order each term to either a cache index or a term identifier. Subsequent to translating each term, each section is group encoded and stored in a data store.
  • a PDI generated in this manner can be used in combination with an inverted index to identify contextually-relevant search results. For example, a search query is received which comprises keyword terms and surrounding non-keyword terms having a contextual meaning. The inverted index is accessed, and the search query keyword terms are used to identify a set of documents that contains the keywords. A data store storing PDIs is then accessed, and the PDIs for the set of documents are identified.
  • the keyword terms in the document are located, and the keyword terms along with surrounding non-keyword terms are analyzed to determine a respective contextual meaning.
  • Those documents that have respective contextual meanings most relevant to the contextual meaning associated with the search query are identified, ranked, and presented in an ordered list to a user on a search results page. The result is a list which better represents contextually-relevant search results than when using an inverted index by itself.
  • FIG. 1 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention
  • FIG. 2 is a block diagram of an exemplary system for generating a per-document index suitable for use in implementing embodiments of the present invention
  • FIG. 3 depicts an illustrative data table suitable for storing mappings between document sections and term identifiers or cache indexes in accordance with an embodiment of the present invention
  • FIG. 4 is a flow diagram that illustrates an exemplary method of generating a per-document index for semantic searching in accordance with an embodiment of the present invention.
  • FIG. 5 is a flow diagram that illustrates an exemplary method of utilizing a per-document index in combination with an inverted index to identify contextually-relevant search results in accordance with an embodiment of the present invention.
  • aspects of the present invention relate to systems, methods, and computer-storage media for, among other things, generating a per-document index (PDI) used for semantic searching, search confirmation, and/or search ranking.
  • the PDI is forward encoded which preserves the contextual information along with other information associated with the document. This information may include keywords, surrounding non-keyword terms, annotations associated with the document, metadata associated with the document, and the like. All this information provides valuable indicators as to the underlying meaning of the document.
  • the PDI is generated in such a way that rich per-document information, including semantic and/or contextual information, of different kinds can be efficiently stored. As well, the information can be accessed and utilized at the time a search query is received without significant search-time penalties.
  • the PDI is generated by parsing a document into a plurality of sections and, for each section, translating in order each term to either a cache index or a term identifier.
  • the term “document” includes the document in its original or native form along with any information that may have been added or associated with the document. This information may include annotations and/or metadata that help to augment the understanding of the document's contents and context.
  • each section is group encoded and stored in a data store.
  • the PDI data structure may comprise a plurality of sections with each section comprising an in-order arrangement of at least one of terms, term identifiers or cache indexes.
  • the sections include a document data section comprising terms that identify the document, a custom-dictionary section comprising document-specific terms, a body section comprising terms in the document body, one or more meta-streams sections comprising terms associated with at least one of document layout, anchor uniform resource locators (URLs), or domain URLs, and one or more meta-word sections comprising terms associated with identified attributes and/or known contexts of the document.
  • a document data section comprising terms that identify the document
  • a custom-dictionary section comprising document-specific terms
  • a body section comprising terms in the document body
  • one or more meta-streams sections comprising terms associated with at least one of document layout, anchor uniform resource locators (URLs), or domain URLs
  • URLs anchor uniform resource locators
  • meta-word sections comprising terms associated with identified attributes and/or known contexts of the document.
  • a PDI generated in this manner can be used in combination with an inverted index to identify contextually-relevant search results.
  • a search query is received which comprises keyword terms and surrounding non-keyword terms having a contextual meaning.
  • the inverted index is accessed, and the search query keyword terms are used to identify a set of documents that contains the keywords.
  • a data store storing PDIs is then accessed, and the PDIs for the set of documents are identified, retrieved, and decoded.
  • the keyword terms in the document are located, and the keyword terms along with surrounding non-keyword terms are analyzed to determine a respective contextual meaning.
  • Those documents that have respective contextual meanings most relevant to the contextual meaning associated with the search query are identified, ranked, and presented in an ordered list to a user on a search results page.
  • the result is a search result list which better represents contextually-relevant search results than when using an inverted index by itself.
  • FIG. 1 An exemplary computing environment suitable for use in implementing embodiments of the present invention is described below in order to provide a general context for various aspects of the present invention.
  • FIG. 1 such an exemplary computing environment is shown and designated generally as computing device 100 .
  • the computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention. Neither should the computing device 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.
  • Embodiments of the invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device.
  • program modules including routines, programs, objects, components, data structures, etc., refer to code that performs particular tasks or implements particular abstract data types.
  • Embodiments of the invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, and the like.
  • Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
  • the computing device 100 includes a bus 110 that directly or indirectly couples the following devices: a memory 112 , one or more processors 114 , one or more presentation components 116 , one or more input/output (I/O) ports 118 , I/O components 120 , and an illustrative power supply 122 .
  • the bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof).
  • busses such as an address bus, data bus, or combination thereof.
  • FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present invention. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG. 1 and reference to “computer” or “computing device.”
  • the computing device 100 typically includes a variety of computer-readable media.
  • Computer-readable media may be any available media that is accessible by the computing device 100 and includes both volatile and nonvolatile media, removable and non-removable media.
  • Computer-readable media comprises computer storage media and communication media; computer storage media excludes signals per se.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 100 .
  • Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
  • the memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory.
  • the memory may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, solid state drives (SSDs), and the like.
  • the computing device 100 includes one or more processors that read data from various entities such as the memory 112 or the I/O components 120 .
  • the presentation component(s) 116 present data indications to a user or other device.
  • Exemplary presentation components include a display device, speaker, printing component, vibrating component, and the like.
  • the I/O ports 118 allow the computing device 100 to be logically coupled to other devices including the I/O components 120 , some of which may be built in.
  • Illustrative components include a microphone, a camera, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
  • aspects of the subject matter described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a mobile device.
  • program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types.
  • aspects of the subject matter described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including memory storage devices.
  • server is often used herein, it will be recognized that this term may also encompass a search engine, a Web browser, a set of one or more processes distributed on one or more computers, one or more stand-alone storage devices, a set of one or more other computing or storage devices, a combination of one or more of the above, and the like.
  • an exemplary system 200 is depicted for use in generating a per-document index (PDI) for use in semantic or contextual searching and search ranking.
  • the system 200 is merely an example of one suitable system environment and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the present invention. Neither should the system 200 be interpreted as having any dependency or requirement related to any single module/component or combination of modules/components illustrated therein.
  • the system includes a per-document index service 210 , a data store 212 , and an end-user computing device 214 all in communication with one another via a network 216 .
  • the network 216 may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. Accordingly, the network 216 is not further described herein.
  • one or more of the illustrated components/modules may be implemented as stand-alone applications. In other embodiments, one or more of the illustrated components/modules may be integrated directly into, for example, the operating system of the end-user computing device 214 or the per-document index service 210 .
  • the components/modules illustrated in FIG. 2 are exemplary in nature and in number and should not be construed as limiting. Any number of components/modules may be employed to achieve the desired functionality within the scope of embodiments hereof. Further, components/modules may be located on any number of servers. By way of example only, the per-document index service 210 might reside on a server, a cluster of servers, or a computing device remote from one or more of the remaining components.
  • the data store 212 is configured to store one or more per-document indexes (PDIs) or forward indexes (for the purposes of this application, the two terms are used interchangeably).
  • PDIs per-document indexes
  • Each PDI stores a forward-encoded document.
  • a document comprises a Web page, a collection of Web pages, representations of documents (e.g., a PDF file), and the like.
  • a document comprises the document in its native or original form along with any annotations and/or metadata that have been associated with the document.
  • Forward encoding preserves not only the keywords associated with the original document but also the contextual and other information associated with the document including the contextual order of the document.
  • the PDI is structured in such a way as to allow access to both keyword terms and the context surrounding those terms at the time a search query is received without significant search-time penalties.
  • the data store 212 comprises a solid state drive (SSD) that stores the PDI persistently.
  • SSD solid state drive
  • the data store 212 may also be configured as an inverted index that maps keyword terms to documents that contain those keyword terms.
  • the data store 212 is configured to store information used by the per-document index service 210 .
  • the data store 212 may store section-specific, prefix-specific, culture-specific, language-specific, or custom dictionaries for use in translating terms in a document section to term identifiers.
  • the data store 212 may store attributes and/or known contexts that are identified for each document such as, for example, category classifications of documents, social media references, location information associated with the document, originating language of the document, fingerprint information associated with the document, and the like.
  • the data store 212 may also store information concerning recorded search behavior (query logs, rating logs, browser or search logs, query click logs, browsing times, related search lists, etc.) of users in general.
  • Query click logs provide information on documents selected by users in response to a search query
  • browsing times provide information on the estimated total time users spend browsing a document
  • browser/search logs provide information on documents viewed by users during a search session and how frequently any one document is visited by users.
  • rating logs indicate an importance or ranking of a document based on, for example, various rating algorithms known in the art. This information may be used by the per-document index service 210 to carry out various search result ranking algorithms.
  • the information stored in association with the data store 212 is configured to be searchable for one or more items of information stored in association therewith.
  • the content and volume of such information in the data store 212 are not intended to limit the scope of embodiments of the present invention in any way.
  • the data store 212 may, in fact, be a plurality of storage devices, for instance, a database cluster, portions of which may reside on the per-document index service 210 , the end-user computing device 214 , and/or any combination thereof.
  • the end-user computing device 214 shown in FIG. 2 may be any type of computing device, such as, for example, the computing device 100 described above with reference to FIG. 1 .
  • the end-user computing device 214 may be a personal computer, desktop computer, laptop computer, handheld device, mobile handset, consumer electronic device, or the like. It should be noted, however, that embodiments are not limited to implementation on such computing devices, but may be implemented on any of a variety of different types of computing devices within the scope of embodiments hereof.
  • the end-user computing device 214 may receive inputs through a variety of means such as voice, touch, and/or gestures.
  • the end-user computing device includes a display screen. The display screen is configured to present information, including search results, to the user of the end-user computing device 214 .
  • the system 200 is merely exemplary. While the per-document index service 210 is illustrated as a single unit, it will be appreciated that the per-document index service 210 is scalable. For example, the per-document index service 210 may in actuality include a plurality of computing devices in communication with one another. Moreover, the data store 212 , or portions thereof, may be included within, for instance, the per-document index service 210 as a computer-storage medium.
  • the single unit depictions are meant for clarity, not to limit the scope of embodiments in any form.
  • the per-document index service 210 comprises a receiving component 218 , a parsing component 220 , a translation component 222 , an encoding component 224 , a decoding component 226 , and a contextual analysis component 228 .
  • one or more of the components 218 , 220 , 222 , 224 , 226 , and 228 may be implemented as stand-alone applications.
  • one or more of the components 218 , 220 , 222 , 224 , 226 , and 228 may be integrated directly into the operating system of a computing device such as the computing device 100 of FIG. 1 .
  • the receiving component 218 is configured to receive one or more search queries inputted by a user.
  • the search queries may be inputted on a search engine page, a search box on a Web page, and the like.
  • the search query may comprise one or more terms arranged in a defined grammatical pattern or sequence. Some of the terms may comprise keyword terms, while other terms may join the keyword terms, surround the keyword terms, or act as qualifiers of the keyword terms. For the purposes of this application, terms that join or surround keyword terms or act as qualifiers of the keyword terms are known as surrounding terms. For instance, the search query “books for children” may be considered to have two keywords, “books” and “children,” and a surrounding word, “for.” The word “for” provides context for the search query but is often ignored by traditional ranking algorithms.
  • the search query “books by children” contains the same two keywords as the search query “books for children,” but the surrounding word “by” completely changes the semantic meaning of the search query.
  • the presence of a qualifier may change the semantic or contextual meaning of the search query.
  • the search query “non-profit organizations” has a different contextual meaning than the search query “for-profit organizations” although the two search queries may share the same keywords.
  • the terms “semantic” and “contextual” are used interchangeably. Both terms refer to the underlying meaning of a group of words or a phrase.
  • the receiving component 218 is further configured to receive one or more documents.
  • the documents may be received in response to a Web crawler executing a crawl and extracting documents.
  • the receiving component 218 may receive the documents from a data store such as the data store 212 , a document source, or a third-party source.
  • Documents comprise the original document along with any annotations and/or metadata that have been associated with or added to the document.
  • the parsing component 220 is configured to parse the document into a plurality of sections.
  • the sections may include a document data section, one or more meta-word sections, one or more meta-streams sections, and a body section.
  • the document data section comprises terms or common attributes that identify the document. Such terms may include the author(s), a date the document was created, the reading level, whether the document is classified as “adult content,” ratings associated with the document, spam likelihood, the number of pages of the document, and the like.
  • the body section comprises terms in the main body of the document
  • the meta-streams section(s) comprises terms that occur in, for example, the title of the document; headers, section descriptions, footnotes, and endnotes associated with the document; and the like.
  • the meta-streams section(s) may also include anchor uniform resource locators (URLs) that comprise URLs that are determined to reference the document in question, as well as descriptions or phrases found at the anchor URL locations which are believed to be descriptive of the document.
  • URLs anchor uniform resource locators
  • meta-streams section(s) may include domain URLs that comprise URLs within the document that link to other documents (e.g., hyperlinks), and any URLs associated with the document itself.
  • the meta-streams section(s) provides information regarding the layout of the document and information concerning how the document relates to other documents.
  • the meta-streams section may comprise one section.
  • there may be a meta-stream section for terms that define the structure or layout of the document e.g., title, headers, section descriptors
  • a meta-streams section that provides information on how the document relates to other documents (e.g., anchor URLs, hyperlinks, and the like). It is contemplated that there may be more than two meta-streams sections with each section providing information related to document structure and/or document relatedness to other documents. Any and all such aspects are contemplated as being within the scope of the invention.
  • the meta-words section(s) comprises terms associated with attributes and/or determined contexts of the document.
  • the parsing component 220 is configured to determine or identify attributes associated with the document and to generate meta-word terms for the identified attributes.
  • attributes associated with a document There may be many types of attributes associated with a document.
  • one attribute may comprise fingerprints associated with the document.
  • the fingerprint of the document is a snapshot of a portion of the document that uniquely defines some aspect of the document.
  • the fingerprint of a document may be compared to fingerprints of other document to determine duplication or plagiarism.
  • Another attribute may include location information associated with the document.
  • the parsing component 220 determines the country, state, zip code, and the originating language of the document.
  • Another attribute may include a determined category associated with the document (e.g., if the document is directed to common home repairs, it may be classified in the home improvement category).
  • the parsing component 220 is configured to generate the terms that describe the attributes.
  • each term comprises a prefix that describes the type of attribute and a value associated with the attribute.
  • the prefix may be “ZIP” and the value may be the actual zip code such as “98052.”
  • the complete meta-word term may comprise “meta_ZIP — 98052.”
  • each term may comprise just a prefix that describes the type of attribute. Any and all such aspects are contemplated as being within the scope of the invention.
  • the parsing component is additionally configured to identify known or determined contexts associated with the document.
  • a known or determined context may comprise blog posts related to the document, social media (Facebook®, Twitter®, Instagram®, etc.) comments, likes, or postings associated with the document, applications that reference the document, and the like.
  • the document may be a Web page describing a restaurant. A celebrity included the restaurant name in a blog posting.
  • the portion of the blog associated with the restaurant is identified by the parsing component 220 and is included in a meta-word section.
  • the meta-words section may comprise one section.
  • the parsing component 220 is further configured to generate a custom-dictionary section by identifying terms in the document that do not ordinarily occur outside of the context of the document.
  • the custom-dictionary section comprises document-specific terms at specified positions within the dictionary.
  • the translation component 222 is configured to translate or encode in-order (e.g., forward encode) each term in the document data, body, meta-words, and meta-streams sections to either a cache index or a term identifier. Each section is encoded separately by the translation component 222 . In-order encoding encodes the terms as they appear in the document and yields the position of the term in the document implicitly. This can be contrasted with an inverted index where the position of the term in the document is explicitly encoded.
  • in-order e.g., forward encode
  • the translation component 222 forward encodes each section by accessing a section-specific dictionary that comprises terms commonly associated with the particular section. Each term is at an identified position within the section-specific dictionary, and the position of the term within the dictionary comprises the term identifier for the term. Thus, the term “the” may be found at position 5 within the dictionary; the term identifier for the word “the” would then comprise the number “5.”
  • the translation component 222 then replaces the term with its corresponding term identifier. Using term identifiers enables the terms in the document to be compressed, thus taking up less storage space. Further, term identifiers are easy to encode and decode which speeds up the retrieval of search results.
  • the term identifiers may comprise numerical values. Frequently-used words are associated with smaller term identifiers while less frequently-used words are associated with larger term identifiers.
  • the term “the” may be associated with the term identifier of “5,” while the less-frequently used word “especially” may be associated with the term identifier of “1230.”
  • the term identifier may be 1-byte, 2-byte, or 3-byte.
  • each section generally has its own section-specific dictionary. However, the body and meta-streams sections may share the same dictionary because of the commonality of terms between these two sections.
  • section-specific dictionaries comprise commonly-used section terms.
  • the section-specific dictionaries may be used to generate term identifiers for a plurality of documents. Terms that are unique to the document may not be found in the section-specific dictionary.
  • the translation component 222 accesses the custom dictionary to identify the respective position of the document-specific term and uses this position as the term identifier for the document-specific word.
  • a dictionary may be specialized for a prefix, or may be shared by several prefixes. Alternatively, there may not be a dictionary for a particular prefix.
  • the prefix is not translated into a term identifier but is stored in its native form.
  • the term identifier for a value associated with a prefix is dependent upon the dictionary used for the value's prefix.
  • the prefix “ZIP” may be encoded using a zip code-specific dictionary. Values associated with the prefix would be encoded using the same zip code-specific dictionary.
  • the translation component 222 is further configured to process the term identifiers through an entry cache using methods known in the art.
  • the original term identifier is output. If there is a cache hit, a cache index for the term is output. If there is an initial cache miss for a term identifier, the cache allocates a new entry and copies in data from memory. Thus, next time the term identifier is processed, a cache index for that term is outputted.
  • the entry cache is 1,024 terms, and the cache index may be 10 bits. The use of an entry cache further compresses the document data in the PDI.
  • the encoding component 224 is configured to group encode the different sections using methods known in the art. In general, a section stream is processed 32 bits at a time and contains three operational codes to reduce unpredictable branches during the decoding process. The six high bits of the term are shifted down and used to switch into one of 27 possible packings.
  • Each operational code can indicate: 1) a 10 bit cache index for a meta-stream or body term from the cache; 2) a term identifier for a body term or meta-stream term that was not present in the cache, or term identifiers or terms for terms in the other sections (e.g., meta-data section, document data section, and/or custom-dictionary section); 3) constants for use in position or metadata; and 4) other features of the PDI such as section boundaries, boundaries within the document, and the like.
  • the decoding component 226 is configured to decode the data using, for example, the operational codes.
  • each term in each section may be viewed as having three parts: 1) the position of the term in the document (this is implicitly encoded); 2) the term identifier or a cache index; and 3) any metadata associated with the term.
  • Metadata may include type of font, capitalization, bolding, italicizing, underlining, punctuation, position of the term at the end of a sentence, section or document, and the like.
  • Metadata may also include an attribute of the term and its associated value. For instance, an attribute of a term may include a “date” or a “reading level.” These attributes are associated with values. For instance the “date” attribute may have a value of “Aug. 12, 2012,” and the “reading level” attribute may have a value of “4”
  • Up to 64 bits of metadata may be associated with a term; the metadata may be used in ranking algorithms.
  • the contextual analysis component 228 is configured to perform a number of actions.
  • the contextual analysis component 228 is configured to determine a contextual meaning of a search query by analyzing keyword terms as well as any surrounding terms of a search query received by the receiving component 218 .
  • the contextual analysis component 228 is also configured to use the search query keyword terms to identify a set of documents that contains the keyword terms; this may be accomplished by accessing and utilizing an inverted data store stored in association with the data store 212 . Once the set of documents is identified, the contextual analysis component 228 accesses the PDIs associated with the set of documents and instructs the decoding component 226 to decode the PDIs associated with the documents. The contextual analysis component 228 then performs contextual analysis of the documents.
  • the contextual analysis component 228 locates the instances of the keyword(s) term in each decoded document, constructs a contextual window around the keyword term, and determines a contextual meaning for the contextual window.
  • the contextual window may include a predetermined number of terms surrounding the keyword.
  • the size of the contextual window may be determined on the fly by continually including terms surrounding the keyword term until a contextual meaning is established.
  • the document may contain the following phrase, “Books by children authors are rare.”
  • the keywords terms are “Books” and “children.”
  • the contextual analysis component 228 would include all the terms in the sentence in the contextual window because all of the terms establish a contextual meaning for the phrase.
  • the contextual analysis component 228 is configured to rank the documents within the set based on how relevant each document's contextual window is compared to the contextual meaning of the search query. Those documents having contextual windows that are most relevant to the contextual meaning of the search query are promoted to a higher ranking, while those documents having contextual windows that are unrelated or not relevant to the contextual meaning of the search query are demoted to a lower ranking. The set of documents is then presented as a rank-ordered list on a search results page.
  • a search query may comprise the phrase, “Books by children.”
  • the contextual meaning of this search query can be surmised as “books written by children authors.”
  • An inverted index is utilized to identify, for example, two documents that contain the search query keyword terms of “books” and “children.”
  • a contextual window is constructed that includes the phrase “some great children's books include,” while a contextual window in a second document is constructed that includes the phrase “although books by children authors are rare, some examples include.”
  • the contextual meaning associated with the second document is more relevant to the contextual meaning of the search query as compared to the contextual meaning associated with the first document.
  • the second document would be ranked higher than the first document and presented before the first document on a search results page.
  • the contextual analysis component 228 is also configured to use the contextual windows to classify documents by category.
  • a document's contextual window may comprise the phrase “hammers and nails are useful tools to install home windows,” where “hammer” and “nail” are keyword terms.
  • the contextual analysis component 228 may classify this document in the “home improvement” category.
  • a document's contextual window may contain the same keywords but may be classified in a different category.
  • a document's contextual window may comprise the phrase “We sell hammers and nails at Miller Hardware.”
  • the contextual analysis component 228 may classify this document in the “retail” category.
  • the contextual analysis component 228 is further configured to cluster documents with similar categories and present these document clusters together on a search results page.
  • the contextual analysis component 228 is configured to identify documents whose contextual windows have keyword terms and surrounding terms that share structural pattern similarity with the search query. These documents may be ranked higher than documents whose contextual windows do not share structural pattern similarity with the search query. This is especially useful when a user inputs a long search query phrase.
  • a user may input a query such as “How do I get a taxi at the Beijing airport.” Documents that contain the same terms in substantially the same order as the search query may be ranked higher than those documents that contain the terms in a different order.
  • FIG. 3 an exemplary data table 300 is shown representing relationships between document sections and term identifiers and/or cache indexes in a PDI.
  • This data structure is illustrative and exemplary in nature and is not meant to be limiting in any way.
  • the purpose of FIG. 3 is to portray concepts and relationships between various data elements and not the actual arrangement of data in a data store. As such, instead of using actual term identifier values and/or cache indexes, textual descriptions of these values are used instead (e.g., “cacheIndex 1 ”).
  • a document 310 is stored in its PDI as a plurality of sections such as a document data section 312 , a custom-dictionary section 314 , a meta-words section 316 , a body section 318 , and a meta-streams section 320 .
  • the document data section 312 comprises terms or attributes that identify the document 310 .
  • Each of the terms in the document data section 312 is associated with a term identifier using, for example, a section-specific dictionary.
  • the term identifiers are encoded in the order they appear in the document data section 312 as shown by term identifiers 322 .
  • the document data term identifiers are the same at positions one and four in the document data section (e.g., D.D.Id 1 ), indicating that the same document data term appears at both of these positions in the document data section 312 .
  • the custom-dictionary section 314 comprises an in-order arrangement of document-specific terms 324 .
  • Document-specific terms are considered unique to the document 310 and are generally not found in the section-specific dictionaries used for each of the sections. Thus, when one of the document-specific terms is encountered during the translation process, the custom-dictionary section 314 is accessed and the position of that term in the custom dictionary is identified.
  • the meta-words section 316 comprises an in-order arrangement of prefix term identifiers 326 and 328 , and, optionally, value term identifiers 328 and 332 associated with identified attributes of the document 310 .
  • a prefix-specific dictionary may be used to determine term identifiers for one or more prefixes.
  • the prefix-specific dictionary used to determine term identifiers for a prefix is also used to determine a term identifier for the value associated with the prefix.
  • the meta-words section 316 may comprise multiple sections with each section sharing structural and/or semantic similarity.
  • the body section 318 comprises an in-order arrangement of either cache indexes 336 , 338 or term identifiers 334 , 340 associated with terms in the body of the document 310 .
  • a cache index such as cache indexes 336 and 338 , is stored when there is a cache hit when the term's associated term identifier is processed through the entry cache.
  • the term identifier such as term identifiers 334 and 340 , is stored in the PDI.
  • the meta-streams section 320 comprises an in-order arrangement of either cache indexes or term identifiers (represented by numeral 342 ) associated with terms in the title of the document 310 , headers and section headings of the document 310 , and/or URLs associated with the document 310 .
  • a cache index is stored when there is a cache hit when the term's associated term identifier is processed through the entry cache.
  • the term identifier is stored in the PDI if there is a cache miss when a term's associated term identifier is processed through the entry cache.
  • a flow diagram is depicted of an exemplary method 400 of generating a per-document index for semantic searching.
  • a document is received by a receiving component such as the receiving component 218 of FIG. 2 .
  • the document may be received from a data store such as the data store 212 of FIG. 2 , in response to a Web crawler extracting the document from the World Wide Web, from the document source, or from a third-party.
  • the document may comprise one or more Web pages, Web sites, representations of documents (e.g., a PDF file), and the like.
  • the document includes the original document along with any added annotations.
  • the document is parsed into a plurality of sections by a parsing component such as the parsing component 220 of FIG. 2 .
  • the sections may comprise a document data section that comprises terms or attributes that identify the document, and a body section comprising terms from the body of the document.
  • the sections also include a meta-streams section(s) that comprises: 1) terms associated with the title, header, footnotes, endnotes, and/or section headings of the document; 2) URLs that reference the document (e.g., anchor URLs), are associated with the document, or are embedded in the document as hyperlinks; and/or 3) context or descriptions found where the anchor URLs are located.
  • Parsing the document also includes identifying attributes associated with the document and associating those attributes with meta-word terms.
  • Document attributes are numerous. Some representative examples include originating language of the document, fingerprints associated with the document, the country, state and zip code of the document, a category corresponding to the document, and the like.
  • Each meta-word term includes a prefix that identifies the type of attribute (e.g., fingerprint, location, category, etc.) and, optionally, a value associated with the attribute.
  • an exemplary meta-word may be “meta_CATG_retail,” where “CATG” is the prefix and “retail” is the value.
  • Parsing the document also includes identifying known contexts associated with the documents. Known contexts may include blog postings that reference the document, social media comments, posts, and likes associated with the document, applications that routinely access the document, and the like. Parsing the document may further include creating a custom-dictionary section that comprises document-specific terms not found in the section dictionaries. Each of the document-specific terms is associated with a position in the custom dictionary.
  • each term in the document data, body, meta-words, and meta-streams sections is translated in order to either a cache index or a term identifier by a translation component such as the translation component 222 of FIG. 2 .
  • a section-specific dictionary is accessed, and the term is identified in the section-specific dictionary. The position of the term in the dictionary is used as the term identifier. If a term is not found in the section-specific dictionary, the custom dictionary is accessed to identify the term and its associated position.
  • Term identifiers comprise numerical values that are 1-byte, 2-byte, or 3-byte. Terms that are frequently used are associated with smaller term identifiers, while terms that are infrequently used are associated with larger term identifiers.
  • each section has a section-specific dictionary.
  • the body and meta-streams sections may share the same section-specific dictionary because these two sections commonly use the same terms.
  • each prefix may have a corresponding dictionary
  • a dictionary may be shared by several prefixes, or a prefix may not have a corresponding dictionary.
  • a prefix that does not have a corresponding dictionary will be encoded in its native form.
  • the dictionary used for a prefix will also be used for any values associated with that prefix.
  • Term identifiers associated with terms in the body section and meta-streams section are further processed by passing the term identifiers through an entry cache.
  • the entry cache is 1,024 terms. If there is a cache miss, the original term identifier is outputted. However, if there is a cache hit, a cache index is outputted; the cache index is 10 bits. The use of term identifiers and cache indexes allows the large amount of data associated with a document to be efficiently compressed which saves on storage space.
  • each of the sections is group encoded by an encoding component such as the encoding component 224 of FIG. 2 to generate the PDI.
  • the sections are encoded 32 bits at a time. Each 32 bits contains three operational codes that can indicate that the remaining bits comprise one or more 10 bit cache indexes, terms, or term identifiers, data used to identify term position or metadata associated with the term, section and stream boundaries, and the like.
  • the PDI is stored in association with a data store such as a solid state drive (SSD).
  • SSD solid state drive
  • Each term in the PDI is associated with a position, which is implicitly encoded through in-order encoding, a term identifier or cache index, and any metadata associated with the term.
  • Metadata may include font, capitalization, quantities or values, punctuation, bolding, italicizing, underlining, position with the document body, and the like.
  • a flow diagram is depicted of an exemplary method 500 of utilizing a PDI in combination with an inverted index to identify contextually-relevant search results.
  • a search query comprising one or more keyword terms and one or more surrounding terms is received by a receiving component such as the receiving component 218 of FIG. 2 .
  • the search query may be received in response to a user inputting the query into a search box of a search engine page, or a search box associated with a Web page or Web site.
  • the keyword terms and the surrounding terms impart a contextual meaning to the search query; this contextual meaning may be determined by a contextual analysis component such as the contextual analysis component 228 of FIG. 2 .
  • an inverted index is accessed by the contextual analysis component.
  • the inverted index comprises a mapping between keywords and documents that contain those keywords and may be stored in association with a first data store such as the data store 212 of FIG. 2 .
  • the contextual analysis component uses the keyword terms from the search query to identify a set of documents in the inverted index that contain one or more of the search query keyword terms.
  • Each document in the set may be associated with a document identifier (Dodd) such as the URL of the document, or other unique identifiers such as a cryptographic hash of the URL, or a unique sequence number assigned to the document when received by the system.
  • Dodd document identifier
  • the contextual analysis component accesses the per-document indexes associated with the set of documents using, for example, the DocIds.
  • the PDIs may be stored in association with a second data store.
  • the second data store may be the same as the first data store, while in another aspect, the second data store may be different from the first data store.
  • the second data store may comprise a SSD.
  • the contextual analysis component analyzes each document in the set to determine one or more contextual meanings associated with the each document.
  • the contextual analysis involves locating instances of the keyword terms in a document and constructing a contextual window around the keyword term(s) that includes the keyword term(s) and one or more surrounding non-search query terms.
  • the contextual window may include a predetermined number of terms or may include a variable number of terms sufficient to establish a contextual meaning for the contextual window.
  • a contextual meaning is determined.
  • a single document may contain multiple contextual windows. Each of the multiple contextual windows may share the same contextual meaning, different contextual meanings, or a combination of both.
  • An overall contextual meaning of the document may be determined by analyzing in-order the multiple contextual windows.
  • each document in the set of documents is ranked based on the respective contextual meaning of the document as compared to the contextual meaning associated with the search query. Those documents having contextual meanings that are relevant to the search query contextual meaning or promoted in the ranking, while documents having contextual meanings that are less relevant or different from the search query contextual meaning are demoted in the ranking.
  • the ranked set of documents is presented as a rank-ordered list on a search results page.
  • the method 500 may further comprise classifying each document in the set of documents into categories based on the contextual meaning associated with the contextual window of the document. For example, a contextual window in one document may comprise the phrase, “Traveling in Hong Kong is exciting,” where “Kong” is the search query keyword term. This document may be classified in the travel category. A contextual window in a second document may comprise the phrase, “Kong toys are a favorite of dogs,” where “Kong” is again the search query keyword term. This document may be classified in the pet category. Documents that share the same category are clustered together and are presented together on the search results page. Thus, the documents classified in the travel category would be presented in a group, and the documents classified in the pet category would be presented in separate group.
  • the method 500 may also include analyzing the structural pattern of terms in the contextual window and comparing it to the structural pattern of terms in the search query.
  • the structural pattern is the arrangement of terms within a contextual window and/or a search query. For instance, the structural pattern of the phrase “the dog ran after the cat” is different from the structural pattern of the phrase “the cat ran after the dog” although each contains the same terms. Documents whose contextual windows have terms in the same or similar structural pattern as that of the search query may be ranked higher than documents whose contextual windows have terms in a different structural pattern as that of the search query.
  • the methods described above are representative examples of the different types of contextually-relevant search results that can result from using resources associated with the PDIs.
  • the array of attributes, context, and in-order contextual windows present in the PDI can be utilized in new ways to produce ever more meaningful search results that match user intent.

Abstract

Methods, computer systems, and computer-storage medium for generating a per-document index used for semantic searching is provided. A document is received and parsed into a plurality of section. Each term in each section is translated in order to at least one of a cache index or a term identifier. Subsequent to translating the terms, each section is separately group encoded to generate the per-document index. The per-document index is stored in association with a data store.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application, having attorney docket number 337674.02/MFCP.230961 and entitled “Per-Document Index for Semantic Searching,” is a continuation application of pending U.S. application Ser. No. 13/687,135, filed Nov. 28, 2012 and entitled “Per-Document Index for Semantic Searching.” The entirety of the afore-mentioned application is incorporated by reference herein.
  • BACKGROUND
  • Traditional search ranking algorithms rely on an inverted index to match keywords extracted from search queries to keywords associated with one or more documents. Inverted indices store a mapping from content, such as keywords, to its location in a database file, or in a document or set of documents. Those documents having keywords that match search query keywords are returned as search results.
  • Search ranking algorithms have been developed, however, that rely on additional information in documents besides keywords in order to return more contextually-meaningful search results that better match user intent. The requirements of these new algorithms along with the ever-increasing size of Web data can present issues regarding the storage of document information.
  • SUMMARY
  • This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
  • Aspects of the present invention relate to systems, methods, and computer-storage media for, among other things, generating a per-document index (PDI) used for semantic searching and search ranking. The PDI is forward encoded which preserves the semantic and contextual information of the original document including keywords and surrounding non-keyword terms; this semantic information provides valuable indicators as to the underlying meaning of the document. The PDI is generated in such a way that rich per-document information, including semantic and/or contextual information, of different kinds can be efficiently stored. As well, the information can be accessed and utilized at the time a search query is received without significant search-time penalties.
  • In one aspect, the PDI is generated by parsing a document into a plurality of sections and, for each section, translating in order each term to either a cache index or a term identifier. Subsequent to translating each term, each section is group encoded and stored in a data store. A PDI generated in this manner can be used in combination with an inverted index to identify contextually-relevant search results. For example, a search query is received which comprises keyword terms and surrounding non-keyword terms having a contextual meaning. The inverted index is accessed, and the search query keyword terms are used to identify a set of documents that contains the keywords. A data store storing PDIs is then accessed, and the PDIs for the set of documents are identified. For each of the documents in the set, the keyword terms in the document are located, and the keyword terms along with surrounding non-keyword terms are analyzed to determine a respective contextual meaning. Those documents that have respective contextual meanings most relevant to the contextual meaning associated with the search query are identified, ranked, and presented in an ordered list to a user on a search results page. The result is a list which better represents contextually-relevant search results than when using an inverted index by itself.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention is described in detail below with reference to the attached drawing figures, wherein:
  • FIG. 1 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention;
  • FIG. 2 is a block diagram of an exemplary system for generating a per-document index suitable for use in implementing embodiments of the present invention;
  • FIG. 3 depicts an illustrative data table suitable for storing mappings between document sections and term identifiers or cache indexes in accordance with an embodiment of the present invention;
  • FIG. 4 is a flow diagram that illustrates an exemplary method of generating a per-document index for semantic searching in accordance with an embodiment of the present invention; and
  • FIG. 5 is a flow diagram that illustrates an exemplary method of utilizing a per-document index in combination with an inverted index to identify contextually-relevant search results in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
  • Aspects of the present invention relate to systems, methods, and computer-storage media for, among other things, generating a per-document index (PDI) used for semantic searching, search confirmation, and/or search ranking. The PDI is forward encoded which preserves the contextual information along with other information associated with the document. This information may include keywords, surrounding non-keyword terms, annotations associated with the document, metadata associated with the document, and the like. All this information provides valuable indicators as to the underlying meaning of the document. The PDI is generated in such a way that rich per-document information, including semantic and/or contextual information, of different kinds can be efficiently stored. As well, the information can be accessed and utilized at the time a search query is received without significant search-time penalties.
  • In one aspect, the PDI is generated by parsing a document into a plurality of sections and, for each section, translating in order each term to either a cache index or a term identifier. As used throughout this application, the term “document” includes the document in its original or native form along with any information that may have been added or associated with the document. This information may include annotations and/or metadata that help to augment the understanding of the document's contents and context. Subsequent to translating each term, each section is group encoded and stored in a data store. The PDI data structure may comprise a plurality of sections with each section comprising an in-order arrangement of at least one of terms, term identifiers or cache indexes. The sections include a document data section comprising terms that identify the document, a custom-dictionary section comprising document-specific terms, a body section comprising terms in the document body, one or more meta-streams sections comprising terms associated with at least one of document layout, anchor uniform resource locators (URLs), or domain URLs, and one or more meta-word sections comprising terms associated with identified attributes and/or known contexts of the document.
  • A PDI generated in this manner can be used in combination with an inverted index to identify contextually-relevant search results. For example, a search query is received which comprises keyword terms and surrounding non-keyword terms having a contextual meaning. The inverted index is accessed, and the search query keyword terms are used to identify a set of documents that contains the keywords. A data store storing PDIs is then accessed, and the PDIs for the set of documents are identified, retrieved, and decoded. For each of the documents in the set, the keyword terms in the document are located, and the keyword terms along with surrounding non-keyword terms are analyzed to determine a respective contextual meaning. Those documents that have respective contextual meanings most relevant to the contextual meaning associated with the search query are identified, ranked, and presented in an ordered list to a user on a search results page. The result is a search result list which better represents contextually-relevant search results than when using an inverted index by itself.
  • An exemplary computing environment suitable for use in implementing embodiments of the present invention is described below in order to provide a general context for various aspects of the present invention. Referring to FIG. 1, such an exemplary computing environment is shown and designated generally as computing device 100. The computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention. Neither should the computing device 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.
  • Embodiments of the invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules, including routines, programs, objects, components, data structures, etc., refer to code that performs particular tasks or implements particular abstract data types. Embodiments of the invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
  • With continued reference to FIG. 1, the computing device 100 includes a bus 110 that directly or indirectly couples the following devices: a memory 112, one or more processors 114, one or more presentation components 116, one or more input/output (I/O) ports 118, I/O components 120, and an illustrative power supply 122. The bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 1 are shown with lines for the sake of clarity, in reality, delineating various components is not so clear, and metaphorically, the lines would more accurately be grey and fuzzy. For example, one may consider a presentation component such as a display device to be an I/O component. Additionally, many processors have memory. The inventors hereof recognize that such is the nature of the art, and reiterate that the diagram of FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments of the present invention. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG. 1 and reference to “computer” or “computing device.”
  • The computing device 100 typically includes a variety of computer-readable media. Computer-readable media may be any available media that is accessible by the computing device 100 and includes both volatile and nonvolatile media, removable and non-removable media. Computer-readable media comprises computer storage media and communication media; computer storage media excludes signals per se. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 100. Communication media, on the other hand, embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
  • The memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, solid state drives (SSDs), and the like. The computing device 100 includes one or more processors that read data from various entities such as the memory 112 or the I/O components 120. The presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, and the like.
  • The I/O ports 118 allow the computing device 100 to be logically coupled to other devices including the I/O components 120, some of which may be built in. Illustrative components include a microphone, a camera, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
  • Aspects of the subject matter described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a mobile device. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. Aspects of the subject matter described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
  • Furthermore, although the term “server” is often used herein, it will be recognized that this term may also encompass a search engine, a Web browser, a set of one or more processes distributed on one or more computers, one or more stand-alone storage devices, a set of one or more other computing or storage devices, a combination of one or more of the above, and the like.
  • With this as a background and turning to FIG. 2, an exemplary system 200 is depicted for use in generating a per-document index (PDI) for use in semantic or contextual searching and search ranking. The system 200 is merely an example of one suitable system environment and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the present invention. Neither should the system 200 be interpreted as having any dependency or requirement related to any single module/component or combination of modules/components illustrated therein.
  • The system includes a per-document index service 210, a data store 212, and an end-user computing device 214 all in communication with one another via a network 216. The network 216 may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. Accordingly, the network 216 is not further described herein.
  • In some embodiments, one or more of the illustrated components/modules may be implemented as stand-alone applications. In other embodiments, one or more of the illustrated components/modules may be integrated directly into, for example, the operating system of the end-user computing device 214 or the per-document index service 210. The components/modules illustrated in FIG. 2 are exemplary in nature and in number and should not be construed as limiting. Any number of components/modules may be employed to achieve the desired functionality within the scope of embodiments hereof. Further, components/modules may be located on any number of servers. By way of example only, the per-document index service 210 might reside on a server, a cluster of servers, or a computing device remote from one or more of the remaining components.
  • It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components/modules, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.
  • In one aspect, the data store 212 is configured to store one or more per-document indexes (PDIs) or forward indexes (for the purposes of this application, the two terms are used interchangeably). Each PDI stores a forward-encoded document. A document comprises a Web page, a collection of Web pages, representations of documents (e.g., a PDF file), and the like. As mentioned above, a document comprises the document in its native or original form along with any annotations and/or metadata that have been associated with the document. Forward encoding preserves not only the keywords associated with the original document but also the contextual and other information associated with the document including the contextual order of the document. The PDI is structured in such a way as to allow access to both keyword terms and the context surrounding those terms at the time a search query is received without significant search-time penalties. In one aspect, the data store 212 comprises a solid state drive (SSD) that stores the PDI persistently. The data store 212 may also be configured as an inverted index that maps keyword terms to documents that contain those keyword terms.
  • Additionally, the data store 212 is configured to store information used by the per-document index service 210. For instance, the data store 212 may store section-specific, prefix-specific, culture-specific, language-specific, or custom dictionaries for use in translating terms in a document section to term identifiers. The data store 212 may store attributes and/or known contexts that are identified for each document such as, for example, category classifications of documents, social media references, location information associated with the document, originating language of the document, fingerprint information associated with the document, and the like. These aspects will be explored in greater depth below. The data store 212 may also store information concerning recorded search behavior (query logs, rating logs, browser or search logs, query click logs, browsing times, related search lists, etc.) of users in general. Query click logs provide information on documents selected by users in response to a search query, browsing times provide information on the estimated total time users spend browsing a document, while browser/search logs provide information on documents viewed by users during a search session and how frequently any one document is visited by users. Additionally, rating logs indicate an importance or ranking of a document based on, for example, various rating algorithms known in the art. This information may be used by the per-document index service 210 to carry out various search result ranking algorithms.
  • The information stored in association with the data store 212 is configured to be searchable for one or more items of information stored in association therewith. The content and volume of such information in the data store 212 are not intended to limit the scope of embodiments of the present invention in any way. Further, though illustrated as a single, independent component, the data store 212 may, in fact, be a plurality of storage devices, for instance, a database cluster, portions of which may reside on the per-document index service 210, the end-user computing device 214, and/or any combination thereof.
  • The end-user computing device 214 shown in FIG. 2 may be any type of computing device, such as, for example, the computing device 100 described above with reference to FIG. 1. By way of example only and not limitation, the end-user computing device 214 may be a personal computer, desktop computer, laptop computer, handheld device, mobile handset, consumer electronic device, or the like. It should be noted, however, that embodiments are not limited to implementation on such computing devices, but may be implemented on any of a variety of different types of computing devices within the scope of embodiments hereof. The end-user computing device 214 may receive inputs through a variety of means such as voice, touch, and/or gestures. The end-user computing device includes a display screen. The display screen is configured to present information, including search results, to the user of the end-user computing device 214.
  • The system 200 is merely exemplary. While the per-document index service 210 is illustrated as a single unit, it will be appreciated that the per-document index service 210 is scalable. For example, the per-document index service 210 may in actuality include a plurality of computing devices in communication with one another. Moreover, the data store 212, or portions thereof, may be included within, for instance, the per-document index service 210 as a computer-storage medium. The single unit depictions are meant for clarity, not to limit the scope of embodiments in any form.
  • As shown in FIG. 2, the per-document index service 210 comprises a receiving component 218, a parsing component 220, a translation component 222, an encoding component 224, a decoding component 226, and a contextual analysis component 228. In some embodiments, one or more of the components 218, 220, 222, 224, 226, and 228 may be implemented as stand-alone applications. In other embodiments, one or more of the components 218, 220, 222, 224, 226, and 228 may be integrated directly into the operating system of a computing device such as the computing device 100 of FIG. 1. It will be understood that the components 218, 220, 222, 224, 226, and 228 illustrated in FIG. 2 are exemplary in nature and in number and should not be construed as limiting. Any number of components may be employed to achieve the desired functionality within the scope of embodiments hereof.
  • The receiving component 218 is configured to receive one or more search queries inputted by a user. The search queries may be inputted on a search engine page, a search box on a Web page, and the like. The search query may comprise one or more terms arranged in a defined grammatical pattern or sequence. Some of the terms may comprise keyword terms, while other terms may join the keyword terms, surround the keyword terms, or act as qualifiers of the keyword terms. For the purposes of this application, terms that join or surround keyword terms or act as qualifiers of the keyword terms are known as surrounding terms. For instance, the search query “books for children” may be considered to have two keywords, “books” and “children,” and a surrounding word, “for.” The word “for” provides context for the search query but is often ignored by traditional ranking algorithms. By way of contrast, the search query “books by children” contains the same two keywords as the search query “books for children,” but the surrounding word “by” completely changes the semantic meaning of the search query. In another example, the presence of a qualifier may change the semantic or contextual meaning of the search query. For instance, the search query “non-profit organizations” has a different contextual meaning than the search query “for-profit organizations” although the two search queries may share the same keywords. For the purposes of this application, the terms “semantic” and “contextual” are used interchangeably. Both terms refer to the underlying meaning of a group of words or a phrase.
  • The receiving component 218 is further configured to receive one or more documents. The documents may be received in response to a Web crawler executing a crawl and extracting documents. As well, the receiving component 218 may receive the documents from a data store such as the data store 212, a document source, or a third-party source. Documents comprise the original document along with any annotations and/or metadata that have been associated with or added to the document.
  • The parsing component 220 is configured to parse the document into a plurality of sections. The sections may include a document data section, one or more meta-word sections, one or more meta-streams sections, and a body section. The document data section comprises terms or common attributes that identify the document. Such terms may include the author(s), a date the document was created, the reading level, whether the document is classified as “adult content,” ratings associated with the document, spam likelihood, the number of pages of the document, and the like.
  • The body section comprises terms in the main body of the document, and the meta-streams section(s) comprises terms that occur in, for example, the title of the document; headers, section descriptions, footnotes, and endnotes associated with the document; and the like. The meta-streams section(s) may also include anchor uniform resource locators (URLs) that comprise URLs that are determined to reference the document in question, as well as descriptions or phrases found at the anchor URL locations which are believed to be descriptive of the document. Additionally, meta-streams section(s) may include domain URLs that comprise URLs within the document that link to other documents (e.g., hyperlinks), and any URLs associated with the document itself. In general, the meta-streams section(s) provides information regarding the layout of the document and information concerning how the document relates to other documents. In one aspect, the meta-streams section may comprise one section. In another aspect there may be a meta-stream section for terms that define the structure or layout of the document (e.g., title, headers, section descriptors), and a meta-streams section that provides information on how the document relates to other documents (e.g., anchor URLs, hyperlinks, and the like). It is contemplated that there may be more than two meta-streams sections with each section providing information related to document structure and/or document relatedness to other documents. Any and all such aspects are contemplated as being within the scope of the invention.
  • The meta-words section(s) comprises terms associated with attributes and/or determined contexts of the document. With respect to the meta-words section(s), the parsing component 220 is configured to determine or identify attributes associated with the document and to generate meta-word terms for the identified attributes. There may be many types of attributes associated with a document. For example, one attribute may comprise fingerprints associated with the document. The fingerprint of the document is a snapshot of a portion of the document that uniquely defines some aspect of the document. The fingerprint of a document may be compared to fingerprints of other document to determine duplication or plagiarism. Another attribute may include location information associated with the document. For instance, the parsing component 220 determines the country, state, zip code, and the originating language of the document. Another attribute may include a determined category associated with the document (e.g., if the document is directed to common home repairs, it may be classified in the home improvement category).
  • The parsing component 220 is configured to generate the terms that describe the attributes. In one aspect, each term comprises a prefix that describes the type of attribute and a value associated with the attribute. For instance, with respect to a zip code associated with the document, the prefix may be “ZIP” and the value may be the actual zip code such as “98052.” Thus, the complete meta-word term may comprise “meta_ZIP98052.” In another aspect, each term may comprise just a prefix that describes the type of attribute. Any and all such aspects are contemplated as being within the scope of the invention.
  • The parsing component is additionally configured to identify known or determined contexts associated with the document. A known or determined context may comprise blog posts related to the document, social media (Facebook®, Twitter®, Instagram®, etc.) comments, likes, or postings associated with the document, applications that reference the document, and the like. By way of illustrative example, the document may be a Web page describing a restaurant. A celebrity included the restaurant name in a blog posting. The portion of the blog associated with the restaurant is identified by the parsing component 220 and is included in a meta-word section. In one aspect, the meta-words section may comprise one section. In another aspect, there may be more than one meta-word section with each section corresponding to a type of attribute or a known context. Any and all such aspects are contemplated as being within the scope of the invention.
  • The parsing component 220 is further configured to generate a custom-dictionary section by identifying terms in the document that do not ordinarily occur outside of the context of the document. The custom-dictionary section comprises document-specific terms at specified positions within the dictionary.
  • The translation component 222 is configured to translate or encode in-order (e.g., forward encode) each term in the document data, body, meta-words, and meta-streams sections to either a cache index or a term identifier. Each section is encoded separately by the translation component 222. In-order encoding encodes the terms as they appear in the document and yields the position of the term in the document implicitly. This can be contrasted with an inverted index where the position of the term in the document is explicitly encoded.
  • The translation component 222 forward encodes each section by accessing a section-specific dictionary that comprises terms commonly associated with the particular section. Each term is at an identified position within the section-specific dictionary, and the position of the term within the dictionary comprises the term identifier for the term. Thus, the term “the” may be found at position 5 within the dictionary; the term identifier for the word “the” would then comprise the number “5.” The translation component 222 then replaces the term with its corresponding term identifier. Using term identifiers enables the terms in the document to be compressed, thus taking up less storage space. Further, term identifiers are easy to encode and decode which speeds up the retrieval of search results.
  • In one aspect, the term identifiers may comprise numerical values. Frequently-used words are associated with smaller term identifiers while less frequently-used words are associated with larger term identifiers. By way of illustrative example, the term “the” may be associated with the term identifier of “5,” while the less-frequently used word “especially” may be associated with the term identifier of “1230.” Depending on the popularity of the word, the term identifier may be 1-byte, 2-byte, or 3-byte. As mentioned, each section generally has its own section-specific dictionary. However, the body and meta-streams sections may share the same dictionary because of the commonality of terms between these two sections.
  • In general, the section-specific dictionaries comprise commonly-used section terms. The section-specific dictionaries may be used to generate term identifiers for a plurality of documents. Terms that are unique to the document may not be found in the section-specific dictionary. In this case, the translation component 222 accesses the custom dictionary to identify the respective position of the document-specific term and uses this position as the term identifier for the document-specific word.
  • With respect to the meta-words section(s), a dictionary may be specialized for a prefix, or may be shared by several prefixes. Alternatively, there may not be a dictionary for a particular prefix. In this case, the prefix is not translated into a term identifier but is stored in its native form. The term identifier for a value associated with a prefix is dependent upon the dictionary used for the value's prefix. By way of illustrative example, the prefix “ZIP” may be encoded using a zip code-specific dictionary. Values associated with the prefix would be encoded using the same zip code-specific dictionary.
  • With respect to the body and meta-streams sections, the translation component 222 is further configured to process the term identifiers through an entry cache using methods known in the art. When there is a cache miss, the original term identifier is output. If there is a cache hit, a cache index for the term is output. If there is an initial cache miss for a term identifier, the cache allocates a new entry and copies in data from memory. Thus, next time the term identifier is processed, a cache index for that term is outputted. In one aspect, the entry cache is 1,024 terms, and the cache index may be 10 bits. The use of an entry cache further compresses the document data in the PDI.
  • The encoding component 224 is configured to group encode the different sections using methods known in the art. In general, a section stream is processed 32 bits at a time and contains three operational codes to reduce unpredictable branches during the decoding process. The six high bits of the term are shifted down and used to switch into one of 27 possible packings. Each operational code can indicate: 1) a 10 bit cache index for a meta-stream or body term from the cache; 2) a term identifier for a body term or meta-stream term that was not present in the cache, or term identifiers or terms for terms in the other sections (e.g., meta-data section, document data section, and/or custom-dictionary section); 3) constants for use in position or metadata; and 4) other features of the PDI such as section boundaries, boundaries within the document, and the like. The decoding component 226 is configured to decode the data using, for example, the operational codes.
  • After encoding, each term in each section may be viewed as having three parts: 1) the position of the term in the document (this is implicitly encoded); 2) the term identifier or a cache index; and 3) any metadata associated with the term. Metadata may include type of font, capitalization, bolding, italicizing, underlining, punctuation, position of the term at the end of a sentence, section or document, and the like. Metadata may also include an attribute of the term and its associated value. For instance, an attribute of a term may include a “date” or a “reading level.” These attributes are associated with values. For instance the “date” attribute may have a value of “Aug. 12, 2012,” and the “reading level” attribute may have a value of “4” Up to 64 bits of metadata may be associated with a term; the metadata may be used in ranking algorithms.
  • The contextual analysis component 228 is configured to perform a number of actions. The contextual analysis component 228 is configured to determine a contextual meaning of a search query by analyzing keyword terms as well as any surrounding terms of a search query received by the receiving component 218. The contextual analysis component 228 is also configured to use the search query keyword terms to identify a set of documents that contains the keyword terms; this may be accomplished by accessing and utilizing an inverted data store stored in association with the data store 212. Once the set of documents is identified, the contextual analysis component 228 accesses the PDIs associated with the set of documents and instructs the decoding component 226 to decode the PDIs associated with the documents. The contextual analysis component 228 then performs contextual analysis of the documents.
  • In one aspect, for instance, the contextual analysis component 228 locates the instances of the keyword(s) term in each decoded document, constructs a contextual window around the keyword term, and determines a contextual meaning for the contextual window. The contextual window may include a predetermined number of terms surrounding the keyword. In another aspect, the size of the contextual window may be determined on the fly by continually including terms surrounding the keyword term until a contextual meaning is established. For instance, the document may contain the following phrase, “Books by children authors are rare.” The keywords terms are “Books” and “children.” In this case the contextual analysis component 228 would include all the terms in the sentence in the contextual window because all of the terms establish a contextual meaning for the phrase.
  • Once the contextual windows have been constructed for the document(s) and the contextual meanings of the windows have been determined, the contextual analysis component 228 is configured to rank the documents within the set based on how relevant each document's contextual window is compared to the contextual meaning of the search query. Those documents having contextual windows that are most relevant to the contextual meaning of the search query are promoted to a higher ranking, while those documents having contextual windows that are unrelated or not relevant to the contextual meaning of the search query are demoted to a lower ranking. The set of documents is then presented as a rank-ordered list on a search results page.
  • By way of illustrative example, a search query may comprise the phrase, “Books by children.” The contextual meaning of this search query can be surmised as “books written by children authors.” An inverted index is utilized to identify, for example, two documents that contain the search query keyword terms of “books” and “children.” In the first document, a contextual window is constructed that includes the phrase “some great children's books include,” while a contextual window in a second document is constructed that includes the phrase “although books by children authors are rare, some examples include.” As can be seen, the contextual meaning associated with the second document is more relevant to the contextual meaning of the search query as compared to the contextual meaning associated with the first document. The second document would be ranked higher than the first document and presented before the first document on a search results page.
  • Continuing, the contextual analysis component 228 is also configured to use the contextual windows to classify documents by category. For example, a document's contextual window may comprise the phrase “hammers and nails are useful tools to install home windows,” where “hammer” and “nail” are keyword terms. The contextual analysis component 228 may classify this document in the “home improvement” category. By contrast, a document's contextual window may contain the same keywords but may be classified in a different category. For instance, a document's contextual window may comprise the phrase “We sell hammers and nails at Miller Hardware.” The contextual analysis component 228 may classify this document in the “retail” category. The contextual analysis component 228 is further configured to cluster documents with similar categories and present these document clusters together on a search results page.
  • In yet another example, the contextual analysis component 228 is configured to identify documents whose contextual windows have keyword terms and surrounding terms that share structural pattern similarity with the search query. These documents may be ranked higher than documents whose contextual windows do not share structural pattern similarity with the search query. This is especially useful when a user inputs a long search query phrase. By way of illustrative example, a user may input a query such as “How do I get a taxi at the Beijing airport.” Documents that contain the same terms in substantially the same order as the search query may be ranked higher than those documents that contain the terms in a different order.
  • Turning now to FIG. 3, an exemplary data table 300 is shown representing relationships between document sections and term identifiers and/or cache indexes in a PDI. This data structure is illustrative and exemplary in nature and is not meant to be limiting in any way. The purpose of FIG. 3 is to portray concepts and relationships between various data elements and not the actual arrangement of data in a data store. As such, instead of using actual term identifier values and/or cache indexes, textual descriptions of these values are used instead (e.g., “cacheIndex1”).
  • As shown in table 300, a document 310 is stored in its PDI as a plurality of sections such as a document data section 312, a custom-dictionary section 314, a meta-words section 316, a body section 318, and a meta-streams section 320. The document data section 312 comprises terms or attributes that identify the document 310. Each of the terms in the document data section 312 is associated with a term identifier using, for example, a section-specific dictionary. The term identifiers are encoded in the order they appear in the document data section 312 as shown by term identifiers 322. In this case, the document data term identifiers are the same at positions one and four in the document data section (e.g., D.D.Id1), indicating that the same document data term appears at both of these positions in the document data section 312.
  • The custom-dictionary section 314 comprises an in-order arrangement of document-specific terms 324. Document-specific terms are considered unique to the document 310 and are generally not found in the section-specific dictionaries used for each of the sections. Thus, when one of the document-specific terms is encountered during the translation process, the custom-dictionary section 314 is accessed and the position of that term in the custom dictionary is identified.
  • The meta-words section 316 comprises an in-order arrangement of prefix term identifiers 326 and 328, and, optionally, value term identifiers 328 and 332 associated with identified attributes of the document 310. As discussed above, a prefix-specific dictionary may be used to determine term identifiers for one or more prefixes. The prefix-specific dictionary used to determine term identifiers for a prefix is also used to determine a term identifier for the value associated with the prefix. In one aspect, the meta-words section 316 may comprise multiple sections with each section sharing structural and/or semantic similarity.
  • The body section 318 comprises an in-order arrangement of either cache indexes 336, 338 or term identifiers 334, 340 associated with terms in the body of the document 310. A cache index, such as cache indexes 336 and 338, is stored when there is a cache hit when the term's associated term identifier is processed through the entry cache. However, if there is a cache miss when a term's associated term identifier is processed through the entry cache, the term identifier, such as term identifiers 334 and 340, is stored in the PDI.
  • Likewise, the meta-streams section 320 comprises an in-order arrangement of either cache indexes or term identifiers (represented by numeral 342) associated with terms in the title of the document 310, headers and section headings of the document 310, and/or URLs associated with the document 310. Like above, a cache index is stored when there is a cache hit when the term's associated term identifier is processed through the entry cache. However, the term identifier is stored in the PDI if there is a cache miss when a term's associated term identifier is processed through the entry cache. In one aspect, there may be more than one meta-streams section. For instance, there may be a meta-stream section for repetitive anchor URLs, a meta-stream section for single anchor URLs, a meta-stream section for title terms, and the like.
  • Turning now to FIG. 4, a flow diagram is depicted of an exemplary method 400 of generating a per-document index for semantic searching. At a step 410, a document is received by a receiving component such as the receiving component 218 of FIG. 2. The document may be received from a data store such as the data store 212 of FIG. 2, in response to a Web crawler extracting the document from the World Wide Web, from the document source, or from a third-party. The document may comprise one or more Web pages, Web sites, representations of documents (e.g., a PDF file), and the like. The document includes the original document along with any added annotations.
  • At a step 412, the document is parsed into a plurality of sections by a parsing component such as the parsing component 220 of FIG. 2. The sections may comprise a document data section that comprises terms or attributes that identify the document, and a body section comprising terms from the body of the document. The sections also include a meta-streams section(s) that comprises: 1) terms associated with the title, header, footnotes, endnotes, and/or section headings of the document; 2) URLs that reference the document (e.g., anchor URLs), are associated with the document, or are embedded in the document as hyperlinks; and/or 3) context or descriptions found where the anchor URLs are located.
  • Parsing the document also includes identifying attributes associated with the document and associating those attributes with meta-word terms. Document attributes are numerous. Some representative examples include originating language of the document, fingerprints associated with the document, the country, state and zip code of the document, a category corresponding to the document, and the like. Each meta-word term includes a prefix that identifies the type of attribute (e.g., fingerprint, location, category, etc.) and, optionally, a value associated with the attribute. Using the category attribute as an example, an exemplary meta-word may be “meta_CATG_retail,” where “CATG” is the prefix and “retail” is the value.
  • Parsing the document also includes identifying known contexts associated with the documents. Known contexts may include blog postings that reference the document, social media comments, posts, and likes associated with the document, applications that routinely access the document, and the like. Parsing the document may further include creating a custom-dictionary section that comprises document-specific terms not found in the section dictionaries. Each of the document-specific terms is associated with a position in the custom dictionary.
  • At a step 414, each term in the document data, body, meta-words, and meta-streams sections is translated in order to either a cache index or a term identifier by a translation component such as the translation component 222 of FIG. 2. For each term, a section-specific dictionary is accessed, and the term is identified in the section-specific dictionary. The position of the term in the dictionary is used as the term identifier. If a term is not found in the section-specific dictionary, the custom dictionary is accessed to identify the term and its associated position. Term identifiers comprise numerical values that are 1-byte, 2-byte, or 3-byte. Terms that are frequently used are associated with smaller term identifiers, while terms that are infrequently used are associated with larger term identifiers.
  • As mentioned, each section has a section-specific dictionary. However, the body and meta-streams sections may share the same section-specific dictionary because these two sections commonly use the same terms. Further, with respect to the meta-words section, each prefix may have a corresponding dictionary, a dictionary may be shared by several prefixes, or a prefix may not have a corresponding dictionary. A prefix that does not have a corresponding dictionary will be encoded in its native form. The dictionary used for a prefix will also be used for any values associated with that prefix.
  • Term identifiers associated with terms in the body section and meta-streams section are further processed by passing the term identifiers through an entry cache. In one aspect, the entry cache is 1,024 terms. If there is a cache miss, the original term identifier is outputted. However, if there is a cache hit, a cache index is outputted; the cache index is 10 bits. The use of term identifiers and cache indexes allows the large amount of data associated with a document to be efficiently compressed which saves on storage space.
  • At a step 416, each of the sections, including the custom-dictionary section, is group encoded by an encoding component such as the encoding component 224 of FIG. 2 to generate the PDI. In one aspect, the sections are encoded 32 bits at a time. Each 32 bits contains three operational codes that can indicate that the remaining bits comprise one or more 10 bit cache indexes, terms, or term identifiers, data used to identify term position or metadata associated with the term, section and stream boundaries, and the like.
  • At a step 418, the PDI is stored in association with a data store such as a solid state drive (SSD). Each term in the PDI is associated with a position, which is implicitly encoded through in-order encoding, a term identifier or cache index, and any metadata associated with the term. Metadata may include font, capitalization, quantities or values, punctuation, bolding, italicizing, underlining, position with the document body, and the like.
  • Turning now to FIG. 5, a flow diagram is depicted of an exemplary method 500 of utilizing a PDI in combination with an inverted index to identify contextually-relevant search results. At a step 510, a search query comprising one or more keyword terms and one or more surrounding terms is received by a receiving component such as the receiving component 218 of FIG. 2. The search query may be received in response to a user inputting the query into a search box of a search engine page, or a search box associated with a Web page or Web site. The keyword terms and the surrounding terms impart a contextual meaning to the search query; this contextual meaning may be determined by a contextual analysis component such as the contextual analysis component 228 of FIG. 2.
  • At a step 512, an inverted index is accessed by the contextual analysis component. The inverted index comprises a mapping between keywords and documents that contain those keywords and may be stored in association with a first data store such as the data store 212 of FIG. 2. At a step 514, the contextual analysis component uses the keyword terms from the search query to identify a set of documents in the inverted index that contain one or more of the search query keyword terms. Each document in the set may be associated with a document identifier (Dodd) such as the URL of the document, or other unique identifiers such as a cryptographic hash of the URL, or a unique sequence number assigned to the document when received by the system.
  • At a step 516, the contextual analysis component accesses the per-document indexes associated with the set of documents using, for example, the DocIds. The PDIs may be stored in association with a second data store. In one aspect, the second data store may be the same as the first data store, while in another aspect, the second data store may be different from the first data store. In one embodiment, the second data store may comprise a SSD. Once identified and retrieved, the set of documents may be decoded using a decoding component such as the decoding component 226 of FIG. 2.
  • Once the set of documents is decoded, at a step 518, the contextual analysis component analyzes each document in the set to determine one or more contextual meanings associated with the each document. The contextual analysis involves locating instances of the keyword terms in a document and constructing a contextual window around the keyword term(s) that includes the keyword term(s) and one or more surrounding non-search query terms. The contextual window may include a predetermined number of terms or may include a variable number of terms sufficient to establish a contextual meaning for the contextual window. Once the contextual window is constructed, a contextual meaning is determined. A single document may contain multiple contextual windows. Each of the multiple contextual windows may share the same contextual meaning, different contextual meanings, or a combination of both. An overall contextual meaning of the document may be determined by analyzing in-order the multiple contextual windows.
  • At a step 520, each document in the set of documents is ranked based on the respective contextual meaning of the document as compared to the contextual meaning associated with the search query. Those documents having contextual meanings that are relevant to the search query contextual meaning or promoted in the ranking, while documents having contextual meanings that are less relevant or different from the search query contextual meaning are demoted in the ranking. At a step 524, the ranked set of documents is presented as a rank-ordered list on a search results page.
  • The method 500 may further comprise classifying each document in the set of documents into categories based on the contextual meaning associated with the contextual window of the document. For example, a contextual window in one document may comprise the phrase, “Traveling in Hong Kong is exciting,” where “Kong” is the search query keyword term. This document may be classified in the travel category. A contextual window in a second document may comprise the phrase, “Kong toys are a favorite of dogs,” where “Kong” is again the search query keyword term. This document may be classified in the pet category. Documents that share the same category are clustered together and are presented together on the search results page. Thus, the documents classified in the travel category would be presented in a group, and the documents classified in the pet category would be presented in separate group.
  • The method 500 may also include analyzing the structural pattern of terms in the contextual window and comparing it to the structural pattern of terms in the search query. The structural pattern is the arrangement of terms within a contextual window and/or a search query. For instance, the structural pattern of the phrase “the dog ran after the cat” is different from the structural pattern of the phrase “the cat ran after the dog” although each contains the same terms. Documents whose contextual windows have terms in the same or similar structural pattern as that of the search query may be ranked higher than documents whose contextual windows have terms in a different structural pattern as that of the search query.
  • The methods described above are representative examples of the different types of contextually-relevant search results that can result from using resources associated with the PDIs. The array of attributes, context, and in-order contextual windows present in the PDI can be utilized in new ways to produce ever more meaningful search results that match user intent.
  • The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.

Claims (20)

What is claimed is:
1. One or more computer-storage media storing computer-useable instructions that, when used by one or more computing devices, cause the one or more computing devices to perform a method of generating a per-document index for semantic searching, the method comprising:
receiving a document, the document comprising the original document and annotations associated with the document;
parsing the document into a plurality of sections;
for each section of the plurality of sections, translating in order each term to at least one of its corresponding cache index, or its corresponding term identifier;
subsequent to translating in order the each term, group encoding the each section of the plurality of sections, the group-encoded sections comprising a per-document index for the document; and
storing the per-document index in a data store.
2. The media of claim 1, further comprising:
generating a custom-dictionary section for the document, the custom-dictionary section comprising document-specific terms at a specified position within the custom-dictionary section;
group encoding the custom-dictionary section; and
storing the encoded custom-dictionary section is association with the per-document index.
3. The media of claim 2, wherein parsing the document further comprises identifying one or more attributes associated with the document and generating meta-word terms for each of the one or more attributes.
4. The media of claim 3, wherein the one or more attributes comprise at least one or more of location information associated with the document, originating language of the document, and fingerprint information associated with the document.
5. The media of claim 3, wherein each of the meta-word terms includes a prefix identifying a type of attribute.
6. The media of claim 5, wherein the each of the meta-word terms further includes a value associated with prefix.
7. The media of claim 2, wherein the plurality of sections comprise at least a document data section, a body section, one or more meta-word sections, and one or more meta-stream sections.
8. A computerized method carried out by at least one server having at least one processor for generating a per-document index for semantic searching, the method comprising:
receiving a document, the document comprising the original document and annotations associated with the document;
parsing the document into a plurality of sections;
for each section of the plurality of sections, translating, using the at least one processor, in order each term to at least one of its corresponding cache index, or its corresponding term identifier;
subsequent to translating in order the each term, group encoding the each section of the plurality of sections, the group-encoded sections comprising a per-document index for the document; and
storing the per-document index in a data store.
9. The method of claim 8, further comprising:
generating a custom-dictionary section for the document, the custom-dictionary section comprising document-specific terms at a specified position within the custom-dictionary section;
group encoding the custom-dictionary section; and
storing the encoded custom-dictionary section is association with the per-document index.
10. The method of claim 9, wherein translating in order the each term to its corresponding term identifier comprises:
accessing at least one of a section-specific dictionary or the custom-dictionary section; and
identifying a position of the each term in the at least one of the section-specific dictionary or the custom-dictionary section, the position of the each term comprising the each term's term identifier.
11. The method of claim 10, wherein the plurality of sections comprise at least a document data section, a body section, one or more meta-word sections, and one or more meta-stream sections.
12. The method of claim 11, wherein the body section and the one or more meta-streams sections share the same section-specific dictionary.
13. The method of claim 8, wherein translating in order the each term to its corresponding cache index comprises processing the each term's corresponding term identifier through an entry cache.
14. The method of claim 8, wherein the term identifier is a numerical value.
15. The method of claim 14, wherein the numerical value is smaller when the each term is commonly used, and wherein the numerical value is larger when the each term is infrequently used.
16. A system for generating a per-document index for semantic searching, the system comprising:
a server having one or more processors and one or more computer-storage media;
a data store coupled with the server,
wherein the server:
receives a document, the document comprising the original document and annotations associated with the document;
parses the document into a plurality of sections;
for each section of the plurality of sections, translates in order each term to at least one of its corresponding cache index, or its corresponding term identifier;
subsequent to translating in order the each term, group encodes the each section of the plurality of sections, the group-encoded sections comprising a per-document index for the document; and
stores the per-document index in the data store.
17. The system of claim 16, wherein the plurality of sections comprise at least a document data section, a body section, one or more meta-word sections, and one or more meta-stream sections.
18. The system of claim 16, wherein the server further:
generates a custom-dictionary section for the document, the custom-dictionary section comprising document-specific terms at a specified position within the custom-dictionary section;
group encodes the custom-dictionary section; and
stores the encoded custom-dictionary section is association with the per-document index in the data store.
19. The system of claim 16, wherein the each term of the each section is associated with one or more of a position of the each term within the section, the term identifier, and metadata associated with the each term.
20. The system of claim 16, wherein the data store comprises a solid state drive (SSD).
US14/700,693 2012-11-28 2015-04-30 Per-document index for semantic searching Abandoned US20150234917A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/700,693 US20150234917A1 (en) 2012-11-28 2015-04-30 Per-document index for semantic searching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/687,135 US9069857B2 (en) 2012-11-28 2012-11-28 Per-document index for semantic searching
US14/700,693 US20150234917A1 (en) 2012-11-28 2015-04-30 Per-document index for semantic searching

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/687,135 Continuation US9069857B2 (en) 2012-11-28 2012-11-28 Per-document index for semantic searching

Publications (1)

Publication Number Publication Date
US20150234917A1 true US20150234917A1 (en) 2015-08-20

Family

ID=50774177

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/687,135 Active 2033-03-15 US9069857B2 (en) 2012-11-28 2012-11-28 Per-document index for semantic searching
US14/700,693 Abandoned US20150234917A1 (en) 2012-11-28 2015-04-30 Per-document index for semantic searching

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/687,135 Active 2033-03-15 US9069857B2 (en) 2012-11-28 2012-11-28 Per-document index for semantic searching

Country Status (1)

Country Link
US (2) US9069857B2 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9984110B2 (en) * 2014-08-21 2018-05-29 Dropbox, Inc. Multi-user search system with methodology for personalized search query autocomplete
US10353964B2 (en) * 2014-09-15 2019-07-16 Google Llc Evaluating semantic interpretations of a search query
US10579601B2 (en) 2014-12-12 2020-03-03 Aveva Software, Llc Data dictionary system in an event historian
US10769104B2 (en) * 2014-12-12 2020-09-08 Aveva Software, Llc Block data storage system in an event historian
US9384226B1 (en) 2015-01-30 2016-07-05 Dropbox, Inc. Personal content item searching system and method
US9183303B1 (en) 2015-01-30 2015-11-10 Dropbox, Inc. Personal content item searching system and method
US10146438B1 (en) 2016-06-29 2018-12-04 EMC IP Holding Company LLC Additive library for data structures in a flash memory
US10055351B1 (en) 2016-06-29 2018-08-21 EMC IP Holding Company LLC Low-overhead index for a flash cache
US10331561B1 (en) 2016-06-29 2019-06-25 Emc Corporation Systems and methods for rebuilding a cache index
US10261704B1 (en) 2016-06-29 2019-04-16 EMC IP Holding Company LLC Linked lists in flash memory
US10037164B1 (en) 2016-06-29 2018-07-31 EMC IP Holding Company LLC Flash interface for processing datasets
US10089025B1 (en) 2016-06-29 2018-10-02 EMC IP Holding Company LLC Bloom filters in a flash memory
US10726501B1 (en) 2017-04-25 2020-07-28 Intuit Inc. Method to use transaction, account, and company similarity clusters derived from the historic transaction data to match new transactions to accounts
US10891947B1 (en) 2017-08-03 2021-01-12 Wells Fargo Bank, N.A. Adaptive conversation support bot
US10592542B2 (en) * 2017-08-31 2020-03-17 International Business Machines Corporation Document ranking by contextual vectors from natural language query
US10956986B1 (en) 2017-09-27 2021-03-23 Intuit Inc. System and method for automatic assistance of transaction sorting for use with a transaction management service
US20190108276A1 (en) * 2017-10-10 2019-04-11 NEGENTROPICS Mesterséges Intelligencia Kutató és Fejlesztõ Kft Methods and system for semantic search in large databases
US10789282B1 (en) * 2018-01-31 2020-09-29 Dell Products L.P. Document indexing with cluster computing
US11841854B2 (en) * 2018-07-24 2023-12-12 MachEye, Inc. Differentiation of search results for accurate query output
US11853107B2 (en) 2018-07-24 2023-12-26 MachEye, Inc. Dynamic phase generation and resource load reduction for a query
US11816436B2 (en) 2018-07-24 2023-11-14 MachEye, Inc. Automated summarization of extracted insight data
US11463390B2 (en) * 2018-08-01 2022-10-04 Citrix Systems, Inc. Selecting attachments for electronic mail messages
US11232132B2 (en) * 2018-11-30 2022-01-25 Wipro Limited Method, device, and system for clustering document objects based on information content
GB2589254A (en) * 2019-05-31 2021-05-26 Collatr Ltd Digital document management system
US11222013B2 (en) * 2019-11-19 2022-01-11 Sap Se Custom named entities and tags for natural language search query processing
US11520782B2 (en) * 2020-10-13 2022-12-06 Oracle International Corporation Techniques for utilizing patterns and logical entities
US11651013B2 (en) * 2021-01-06 2023-05-16 International Business Machines Corporation Context-based text searching
US11657078B2 (en) * 2021-10-14 2023-05-23 Fmr Llc Automatic identification of document sections to generate a searchable data structure

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070156669A1 (en) * 2005-11-16 2007-07-05 Marchisio Giovanni B Extending keyword searching to syntactically and semantically annotated data
US7536389B1 (en) * 2005-02-22 2009-05-19 Yahoo ! Inc. Techniques for crawling dynamic web content
US20090204605A1 (en) * 2008-02-07 2009-08-13 Nec Laboratories America, Inc. Semantic Search Via Role Labeling
US7860312B2 (en) * 2004-10-22 2010-12-28 Xerox Corporation System and method for identifying and labeling fields of text associated with scanned business documents
US8005858B1 (en) * 2001-05-31 2011-08-23 Autonomy Corporation PLC Method and apparatus to link to a related document
US20120303640A1 (en) * 2011-05-24 2012-11-29 International Business Machines Corporation Self-Parsing XML Documents to Improve XML Processing
US8452850B2 (en) * 2000-12-14 2013-05-28 International Business Machines Corporation Method, apparatus and computer program product to crawl a web site
US8924395B2 (en) * 2010-10-06 2014-12-30 Planet Data Solutions System and method for indexing electronic discovery data
US9116895B1 (en) * 2011-08-25 2015-08-25 Infotech International Llc Document processing system and method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8396859B2 (en) * 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
US7856441B1 (en) * 2005-01-10 2010-12-21 Yahoo! Inc. Search systems and methods using enhanced contextual queries
US8024329B1 (en) 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
US8171031B2 (en) 2008-06-27 2012-05-01 Microsoft Corporation Index optimization for ranking using a linear model
WO2010082207A1 (en) * 2009-01-16 2010-07-22 Sanjiv Agarwal Dynamic indexing while authoring
US8620900B2 (en) 2009-02-09 2013-12-31 The Hong Kong Polytechnic University Method for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface
US8452794B2 (en) * 2009-02-11 2013-05-28 Microsoft Corporation Visual and textual query suggestion
CN102023989B (en) 2009-09-23 2012-10-10 阿里巴巴集团控股有限公司 Information retrieval method and system thereof
WO2012050800A1 (en) 2010-09-29 2012-04-19 International Business Machines Corporation Context-based disambiguation of acronyms and abbreviations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8452850B2 (en) * 2000-12-14 2013-05-28 International Business Machines Corporation Method, apparatus and computer program product to crawl a web site
US8005858B1 (en) * 2001-05-31 2011-08-23 Autonomy Corporation PLC Method and apparatus to link to a related document
US7860312B2 (en) * 2004-10-22 2010-12-28 Xerox Corporation System and method for identifying and labeling fields of text associated with scanned business documents
US7536389B1 (en) * 2005-02-22 2009-05-19 Yahoo ! Inc. Techniques for crawling dynamic web content
US20070156669A1 (en) * 2005-11-16 2007-07-05 Marchisio Giovanni B Extending keyword searching to syntactically and semantically annotated data
US20090204605A1 (en) * 2008-02-07 2009-08-13 Nec Laboratories America, Inc. Semantic Search Via Role Labeling
US8924395B2 (en) * 2010-10-06 2014-12-30 Planet Data Solutions System and method for indexing electronic discovery data
US20120303640A1 (en) * 2011-05-24 2012-11-29 International Business Machines Corporation Self-Parsing XML Documents to Improve XML Processing
US9116895B1 (en) * 2011-08-25 2015-08-25 Infotech International Llc Document processing system and method

Also Published As

Publication number Publication date
US20140149401A1 (en) 2014-05-29
US9069857B2 (en) 2015-06-30

Similar Documents

Publication Publication Date Title
US9069857B2 (en) Per-document index for semantic searching
US10565273B2 (en) Tenantization of search result ranking
US20200192948A1 (en) Efficient forward ranking in a search engine
US9411906B2 (en) Suggesting and refining user input based on original user input
US8639708B2 (en) Fact-based indexing for natural language search
US10073840B2 (en) Unsupervised relation detection model training
US9311823B2 (en) Caching natural language questions and results in a question and answer system
EP2181405B1 (en) Automatic expanded language search
US8073877B2 (en) Scalable semi-structured named entity detection
US20140081941A1 (en) Semantic ranking using a forward index
US9507867B2 (en) Discovery engine
US20120130995A1 (en) Efficient forward ranking in a search engine
US20130268526A1 (en) Discovery engine
US20120117102A1 (en) Query suggestions using replacement substitutions and an advanced query syntax
US20060179039A1 (en) Method and system for performing secondary search actions based on primary search result attributes
KR20100075454A (en) Identification of semantic relationships within reported speech
AU2018250372B2 (en) Method to construct content based on a content repository
Liu et al. Information retrieval and Web search
CN105808615A (en) Document index generation method and device based on word segment weights
US20200065395A1 (en) Efficient leaf invalidation for query execution
US8027957B2 (en) Grammar compression
Asdaghi et al. A novel set of contextual features for web spam detection
CN102375835B (en) A kind of information search system and method
Zhang et al. A knowledge base approach to cross-lingual keyword query interpretation
CN114391142A (en) Parsing queries using structured and unstructured data

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION