US20080222146A1 - System and method for creation, representation, and delivery of document corpus entity co-occurrence information - Google Patents

System and method for creation, representation, and delivery of document corpus entity co-occurrence information Download PDF

Info

Publication number
US20080222146A1
US20080222146A1 US12/062,096 US6209608A US2008222146A1 US 20080222146 A1 US20080222146 A1 US 20080222146A1 US 6209608 A US6209608 A US 6209608A US 2008222146 A1 US2008222146 A1 US 2008222146A1
Authority
US
United States
Prior art keywords
entity
occurring
sub
sparse matrix
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/062,096
Inventor
Daniel Frederick Gruhl
Daniel Norin Meredith
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/062,096 priority Critical patent/US20080222146A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MEREDITH, DANIEL NORIN, GRUHL, DANIEL FREDERICK
Publication of US20080222146A1 publication Critical patent/US20080222146A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting

Definitions

  • the present invention relates generally to creating, representing, and delivering entity co-occurrence information pertaining to entities in a document corpus such as the World Wide Web.
  • the Internet is a ubiquitous source of information.
  • search engines all of which are designed to respond to queries for information by returning what is hoped to be relevant query responses, it remains problematic to filter through search results for the answers to certain types of queries that existing search engines do not effectively account for.
  • types of queries that current search engines inadequately address are those that relate in general not just to a single entity, such as a single person, company, or product, but to entity combinations that are bounded by co-occurrence criteria between the entities. This is because it is often the case that the co-occurrence criteria can be unnamed in the sense that it may not be readily apparent why a particular co-occurrence exists.
  • a computer is programmed to execute logic that includes receiving a query, and in response to the query, accessing a sparse matrix that contains information which represents co-occurrences of entities in a document corpus. Information obtained in the accessing act is returned as a response to the query.
  • the sparse matrix has groups of sub-rows, and each group corresponds to an entity in the document corpus.
  • the groups are sorted in the sparse matrix from most occurring entity to least occurring entity, with each sub-row of a group corresponding to an entity co-occurring in the document corpus, within at least one co-occurrence criterion, with the entity represented by the group.
  • the sub-rows within a group are sorted from most occurring co-occurrence to least occurring co-occurrence.
  • the logic can further include, in response to the query, accessing a row index that points to a starting position of a group of sub-rows in the sparse matrix.
  • the logic can also include, in response to the query, accessing a header including at least two bytes, the first of which indicates a file version and the second byte of which indicates a number of bytes used for at least one cardinality representing a corresponding number of entity co-occurrences.
  • the cardinality may be expressed exactly or using a two-byte approximation.
  • the logic can also include accessing a string table including an index and a corresponding data string.
  • the index can be a concatenated list of integers representing offsets of entity-representing strings in the data string, and the entity-representing strings in the data string may be listed in descending order of frequency of occurrence in the document corpus.
  • a service in another aspect, includes receiving a query for information contained in the World Wide Web, and returning a response to the query at least in part by accessing a data structure including a sparse matrix.
  • a method for responding to queries for information in a document corpus includes receiving the query and using at least a portion of the query as an entering argument to access a sparse matrix. A response to the query is returned based on the access of the sparse matrix.
  • FIG. 1 is a schematic diagram of a non-limiting computer system that can be used to create and use the data structures shown herein to return responses to user queries;
  • FIG. 2 is a schematic representation of the present sparse matrix with row index, along with a counterpart dense matrix representation that is shown only for illustration;
  • FIG. 3 is a flow chart of the logic for establishing the sparse matrix
  • FIGS. 4 and 5 show various data structures that can be used as part of the logic of FIG. 3 .
  • a system is shown, generally designated 10 , that includes one or more computers 12 (only a single computer 12 shown in FIG. 1 for clarity of disclosure) that can communicate with a corpus 14 of documents.
  • the corpus 14 may be the World Wide Web with computer-implemented Web sites, and the computer 12 can communicate with the Web by means of a software-implemented browser 15 .
  • the computer 12 includes input devices such as a keyboard 16 and/or mouse 18 or other input device for inputting programming data to establish the present data structures and/or for inputting subsequent user queries and accessing the data structures to return responses to the queries.
  • the computer 12 can use one or more output devices 20 such as a computer monitor to display query results.
  • the data structures below which facilitate co-occurrence querying can be provided to the computer 12 for execution thereof by a user of the computer so that a user can input a query and the computer can return a response.
  • a user can access the Web or other network, input a query to a Web server or other network server, and the server can access the data structures herein to return a response to the query as a paid-for service.
  • the data structures owing to their compact size, may be provided on the below-described removable portable data storage medium and vended to users, who may purchase the portable data storage medium and engage it with their own personal computers to query for co-occurrences.
  • the computer 12 can be, without limitation, a personal computer made by International Business Machines Corporation (IBM) of Armonk, N.Y. or equivalent.
  • IBM International Business Machines Corporation
  • Other digital processors may be used, such as a laptop computer, mainframe computer, palmtop computer, personal assistant, or any other suitable processing apparatus.
  • other input devices including keypads, trackballs, and voice recognition devices can be used, as can other output devices, such as printers, other computers or data storage devices, and computer networks.
  • the computer 12 has a processor 22 that executes the logic shown herein.
  • the logic may be implemented in software as a series of computer-executable instructions.
  • the instructions may be contained on a data storage device with a computer readable medium, such as a computer diskette. Or, the instructions may be stored on random access memory (RAM) of the computers, on a hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device.
  • RAM random access memory
  • the computer-executable instructions may be lines of JAVA code.
  • the flow charts herein illustrate the structure of the logic of the present invention as embodied in computer program software.
  • the flow charts illustrate the structures of computer program code elements including logic circuits on an integrated circuit, that function according to this invention.
  • the invention is practiced in its essential embodiment by a machine component that renders the program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
  • the sparse matrix and string table may be stored on a removable data storage media 24 such as a DVD, CD, thumb drive, solid state portable memory device, etc.
  • an s-web includes a header (not shown), a string table which lists the names of the entities to be considered, and a sparse matrix 30 of the co-occurrences with row index 32 .
  • a sparse matrix 30 of the co-occurrences with row index 32 .
  • the representation of the sparse matrix drops zeroes in the dense matrix to make the resulting data structure as compact as possible.
  • the sparse matrix 30 is not merely the dense matrix 34 with the zeroes dropped, but rather is a representation of the dense matrix with zeroes dropped and data rearranged. Details of the sparse matrix will be discussed further below, but first the header and string table will be described.
  • the header in a preferred non-limiting implementation includes two bytes, the first of which indicates the file version and the second of which indicates the number of bytes used for cardinalities and offsets. Smaller tables can use less bytes per entry.
  • a “cardinality” refers to the number of co-occurrences between two entities.
  • the header can indicate the largest cardinality in the sparse matrix, either exactly or using a two-byte approximation (reduced format) such as a 10+6 bit mantissa and order of magnitude exponent.
  • the preferred non-limiting string table can have two parts, namely, an index and the corresponding data.
  • the index is a concatenated list of integers (preferably represented using the minimum number of bytes) that provides the offsets of the various strings. String length may be calculated by subtraction from the next occurring string.
  • the index of the string table is followed by the per-string data, which lists each entity represented in the sparse matrix.
  • the entities in the data portion of the string table preferably are listed in descending order of frequency of occurrence in the document corpus 14 , for reasons that will become clear shortly.
  • the string data can be compressed if desired, but should be compressed on a per string basis, so it often makes more sense to simply compress the whole file at the file system level.
  • the entities in the document corpus are obtained as set forth further below, sorted, and then concatenated to produce the string data portion of the string table, with their offsets calculated and recorded in the index portion.
  • a portion of the string table might appear as follows:
  • a row in the dense matrix which represents a single entity, is broken into sub-rows in the sparse matrix, with each sub-row representing a column from the corresponding row in the dense matrix representation.
  • a group of sub-rows in the sparse matrix corresponds to an entity in the document corpus.
  • a column in the dense matrix representation (and hence a sub-row in the sparse matrix 30 ) corresponds to an entity that has satisfied the co-occurrence criteria with the row entity as further discussed below, and the value in the column indicates the number of co-occurrences of the two entities. Since most entities co-occur with only a small subset of all the entities in the corpus, the dense matrix representation is mostly composed of zeroes as shown. With this critical observation, the sparse matrix 30 is provided.
  • the groups of sub-rows in the sparse matrix 30 are sorted in two ways. First, the order of the groups themselves depends on the frequency of occurrence of the corresponding entities in the document corpus, i.e., the first group of sub-rows correspond to the most commonly occurring entity in the document corpus 14 , the second group of sub-rows represents the second-most commonly occurring entity, and so on. This method of sorting facilitates responding to queries such as “what is the most common cough syrup mentioned on the web?” Recall that the entities in the string table data portion are similarly sorted, i.e., the first string is the most commonly occurring entity and so on.
  • the first group of sub-rows correspond to a single entity, in fact the most frequently occurring entity in the document corpus.
  • the first numeral of each sub-row of the sparse matrix 30 may be dropped in implementation, with the row index 32 being used to point to the beginning of each new group of sub-rows as shown.
  • the second numeral in each sub-row represents a non-zero column from the dense matrix representation
  • the third numeral represents the value in the column.
  • there are four sub-rows in the first group with the first sub-row indicating that a value of “3” corresponds to column “7”, the second sub-row indicating that a value of “2” corresponds to column “17”, the third sub-row indicating that a value of “1” corresponds to the first column, and the fourth sub-row indicating that a value of “1” corresponds to the thirteenth column.
  • the second way in which the sparse matrix 30 is sorted may now be appreciated. Not only are the groups of sub-rows intersorted by frequency of occurrence of the corresponding entities, but within each group, the sub-rows are intrasorted by cardinality, with the sub-row indicating the highest number of co-occurrences first, the sub-row indicating the second-highest number of co-occurrences second, and so on.
  • This second way in which the sparse matrix 30 is sorted thus facilitates responding to queries such as “which cough syrups are most often co-mentioned with aspirin?”
  • FIGS. 3-5 illustrate how the data structures discussed above can be generated.
  • a hierarchical structure of entity classes may be established. More specifically, consider that entities can be regarded as annotations which have been placed on a document either manually or automatically via an algorithm.
  • each entity can be an unstructured information management architecture (UIMA) annotation which records the unique identifier of the entity, its location on the document, and the number of tokens by which the entity is represented. This information is then compiled into a vector of annotations per document as set forth further below.
  • Block 40 recognizes that many annotations fall into classes of annotation, and entities are no different.
  • UIMA unstructured information management architecture
  • annotations are classified and structured in this manner, the logic can move to block 42 to examine each document (or a relevant subset thereof in the corpus and determine entities, their locations, and the number of tokens associated with each entity to thereby establish annotation vectors.
  • Multiple annotations may be produced at a given annotation location, e.g., at the location in a document of “Sam Palmisano”, annotations for “Entity”, “Entity/People”, and “Entity/People/Sam Palmisano” can be produced.
  • FIG. 4 illustrates how annotation vectors are generated. While the example documents in FIG. 4 are in Web markup language, the invention is not limited to a particular format of document.
  • a raw document 44 with document ID, content, and other data known to those of skill in the art can be stored at 46 and then operated on by an annotator 48 to produce an annotated document 50 , which lists, among things, various entities in the document as shown.
  • the annotated document 50 may also be stored at 46 .
  • An index component 52 then accesses the annotated documents 50 to produce annotation vectors 54 , showing, for each entity, the documents in which it appears.
  • the annotation vectors are inverted by a software-implemented indexer such that for each document, a table of unique annotations is produced and the locations on the document where the annotation occurred are recorded.
  • the location, span and unique entity identifiers are recorded for each location.
  • the annotation locations are structured as a list of annotations, sorted by the order the individual annotations occurred in the document. If an annotation is unique on a document, the table can be considered to point at a location list with a size of one.
  • a unique annotation table 58 (referred to herein a dictionary) and the corresponding annotation lists are merged to produce the document table 60 .
  • a final index as shown in FIG. 5 is produced which contains all the unique annotations and lists of the documents in which they have occurred, also preferably with the location within a document of each occurrence.
  • the data structure of FIG. 5 facilitates efficient entity (term) lookup, efficient Boolean operations, and efficient storage of a large number of data records.
  • the logic next moves to block 62 to define a set of inner entities and a set of outer entities.
  • the inner entities define the sub-row groups and the outer entities define the sub-rows within a group in the sparse matrix 30 of FIG. 2 .
  • the inner set is the class of entities of primary interest.
  • the inner set can be the set of all entities, or a subset of all entities.
  • the outer set is the class of entities of interest for determining if a relationship exists between that entity and an inner entity, and this set may also be the set of all entities or only a subset thereof.
  • the lists of document locations for those classes are retrieved from the indexer, i.e., the data structures of FIGS. 4 and 5 are accessed.
  • the lists are scanned sequentially to determine all the pairs of inner and outer entities which occur within a given proximity boundary. Proximity boundaries can be within the same sentence, paragraph, document, or within a fixed number of tokens.
  • a loop is entered in which the unique entity identifiers stored within the two locations are compared to each other at decision diamond 68 to ensure that the entities are unique. If they are the same, the process accesses the next pair (assuming the Do loop is not complete) at block 70 and loops back to decision diamond 68 . On the other hand, if the entities are unique from each other the pair is appended to a list of all pairs which have been discovered at block 72 .
  • the list of pairs is processed at block 74 to produce a table of all unique pairs which occurred and the number of times the pair occurred. This table is sorted in accordance with principles discussed above into the sparse matrix 30 of FIG. 2 . The string table is likewise produced using the lists in FIGS. 4 and 5 .
  • the sparse matrix 30 and string table may be used as follows. It is to be understood that other sparse matrices less preferably may be used, but in the preferred implementation the sparse matrix 30 , advantageously ordered as discussed above, is used.
  • the string table (which, recall, has the same order of entities as the sparse matrix) is accessed to locate the drug X (and hence the position of its group of sub-rows in the sparse matrix). Then the sparse matrix is accessed using the drug entity as entering argument, and the column represented by the highest sub-row in the group corresponding to a medical condition is retrieved. Since the sub-rows are in order of cardinality, the first sub-row indicates the entity in the corpus having the most co-occurrences with the drug X, and it is examined to determine whether it corresponds to a co-occurring entity that is classified as a “condition”.
  • the string table is accessed from the beginning to find the highest cardinality entity that has been classified as a drug, and the result returned.
  • An s-web of around thirty thousand co-occurrence entries may be smaller than two gigabytes. This means that these “co-occurrence snapshots” can fit easily on removable media (DVD, CD, thumb drive, etc). Applications can be included on this media as well, allowing stand alone delivery of these facts which customers can explore to discover actionable business insights.

Abstract

To respond to queries that relate to co-occurring entities on the Web, a compact sparse matrix representing entity co-occurrences is generated and then accessed to satisfy queries. The sparse matrix has groups of sub-rows, with each group corresponding to an entity in a document corpus. The groups are sorted from most occurring entity to least occurring entity. Each sub-row within a group corresponds to an entity that co-occurs in the document corpus, within a co-occurrence criterion, with the entity represented by the group, and to facilitate query response the sub-rows within a group are sorted from most occurring co-occurrence to least occurring co-occurrence.

Description

    FIELD OF THE INVENTION
  • The present invention relates generally to creating, representing, and delivering entity co-occurrence information pertaining to entities in a document corpus such as the World Wide Web.
  • BACKGROUND
  • The Internet is a ubiquitous source of information. Despite the presence of a large number of search engines, however, all of which are designed to respond to queries for information by returning what is hoped to be relevant query responses, it remains problematic to filter through search results for the answers to certain types of queries that existing search engines do not effectively account for. Among the types of queries that current search engines inadequately address are those that relate in general not just to a single entity, such as a single person, company, or product, but to entity combinations that are bounded by co-occurrence criteria between the entities. This is because it is often the case that the co-occurrence criteria can be unnamed in the sense that it may not be readily apparent why a particular co-occurrence exists.
  • For example, consider the sentence “in their speech Sam Palmisano and Steve Mills announced a new version of IBM's database product DB2 will ship by the end of third quarter.” This sentence contains the following example unnamed co-occurrences:
  • Sam Palmisano and Steve Mills, Sam Palmisano and IBM, Sam Palmisano and DB2, Steve Mills and IBM, Steve Mills and DB2.
  • One might wish to inquire of a large document corpus such as the Web, “which person co-occurs most often with IBM? ”, but present search engines largely cannot respond to even a simple co-occurrence query like this one. Other co-occurrence questions with important implications but currently no effective answers exist, such as which medical conditions are most often mentioned with a drug, which technologies most often mentioned with a company, etc. With these critical observations in mind, the invention herein is provided.
  • SUMMARY OF THE INVENTION
  • A computer is programmed to execute logic that includes receiving a query, and in response to the query, accessing a sparse matrix that contains information which represents co-occurrences of entities in a document corpus. Information obtained in the accessing act is returned as a response to the query.
  • In one non-limiting implementation, the sparse matrix has groups of sub-rows, and each group corresponds to an entity in the document corpus. The groups are sorted in the sparse matrix from most occurring entity to least occurring entity, with each sub-row of a group corresponding to an entity co-occurring in the document corpus, within at least one co-occurrence criterion, with the entity represented by the group. The sub-rows within a group are sorted from most occurring co-occurrence to least occurring co-occurrence.
  • In the preferred non-limiting implementation, the logic can further include, in response to the query, accessing a row index that points to a starting position of a group of sub-rows in the sparse matrix. The logic can also include, in response to the query, accessing a header including at least two bytes, the first of which indicates a file version and the second byte of which indicates a number of bytes used for at least one cardinality representing a corresponding number of entity co-occurrences. The cardinality may be expressed exactly or using a two-byte approximation.
  • If desired, the logic can also include accessing a string table including an index and a corresponding data string. The index can be a concatenated list of integers representing offsets of entity-representing strings in the data string, and the entity-representing strings in the data string may be listed in descending order of frequency of occurrence in the document corpus.
  • In another aspect, a service includes receiving a query for information contained in the World Wide Web, and returning a response to the query at least in part by accessing a data structure including a sparse matrix.
  • In yet another aspect, a method for responding to queries for information in a document corpus includes receiving the query and using at least a portion of the query as an entering argument to access a sparse matrix. A response to the query is returned based on the access of the sparse matrix.
  • The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic diagram of a non-limiting computer system that can be used to create and use the data structures shown herein to return responses to user queries;
  • FIG. 2 is a schematic representation of the present sparse matrix with row index, along with a counterpart dense matrix representation that is shown only for illustration;
  • FIG. 3 is a flow chart of the logic for establishing the sparse matrix; and
  • FIGS. 4 and 5 show various data structures that can be used as part of the logic of FIG. 3.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • Referring initially to FIG. 1, a system is shown, generally designated 10, that includes one or more computers 12 (only a single computer 12 shown in FIG. 1 for clarity of disclosure) that can communicate with a corpus 14 of documents. The corpus 14 may be the World Wide Web with computer-implemented Web sites, and the computer 12 can communicate with the Web by means of a software-implemented browser 15. The computer 12 includes input devices such as a keyboard 16 and/or mouse 18 or other input device for inputting programming data to establish the present data structures and/or for inputting subsequent user queries and accessing the data structures to return responses to the queries. The computer 12 can use one or more output devices 20 such as a computer monitor to display query results.
  • It is to be appreciated that the data structures below which facilitate co-occurrence querying can be provided to the computer 12 for execution thereof by a user of the computer so that a user can input a query and the computer can return a response. It is to be further understood that in other aspects, a user can access the Web or other network, input a query to a Web server or other network server, and the server can access the data structures herein to return a response to the query as a paid-for service. Yet again, the data structures, owing to their compact size, may be provided on the below-described removable portable data storage medium and vended to users, who may purchase the portable data storage medium and engage it with their own personal computers to query for co-occurrences.
  • The computer 12 can be, without limitation, a personal computer made by International Business Machines Corporation (IBM) of Armonk, N.Y. or equivalent. Other digital processors, however, may be used, such as a laptop computer, mainframe computer, palmtop computer, personal assistant, or any other suitable processing apparatus. Likewise, other input devices, including keypads, trackballs, and voice recognition devices can be used, as can other output devices, such as printers, other computers or data storage devices, and computer networks.
  • In any case, the computer 12 has a processor 22 that executes the logic shown herein. The logic may be implemented in software as a series of computer-executable instructions. The instructions may be contained on a data storage device with a computer readable medium, such as a computer diskette. Or, the instructions may be stored on random access memory (RAM) of the computers, on a hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device. In an illustrative embodiment of the invention, the computer-executable instructions may be lines of JAVA code.
  • Indeed, the flow charts herein illustrate the structure of the logic of the present invention as embodied in computer program software. Those skilled in the art will appreciate that the flow charts illustrate the structures of computer program code elements including logic circuits on an integrated circuit, that function according to this invention. Manifestly, the invention is practiced in its essential embodiment by a machine component that renders the program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
  • Completing the description of FIG. 1, owing to the relatively efficient, compact size (in some implementations, less than two gigabytes) of the sparse matrix and accompanying string table described herein that can be used to respond to user queries, the sparse matrix and string table may be stored on a removable data storage media 24 such as a DVD, CD, thumb drive, solid state portable memory device, etc.
  • Now referring to FIG. 2, a data structure that is generated for searching for co-occurrences of entities in the document corpus 14 is shown and is referred to herein as an “s-web”. Essentially, in the preferred implementation an s-web includes a header (not shown), a string table which lists the names of the entities to be considered, and a sparse matrix 30 of the co-occurrences with row index 32. As can be seen comparing the sparse matrix 30 with a corresponding dense matrix representation 34, the representation of the sparse matrix drops zeroes in the dense matrix to make the resulting data structure as compact as possible. However, the sparse matrix 30 is not merely the dense matrix 34 with the zeroes dropped, but rather is a representation of the dense matrix with zeroes dropped and data rearranged. Details of the sparse matrix will be discussed further below, but first the header and string table will be described.
  • First considering the header, in a preferred non-limiting implementation the header includes two bytes, the first of which indicates the file version and the second of which indicates the number of bytes used for cardinalities and offsets. Smaller tables can use less bytes per entry.
  • As set forth further below, as used herein a “cardinality” refers to the number of co-occurrences between two entities. The header can indicate the largest cardinality in the sparse matrix, either exactly or using a two-byte approximation (reduced format) such as a 10+6 bit mantissa and order of magnitude exponent.
  • The preferred non-limiting string table can have two parts, namely, an index and the corresponding data. The index is a concatenated list of integers (preferably represented using the minimum number of bytes) that provides the offsets of the various strings. String length may be calculated by subtraction from the next occurring string.
  • The index of the string table is followed by the per-string data, which lists each entity represented in the sparse matrix. The entities in the data portion of the string table preferably are listed in descending order of frequency of occurrence in the document corpus 14, for reasons that will become clear shortly. The string data can be compressed if desired, but should be compressed on a per string basis, so it often makes more sense to simply compress the whole file at the file system level.
  • In generating the string table, the entities in the document corpus are obtained as set forth further below, sorted, and then concatenated to produce the string data portion of the string table, with their offsets calculated and recorded in the index portion. Thus, a portion of the string table might appear as follows:
  • data portion: Dan SmithUSPTOIBM . . . ,
    index 0 10 15 . . . , it being understood that “0” in the index points to just before “Dan Smith” (which starts at the zero position in the string data), “10” in the index points to just before “USPTO” (which starts at the tenth position in the data string), and “15” in the index points to just before “IBM” (which starts at the fifteenth position in the data string).
  • Returning to the sparse matrix 30, in the preferred implementation a row in the dense matrix, which represents a single entity, is broken into sub-rows in the sparse matrix, with each sub-row representing a column from the corresponding row in the dense matrix representation. Thus, a group of sub-rows in the sparse matrix corresponds to an entity in the document corpus. A column in the dense matrix representation (and hence a sub-row in the sparse matrix 30) corresponds to an entity that has satisfied the co-occurrence criteria with the row entity as further discussed below, and the value in the column indicates the number of co-occurrences of the two entities. Since most entities co-occur with only a small subset of all the entities in the corpus, the dense matrix representation is mostly composed of zeroes as shown. With this critical observation, the sparse matrix 30 is provided.
  • The groups of sub-rows in the sparse matrix 30 are sorted in two ways. First, the order of the groups themselves depends on the frequency of occurrence of the corresponding entities in the document corpus, i.e., the first group of sub-rows correspond to the most commonly occurring entity in the document corpus 14, the second group of sub-rows represents the second-most commonly occurring entity, and so on. This method of sorting facilitates responding to queries such as “what is the most common cough syrup mentioned on the web?” Recall that the entities in the string table data portion are similarly sorted, i.e., the first string is the most commonly occurring entity and so on.
  • Thus, as shown in FIG. 2, the first group of sub-rows (those beginning with the numeral “1”) correspond to a single entity, in fact the most frequently occurring entity in the document corpus. To further conserve spade, the first numeral of each sub-row of the sparse matrix 30 may be dropped in implementation, with the row index 32 being used to point to the beginning of each new group of sub-rows as shown.
  • The second numeral in each sub-row represents a non-zero column from the dense matrix representation, and the third numeral represents the value in the column. In the example shown in FIG. 2, there are four sub-rows in the first group, with the first sub-row indicating that a value of “3” corresponds to column “7”, the second sub-row indicating that a value of “2” corresponds to column “17”, the third sub-row indicating that a value of “1” corresponds to the first column, and the fourth sub-row indicating that a value of “1” corresponds to the thirteenth column.
  • Accordingly, the second way in which the sparse matrix 30 is sorted may now be appreciated. Not only are the groups of sub-rows intersorted by frequency of occurrence of the corresponding entities, but within each group, the sub-rows are intrasorted by cardinality, with the sub-row indicating the highest number of co-occurrences first, the sub-row indicating the second-highest number of co-occurrences second, and so on. This second way in which the sparse matrix 30 is sorted thus facilitates responding to queries such as “which cough syrups are most often co-mentioned with aspirin?”
  • FIGS. 3-5 illustrate how the data structures discussed above can be generated. Commencing at block 40, a hierarchical structure of entity classes may be established. More specifically, consider that entities can be regarded as annotations which have been placed on a document either manually or automatically via an algorithm. In a non-limiting implementation each entity can be an unstructured information management architecture (UIMA) annotation which records the unique identifier of the entity, its location on the document, and the number of tokens by which the entity is represented. This information is then compiled into a vector of annotations per document as set forth further below. Block 40 recognizes that many annotations fall into classes of annotation, and entities are no different. In the example in the background, “Sam Palmisano” and “Steve Mills” are both of the “People” class of entities, whereas the annotation “IBM” is of the Organization class and “DB2” can be considered part of the Product class of entities. This non-limiting illustrative classification allows for a simple hierarchical structure of entities to be created:
  • /Entity/People/Sam Palmisano
  • /Entity/People/Steve Mills
  • /Entity/Organizations/IBM
  • /Entity/Products/DB2
  • When annotations are classified and structured in this manner, the logic can move to block 42 to examine each document (or a relevant subset thereof in the corpus and determine entities, their locations, and the number of tokens associated with each entity to thereby establish annotation vectors. Multiple annotations may be produced at a given annotation location, e.g., at the location in a document of “Sam Palmisano”, annotations for “Entity”, “Entity/People”, and “Entity/People/Sam Palmisano” can be produced.
  • FIG. 4 illustrates how annotation vectors are generated. While the example documents in FIG. 4 are in Web markup language, the invention is not limited to a particular format of document.
  • As shown, a raw document 44 with document ID, content, and other data known to those of skill in the art (crawl date, URL, etc.) can be stored at 46 and then operated on by an annotator 48 to produce an annotated document 50, which lists, among things, various entities in the document as shown. The annotated document 50 may also be stored at 46. An index component 52 then accesses the annotated documents 50 to produce annotation vectors 54, showing, for each entity, the documents in which it appears.
  • Proceeding to block 56 in FIG. 3, the annotation vectors are inverted by a software-implemented indexer such that for each document, a table of unique annotations is produced and the locations on the document where the annotation occurred are recorded. Within a non-limiting indexer, the location, span and unique entity identifiers are recorded for each location. When a given annotation has occurred more than once on a document, the annotation locations are structured as a list of annotations, sorted by the order the individual annotations occurred in the document. If an annotation is unique on a document, the table can be considered to point at a location list with a size of one.
  • Briefly referencing FIG. 5, as more documents are processed by the indexer, a unique annotation table 58 (referred to herein a dictionary) and the corresponding annotation lists are merged to produce the document table 60. Once all documents have been processed a final index as shown in FIG. 5 is produced which contains all the unique annotations and lists of the documents in which they have occurred, also preferably with the location within a document of each occurrence. The data structure of FIG. 5 facilitates efficient entity (term) lookup, efficient Boolean operations, and efficient storage of a large number of data records.
  • Returning once again to FIG. 3, the logic next moves to block 62 to define a set of inner entities and a set of outer entities. Notionally, the inner entities define the sub-row groups and the outer entities define the sub-rows within a group in the sparse matrix 30 of FIG. 2.
  • Thus, the inner set is the class of entities of primary interest. The inner set can be the set of all entities, or a subset of all entities. The outer set is the class of entities of interest for determining if a relationship exists between that entity and an inner entity, and this set may also be the set of all entities or only a subset thereof.
  • Once the classes of entities are defined, the lists of document locations for those classes are retrieved from the indexer, i.e., the data structures of FIGS. 4 and 5 are accessed. At block 64 the lists are scanned sequentially to determine all the pairs of inner and outer entities which occur within a given proximity boundary. Proximity boundaries can be within the same sentence, paragraph, document, or within a fixed number of tokens.
  • When a pair is determined to be within the proximity constraint, at block 66 a loop is entered in which the unique entity identifiers stored within the two locations are compared to each other at decision diamond 68 to ensure that the entities are unique. If they are the same, the process accesses the next pair (assuming the Do loop is not complete) at block 70 and loops back to decision diamond 68. On the other hand, if the entities are unique from each other the pair is appended to a list of all pairs which have been discovered at block 72.
  • Once the lists of locations have been exhausted (i.e., the DO loop is complete), the list of pairs is processed at block 74 to produce a table of all unique pairs which occurred and the number of times the pair occurred. This table is sorted in accordance with principles discussed above into the sparse matrix 30 of FIG. 2. The string table is likewise produced using the lists in FIGS. 4 and 5.
  • To execute a query, the sparse matrix 30 and string table may be used as follows. It is to be understood that other sparse matrices less preferably may be used, but in the preferred implementation the sparse matrix 30, advantageously ordered as discussed above, is used.
  • For an example query “which “N” medical conditions are most often mentioned with drug X?”, the string table (which, recall, has the same order of entities as the sparse matrix) is accessed to locate the drug X (and hence the position of its group of sub-rows in the sparse matrix). Then the sparse matrix is accessed using the drug entity as entering argument, and the column represented by the highest sub-row in the group corresponding to a medical condition is retrieved. Since the sub-rows are in order of cardinality, the first sub-row indicates the entity in the corpus having the most co-occurrences with the drug X, and it is examined to determine whether it corresponds to a co-occurring entity that is classified as a “condition”. If not, the next sub-row is examined, and so on, until the highest cardinality “N” sub-rows indicating the most frequently co-occurring conditions are identified. The result is then returned. For a simpler query, e.g., “which drug is most often mentioned on the Web”, the string table is accessed from the beginning to find the highest cardinality entity that has been classified as a drug, and the result returned.
  • An s-web of around thirty thousand co-occurrence entries may be smaller than two gigabytes. This means that these “co-occurrence snapshots” can fit easily on removable media (DVD, CD, thumb drive, etc). Applications can be included on this media as well, allowing stand alone delivery of these facts which customers can explore to discover actionable business insights.
  • While the particular SYSTEM AND METHOD FOR CREATION, REPRESENTATION, AND DELIVERY OP DOCUMENT CORPUS ENTITY CO-OCCURRENCE INFORMATION is herein shown and described in detail, it is to be understood that the subject matter which is encompassed by the present invention is limited only by the claims.

Claims (21)

1. A computer programmed to execute logic comprising:
receiving a query;
in response to the query, accessing at least one sparse matrix containing information representing co-occurrences of entities in a document corpus; and
returning information obtained in the accessing act as a response to the query.
2. The computer of claim 1, wherein the sparse matrix has groups of sub-rows, each group corresponding to an entity in the document corpus, the groups being sorted in the sparse matrix from most occurring entity to least occurring entity, each sub-row of a group corresponding to an entity co-occurring in the document corpus, within at least one co-occurrence criterion, with the entity represented by the group, the sub-rows within a group being sorted from most occurring co-occurrence to least occurring co-occurrence.
3. The computer of claim 1, wherein the logic further includes, in response to the query, accessing a row index that points to a starting position of a group of sub-rows in the sparse matrix.
4. The computer of claim 1, wherein the logic further includes, in response to the query, accessing a header including at least two bytes, the first of which indicates a file version and the second byte of which indicates a number of bytes used for at least one cardinality representing a corresponding number of entity co-occurrences.
5. The computer of claim 4, wherein the cardinality is expressed exactly.
6. The computer of claim 4, wherein the cardinality is expressed using a two-byte approximation.
7. The computer of claim 1, wherein the logic further comprises accessing a string table including an index and a corresponding data string.
8. The computer of claim 7, wherein the index is a concatenated list of integers representing offsets of entity-representing strings in the data string.
9. The computer of claim 8, wherein the entity-representing strings in the data string are listed in descending order of frequency of occurrence in the document corpus.
10. The computer of claim 1, wherein the document corpus is the World Wide Web.
11. A service, comprising:
receiving a query for information contained in the World Wide Web; and
returning a response to the query at least in part by accessing a data structure including a sparse matrix.
12. The service of claim 11, wherein the sparse matrix comprises entity representations representing entities in a document corpus, the entity representations being sorted by frequency of entity occurrence within the corpus and, within an entity representation, information being sorted by frequency of co-occurrence of other entities with the entity corresponding to the entity representation.
13. The service of claim 10, wherein the entity representations are groups of sub-rows in the sparse matrix, the groups are sorted from most occurring entity to least occurring entity, with each sub-row of a group corresponding to an entity co-occurring in the document corpus with the entity represented by the group, the sub-rows within a group being sorted from most occurring co-occurrence to least occurring co-occurrence.
14. The service of claim 11, wherein the data structure includes a row index that points to a starting position of a group of sub-rows in the sparse matrix.
15. The service of claim 11, wherein the data structure includes a header including at least two bytes, the first of which indicates a file version and the second byte of which indicates a number of bytes used for at least one cardinality representing a corresponding number of entity co-occurrences.
16. The service of claim 11, wherein the data structure includes a string table.
17. The service of claim 16, wherein the string table includes an index and a corresponding data string.
18. The service of claim 17, wherein the index is a concatenated list of integers representing offsets of entity-representing strings in the data string.
19. The service of claim 18, wherein the entity-representing strings in the data string are listed in descending order of frequency of occurrence in the document corpus.
20. A method for responding to queries for information in a document corpus, comprising:
receiving the query;
using at least a portion of the query as an entering argument to access a sparse matrix; and
returning a response to the query at least in part based on the access of the sparse matrix.
21. The method of claim 20, wherein the document corpus includes the World Wide Web and the sparse matrix includes entity representations that are respective groups of sub-rows in the sparse matrix, wherein the groups are sorted from most occurring entity to least occurring entity, with each sub-row of a group corresponding to an entity co-occurring in the document corpus with the entity represented by the group, the sub-rows within a group being sorted from most occurring co-occurrence to least occurring co-occurrence.
US12/062,096 2006-05-26 2008-04-03 System and method for creation, representation, and delivery of document corpus entity co-occurrence information Abandoned US20080222146A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/062,096 US20080222146A1 (en) 2006-05-26 2008-04-03 System and method for creation, representation, and delivery of document corpus entity co-occurrence information

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/442,377 US7593940B2 (en) 2006-05-26 2006-05-26 System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US12/062,096 US20080222146A1 (en) 2006-05-26 2008-04-03 System and method for creation, representation, and delivery of document corpus entity co-occurrence information

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/442,377 Continuation US7593940B2 (en) 2006-05-26 2006-05-26 System and method for creation, representation, and delivery of document corpus entity co-occurrence information

Publications (1)

Publication Number Publication Date
US20080222146A1 true US20080222146A1 (en) 2008-09-11

Family

ID=38750731

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/442,377 Expired - Fee Related US7593940B2 (en) 2006-05-26 2006-05-26 System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US12/062,096 Abandoned US20080222146A1 (en) 2006-05-26 2008-04-03 System and method for creation, representation, and delivery of document corpus entity co-occurrence information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/442,377 Expired - Fee Related US7593940B2 (en) 2006-05-26 2006-05-26 System and method for creation, representation, and delivery of document corpus entity co-occurrence information

Country Status (2)

Country Link
US (2) US7593940B2 (en)
CN (1) CN101079070B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977953B1 (en) * 2006-01-27 2015-03-10 Linguastat, Inc. Customizing information by combining pair of annotations from at least two different documents
US7587407B2 (en) * 2006-05-26 2009-09-08 International Business Machines Corporation System and method for creation, representation, and delivery of document corpus entity co-occurrence information
EP1939767A1 (en) * 2006-12-22 2008-07-02 France Telecom Construction of a large co-occurrence data file
CA2717462C (en) 2007-03-14 2016-09-27 Evri Inc. Query templates and labeled search tip system, methods, and techniques
US8594996B2 (en) 2007-10-17 2013-11-26 Evri Inc. NLP-based entity recognition and disambiguation
WO2009052308A1 (en) 2007-10-17 2009-04-23 Roseman Neil S Nlp-based content recommender
US8386461B2 (en) * 2008-06-16 2013-02-26 Qualcomm Incorporated Method and apparatus for generating hash mnemonics
US9710556B2 (en) * 2010-03-01 2017-07-18 Vcvc Iii Llc Content recommendation based on collections of entities
US8645125B2 (en) 2010-03-30 2014-02-04 Evri, Inc. NLP-based systems and methods for providing quotations
US8255399B2 (en) 2010-04-28 2012-08-28 Microsoft Corporation Data classifier
EP2616926A4 (en) * 2010-09-24 2015-09-23 Ibm Providing question and answers with deferred type evaluation using text with limited structure
US8725739B2 (en) 2010-11-01 2014-05-13 Evri, Inc. Category-based content recommendation
US9129007B2 (en) 2010-11-10 2015-09-08 Microsoft Technology Licensing, Llc Indexing and querying hash sequence matrices
US8484024B2 (en) * 2011-02-24 2013-07-09 Nuance Communications, Inc. Phonetic features for speech recognition
WO2014133473A1 (en) * 2013-02-28 2014-09-04 Vata Celal Korkut Combinational data mining
US9336280B2 (en) 2013-12-02 2016-05-10 Qbase, LLC Method for entity-driven alerts based on disambiguated features
CA2932401A1 (en) * 2013-12-02 2015-06-11 Qbase, LLC Systems and methods for in-memory database search
US9317565B2 (en) * 2013-12-02 2016-04-19 Qbase, LLC Alerting system based on newly disambiguated features
WO2015132446A1 (en) * 2014-03-04 2015-09-11 Nokia Technologies Oy Method and apparatus for secured information storage
CN105468605B (en) * 2014-08-25 2019-04-12 济南中林信息科技有限公司 Entity information map generation method and device
CN107220321B (en) * 2017-05-19 2021-02-09 重庆邮电大学 Method and system for three-dimensional materialization of entity in scene conversion
US10963514B2 (en) * 2017-11-30 2021-03-30 Facebook, Inc. Using related mentions to enhance link probability on online social networks
CN111427967B (en) * 2018-12-24 2023-06-09 顺丰科技有限公司 Entity relationship query method and device
CN112069366B (en) * 2020-08-28 2024-02-09 喜大(上海)网络科技有限公司 Recall determination method, recall determination device, recall determination equipment and storage medium
CN112199082B (en) * 2020-10-14 2023-04-14 杭州安恒信息技术股份有限公司 HTTP response processing method and device, electronic equipment and storage medium

Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963642A (en) * 1996-12-30 1999-10-05 Goldstein; Benjamin D. Method and apparatus for secure storage of data
US5978792A (en) * 1995-02-24 1999-11-02 International Business Machines Corporation Method and apparatus for generating dynamic and hybrid sparse indices for workfiles used in SQL queries
US5987460A (en) * 1996-07-05 1999-11-16 Hitachi, Ltd. Document retrieval-assisting method and system for the same and document retrieval service using the same with document frequency and term frequency
US6058392A (en) * 1996-11-18 2000-05-02 Wesley C. Sampson Revocable Trust Method for the organizational indexing, storage, and retrieval of data according to data pattern signatures
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US20020165884A1 (en) * 2001-05-04 2002-11-07 International Business Machines Corporation Efficient storage mechanism for representing term occurrence in unstructured text documents
US6678679B1 (en) * 2000-10-10 2004-01-13 Science Applications International Corporation Method and system for facilitating the refinement of data queries
US20040064438A1 (en) * 2002-09-30 2004-04-01 Kostoff Ronald N. Method for data and text mining and literature-based discovery
US6785677B1 (en) * 2001-05-02 2004-08-31 Unisys Corporation Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector
US20040199495A1 (en) * 2002-07-03 2004-10-07 Sean Colbath Name browsing systems and methods
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
US20050143971A1 (en) * 2003-10-27 2005-06-30 Jill Burstein Method and system for determining text coherence
US20050262039A1 (en) * 2004-05-20 2005-11-24 International Business Machines Corporation Method and system for analyzing unstructured text in data warehouse
US7031910B2 (en) * 2001-10-16 2006-04-18 Xerox Corporation Method and system for encoding and accessing linguistic frequency data
US7117208B2 (en) * 2000-09-28 2006-10-03 Oracle Corporation Enterprise web mining system and method
US7149983B1 (en) * 2002-05-08 2006-12-12 Microsoft Corporation User interface and method to facilitate hierarchical specification of queries using an information taxonomy
US20070061348A1 (en) * 2001-04-19 2007-03-15 International Business Machines Corporation Method and system for identifying relationships between text documents and structured variables pertaining to the text documents
US7213198B1 (en) * 1999-08-12 2007-05-01 Google Inc. Link based clustering of hyperlinked documents
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US7289911B1 (en) * 2000-08-23 2007-10-30 David Roth Rigney System, methods, and computer program product for analyzing microarray data
US7302442B2 (en) * 2005-06-02 2007-11-27 Data Pattern Index Method for recording, identification, selection, and reporting network transversal paths
US20080189232A1 (en) * 2007-02-02 2008-08-07 Veoh Networks, Inc. Indicator-based recommendation system
US7587407B2 (en) * 2006-05-26 2009-09-08 International Business Machines Corporation System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US7743058B2 (en) * 2007-01-10 2010-06-22 Microsoft Corporation Co-clustering objects of heterogeneous types

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5619709A (en) * 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US7007015B1 (en) * 2002-05-01 2006-02-28 Microsoft Corporation Prioritized merging for full-text index on relational store
US7016914B2 (en) * 2002-06-05 2006-03-21 Microsoft Corporation Performant and scalable merge strategy for text indexing
US7139752B2 (en) * 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US7373102B2 (en) * 2003-08-11 2008-05-13 Educational Testing Service Cooccurrence and constructions

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5978792A (en) * 1995-02-24 1999-11-02 International Business Machines Corporation Method and apparatus for generating dynamic and hybrid sparse indices for workfiles used in SQL queries
US5987460A (en) * 1996-07-05 1999-11-16 Hitachi, Ltd. Document retrieval-assisting method and system for the same and document retrieval service using the same with document frequency and term frequency
US6058392A (en) * 1996-11-18 2000-05-02 Wesley C. Sampson Revocable Trust Method for the organizational indexing, storage, and retrieval of data according to data pattern signatures
US5963642A (en) * 1996-12-30 1999-10-05 Goldstein; Benjamin D. Method and apparatus for secure storage of data
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US7213198B1 (en) * 1999-08-12 2007-05-01 Google Inc. Link based clustering of hyperlinked documents
US7289911B1 (en) * 2000-08-23 2007-10-30 David Roth Rigney System, methods, and computer program product for analyzing microarray data
US7117208B2 (en) * 2000-09-28 2006-10-03 Oracle Corporation Enterprise web mining system and method
US6678679B1 (en) * 2000-10-10 2004-01-13 Science Applications International Corporation Method and system for facilitating the refinement of data queries
US6954750B2 (en) * 2000-10-10 2005-10-11 Content Analyst Company, Llc Method and system for facilitating the refinement of data queries
US20070061348A1 (en) * 2001-04-19 2007-03-15 International Business Machines Corporation Method and system for identifying relationships between text documents and structured variables pertaining to the text documents
US6785677B1 (en) * 2001-05-02 2004-08-31 Unisys Corporation Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector
US20020165884A1 (en) * 2001-05-04 2002-11-07 International Business Machines Corporation Efficient storage mechanism for representing term occurrence in unstructured text documents
US7031910B2 (en) * 2001-10-16 2006-04-18 Xerox Corporation Method and system for encoding and accessing linguistic frequency data
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
US7149983B1 (en) * 2002-05-08 2006-12-12 Microsoft Corporation User interface and method to facilitate hierarchical specification of queries using an information taxonomy
US20040199495A1 (en) * 2002-07-03 2004-10-07 Sean Colbath Name browsing systems and methods
US20040064438A1 (en) * 2002-09-30 2004-04-01 Kostoff Ronald N. Method for data and text mining and literature-based discovery
US20050143971A1 (en) * 2003-10-27 2005-06-30 Jill Burstein Method and system for determining text coherence
US20050262039A1 (en) * 2004-05-20 2005-11-24 International Business Machines Corporation Method and system for analyzing unstructured text in data warehouse
US7302442B2 (en) * 2005-06-02 2007-11-27 Data Pattern Index Method for recording, identification, selection, and reporting network transversal paths
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US7587407B2 (en) * 2006-05-26 2009-09-08 International Business Machines Corporation System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US7743058B2 (en) * 2007-01-10 2010-06-22 Microsoft Corporation Co-clustering objects of heterogeneous types
US20080189232A1 (en) * 2007-02-02 2008-08-07 Veoh Networks, Inc. Indicator-based recommendation system

Also Published As

Publication number Publication date
US20070276830A1 (en) 2007-11-29
US7593940B2 (en) 2009-09-22
CN101079070B (en) 2010-09-15
CN101079070A (en) 2007-11-28

Similar Documents

Publication Publication Date Title
US7593940B2 (en) System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US7587407B2 (en) System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US7987189B2 (en) Content data indexing and result ranking
US9286377B2 (en) System and method for identifying semantically relevant documents
Dill et al. A case for automated large-scale semantic annotation
Tseng Automatic thesaurus generation for Chinese documents
Feinerer et al. Text mining infrastructure in R
Dill et al. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation
Nadkarni et al. UMLS concept indexing for production databases: a feasibility study
Adar et al. Information arbitrage across multi-lingual Wikipedia
US20100005061A1 (en) Information processing with integrated semantic contexts
US20100005087A1 (en) Facilitating collaborative searching using semantic contexts associated with information
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
US20130110839A1 (en) Constructing an analysis of a document
US20060080315A1 (en) Statistical natural language processing algorithm for use with massively parallel relational database management system
WO2003017143A2 (en) Method and system for enhanced data searching
WO2009100081A1 (en) System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant data
Liu et al. Information retrieval and Web search
WO2011072172A1 (en) System and method for quickly determining a subset of irrelevant data from large data content
Abramowicz et al. Filtering the Web to feed data warehouses
WO2010089248A1 (en) Method and system for semantic searching
Ananthanarayanan et al. Rule based synonyms for entity extraction from noisy text
Agichtein Extracting relations from large text collections
JP2004133564A (en) Document search system
WO2010089403A1 (en) Two-valued logic database management system with support for missing information

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GRUHL, DANIEL FREDERICK;MEREDITH, DANIEL NORIN;REEL/FRAME:020751/0651;SIGNING DATES FROM 20060523 TO 20060526

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION