WO2001042984A1 - Process and system for retrieval of documents using context-relevant semantic profiles - Google Patents
Process and system for retrieval of documents using context-relevant semantic profiles Download PDFInfo
- Publication number
- WO2001042984A1 WO2001042984A1 PCT/US2000/023784 US0023784W WO0142984A1 WO 2001042984 A1 WO2001042984 A1 WO 2001042984A1 US 0023784 W US0023784 W US 0023784W WO 0142984 A1 WO0142984 A1 WO 0142984A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- query
- document
- semantic
- documents
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
Definitions
- This invention relates to computer-based document search and retrieval. It provides a computational means for learning semantic profiles of terms from a text corpus of known relevance and for cataloging and delivering references to documents with similar semantic profiles.
- the number of documents contained in computer-based information retrieval systems is growing at tremendous rates. For example, the world wide web is thought to contain more than 800 million documents already. People looking for specific information in that sea of documents are often frustrated by two factors. First, only a subset of these documents is indexed and, second, of those that are indexed, many are indexed ambiguously. The present invention does not address directly the first limitation of document retrieval, but this limitation nonetheless plays a role in addressing the second limitation.
- the main method currently used for document retrieval is keyword or free-text search. A user enters a search query consisting of one or a few words or phrases and the system returns all of the documents that have been indexed as containing those words or phrases. As more documents are indexed, more documents are expected to contain the specified search terms. For example, one world wide web search engine recently returned more than 755,000 documents in response to a query for the word
- the object of the invention is to provide a method for document search in a defined context, so as to obviate the problems related to polysemy and synonymy, and focus the search on documents relevant to the context.
- This invention provides means for context-relevant document retrieval that preferentially returns items that are relevant to a user's interests.
- it learns the semantic profiles of terms from a training body or corpus of text that is known to be relevant.
- a neural network is used to extract semantic profiles from the text of known relevance.
- a new set of documents such as world wide web pages obtained from a keyword search of the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents.
- these semantic profiles can be organized into clusters, i.e., groups of points in a multidimensional space forming relatively close associations, in order to minimize the time required to answer a query.
- Figure 1 is a block diagram showing the basic steps of the process of building and training a neural network, and of evaluating the semantic profiles of the documents searched.
- Figure 2 shows a more detailed block diagram of the steps used to process the base text and those used to build the term vectors .
- Figure 3 shows an example of a text vector that illustrates how words are represented by the vector; it also shows examples of word stemming.
- Figure 4 shows the steps of processing a user' s a query by the system.
- Figure 5 shows the outline of the neural network system used to compute the semantics of the words in their specific training context; it shows how the network can be used to estimate result vectors given a selected input text vector.
- Figure 6 shows a schematic of a Self Organizing Feature Map neural network used for clustering similar documents.
- the present invention uses the context in which words appear as part of the representation of what those words mean.
- Context serves to disambiguate polysemous words and serves to extract the commonality among synonymous word use, because synonyms tend to occur in the same context.
- the system extracts these meaningful relationships using brain-style computational systems as implemented in artificial neural networks (or isomorphic statistical processes) . It then retrieves documents that are specifically relevant to the words used in the search query as those words are used in a specific context. When searching for documents in a "golf" context, the word "pitch,” for example, will have a different meaning than when searching in a "music” context. In the golf context, documents relevant to musical pitch would simply be clutter.
- the context for the system comes from the analysis of a body of text, called a base text or a text corpus, that is representative of the interests of a particular community, as a guide to interpreting the words in their vocabulary, i.e., the aggregate of words in the text corpus.
- the editors of a magazine know what their readers want to know about. They have established a vernacular of communication with their audience. Their readers, in turn, have evolved a transparent conceptual framework for this material, categorizing things partially on the basis of the conventions established by the editors and partially on the basis of their own unique perspectives. When they search, they want their results to match these conceptual frameworks.
- the documents written by a corporation or another group also employ a specialized vocabulary and a consensual interpretation of those terms. Texts such as these could also serve as base texts for this system.
- This invention uses a corpus of texts with a single point of view to train a neural network to extract the context- specific meaning of the terms appearing in the corpus. For example, to learn about golf-related terms, the system would process a series of documents known to be about golf, such as those appearing in a golfing magazine. Other specialty magazines, or more generally, other texts that are produced either by or for a specific community could similarly be used to train the network to learn other contexts. A user searching from a golf-related context would then be offered preferentially golf-related documents. A user employing a search engine trained on musical materials would then preferentially be offered music-related documents in response to a search query.
- the first three steps of figure 1 illustrate the sequence of steps employed by the invention in processing a text of known relevance, i.e., the base text or text corpus, to extract its vocabulary and semantic patterns (101, 102, 103).
- This text might consist of a set of articles published in a book or special interest magazine for some period of time, e.g., six months to a year. It could alternatively be any body of text produced either by or for a community such as a chat group, industry organization, corporation, or field of scholarship.
- This text provides a coherent view of the document collection from the perspective of the community.
- the ranges of word meanings in the text are relatively constrained compared to their general use in all texts; thus, the ambiguity of the terms is reduced substantially.
- Step 101 processing the text, is detailed on the left side of figure 2 (201, 202, 204) .
- the text is converted to ASCII (201), if necessary.
- formatting is removed from the text, leaving hard-returns to mark the ends of complete paragraphs (202).
- short paragraphs e.g., paragraphs of fewer than five words are combined with the subsequent paragraphs (203) .
- the second step (102) is to convert the text into term vectors.
- stop words - such as "a,” “an,” “the,” and other very common words - are removed from the vocabulary (204). Also removed are certain inflectional morphemes (205) . For example, “trained” is transformed to “train” by removing the past-tense morpheme “-ed, " “government” is reduced to “govern,” and so forth. Stop words are among the most common words in the language.
- stop words Their role is to provide grammatical structure to the sentence, rather than to contribute substantially to the meaning of the text.
- Many hundred stop words are used in database search applications.
- the current instantiation of the invention includes 379 stop words, listed in Appendix A, but the number is not critical. Having more or fewer stop words affects the efficiency of storage and computation, without changing the overall strategy of the process.
- inflectional morphemes add little to the basic meaning of the words in the sentence.
- the inflections function primarily to signal grammatical roles.
- changing "govern” to "government” transforms the word from a verb to a noun, but does not change its underlying meaning.
- the removal of inflections, or "stemming,” is common in many database retrieval applications.
- the list of unique words in the text i.e., all of the word stems with inflections and stop words removed) forms the vocabulary of the text. The length of this 5 list is K.
- the next step is to build the input vectors (206) .
- Each element in the vector corresponds to a unique word in the vocabulary. For example, if the text corpus uses 10,000 different words, then the input 10 vector would consist of 10,000 elements. Table 1 shows some examples of words and their index numbers .
- Table 1 Example words and their index into the text vector .
- Figure 3 shows a schematic of a text vector and a few of the 25 words that go into it.
- the ordering of the words is constant but arbitrary.
- Element number 1640 is nonzero whenever the word "domino” or “dominoes” appears in the text being processed. Both of these words point to the same element in the text vector because "domino" is the stem for "dominoes.”
- This pair, as well c O as the set “expand,” “expands,” and “expanding,” illustrates the effect of stemming — multiple forms constitute only a single unique word entry. We may wish keep track of the various forms even when they do not point to unique vector elements, because this makes computing the vector representation for real texts more efficient.
- Paragraphs tend to be about a single topic and they tend to use the words in them in the same way. It would be rare, for example, to find a paragraph that used the word "pitch" in both its golf and its musical sense. For each word in the paragraph, the vector element corresponding to that word is increased. It may be preferable to limit each vector element to some maximum value, or, more generally, to convert the count of each word to the corresponding vector element through some mathematical function. Vector elements corresponding to words not present in the paragraph are set to zero. These vectors are then used to train the neural network.
- Each input vector is presented to a neural network typically consisting of three layers of units (103) .
- the input layer receives the input text vectors, the output layer produces a corresponding output or result vector, and the middle hidden layer intervenes between the input and output layers.
- the weights of the connections between each hidden unit and the elements of the input vector are adjusted according to a version of the Hebbian learning rule so that the input pattern is reproduced on the output units with minimum error.
- the network learns the general profile by which each word is used over all of the paragraphs in which it appears.
- the hidden units come to encode a summary of the relationships among the elements of the input vectors. This summary reflects the disambiguating effects of the context in order to deal with polysemy.
- the set of documents in the document base is processed through the same neural network to obtain semantic profiles for each document.
- These documents could be pages retrieved from the world wide web.
- Documents are retrieved ( Figure 1, 104) and converted to text vectors — one vector for each document ( Figure 1, 105) .
- the inflections are removed according to the same rules used to process the original text.
- the stop words are removed as well.
- Words in the documents that are not in the original text are counted, but otherwise ignored.
- the terms are weighted using inverse document frequency (the number of documents that contain the word, meaning that more common words, i . e . , those that appear in more documents, are weighted less than words that appear in only a few documents. They are also weighted by the inverse length of the document - the longer the document, the more frequently words can appear in it, and this bias needs to be removed.
- documents are rejected if they contain more than a certain number of characters, because such lengths usually indicate that the document is a digest of a large number of only weakly related items, such as want-ads or email archives. Such documents are too long and contain too many different kinds of items to be useful. We have empirically found that a good maximum length limit is 250,000 characters.
- Each document's identification information (e.g., its URL), its semantic profile, a brief summary of the document, and any other pertinent information are added to the collection of relevant documents.
- This collection can be stored on a server or on any other accessible computer, including the user's computer .
- a user query (Figure 4, 401) is processed through the same neural network to produce its semantic profile ( Figure 4, 402, 403) .
- Each word in the query is entered into the text vector and this vector is then fed to the neural network.
- the pattern of activation of the hidden units represents a semantic profile of the search terms. It is then a simple matter to compare the semantic profile of the search terms against the semantic profiles of each stored page ( Figure 4, 405) .
- the pages that match most closely are the most relevant to the search and should be displayed first for the user.
- the profiles for each of the cached documents can be organized into clusters using either self-organizing feature map neural networks or the equivalent K-means clustering statistical procedure. See TEUVO KOHONEN, SELF-ORGANIZATION AND ASSOCIATIVE MEMORY, (2 nd ed. Heidelberg Springer-Verlag 1988); Teuvo Kohonen, Self -Organized Forma tion of Topologi cally Correct Fea ture Maps, 43 BIOLOGICAL CYBERNETICS 59-
- Clustering takes advantage of the fact that vectors can be conceptualized as points in a high-dimensional space, one dimension corresponding to each element in the semantic profile. The proximity of one vector to another corresponds to the similarity between the two vectors, obtained using the dot product of the two vectors.
- the user receives a list of documents in which the most pertinent documents are listed first ( Figure 4, 406, 407). The user can then click on one of the list items and the computer will retrieve the relevant document.
- the first embodiment to be described is based on a neural network that extracts the principal components from the cooccurrence matrix.
- the base text (the text corpus) is pre-processed to remove all formatting and all hard return characters except those between paragraphs (step 101). Very short paragraphs, such as titles are combined with the subsequent paragraphs.
- a dictionary (vocabulary) is created that maps word forms and their uninflected stems to elements in a text vector (step 102) .
- the number of unique entries is counted to give the length of the text vector ( K) , which is set to the number of unique words ir. the base text.
- One text vector of length K is constructed for each paragraph of the base text and these vectors are presented to the neural network (step 103) for learning (training) .
- the network implements a principal components analysis of the collection of text vectors. This analysis reduces the data representation from a set of sparse vectors with length K to a collection of reduced vectors with length N. It projects the original data vectors onto another set of vectors eliminating the redundancy, i . e . , the correlation, among the elements of the original vectors.
- the first principal component is given by
- the matrix C is the K * K
- w vectors are all orthonormal.
- N is substantially smaller than K, producing the data reduction and capturing the regularities in the word usage patterns.
- the network consists of K inputs corresponding to the K elements of the text vector and N linear neurons. The output of the nth neuron in response
- the identical transform can be accomplished using well-known statistical techniques.
- the network and statistical techniques differ only in the details of the algorithm by which the N principal components are computed, the results are identical. See Erkki Oja, Principal Componen ts , Minor Componen ts , and Linear Neural Networks, 5 NEURAL NETWORKS 927-35 (1992) ; Erkki Oja, Principal Component Analysis, in BRAIN THEORY AND NEURAL NETWORKS
- ⁇ (t) is the learning rate parameter governing the speed of gradient ascent. Typically, these step sizes decrease slowly with time.
- the first term on the right is the Hebbian learning term y , x ⁇ t) , and the remaining terms implement the orthonormality
- connection matrix w has K rows and N columns . Premult iplying the connection matrix by a text vector x yields the neural network activation pattern , which can be used as a
- input vector can be approximated by multiplying the activation vector or semantic profile by the transpose of the connection
- a query encoded into a text vector generates a semantic profile in the same way as the original text.
- This profile is a projection of the semantics of the query in the lower- dimensional basis space.
- Similar profiles can be obtained for documents by transforming the vocabulary in the document into a text vector and multiplying its transpose by the weight matrix.
- the relevance of the document to the query corresponds to the similarity in their semantic profiles, which can be assessed by taking the dot product of the query semantic profile and the document semantic profile. High products indicate high relatedness .
- the documents are organized into clusters using either Kohonen' s self-organizing feature map neural networks (Kohonen, 1982, 1988) or K-means clustering. For example, we might use 144 clusters to store profiles of a million or more documents. The exact number is of course not critical.
- Self organizing maps also called “topology preserving maps” because they preserve the topology of the similarity structure of the documents, begin with a two-dimensional sheet of neurons arranged in a grid ( Figure 6) . Each neuron has weighted connections to each of the inputs in the semantic profile. Initially, the weights of these connections are randomized.
- the dot product between the weight vector for each neuron and the profile vector is computed.
- the neuron with the highest dot product is called the winner.
- This neuron adjusts its weight vector to more closely approximate the profile vector that caused it to win. Its neighboring neurons also adjust their weights toward the profile vector, but by smaller amounts that decrease with increasing separation from the winning neuron.
- the neurons in the sheet come to represent the centroids of profile vector categories and nearby neurons come to represent nearby (in similarity space) clusters.
- the query terms are transformed into a text vector (the query vector) , which is submitted to the neural network.
- the resulting profile is then compared with the centroids of each of the categories learned by the self-organizing map.
- centroids will provide the best match to the query vector, so the patterns represented by that neuron are likely to be the best match for the query.
- the semantic profiles of the documents in this cluster are compared with the query vector and ranked in order of decreasing dot products, corresponding to decreasing relevance to the query. Typically only the documents from one cluster will have to be compared one by one to the query vector and the remaining documents in the database will not need to be examined. If too few documents are retrieved in this step (e.g., fewer than 50 that meet a minimum relevance requirement or fewer than the user requests) , the next closest cluster will also be searched.
- F i is either 1 or 0 depending on whether or not the i th word appeared in the corresponding paragraph.
- This learning rule increments the association between word i and the other words with which it appears without regard to the frequency of word i in the context. The maximum prevents total domination of the network by a small number of words and it allows the network to be stored in the form of 8-bit integers for substantial memory savings.
- the essence of this learning rule is a matrix in which the rows represent all of the unique words in the vocabulary and the columns represent the frequencies with which each other word appeared with that specific word. This representation is very simple, with additional nonlinearities typical of neural networks being incorporated elsewhere in the process, described later, for simplicity and speed of processing.
- This matrix is extremely sparse - most of its entries are 0's because many word pairs do not occur; it is stored on disk as a sparse matrix. As described, this matrix implements what is called a linear associator network (Kohonen, 1988). In the linear associator network we can multiply the text vector x by the weight matrix w to produce the estimated or
- Candidate documents for the database are processed by parsing them into vectors similar to those derived from the basic text. Some of the words in a document are in the vocabulary being processed and some are not. Those words not in the vocabulary are counted, but are otherwise ignored. The frequency with which each word appears in the document is stored in a text vector. This vector is then submitted to the neural network and a result vector is obtained. A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network. It is computed on the output for mathematical convenience, but could equally well be done within the network. Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text. We typically set the threshold to be 2.5-3.5 times the standard deviation of the nonzero output vector elements. The specific value is not critical provided that it is substantially above the mean. The nonzero entries are then sorted into descending order
- the result vector x represents the neural
- Vocabulary words are those that are included in the base text dictionary. Ratios very far from 0.4 are anomalous and such documents can be discarded. Very high ratios indicate that the document being processed is very short and contains only a few words from the vocabulary making the determination of its true relevance difficult. Documents with very low ratios indicate that they are largely about other topics and so are irrelevant to the community's interests.
- the remaining documents are further processed to form a database.
- the database consists of two parts — a table of word frequencies by document and a structure containing the identifying information about the documents including an ID/serial number, its URL, its title, and a summary of the document.
- a table of word frequencies by document a structure containing the identifying information about the documents including an ID/serial number, its URL, its title, and a summary of the document.
- the PCA Network Embodiment we store the information about the document's content with the document identifier information. In this embodiment we store it as a separate table or matrix.
- the word-by-document matrix has as many rows as there are elements in the text vectors, i . e . , K. Each row corresponds to one word or word stem.
- the first part of the initialization sequence loads the neural network, the word-to-index dictionary, and the word-by- document matrix. For each word we compute the number of pages that contain that word, Pw and use this number to compute a log inverse document frequency, idf , according to the formula
- Words that occur in more pages receive lower weights than words that appear in fewer pages. Rare words are more selective than frequent words. This is a common technique in information retrieval systems.
- the user submits a query, which is stripped of stop words and extraneous nonalphabetic characters, and converted to lowercase. Words that are in the vocabulary cause the corresponding elements of the text vector to be set to 1. Words that are not in the vocabulary are submitted to the stemming algorithm. If a word form appeared in the text (e.g., "appears"), then it and its stem (“appear”) would both have entries in the dictionary.
- the elements of the query vector corresponding to the query terms are set to 1.0.
- the query vector is then submitted to the neural network and a result vector is produced.
- This result vector is modified when more than one search term is included in the query.
- a separate temporary vector is maintained which is the product of the initial result vector from each of the terms in the query.
- the idea here is to emphasize those terms that are common to the multiple search terms in order to emphasize the shared meaning of the terms.
- the effect of the multiplication is to "AND" the two vectors. For example, the two terms “mother” and “law” have in common the fact that they occur together in the context of family relations concerning mothers-in-law. (The word “in” is typically a stop word not included in processing) . It is more likely that a query containing "mother” and "law” would be about mothers-in-law than about legal issues and maternity unless these specific relations were contained the text. This technique, as a result, allows the query terms to help disambiguate one another.
- the signum of the nonzero terms of the temporary vector times a constant, typically 3, is added to the sum of the initial result vector. (The signum function returns -1 if the argument is less than 0, 0 if the argument is 0, and +1 if the argument is greater than 0) .
- the result vector is then weighted by the log-idf values computed earlier for those terms and thresholded as described earlier to eliminate weakly related terms.
- the elements of the result vector corresponding to the original search terms are then incremented to ensure that the original search terms play an important role in determining which documents are pertinent to the search.
- the entire resulting vector is then divided by its norm to normalize it.
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP00959600A EP1247207A1 (en) | 1999-12-08 | 2000-08-29 | Process and system for retrieval of documents using context-relevant semantic profiles |
CA002393480A CA2393480A1 (en) | 1999-12-08 | 2000-08-29 | Process and system for retrieval of documents using context-relevant semantic profiles |
AU70892/00A AU782200B2 (en) | 1999-12-08 | 2000-08-29 | Process and system for retrieval of documents using context-relevant semantic profiles |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/457,190 US6189002B1 (en) | 1998-12-14 | 1999-12-08 | Process and system for retrieval of documents using context-relevant semantic profiles |
US09/457,190 | 1999-12-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2001042984A1 true WO2001042984A1 (en) | 2001-06-14 |
Family
ID=23815788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/023784 WO2001042984A1 (en) | 1999-12-08 | 2000-08-29 | Process and system for retrieval of documents using context-relevant semantic profiles |
Country Status (5)
Country | Link |
---|---|
US (1) | US6189002B1 (en) |
EP (1) | EP1247207A1 (en) |
AU (1) | AU782200B2 (en) |
CA (1) | CA2393480A1 (en) |
WO (1) | WO2001042984A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2377046A (en) * | 2001-06-29 | 2002-12-31 | Ibm | Metadata generation |
EP1288792A1 (en) * | 2001-08-27 | 2003-03-05 | SER Systems AG | A method for automatically indexing documents |
EP1365331A2 (en) * | 2002-05-24 | 2003-11-26 | Océ-Technologies B.V. | Determination of a semantic snapshot |
GR1004748B (en) * | 2003-11-14 | 2004-12-02 | Βουτσινασαβασιλειοσα | A system and method for data mining from relational databases using a hybrid neural-symbolic system |
WO2005020093A1 (en) * | 2003-08-21 | 2005-03-03 | Idilia Inc. | Internet searching using semantic disambiguation and expansion |
WO2005020094A1 (en) * | 2003-08-21 | 2005-03-03 | Idilia Inc. | System and method for associating documents with contextual advertisements |
US7295991B1 (en) * | 2000-11-10 | 2007-11-13 | Erc Dataplus, Inc. | Employment sourcing system |
CN101079025B (en) * | 2006-06-19 | 2010-06-16 | 腾讯科技(深圳)有限公司 | File correlation computing system and method |
US7908430B2 (en) | 2000-08-18 | 2011-03-15 | Bdgb Enterprise Software S.A.R.L. | Associative memory |
US8276067B2 (en) | 1999-04-28 | 2012-09-25 | Bdgb Enterprise Software S.A.R.L. | Classification method and apparatus |
US8321357B2 (en) | 2009-09-30 | 2012-11-27 | Lapir Gennady | Method and system for extraction |
US9152883B2 (en) | 2009-11-02 | 2015-10-06 | Harry Urbschat | System and method for increasing the accuracy of optical character recognition (OCR) |
US9159584B2 (en) | 2000-08-18 | 2015-10-13 | Gannady Lapir | Methods and systems of retrieving documents |
US9158833B2 (en) | 2009-11-02 | 2015-10-13 | Harry Urbschat | System and method for obtaining document information |
US9213756B2 (en) | 2009-11-02 | 2015-12-15 | Harry Urbschat | System and method of using dynamic variance networks |
Families Citing this family (240)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6490579B1 (en) * | 1998-07-16 | 2002-12-03 | Perot Systems Corporation | Search engine system and method utilizing context of heterogeneous information resources |
JP3903610B2 (en) * | 1998-09-28 | 2007-04-11 | 富士ゼロックス株式会社 | Search device, search method, and computer-readable recording medium storing search program |
US6829610B1 (en) * | 1999-03-11 | 2004-12-07 | Microsoft Corporation | Scalable storage system supporting multi-level query resolution |
US6701305B1 (en) | 1999-06-09 | 2004-03-02 | The Boeing Company | Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace |
US6611825B1 (en) * | 1999-06-09 | 2003-08-26 | The Boeing Company | Method and system for text mining using multidimensional subspaces |
US6578025B1 (en) | 1999-06-11 | 2003-06-10 | Abuzz Technologies, Inc. | Method and apparatus for distributing information to users |
US20040236721A1 (en) * | 2003-05-20 | 2004-11-25 | Jordan Pollack | Method and apparatus for distributing information to users |
US6546390B1 (en) | 1999-06-11 | 2003-04-08 | Abuzz Technologies, Inc. | Method and apparatus for evaluating relevancy of messages to users |
US6539385B1 (en) | 1999-06-11 | 2003-03-25 | Abuzz Technologies, Inc. | Dual-use email system |
US6571238B1 (en) * | 1999-06-11 | 2003-05-27 | Abuzz Technologies, Inc. | System for regulating flow of information to user by using time dependent function to adjust relevancy threshold |
US7072888B1 (en) * | 1999-06-16 | 2006-07-04 | Triogo, Inc. | Process for improving search engine efficiency using feedback |
US6477524B1 (en) * | 1999-08-18 | 2002-11-05 | Sharp Laboratories Of America, Incorporated | Method for statistical text analysis |
US6816857B1 (en) | 1999-11-01 | 2004-11-09 | Applied Semantics, Inc. | Meaning-based advertising and document relevance determination |
US6389418B1 (en) * | 1999-10-01 | 2002-05-14 | Sandia Corporation | Patent data mining method and apparatus |
US6393417B1 (en) * | 1999-10-15 | 2002-05-21 | De Le Fevre Patrick Y. | Method for providing a rapid internet search |
US6457029B1 (en) * | 1999-12-22 | 2002-09-24 | International Business Machines Corporation | Computer method and system for same document lookup with different keywords from a single view |
WO2001048582A2 (en) * | 1999-12-24 | 2001-07-05 | Ravenpack Ag | Method and device for presenting data to a user |
US6751621B1 (en) * | 2000-01-27 | 2004-06-15 | Manning & Napier Information Services, Llc. | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors |
US6654739B1 (en) * | 2000-01-31 | 2003-11-25 | International Business Machines Corporation | Lightweight document clustering |
JP4070382B2 (en) * | 2000-02-08 | 2008-04-02 | 富士通株式会社 | Information retrieval apparatus and computer-readable recording medium on which information retrieval program is recorded |
US6311194B1 (en) * | 2000-03-15 | 2001-10-30 | Taalee, Inc. | System and method for creating a semantic web and its applications in browsing, searching, profiling, personalization and advertising |
EP1428142A2 (en) | 2000-03-22 | 2004-06-16 | Sidestep, Inc. | Method and apparatus for dynamic information connection engine |
US6564210B1 (en) * | 2000-03-27 | 2003-05-13 | Virtual Self Ltd. | System and method for searching databases employing user profiles |
US7567958B1 (en) * | 2000-04-04 | 2009-07-28 | Aol, Llc | Filtering system for providing personalized information in the absence of negative data |
US7356604B1 (en) * | 2000-04-18 | 2008-04-08 | Claritech Corporation | Method and apparatus for comparing scores in a vector space retrieval process |
US7912868B2 (en) * | 2000-05-02 | 2011-03-22 | Textwise Llc | Advertisement placement method and system using semantic analysis |
US6728695B1 (en) * | 2000-05-26 | 2004-04-27 | Burning Glass Technologies, Llc | Method and apparatus for making predictions about entities represented in documents |
WO2001098919A1 (en) * | 2000-06-21 | 2001-12-27 | Striking Eagle Investments, Ltd. | Locating information in a network based on user's evaluation |
US6895406B2 (en) * | 2000-08-25 | 2005-05-17 | Seaseer R&D, Llc | Dynamic personalization method of creating personalized user profiles for searching a database of information |
US6751614B1 (en) | 2000-11-09 | 2004-06-15 | Satyam Computer Services Limited Of Mayfair Centre | System and method for topic-based document analysis for information filtering |
US20030028564A1 (en) * | 2000-12-19 | 2003-02-06 | Lingomotors, Inc. | Natural language method and system for matching and ranking documents in terms of semantic relatedness |
US7440943B2 (en) * | 2000-12-22 | 2008-10-21 | Xerox Corporation | Recommender system and method |
EP1217539A1 (en) * | 2000-12-23 | 2002-06-26 | ALSTOM (Switzerland) Ltd | Method for nonlinear preparation and identification of information |
US7254773B2 (en) * | 2000-12-29 | 2007-08-07 | International Business Machines Corporation | Automated spell analysis |
US6751628B2 (en) * | 2001-01-11 | 2004-06-15 | Dolphin Search | Process and system for sparse vector and matrix representation of document indexing and retrieval |
US8219620B2 (en) * | 2001-02-20 | 2012-07-10 | Mcafee, Inc. | Unwanted e-mail filtering system including voting feedback |
US7426505B2 (en) * | 2001-03-07 | 2008-09-16 | International Business Machines Corporation | Method for identifying word patterns in text |
US7231381B2 (en) * | 2001-03-13 | 2007-06-12 | Microsoft Corporation | Media content search engine incorporating text content and user log mining |
US6820081B1 (en) | 2001-03-19 | 2004-11-16 | Attenex Corporation | System and method for evaluating a structured message store for message redundancy |
US6748398B2 (en) * | 2001-03-30 | 2004-06-08 | Microsoft Corporation | Relevance maximizing, iteration minimizing, relevance-feedback, content-based image retrieval (CBIR) |
USRE46973E1 (en) | 2001-05-07 | 2018-07-31 | Ureveal, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7536413B1 (en) | 2001-05-07 | 2009-05-19 | Ixreveal, Inc. | Concept-based categorization of unstructured objects |
US7627588B1 (en) | 2001-05-07 | 2009-12-01 | Ixreveal, Inc. | System and method for concept based analysis of unstructured data |
US7194483B1 (en) | 2001-05-07 | 2007-03-20 | Intelligenxia, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US6769016B2 (en) * | 2001-07-26 | 2004-07-27 | Networks Associates Technology, Inc. | Intelligent SPAM detection system using an updateable neural analysis engine |
US7016939B1 (en) * | 2001-07-26 | 2006-03-21 | Mcafee, Inc. | Intelligent SPAM detection system using statistical analysis |
US6883008B2 (en) * | 2001-07-31 | 2005-04-19 | Ase Edge, Inc. | System for utilizing audible, visual and textual data with alternative combinable multimedia forms of presenting information for real-time interactive use by multiple users in different remote environments |
US6609124B2 (en) | 2001-08-13 | 2003-08-19 | International Business Machines Corporation | Hub for strategic intelligence |
US20030061232A1 (en) * | 2001-09-21 | 2003-03-27 | Dun & Bradstreet Inc. | Method and system for processing business data |
US7283992B2 (en) * | 2001-11-30 | 2007-10-16 | Microsoft Corporation | Media agent to suggest contextually related media content |
US20050154708A1 (en) * | 2002-01-29 | 2005-07-14 | Yao Sun | Information exchange between heterogeneous databases through automated identification of concept equivalence |
US20080027769A1 (en) | 2002-09-09 | 2008-01-31 | Jeff Scott Eder | Knowledge based performance management system |
US8589413B1 (en) | 2002-03-01 | 2013-11-19 | Ixreveal, Inc. | Concept-based method and system for dynamically analyzing results from search engines |
JP4082059B2 (en) * | 2002-03-29 | 2008-04-30 | ソニー株式会社 | Information processing apparatus and method, recording medium, and program |
US8015143B2 (en) * | 2002-05-22 | 2011-09-06 | Estes Timothy W | Knowledge discovery agent system and method |
US7054859B2 (en) * | 2002-06-13 | 2006-05-30 | Hewlett-Packard Development Company, L.P. | Apparatus and method for responding to search requests for stored documents |
US6892198B2 (en) * | 2002-06-14 | 2005-05-10 | Entopia, Inc. | System and method for personalized information retrieval based on user expertise |
JP2004094728A (en) * | 2002-09-02 | 2004-03-25 | Hitachi Ltd | Information distribution method and its system and program |
US7636709B1 (en) | 2002-10-03 | 2009-12-22 | Teradata Us, Inc. | Methods and systems for locating related reports |
US7200589B1 (en) * | 2002-10-03 | 2007-04-03 | Hewlett-Packard Development Company, L.P. | Format-independent advertising of data center resource capabilities |
US7792832B2 (en) * | 2002-10-17 | 2010-09-07 | Poltorak Alexander I | Apparatus and method for identifying potential patent infringement |
US7155427B1 (en) * | 2002-10-30 | 2006-12-26 | Oracle International Corporation | Configurable search tool for finding and scoring non-exact matches in a relational database |
US8375008B1 (en) | 2003-01-17 | 2013-02-12 | Robert Gomes | Method and system for enterprise-wide retention of digital or electronic data |
US8065277B1 (en) | 2003-01-17 | 2011-11-22 | Daniel John Gardner | System and method for a data extraction and backup database |
US8943024B1 (en) | 2003-01-17 | 2015-01-27 | Daniel John Gardner | System and method for data de-duplication |
US8630984B1 (en) | 2003-01-17 | 2014-01-14 | Renew Data Corp. | System and method for data extraction from email files |
US7421418B2 (en) * | 2003-02-19 | 2008-09-02 | Nahava Inc. | Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently |
US8055669B1 (en) * | 2003-03-03 | 2011-11-08 | Google Inc. | Search queries improved based on query semantic information |
US20040237037A1 (en) * | 2003-03-21 | 2004-11-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with recursive page-level link analysis |
US20050188300A1 (en) * | 2003-03-21 | 2005-08-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with link and document analysis |
US7401072B2 (en) | 2003-06-10 | 2008-07-15 | Google Inc. | Named URL entry |
GB2403636A (en) * | 2003-07-02 | 2005-01-05 | Sony Uk Ltd | Information retrieval using an array of nodes |
WO2005017682A2 (en) * | 2003-08-05 | 2005-02-24 | Cnet Networks, Inc. | Product placement engine and method |
US8600963B2 (en) * | 2003-08-14 | 2013-12-03 | Google Inc. | System and method for presenting multiple sets of search results for a single query |
US8548995B1 (en) * | 2003-09-10 | 2013-10-01 | Google Inc. | Ranking of documents based on analysis of related documents |
US7130819B2 (en) * | 2003-09-30 | 2006-10-31 | Yahoo! Inc. | Method and computer readable medium for search scoring |
US7240049B2 (en) * | 2003-11-12 | 2007-07-03 | Yahoo! Inc. | Systems and methods for search query processing using trend analysis |
US7844589B2 (en) * | 2003-11-18 | 2010-11-30 | Yahoo! Inc. | Method and apparatus for performing a search |
US20050149510A1 (en) * | 2004-01-07 | 2005-07-07 | Uri Shafrir | Concept mining and concept discovery-semantic search tool for large digital databases |
US7716158B2 (en) * | 2004-01-09 | 2010-05-11 | Microsoft Corporation | System and method for context sensitive searching |
US7822992B2 (en) * | 2004-04-07 | 2010-10-26 | Microsoft Corporation | In-place content substitution via code-invoking link |
US7890744B2 (en) * | 2004-04-07 | 2011-02-15 | Microsoft Corporation | Activating content based on state |
US7689585B2 (en) * | 2004-04-15 | 2010-03-30 | Microsoft Corporation | Reinforced clustering of multi-type data objects for search term suggestion |
US20050234973A1 (en) * | 2004-04-15 | 2005-10-20 | Microsoft Corporation | Mining service requests for product support |
WO2006014343A2 (en) * | 2004-07-02 | 2006-02-09 | Text-Tech, Llc | Automated evaluation systems and methods |
CA2574554A1 (en) * | 2004-07-21 | 2006-01-26 | Equivio Ltd. | A method for determining near duplicate data objects |
US20070266406A1 (en) * | 2004-11-09 | 2007-11-15 | Murali Aravamudan | Method and system for performing actions using a non-intrusive television with reduced text input |
US20060101504A1 (en) * | 2004-11-09 | 2006-05-11 | Veveo.Tv, Inc. | Method and system for performing searches for television content and channels using a non-intrusive television interface and with reduced text input |
WO2006053011A2 (en) * | 2004-11-09 | 2006-05-18 | Veveo, Inc. | Method and system for secure sharing, gifting, and purchasing of content on television and mobile devices |
US7895218B2 (en) * | 2004-11-09 | 2011-02-22 | Veveo, Inc. | Method and system for performing searches for television content using reduced text input |
US7620628B2 (en) * | 2004-12-06 | 2009-11-17 | Yahoo! Inc. | Search processing with automatic categorization of queries |
US8069151B1 (en) | 2004-12-08 | 2011-11-29 | Chris Crafford | System and method for detecting incongruous or incorrect media in a data recovery process |
US7444325B2 (en) * | 2005-01-14 | 2008-10-28 | Im2, Inc. | Method and system for information extraction |
US20060235843A1 (en) * | 2005-01-31 | 2006-10-19 | Textdigger, Inc. | Method and system for semantic search and retrieval of electronic documents |
US8527468B1 (en) | 2005-02-08 | 2013-09-03 | Renew Data Corp. | System and method for management of retention periods for content in a computing system |
US20060200461A1 (en) * | 2005-03-01 | 2006-09-07 | Lucas Marshall D | Process for identifying weighted contextural relationships between unrelated documents |
EP1875336A2 (en) | 2005-04-11 | 2008-01-09 | Textdigger, Inc. | System and method for searching for a query |
US20060253423A1 (en) * | 2005-05-07 | 2006-11-09 | Mclane Mark | Information retrieval system and method |
US7496549B2 (en) * | 2005-05-26 | 2009-02-24 | Yahoo! Inc. | Matching pursuit approach to sparse Gaussian process regression |
US7469251B2 (en) * | 2005-06-07 | 2008-12-23 | Microsoft Corporation | Extraction of information from documents |
US8122034B2 (en) | 2005-06-30 | 2012-02-21 | Veveo, Inc. | Method and system for incremental search with reduced text entry where the relevance of results is a dynamically computed function of user input search string character count |
US7725485B1 (en) * | 2005-08-01 | 2010-05-25 | Google Inc. | Generating query suggestions using contextual information |
US7788266B2 (en) * | 2005-08-26 | 2010-08-31 | Veveo, Inc. | Method and system for processing ambiguous, multi-term search queries |
US7779011B2 (en) * | 2005-08-26 | 2010-08-17 | Veveo, Inc. | Method and system for dynamically processing ambiguous, reduced text search queries and highlighting results thereof |
US7620607B1 (en) * | 2005-09-26 | 2009-11-17 | Quintura Inc. | System and method for using a bidirectional neural network to identify sentences for use as document annotations |
CN101351795B (en) * | 2005-10-11 | 2012-07-18 | Ix锐示公司 | System, method and device for concept based searching and analysis |
US7644054B2 (en) * | 2005-11-23 | 2010-01-05 | Veveo, Inc. | System and method for finding desired results by incremental search using an ambiguous keypad with the input containing orthographic and typographic errors |
US8429184B2 (en) | 2005-12-05 | 2013-04-23 | Collarity Inc. | Generation of refinement terms for search queries |
US8903810B2 (en) | 2005-12-05 | 2014-12-02 | Collarity, Inc. | Techniques for ranking search results |
US7756855B2 (en) * | 2006-10-11 | 2010-07-13 | Collarity, Inc. | Search phrase refinement by search term replacement |
US8694530B2 (en) | 2006-01-03 | 2014-04-08 | Textdigger, Inc. | Search system with query refinement and search method |
US7676485B2 (en) * | 2006-01-20 | 2010-03-09 | Ixreveal, Inc. | Method and computer program product for converting ontologies into concept semantic networks |
WO2007086059A2 (en) * | 2006-01-25 | 2007-08-02 | Equivio Ltd. | Determining near duplicate 'noisy' data objects |
US20070260703A1 (en) * | 2006-01-27 | 2007-11-08 | Sankar Ardhanari | Methods and systems for transmission of subsequences of incremental query actions and selection of content items based on later received subsequences |
US8601160B1 (en) | 2006-02-09 | 2013-12-03 | Mcafee, Inc. | System, method and computer program product for gathering information relating to electronic content utilizing a DNS server |
JP4923604B2 (en) * | 2006-02-13 | 2012-04-25 | ソニー株式会社 | Information processing apparatus and method, and program |
US8380726B2 (en) | 2006-03-06 | 2013-02-19 | Veveo, Inc. | Methods and systems for selecting and presenting content based on a comparison of preference signatures from multiple users |
US7698332B2 (en) * | 2006-03-13 | 2010-04-13 | Microsoft Corporation | Projecting queries and images into a similarity space |
US8073860B2 (en) | 2006-03-30 | 2011-12-06 | Veveo, Inc. | Method and system for incrementally selecting and providing relevant search engines in response to a user query |
WO2007114932A2 (en) | 2006-04-04 | 2007-10-11 | Textdigger, Inc. | Search system and method with text function tagging |
EP3822819A1 (en) | 2006-04-20 | 2021-05-19 | Veveo, Inc. | User interface methods and systems for selecting and presenting content based on user navigation and selection actions associated with the content |
US20080189273A1 (en) * | 2006-06-07 | 2008-08-07 | Digital Mandate, Llc | System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant content data |
US8150827B2 (en) * | 2006-06-07 | 2012-04-03 | Renew Data Corp. | Methods for enhancing efficiency and cost effectiveness of first pass review of documents |
US7779007B2 (en) * | 2006-06-08 | 2010-08-17 | Ase Edge, Inc. | Identifying content of interest |
US8140267B2 (en) * | 2006-06-30 | 2012-03-20 | International Business Machines Corporation | System and method for identifying similar molecules |
EP1876540A1 (en) * | 2006-07-06 | 2008-01-09 | British Telecommunications Public Limited Company | Organising and storing documents |
US8843475B2 (en) * | 2006-07-12 | 2014-09-23 | Philip Marshall | System and method for collaborative knowledge structure creation and management |
US8401841B2 (en) * | 2006-08-31 | 2013-03-19 | Orcatec Llc | Retrieval of documents using language models |
CA2989780C (en) * | 2006-09-14 | 2022-08-09 | Veveo, Inc. | Methods and systems for dynamically rearranging search results into hierarchically organized concept clusters |
US7577643B2 (en) * | 2006-09-29 | 2009-08-18 | Microsoft Corporation | Key phrase extraction from query logs |
US7925986B2 (en) | 2006-10-06 | 2011-04-12 | Veveo, Inc. | Methods and systems for a linear character selection display interface for ambiguous text input |
US8442972B2 (en) * | 2006-10-11 | 2013-05-14 | Collarity, Inc. | Negative associations for search results ranking and refinement |
WO2008063987A2 (en) * | 2006-11-13 | 2008-05-29 | Veveo, Inc. | Method of and system for selecting and presenting content based on user identification |
US20080183691A1 (en) * | 2007-01-30 | 2008-07-31 | International Business Machines Corporation | Method for a networked knowledge based document retrieval and ranking utilizing extracted document metadata and content |
US7610185B1 (en) * | 2007-02-26 | 2009-10-27 | Quintura, Inc. | GUI for subject matter navigation using maps and search terms |
US8396331B2 (en) * | 2007-02-26 | 2013-03-12 | Microsoft Corporation | Generating a multi-use vocabulary based on image data |
US7529743B1 (en) * | 2007-02-26 | 2009-05-05 | Quintura, Inc. | GUI for subject matter navigation using maps and search terms |
US8180633B2 (en) * | 2007-03-08 | 2012-05-15 | Nec Laboratories America, Inc. | Fast semantic extraction using a neural network architecture |
EP1973045A1 (en) * | 2007-03-20 | 2008-09-24 | British Telecommunications Public Limited Company | Organising and storing documents |
US7873640B2 (en) * | 2007-03-27 | 2011-01-18 | Adobe Systems Incorporated | Semantic analysis documents to rank terms |
US20080313574A1 (en) * | 2007-05-25 | 2008-12-18 | Veveo, Inc. | System and method for search with reduced physical interaction requirements |
US8549424B2 (en) | 2007-05-25 | 2013-10-01 | Veveo, Inc. | System and method for text disambiguation and context designation in incremental search |
US8296294B2 (en) * | 2007-05-25 | 2012-10-23 | Veveo, Inc. | Method and system for unified searching across and within multiple documents |
US20090012984A1 (en) * | 2007-07-02 | 2009-01-08 | Equivio Ltd. | Method for Organizing Large Numbers of Documents |
US20090049018A1 (en) * | 2007-08-14 | 2009-02-19 | John Nicholas Gross | Temporal Document Sorter and Method Using Semantic Decoding and Prediction |
US8280721B2 (en) | 2007-08-31 | 2012-10-02 | Microsoft Corporation | Efficiently representing word sense probabilities |
US8639708B2 (en) * | 2007-08-31 | 2014-01-28 | Microsoft Corporation | Fact-based indexing for natural language search |
US8229970B2 (en) * | 2007-08-31 | 2012-07-24 | Microsoft Corporation | Efficient storage and retrieval of posting lists |
US8463593B2 (en) * | 2007-08-31 | 2013-06-11 | Microsoft Corporation | Natural language hypernym weighting for word sense disambiguation |
US8229730B2 (en) * | 2007-08-31 | 2012-07-24 | Microsoft Corporation | Indexing role hierarchies for words in a search index |
US8316036B2 (en) | 2007-08-31 | 2012-11-20 | Microsoft Corporation | Checkpointing iterators during search |
US8868562B2 (en) * | 2007-08-31 | 2014-10-21 | Microsoft Corporation | Identification of semantic relationships within reported speech |
US8346756B2 (en) * | 2007-08-31 | 2013-01-01 | Microsoft Corporation | Calculating valence of expressions within documents for searching a document index |
US20090070322A1 (en) * | 2007-08-31 | 2009-03-12 | Powerset, Inc. | Browsing knowledge on the basis of semantic relations |
US8712758B2 (en) | 2007-08-31 | 2014-04-29 | Microsoft Corporation | Coreference resolution in an ambiguity-sensitive natural language processing system |
US9875298B2 (en) * | 2007-10-12 | 2018-01-23 | Lexxe Pty Ltd | Automatic generation of a search query |
US7984035B2 (en) * | 2007-12-28 | 2011-07-19 | Microsoft Corporation | Context-based document search |
US8615490B1 (en) | 2008-01-31 | 2013-12-24 | Renew Data Corp. | Method and system for restoring information from backup storage media |
US8392436B2 (en) * | 2008-02-07 | 2013-03-05 | Nec Laboratories America, Inc. | Semantic search via role labeling |
US8583639B2 (en) * | 2008-02-19 | 2013-11-12 | International Business Machines Corporation | Method and system using machine learning to automatically discover home pages on the internet |
US20090228296A1 (en) * | 2008-03-04 | 2009-09-10 | Collarity, Inc. | Optimization of social distribution networks |
US20100076984A1 (en) * | 2008-03-27 | 2010-03-25 | Alkis Papadopoullos | System and method for query expansion using tooltips |
US8061142B2 (en) * | 2008-04-11 | 2011-11-22 | General Electric Company | Mixer for a combustor |
US8032469B2 (en) * | 2008-05-06 | 2011-10-04 | Microsoft Corporation | Recommending similar content identified with a neural network |
US8082248B2 (en) * | 2008-05-29 | 2011-12-20 | Rania Abouyounes | Method and system for document classification based on document structure and written style |
US8438178B2 (en) | 2008-06-26 | 2013-05-07 | Collarity Inc. | Interactions among online digital identities |
US8185509B2 (en) * | 2008-10-15 | 2012-05-22 | Sap France | Association of semantic objects with linguistic entity categories |
US8290961B2 (en) * | 2009-01-13 | 2012-10-16 | Sandia Corporation | Technique for information retrieval using enhanced latent semantic analysis generating rank approximation matrix by factorizing the weighted morpheme-by-document matrix |
US8150974B2 (en) * | 2009-03-17 | 2012-04-03 | Kindsight, Inc. | Character differentiation system generating session fingerprint using events associated with subscriber ID and session ID |
US9245243B2 (en) * | 2009-04-14 | 2016-01-26 | Ureveal, Inc. | Concept-based analysis of structured and unstructured data using concept inheritance |
US8103672B2 (en) * | 2009-05-20 | 2012-01-24 | Detectent, Inc. | Apparatus, system, and method for determining a partial class membership of a data record in a class |
US8341108B2 (en) * | 2009-06-09 | 2012-12-25 | Microsoft Corporation | Kind classification through emergent semantic analysis |
US8352469B2 (en) * | 2009-07-02 | 2013-01-08 | Battelle Memorial Institute | Automatic generation of stop word lists for information retrieval and analysis |
US9235563B2 (en) | 2009-07-02 | 2016-01-12 | Battelle Memorial Institute | Systems and processes for identifying features and determining feature associations in groups of documents |
US8204988B2 (en) * | 2009-09-02 | 2012-06-19 | International Business Machines Corporation | Content-based and time-evolving social network analysis |
US9166714B2 (en) | 2009-09-11 | 2015-10-20 | Veveo, Inc. | Method of and system for presenting enriched video viewing analytics |
US9305089B2 (en) * | 2009-12-08 | 2016-04-05 | At&T Intellectual Property I, L.P. | Search engine device and methods thereof |
US20110145269A1 (en) * | 2009-12-09 | 2011-06-16 | Renew Data Corp. | System and method for quickly determining a subset of irrelevant data from large data content |
US8738668B2 (en) | 2009-12-16 | 2014-05-27 | Renew Data Corp. | System and method for creating a de-duplicated data set |
US20110154399A1 (en) * | 2009-12-22 | 2011-06-23 | Verizon Patent And Licensing, Inc. | Content recommendation engine |
US8875038B2 (en) | 2010-01-19 | 2014-10-28 | Collarity, Inc. | Anchoring for content synchronization |
US20110191330A1 (en) * | 2010-02-04 | 2011-08-04 | Veveo, Inc. | Method of and System for Enhanced Content Discovery Based on Network and Device Access Behavior |
US8706728B2 (en) * | 2010-02-19 | 2014-04-22 | Go Daddy Operating Company, LLC | Calculating reliability scores from word splitting |
US20110246465A1 (en) * | 2010-03-31 | 2011-10-06 | Salesforce.Com, Inc. | Methods and sysems for performing real-time recommendation processing |
US8924419B2 (en) * | 2010-03-31 | 2014-12-30 | Salesforce.Com, Inc. | Method and system for performing an authority analysis |
CN103038764A (en) * | 2010-04-14 | 2013-04-10 | 惠普发展公司,有限责任合伙企业 | Method for keyword extraction |
WO2011137386A1 (en) * | 2010-04-30 | 2011-11-03 | Orbis Technologies, Inc. | Systems and methods for semantic search, content correlation and visualization |
FR2961645A1 (en) * | 2010-06-17 | 2011-12-23 | Kindsight Inc | User characteristics identifying device for e.g. digital TV, has memory providing instructions to be executed by processor, where instructions store record of set of user characteristics |
US8380719B2 (en) * | 2010-06-18 | 2013-02-19 | Microsoft Corporation | Semantic content searching |
US8548989B2 (en) * | 2010-07-30 | 2013-10-01 | International Business Machines Corporation | Querying documents using search terms |
US8577915B2 (en) | 2010-09-10 | 2013-11-05 | Veveo, Inc. | Method of and system for conducting personalized federated search and presentation of results therefrom |
CA2812338C (en) | 2010-09-24 | 2019-08-13 | International Business Machines Corporation | Lexical answer type confidence estimation and application |
US8316030B2 (en) | 2010-11-05 | 2012-11-20 | Nextgen Datacom, Inc. | Method and system for document classification or search using discrete words |
DE102011009378A1 (en) * | 2011-01-25 | 2012-07-26 | SUPERWISE Technologies AG | Automatic extraction of information about semantic relationships from a pool of documents with a neural system |
US8732151B2 (en) | 2011-04-01 | 2014-05-20 | Microsoft Corporation | Enhanced query rewriting through statistical machine translation |
US9223769B2 (en) | 2011-09-21 | 2015-12-29 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9721039B2 (en) * | 2011-12-16 | 2017-08-01 | Palo Alto Research Center Incorporated | Generating a relationship visualization for nonhomogeneous entities |
US9015080B2 (en) | 2012-03-16 | 2015-04-21 | Orbis Technologies, Inc. | Systems and methods for semantic inference and reasoning |
EP2709306B1 (en) * | 2012-09-14 | 2019-03-06 | Alcatel Lucent | Method and system to perform secure boolean search over encrypted documents |
US9189531B2 (en) | 2012-11-30 | 2015-11-17 | Orbis Technologies, Inc. | Ontology harmonization and mediation systems and methods |
US20140279748A1 (en) * | 2013-03-15 | 2014-09-18 | Georges Harik | Method and program structure for machine learning |
US9122681B2 (en) | 2013-03-15 | 2015-09-01 | Gordon Villy Cormack | Systems and methods for classifying electronic information using advanced active learning techniques |
US20150012448A1 (en) * | 2013-07-03 | 2015-01-08 | Icebox, Inc. | Collaborative matter management and analysis |
RU2540832C1 (en) * | 2013-09-24 | 2015-02-10 | Российская Федерация, от имени которой выступает Министерство обороны Российской Федерации | System for searching for different information in local area network |
RU2556425C1 (en) * | 2014-02-14 | 2015-07-10 | Закрытое акционерное общество "Эвентос" (ЗАО "Эвентос") | Method for automatic iterative clusterisation of electronic documents according to semantic similarity, method for search in plurality of documents clustered according to semantic similarity and computer-readable media |
CN108984650B (en) * | 2014-03-26 | 2020-10-16 | 上海智臻智能网络科技股份有限公司 | Computer-readable recording medium and computer device |
CN103995805B (en) * | 2014-06-05 | 2016-08-17 | 神华集团有限责任公司 | The word processing method of the big data of text-oriented |
US10229117B2 (en) | 2015-06-19 | 2019-03-12 | Gordon V. Cormack | Systems and methods for conducting a highly autonomous technology-assisted review classification |
US10839149B2 (en) | 2016-02-01 | 2020-11-17 | Microsoft Technology Licensing, Llc. | Generating templates from user's past documents |
US9922022B2 (en) * | 2016-02-01 | 2018-03-20 | Microsoft Technology Licensing, Llc. | Automatic template generation based on previous documents |
US9858263B2 (en) * | 2016-05-05 | 2018-01-02 | Conduent Business Services, Llc | Semantic parsing using deep neural networks for predicting canonical forms |
US9830315B1 (en) * | 2016-07-13 | 2017-11-28 | Xerox Corporation | Sequence-based structured prediction for semantic parsing |
CN106708929B (en) * | 2016-11-18 | 2020-06-26 | 广州视源电子科技股份有限公司 | Video program searching method and device |
US20180341686A1 (en) * | 2017-05-26 | 2018-11-29 | Nanfang Hu | System and method for data search based on top-to-bottom similarity analysis |
US10255273B2 (en) * | 2017-06-15 | 2019-04-09 | Microsoft Technology Licensing, Llc | Method and system for ranking and summarizing natural language passages |
US10936952B2 (en) | 2017-09-01 | 2021-03-02 | Facebook, Inc. | Detecting content items in violation of an online system policy using templates based on semantic vectors representing content items |
US11195099B2 (en) | 2017-09-01 | 2021-12-07 | Facebook, Inc. | Detecting content items in violation of an online system policy using semantic vectors |
US10599774B1 (en) * | 2018-02-26 | 2020-03-24 | Facebook, Inc. | Evaluating content items based upon semantic similarity of text |
US11776036B2 (en) * | 2018-04-19 | 2023-10-03 | Adobe Inc. | Generating and utilizing classification and query-specific models to generate digital responses to queries from client device |
RU2731658C2 (en) | 2018-06-21 | 2020-09-07 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system of selection for ranking search results using machine learning algorithm |
US10452734B1 (en) | 2018-09-21 | 2019-10-22 | SSB Legal Technologies, LLC | Data visualization platform for use in a network environment |
US11200378B2 (en) | 2018-10-11 | 2021-12-14 | International Business Machines Corporation | Methods and systems for processing language with standardization of source data |
RU2733481C2 (en) | 2018-12-13 | 2020-10-01 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for generating feature for ranging document |
RU2744028C2 (en) * | 2018-12-26 | 2021-03-02 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for storing multiple documents |
RU2744029C1 (en) | 2018-12-29 | 2021-03-02 | Общество С Ограниченной Ответственностью "Яндекс" | System and method of forming training set for machine learning algorithm |
US11397776B2 (en) | 2019-01-31 | 2022-07-26 | At&T Intellectual Property I, L.P. | Systems and methods for automated information retrieval |
US11216459B2 (en) * | 2019-03-25 | 2022-01-04 | Microsoft Technology Licensing, Llc | Multi-layer semantic search |
US11120221B2 (en) | 2019-03-26 | 2021-09-14 | Tata Consultancy Services Limited | Method and system to resolve ambiguities in regulations |
US11227250B2 (en) | 2019-06-26 | 2022-01-18 | International Business Machines Corporation | Rating customer representatives based on past chat transcripts |
US11210677B2 (en) | 2019-06-26 | 2021-12-28 | International Business Machines Corporation | Measuring the effectiveness of individual customer representative responses in historical chat transcripts |
US11461788B2 (en) | 2019-06-26 | 2022-10-04 | International Business Machines Corporation | Matching a customer and customer representative dynamically based on a customer representative's past performance |
CN110570941B (en) * | 2019-07-17 | 2020-08-14 | 北京智能工场科技有限公司 | System and device for assessing psychological state based on text semantic vector model |
US11138212B2 (en) | 2019-07-23 | 2021-10-05 | International Business Machines Corporation | Natural language response recommendation clustering for rapid retrieval |
US11741098B2 (en) * | 2019-11-01 | 2023-08-29 | Applied Brain Research Inc. | Methods and systems for storing and querying database entries with neuromorphic computers |
US11468238B2 (en) * | 2019-11-06 | 2022-10-11 | ServiceNow Inc. | Data processing systems and methods |
US11481417B2 (en) | 2019-11-06 | 2022-10-25 | Servicenow, Inc. | Generation and utilization of vector indexes for data processing systems and methods |
US11455357B2 (en) | 2019-11-06 | 2022-09-27 | Servicenow, Inc. | Data processing systems and methods |
CN110991196B (en) * | 2019-12-18 | 2021-10-26 | 北京百度网讯科技有限公司 | Translation method and device for polysemous words, electronic equipment and medium |
JP7363929B2 (en) * | 2020-01-29 | 2023-10-18 | 日本電信電話株式会社 | Learning device, search device, learning method, search method and program |
WO2022123695A1 (en) * | 2020-12-09 | 2022-06-16 | 日本電信電話株式会社 | Learning device, search device, learning method, search method, and program |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5619709A (en) * | 1993-09-20 | 1997-04-08 | Hnc, Inc. | System and method of context vector generation and retrieval |
US6006221A (en) * | 1995-08-16 | 1999-12-21 | Syracuse University | Multilingual document retrieval system and method using semantic vector matching |
US6076088A (en) * | 1996-02-09 | 2000-06-13 | Paik; Woojin | Information extraction system and method using concept relation concept (CRC) triples |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5325298A (en) * | 1990-11-07 | 1994-06-28 | Hnc, Inc. | Methods for generating or revising context vectors for a plurality of word stems |
US5774845A (en) * | 1993-09-17 | 1998-06-30 | Nec Corporation | Information extraction processor |
-
1999
- 1999-12-08 US US09/457,190 patent/US6189002B1/en not_active Expired - Lifetime
-
2000
- 2000-08-29 WO PCT/US2000/023784 patent/WO2001042984A1/en active IP Right Grant
- 2000-08-29 EP EP00959600A patent/EP1247207A1/en not_active Withdrawn
- 2000-08-29 AU AU70892/00A patent/AU782200B2/en not_active Ceased
- 2000-08-29 CA CA002393480A patent/CA2393480A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5619709A (en) * | 1993-09-20 | 1997-04-08 | Hnc, Inc. | System and method of context vector generation and retrieval |
US6006221A (en) * | 1995-08-16 | 1999-12-21 | Syracuse University | Multilingual document retrieval system and method using semantic vector matching |
US6076088A (en) * | 1996-02-09 | 2000-06-13 | Paik; Woojin | Information extraction system and method using concept relation concept (CRC) triples |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8276067B2 (en) | 1999-04-28 | 2012-09-25 | Bdgb Enterprise Software S.A.R.L. | Classification method and apparatus |
US7908430B2 (en) | 2000-08-18 | 2011-03-15 | Bdgb Enterprise Software S.A.R.L. | Associative memory |
US9159584B2 (en) | 2000-08-18 | 2015-10-13 | Gannady Lapir | Methods and systems of retrieving documents |
US8209481B2 (en) | 2000-08-18 | 2012-06-26 | Bdgb Enterprise Software S.A.R.L | Associative memory |
US7295991B1 (en) * | 2000-11-10 | 2007-11-13 | Erc Dataplus, Inc. | Employment sourcing system |
GB2377046A (en) * | 2001-06-29 | 2002-12-31 | Ibm | Metadata generation |
EP1288792A1 (en) * | 2001-08-27 | 2003-03-05 | SER Systems AG | A method for automatically indexing documents |
US9141691B2 (en) | 2001-08-27 | 2015-09-22 | Alexander GOERKE | Method for automatically indexing documents |
AU2002331728B2 (en) * | 2001-08-27 | 2008-03-06 | Kofax International Switzerland Sàrl | A method for automatically indexing documents |
US8015198B2 (en) | 2001-08-27 | 2011-09-06 | Bdgb Enterprise Software S.A.R.L. | Method for automatically indexing documents |
EP1365331A2 (en) * | 2002-05-24 | 2003-11-26 | Océ-Technologies B.V. | Determination of a semantic snapshot |
EP1365331A3 (en) * | 2002-05-24 | 2003-12-17 | Océ-Technologies B.V. | Determination of a semantic snapshot |
WO2005020094A1 (en) * | 2003-08-21 | 2005-03-03 | Idilia Inc. | System and method for associating documents with contextual advertisements |
EP1665092A1 (en) * | 2003-08-21 | 2006-06-07 | Idilia Inc. | Internet searching using semantic disambiguation and expansion |
US7895221B2 (en) | 2003-08-21 | 2011-02-22 | Idilia Inc. | Internet searching using semantic disambiguation and expansion |
WO2005020093A1 (en) * | 2003-08-21 | 2005-03-03 | Idilia Inc. | Internet searching using semantic disambiguation and expansion |
US7509313B2 (en) | 2003-08-21 | 2009-03-24 | Idilia Inc. | System and method for processing a query |
US8024345B2 (en) | 2003-08-21 | 2011-09-20 | Idilia Inc. | System and method for associating queries and documents with contextual advertisements |
EP1665092A4 (en) * | 2003-08-21 | 2006-11-22 | Idilia Inc | Internet searching using semantic disambiguation and expansion |
US7774333B2 (en) | 2003-08-21 | 2010-08-10 | Idia Inc. | System and method for associating queries and documents with contextual advertisements |
GR1004748B (en) * | 2003-11-14 | 2004-12-02 | Βουτσινασαβασιλειοσα | A system and method for data mining from relational databases using a hybrid neural-symbolic system |
CN101079025B (en) * | 2006-06-19 | 2010-06-16 | 腾讯科技(深圳)有限公司 | File correlation computing system and method |
US8321357B2 (en) | 2009-09-30 | 2012-11-27 | Lapir Gennady | Method and system for extraction |
US9152883B2 (en) | 2009-11-02 | 2015-10-06 | Harry Urbschat | System and method for increasing the accuracy of optical character recognition (OCR) |
US9158833B2 (en) | 2009-11-02 | 2015-10-13 | Harry Urbschat | System and method for obtaining document information |
US9213756B2 (en) | 2009-11-02 | 2015-12-15 | Harry Urbschat | System and method of using dynamic variance networks |
Also Published As
Publication number | Publication date |
---|---|
CA2393480A1 (en) | 2001-06-14 |
US6189002B1 (en) | 2001-02-13 |
EP1247207A1 (en) | 2002-10-09 |
AU782200B2 (en) | 2005-07-07 |
AU7089200A (en) | 2001-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU782200B2 (en) | Process and system for retrieval of documents using context-relevant semantic profiles | |
Lee et al. | Attribute extraction and scoring: A probabilistic approach | |
Froud et al. | Stemming and similarity measures for Arabic Documents Clustering | |
Patil et al. | A novel feature selection based on information gain using WordNet | |
Shaalan et al. | Query expansion based-on similarity of terms for improving Arabic information retrieval | |
KR101429623B1 (en) | Duplication news detection system and method for detecting duplication news | |
Kılınç et al. | An expansion and reranking approach for annotation-based image retrieval from web | |
Najadat et al. | Automatic keyphrase extractor from arabic documents | |
Bhowmik | Keyword extraction from abstracts and titles | |
Freeman et al. | Tree view self-organisation of web content | |
Abdullah et al. | The effectiveness of classification on information retrieval system (case study) | |
Tohalino et al. | Using virtual edges to extract keywords from texts modeled as complex networks | |
Triwijoyo et al. | Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms | |
Plegas et al. | Reducing information redundancy in search results | |
Ramachandran et al. | Document Clustering Using Keyword Extraction | |
Lam et al. | Semantically relevant image retrieval by combining image and linguistic analysis | |
Pham et al. | SOM-based clustering of multilingual documents using an ontology | |
Abdelwahab et al. | Arabic Text Summarization using Pre-Processing Methodologies and Techniques. | |
Rui et al. | A search-based web image annotation method | |
Sharma et al. | Improved stemming approach used for text processing in information retrieval system | |
Sonawane | Graph-based Text Document Extractive Summarization | |
Vashist et al. | Document clustering using improved k-means algorithm | |
Kilinc et al. | DEU at ImageCLEF 2009 WikipediaMM Task: Experiments with Expansion and Reranking Approaches. | |
Stratogiannis et al. | Related Entity Finding Using Semantic Clustering Based on Wikipedia Categories | |
Tuni et al. | Afaan Oromo Hybrid Modelling: A Case based Optimized Intelligence in Information Retrieval System’s Localization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2393480 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 70892/00 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000959600 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2000959600 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000959600 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWG | Wipo information: grant in national office |
Ref document number: 70892/00 Country of ref document: AU |