WO2004100130A2 - Information retrieval and text mining using distributed latent semantic indexing - Google Patents

Information retrieval and text mining using distributed latent semantic indexing Download PDF

Info

Publication number
WO2004100130A2
WO2004100130A2 PCT/US2004/012462 US2004012462W WO2004100130A2 WO 2004100130 A2 WO2004100130 A2 WO 2004100130A2 US 2004012462 W US2004012462 W US 2004012462W WO 2004100130 A2 WO2004100130 A2 WO 2004100130A2
Authority
WO
WIPO (PCT)
Prior art keywords
sub
collection
collections
term
query
Prior art date
Application number
PCT/US2004/012462
Other languages
French (fr)
Other versions
WO2004100130A3 (en
Inventor
Clifford A. Behrens
Devasis Bassu
Original Assignee
Telcordia Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telcordia Technologies, Inc. filed Critical Telcordia Technologies, Inc.
Priority to EP04750497A priority Critical patent/EP1618467A4/en
Priority to JP2006513228A priority patent/JP4485524B2/en
Priority to CA2523128A priority patent/CA2523128C/en
Publication of WO2004100130A2 publication Critical patent/WO2004100130A2/en
Publication of WO2004100130A3 publication Critical patent/WO2004100130A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • This invention is related to a method and system for the concept based retrieval and mining of information using a distributed architecture. More specifically, the present invention partitions a heterogeneous collection of data objects with respect to the conceptual domains found therein and indexes the content of each partitioned sub-collection with Latent Semantic Indexing (LSI), thereby enabling one to query over these distributed LSI vector spaces.
  • LSI Latent Semantic Indexing
  • Vector space representations of these sub-collections of data objects can be used to select appropriate sources of information needed to respond to a user query or mining operation.
  • BACKGROUND Latent Semantic Indexing is an advanced information retrieval (IR) technology that is a variant of the vector retrieval method that exploits dependencies or "semantic similarity" between terms. It is assumed that there exists some underlying or "latent" structure in the pattern of word usage across data objects, such as documents, and that this structure can be discovered statistically.
  • One significant benefit of this approach is that, once a suitable reduced vector space is computed for a collection of documents, a query can retrieve documents similar in meaning or concepts even though the query and document have no matching terms.
  • U(E) is an n x n matrix such that U(E) T U(E) - I Vietnamese, ⁇ (E) is an n x n diagonal
  • LSI is limited as an information retrieval and text mining strategy when document collections grow because with large collections there exists an increasing probability of drawing documents from different conceptual domains. This has the effect of increasing the semantic heterogeneity modeled in a single LSI vector space, thus of introducing noise and "confusing" the LSI search algorithm. As polysemy becomes more pronounced in a collection, vectors for terms tend to be represented by the centroid of all vectors for each unique meaning of the term, and since documents vectors are computed from the weighted sum of vectors for the terms they contain, the semantics of these are also confounded.
  • the number of conceptual domains grows with the size of a document collection. This may result from new concepts being introduced into the information space, or an existing concept becoming extremely large (in number of documents) with further differentiation of its sub-concepts. In both cases, the compression factor in any vector space-based method has to be increased to accommodate this inflation.
  • the present invention provides a method and system for taking a massive, heterogeneous set or collection of data objects (also referred to as a set of documents) and partitioning it into more semantically homogeneous concept spaces or sub- collections. This enables LSI to perform better in the respective vector spaces computed for each of these. Mathematically, this approach amounts to an approximate block-diagonalization of the term-document matrix and obtaining SVD's for each of these blocks. The query process is then a mapping onto the network of overlapping blocks, using similarity metrics to indicate how much these blocks actually overlap.
  • Preprocessing the heterogeneous document collection before computing a term-by-document matrix into sub collection of documents sorted by conceptual domain permits each domain (sub-collection) to be processed independently with LSI. This reduces both storage and computational overhead and opens the possibility of distributing vector spaces (and searches of them) over a wider network of resources. An added benefit of this approach would be greater semantic resolution for any one vector space gained from fewer dimensions, i.e., LSI models exhibiting greater parsimony.
  • a large data collection or plurality of data collections are screened for the existence of grouping or clustering effects. If a data collection is known to be homogenous then the initial screening/clustering step may be skipped for that collection. This information is then used to segregate documents into more semantically homogeneous sub collections before applying SVD to each.
  • the paired similarity between the semantic structures of all LSI vector spaces is computed. This distance measure is based on the similarity of semantic graphs formed from words shared by each pair of vector spaces.
  • the semantics for a query might be inferred from multiple query terms and by presenting a user with the different semantic contexts for query terms represented in all LSI vector spaces, then exploit this information to properly source queries and fuse hit lists.
  • the main idea is to partition a large collection of documents into smaller sub-collections that are conceptually independent (or nearly independent) of each other, and then build LSI vector spaces for each of the sub-collections
  • Conceptual independence may mean the presence of some terms common to two LSI spaces whose semantic similarity measure (defined later on) is approximately zero.
  • the common terms represent polysemy (multiple meanings for a term) over the conceptual domains involved.
  • a multi-resolution conceptual classification is performed on each of the resulting LSI vector spaces.
  • a network/graph of the conceptual domains based on links via common terms is generated. Then this graph is examined at query time for terms that are nearest neighbors to ensure that each contextually appropriate LSI space is properly addressed for a user's query terms.
  • LSI LSI
  • FIG. 1 depicts a flow diagram of the method of processing document collections in accordance with the present invention
  • FIG. 2a and 2b depict flow diagrams of the method of processing document collections in accordance with the present invention, particularly the generation of data on the similarity of sub-collections;
  • FIG. 3 depicts a flow diagram of the method of querying the collection of documents processed in accordance with the methods of the present invention
  • FIG. 4 depicts a schematic diagram of one embodiment of a distributed LSI system in accordance with the present invention.
  • the inventive method of the document collection processing of the present invention is set forth.
  • the method of the present invention generates a frequency count for each term in each document in the collection (or set) of documents.
  • data objects in this context refers to information such as documents, files, records etc. Data objects may also be referred to herein as documents.
  • an optional preprocessing step 100 the terms in each document are reduced to their canonical forms and a predetermined set of "stop" words are ignored. Stop words are typically those words that are used as concept connectors but provide no actual content such as “a” “are” “do” "for” etc. The list of common stop words is well known in the art. Suffix strippers that reduce a set of similar words to their canonical forms are also well known in the art. Such a stripper or parser will reduce a set of words such as computed, computing and computer to a stem word "comput” thereby combining the frequency counts for such words and reducing the overall size of the set of terms.
  • the heterogeneous collection of data objects is partitioned by concept domain into sub-collections of like concept. If it is known that one or more separate sub-collections within a larger collection of data are homogenous in nature, the initial partitioning need not be done for those known homogenous data collections.
  • the plurality of data object clusters can be further refined by performing a series of iterations of the bisecting k-means algorithm.
  • the Singular Value Decomposition (described below) is then applied to reduce each of these k clusters or sub-collections of documents to generate a reduced vector space having approximately 200 orthogonal dimensions.
  • the size is manageable yet able to capture the semantic resolution of the sub-collection. Different sizes may be used depending on constraints such as available computing power and time.
  • step 140 using the k-means or other appropriate algorithm, clustering is then performed on each of these reduced vector spaces to discover vector clusters (representing core concepts) and their centroid vectors for each sub-collection.
  • the vector clusters and centroid vectors could be obtained from the clustering data obtained at step 120. Once these centroid vectors are obtained, a predetermined number of closest terms to each of these centroid vectors are found at step 150.
  • the number of key words is set to 10 per cluster although different numbers of key words may be appropriate in different situations. These are used as keywords to label this sub-collection thereby identifying the concept cluster therein.
  • Each of the k vector spaces provides a different resolution to the underlying concepts present in the data and the context of each one is represented by its own set of keywords.
  • Step 160 is to establish the contextual similarity between these spaces.
  • Step 160 is necessary to select and search contextually appropriate LSI vector spaces in response to a query.
  • Two graph-link measures are developed to establish a similarity graph network. A user query is passed on to the similarity graph network where proper queries are generated for each LSI vector space, and then each works independently to retrieve useful documents.
  • Sub collections C / , C 2 , ..., C k denote the k concept domains obtained as a partition of the document class C using the k-means clustering algorithm.
  • Terms T T 2 , ..., T k denote the corresponding term sets for the k concept domains.
  • V l [vf v2 i ... v ⁇ ]
  • Doument sets, D ⁇ , D 2 , ⁇ - ., D k are the corresponding document sets for the k concept domains.
  • £/ / , U 2 , ..., U k be the corresponding eigen matrices for the k document spaces in the SVD representation.
  • U t [u U 2 ... u/J forms the rank reduced document eigen basis
  • T ⁇ T, n T ⁇ is the set of common terms for the concept
  • the projection of the term vector v into the term space generated by the SVD is given by V, v for the z ' -th concept domain.
  • the method of the present invention exploits two different ways in which the similarity between two concept domains can be measured as set forth in FIGS. 2a and 2b.
  • the first similarity measure is is the number of terms common to each concept domain. With common terms, it is necessary to exclude high frequency terms that act as constructs for the grammar rather than conveying any actual meaning. This is largely achieved during document preprocessing in step 100 by filtering them with a stop-word list, but if such preprocessing was not performed the operation could be performed now in order to exclude unnecessary high frequency terms.
  • the first measure captures the frequency of occurrence of common terms between any two concept domains.
  • the underlying idea is that if many terms are common to their vector spaces, then they ought to be describing the same thing, i.e., they have similar semantic content and provide similar context. This process is described with reference to FIG. 2a.
  • the concept domains C, and C in the case where the common set T y is non-empty, the proximity between these two spaces is defined to be of order zero and the frequency measure to be given by equation (5).
  • this frequency measure is determined for each pair of sub- collections. When T v is empty, there are no common terms between sub-collections C, and C j .
  • the similarity measure above takes into account the proximity between the two concept spaces along with the occurrence of common terms.
  • a similarity graph network can be mapped showing a relationship between sub-collections, either directly or through a linking sub-collection at step 240.
  • the second measure of similarity is more sensitive to the semantics of the common terms, not just how many are shared by two concept domains.
  • the semantic relationships between the common terms (no matter how many there are) in each of the concept domains are examined to determine whether they are related in the same way.
  • one of the matrices (say X) is held fixed while the other one (Y) is permuted (rows/columns).
  • the Mantel test statistic is computed at step 265.
  • NGE the number of times where the obtained statistic is greater than or equal to (NGE) the test statistic value obtained with the original X and Y is counted.
  • Nmns The total number of such permutations is denoted by Nmns- Usually, around 1000 permutations are sufficient for 5% level of significance and 5000 permutations for 1% level of significance.
  • the semantic measure for a proximity of order zero is determined at step 280 by equation (13).
  • the measure for the first order proximity is determined at step 285 by equation (14).
  • step 290 the final semantic similarity measure s2 is given by equation (15) where p again is the proximity between the two spaces .
  • a preferred embodiment of the present invention uses the second similarity measure when comparing the semantics of LSI vector spaces. But it should be noted that its validity is given by the first similarity measure (the proportion of common terms). Suppose the second measure has a very high value (strong semantic relationship) but it turns out that there were only two common terms out of a total of 100 terms in each concept domain. Then the measure is subject to high error. In this situation, the first measure clearly exposes this fact and provides a metric for validating the semantic measure. Both measures are needed to obtain a clear indication of the semantic similarity (or lack thereof) between two concept domains. The most preferred measure of similarity, therefore, is the product of these two.
  • the method resulting similarity graph network and "identifying concept" terms are used in information retrieval or data mining operation.
  • the similarity between the query and a concept domain's vector space so that useful documents in it may be retrieved.
  • the user may also specify the degree of similarity desired in search results. If a greater degree of searching freedom is desired, the system will expand the query vector as described below.
  • a representative query vector is then generated at step 320 as the normalized sum of each of the projected term vectors in the LSI space. Note that there might be several possible cases, e.g., (1) all the terms in Q are present in the concept domain term set 7 ⁇ , (2) some terms are present, or (3) none are present.
  • the sub-collections in which all the query terms exist in the term set for a concept domain are identified.
  • a ranking of the domains, along with the "meaning" each conveys, is helpful to decide which one to query. If a user has an idea of what he or she is looking for, then the "identifying concept" terms provided (as described above) become useful. On the other hand, for the explorative user without a fixed goal, the ranking supports serendipitous discovery.
  • the "identifying concept” terms are naturally terms associated with the closest (in cosine measure) projected term vectors to the query vector. Semantically, these terms are also closest to the query terms. As a member of this concept domain, this term set is the best candidate to represent the domain in trying to uncover what the user meant by the query. The ranking is just the value of the cosine measure between the "identifying concept" term vector and the query vector. A list can be presented to the user so that he or she is able to decide which domains should be searched for matching documents. Results are returned to the user in separate lists for each concept domain (sub-collection) at step 350 of FIG. 1.
  • the information retrieval software uses the standard LSI approach of cosine based vector space similarity to retrieve document matches at step 370 which are then presented to the user at step 380.
  • the selection of the best sub-collections to query can be performed automatically by selecting those with the highest rank first. This would tend to be used more in a strict information retrieval system rather in the more interactive text-mining environment.
  • some of the query terms are missing from the term set for the concept domain.
  • two approaches are used. In the first approach the process chooses to ignore those missing terms and just proceed as before with terms that are present.
  • the process examines relationships between existing terms in the concept domain with the non-existent ones present in the query.
  • This version of the method depicted in the flow diagram 3 differs from the above discussion only in that the query vector for a concept domain is computed at step 320 as the weighted sum of its projected term vectors in a concept domain similar to some other concept domain that actually contains the query terms.
  • the selection of this other concept domain is based on the domain similarity measure described above (the product measure performs well for this).
  • the expanded query vector is constructed for the query domain. With this expanded query vector, it is easy to generate "identifying concept" terms, as before in steps 330 through 370.
  • the first function consists of creating a classification scheme for specifying the multiple LSI vector spaces and consists of steps 110 through 160 and, optionally, step 100 of FIG. 1 and, depending on the similarity technique used, the steps of FIG. 2a or 2b.
  • the second function consists of actually querying this distributed network of spaces as described by steps 310 through 370 of FIG. 3. From a functional perspective, however, these two functions are independent of each other and the first function can be performed at various locations in a distributed network as depicted in FIG. 4.
  • FIG. 4 a network configuration for a distributed LSI network is set forth in which an LSI hub processor 410 is used to control the various data object clustering and information query requests.
  • LSI hub processor 410 has three functions : brokering queries, generating similarity graph networks and indexing (ore re- indexing) newly arrived documents. As one or more servers 421-423 are added to the network each having access to a plurality of data objects in an associated database 431-433, the LSI hub processor 410 controls the distributed processing of the data objects in accordance with the method of the present invention in FIGS. 1 and FIGS. 2a and/or 2b in order to develop a comprehensive network graph of all data objects across all servers and databases. It should be understood that LSI hub processor 410 may perform some or all of the steps set forth in the partitioning and similarity processing method described above or it may only control the processing in one ore more of the servers 421-423.
  • LSI hub processor 410 can then respond to an information retrieval or data mining query from a user terminal 440.
  • the LSI hub processor executes the method of the present invention as described in FIG. 2 and sends query results back to user terminal 440 by extracting those data objects from one or more databases 431-433. From user terminal 440 the user may request LSI hub processor 410 to use the expanded query as discussed above providing extra flexibility to the user.
  • LSI hub processor 410 oversees the computationally intensive clustering operations, decomposing operations and generation of the centroid vectors.
  • LSI hub processor 410 may also be used to more efficiently physically partition data between databases by redirecting the placement of similar clusters in the same database in order to create concept domains having a greater number of data objects thereby making subsequent retrieval or text mining operations more efficient.
  • LSI hub processor 410 may also be used to index new documents in a relevant partition, either physically or virtually, in order to place documents having similar semantic attributes in the same conceptual domain.
  • the LSI hub processor can be requested by the user to present either a ranked list of results grouped by concept domain or a ranked list of results across all queried domains, depending on user preference.
  • An embodiment of the present invention was used to partition and query the NSF awards database that contains abstracts of proposals funded by the NSF since 1989. Information on awards made prior to 1989 is only available if the award has since been amended. A total of 22,877 abstracts selected from 378 different NSF programs were used with a total count of 114,569 unique terms.
  • the distributed LSI method of the present invention provides a set of concept classes, the number of these dependent on the level of resolution (or similarity), along with a set of keywords to label each class.
  • the actual selection of the final set of concept classes is an iterative process whereby the user tunes the level of resolution to suit his or her purpose.
  • the algorithm provides some metrics for the current cluster. For example, concepts classes (represented by their keywords) for two such levels of resolution are listed below.
  • Class l ⁇ ccr, automatically, implementations, techniques, project, algorithms, automatic, systems, abstraction, high-level ⁇
  • Class 2 - ⁇ university, award, support, students, at, universities, institutional, provides, attend, faculty ⁇
  • Class 3 ⁇ study, constrain, meridional, thermohaline, ocean, climate, hemispheres, greenland, observations, eastward ⁇
  • Class 4 ⁇ species, which, animals, how, genetic, animal, evolutionary, important, understanding, known ⁇
  • Class 1 runtime, high-level, run-time, execution, concurrency, application- specific, software, object-based, object-oriented, dsm ⁇
  • Class 2 ⁇ problems, approximation, algorithmic, algorithms, approximating, algorithm, computationally, solving, developed, algebraic ⁇
  • Class 4 ⁇ materials, ceramic, fabricate, microstructures, fabrication, ceramics, fabricated, manufacture, composite, composites ⁇
  • Class 5 ⁇ meridional, wind, magnetosphere, magnetospheric, circulation, hemispheres, imf magnetohydrodynamic, field-aligned, observations ⁇
  • Class 9 ⁇ species, evolutionary, deb, genus, populations, endangered, ecological, phylogeny, diversification, diversity ⁇
  • the preliminary clusters and concept labels obtained using the present invention show that the algorithm seems adept at finding new (or hidden) concepts when the level of resolution is increased. Further, the concept labels returned by the algorithm are accurate and get refined as the level of resolution is increased.
  • the algorithm Given a query (set of terms), the algorithm produces a set of query terms for each LSI space in the distributed environment, which is further, refined by a cut-off score.
  • the algorithm uses a set of similarity metrics, as discussed earlier. Results from individual LSI queries are collected, thresholded and presented to the user, categorized by concept. A subset of the NSF awards database containing 250 documents was selected from each of the following NSF directorate codes:
  • BIO Biological Sciences Through these selections, the entire collection of 1750 documents was ensured to be semantically heterogeneous. Next, eight different LSI spaces were computed - one for all documents belonging to each directorate code, and a final one for the entire collection. The distributed query algorithm was run on the seven LSI spaces and the usual query on the comprehensive space. For comparison purposes, the actual document returned provided the final benchmark because the distributed LSI query mechanism was expected to perform better.
  • the main query consisted of the terms ⁇ brain, simulation ⁇ , and this was fed to the query algorithm. Further, a cut-off of 0.5 (similarity) was set system-wide.
  • the extended query sets (using the cut-off) generated by the algorithm are listed below.
  • BIO ⁇ brain, simulations (2), extended, assessment ⁇
  • the final query results were as follows.
  • the query on the larger LSI space returned no results which had similarity scores greater than 0.5.
  • the top ten contained a couple of documents related to brain simulation but with low scores.

Abstract

The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determining which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.

Description

INFORMATION RETRIEVAL AND TEXT MINING USING DISTRIBUTED LATENT SEMANTIC INDEXING
FIELD OF THE INVENTION
This invention is related to a method and system for the concept based retrieval and mining of information using a distributed architecture. More specifically, the present invention partitions a heterogeneous collection of data objects with respect to the conceptual domains found therein and indexes the content of each partitioned sub-collection with Latent Semantic Indexing (LSI), thereby enabling one to query over these distributed LSI vector spaces. Vector space representations of these sub-collections of data objects can be used to select appropriate sources of information needed to respond to a user query or mining operation.
BACKGROUND Latent Semantic Indexing (LSI) is an advanced information retrieval (IR) technology that is a variant of the vector retrieval method that exploits dependencies or "semantic similarity" between terms. It is assumed that there exists some underlying or "latent" structure in the pattern of word usage across data objects, such as documents, and that this structure can be discovered statistically. One significant benefit of this approach is that, once a suitable reduced vector space is computed for a collection of documents, a query can retrieve documents similar in meaning or concepts even though the query and document have no matching terms.
An LSI approach to information retrieval is detailed in commonly assigned United States Patent No. 4,839,853 applies a singular-value decomposition (SVD) to a term-document matrix for a collection, where each entry gives the number of times a term appears in a document. A large term-document matrix is typically decomposed to a set of approximately 150 to 300 orthogonal factors from which the original matrix can be approximated by linear combination. In the LSI-generated vector space, terms and documents are represented by continuous values on each of these orthogonal dimensions; hence, are given numerical representation in the same space. Mathematically, assuming a collection of m documents with n unique terms that,
together, form an n x m sparse matrix E with terms as its rows and the documents as
its columns, each entry in E gives the number of times a term appears in a document. In the usual case, log-entropy weighting (log(tf+ l)entropy) is applied to these raw frequency counts before applying SVD. The structure attributed to document- document and term- term dependencies is expressed mathematically in equation (1) as the SVD of E:
E = U(E) Σ(E) V(E)T (1)
where U(E) is an n x n matrix such that U(E)TU(E) - I„, Σ(E) is an n x n diagonal
matrix of singular values and V(E) is an n x m matrix such that V(E)TV(E) = Im,
assuming for simplicity that E has fewer terms than documents.
Of course the attraction of SVD is that it can be used to decompose E to a lower dimensional vector space k as set forth in the rank-k reconstruction of equation (2). Ek = Uk(E) Σk(E) Vk(E)τ (2)
Because the number of factors can be much smaller than the number of unique terms used to construct this space, words will not be independent. Words similar in meaning and documents with similar content, based on the words they contain, will be located near one another in the LSI space. These dependencies enable one to query documents with terms, but also terms with documents, terms with terms, and documents with other documents. In fact, the LSI approach merely treats a query as a "pseudo-document," or a weighted vector sum based on the words it contains. In the LSI space, the cosine or dot product between term or document vectors corresponds to their estimated similarity, and this measure of similarity can be exploited in interesting ways to query and filter documents. This measure of correspondence between query vector q and document vector d is given by equation (3). sim(Uk(E)Tq, Uk(E)Td) (3)
In "Using Linear Algebra for Intelligent Information Retrieval" by M. Berry et al., SIAM Review 37(4): pp. 573-595 the authors provide a formal justification for using the matrix of left singular vectors Uk(E) as a vector lexicon.
Widespread use of LSI has resulted in the identification of certain problems exhibited by LSI when attempting to query massive heterogeneous document collections. An SVD is difficult to compute for extremely large term-by-document matrices, and the precision-recall performance tends to degrade as collections become very large. Surprisingly, much of the technical discussion surrounding LSI has focused on linear algebraic methods and algorithms that implement these, particularly problems of applying SVD to massive, sparse term-document matrices. Evaluations of the effect of changing parameters, e.g., different term weightings and the number of factors extracted by SVD, to the performance of LSI have been performed. Most of the approaches to make LSI scale better have been sought from increasing the complexity of LSI's indexing and search algorithms.
LSI is limited as an information retrieval and text mining strategy when document collections grow because with large collections there exists an increasing probability of drawing documents from different conceptual domains. This has the effect of increasing the semantic heterogeneity modeled in a single LSI vector space, thus of introducing noise and "confusing" the LSI search algorithm. As polysemy becomes more pronounced in a collection, vectors for terms tend to be represented by the centroid of all vectors for each unique meaning of the term, and since documents vectors are computed from the weighted sum of vectors for the terms they contain, the semantics of these are also confounded.
In general, the number of conceptual domains grows with the size of a document collection. This may result from new concepts being introduced into the information space, or an existing concept becoming extremely large (in number of documents) with further differentiation of its sub-concepts. In both cases, the compression factor in any vector space-based method has to be increased to accommodate this inflation.
The deleterious effects of training on a large conceptually undifferentiated document collection are numerous. For example, assume that documents drawn from two conceptual domains, technology and food, are combined without sourcing into a single training set and that LSI is applied to this set to create a single vector space. It is easy to imagine how the semantics of these two domains might become confused. Take for instance the location of vectors representing the terms "chip" and "wafer." In the technology domain, the following associations may be found: silicon chip, silicon wafer, silicon valley, and copper wafer. However, in the food domain the terms chip and wafer take-on different meanings and there may be very different semantic relationships: potato chip, corn chip, corn sugar, sugar wafer. But these semantic distinctions become confounded in the LSI vector space. By training on this conceptually undifferentiated corpus, vectors are computed for the shared terms "chip" and "wafer" that really don't discriminate well the distinct meanings that these terms have in the two conceptual domains. Instead, two semantically "diluted" vectors that only represent the numerical average or "centroid" of each term's separate meaning in the two domains is indexed.
Therefore, it would be desirable to have a method and system for performing LSI-based information retrieval and text mining operations that can be efficiently scaled to operate on large heterogeneous sets of data.
Furthermore, it would be desirable to have a method and system for performing LSI-based information retrieval and text mining operations on large data sets quickly and accurately.
Additionally, it would be desirable to have a method and system for performing LSI-based information retrieval and text-mining operations on large data sets without the deleterious effects of mixing conceptually differentiated data.
Also, it would be desirable to have a method and system for the processing of large document collections into a structure that enables development of similarity graph networks of sub-collections having related concept domains. Additionally, it would be desirable to have a method and system that enables a user to query the document collection in a flexible manner so that the user can specify the degree of similarity necessary in search results.
SUMMARY The present invention provides a method and system for taking a massive, heterogeneous set or collection of data objects (also referred to as a set of documents) and partitioning it into more semantically homogeneous concept spaces or sub- collections. This enables LSI to perform better in the respective vector spaces computed for each of these. Mathematically, this approach amounts to an approximate block-diagonalization of the term-document matrix and obtaining SVD's for each of these blocks. The query process is then a mapping onto the network of overlapping blocks, using similarity metrics to indicate how much these blocks actually overlap.
Preprocessing the heterogeneous document collection before computing a term-by-document matrix into sub collection of documents sorted by conceptual domain permits each domain (sub-collection) to be processed independently with LSI. This reduces both storage and computational overhead and opens the possibility of distributing vector spaces (and searches of them) over a wider network of resources. An added benefit of this approach would be greater semantic resolution for any one vector space gained from fewer dimensions, i.e., LSI models exhibiting greater parsimony.
A large data collection or plurality of data collections are screened for the existence of grouping or clustering effects. If a data collection is known to be homogenous then the initial screening/clustering step may be skipped for that collection. This information is then used to segregate documents into more semantically homogeneous sub collections before applying SVD to each. To determine whether a user's query was appropriate for a particular LSI vector space, i.e., whether the intended semantics of a query matched those of a particular document collection, the paired similarity between the semantic structures of all LSI vector spaces is computed. This distance measure is based on the similarity of semantic graphs formed from words shared by each pair of vector spaces. The semantics for a query might be inferred from multiple query terms and by presenting a user with the different semantic contexts for query terms represented in all LSI vector spaces, then exploit this information to properly source queries and fuse hit lists. The main idea is to partition a large collection of documents into smaller sub-collections that are conceptually independent (or nearly independent) of each other, and then build LSI vector spaces for each of the sub-collections
"Conceptual independence" may mean the presence of some terms common to two LSI spaces whose semantic similarity measure (defined later on) is approximately zero. In this case, the common terms represent polysemy (multiple meanings for a term) over the conceptual domains involved. A multi-resolution conceptual classification is performed on each of the resulting LSI vector spaces. In a realistic situation, there may be quite a few common terms present between any two conceptual domains. To address the possible problem of synonymy and polysemy in the query, a network/graph of the conceptual domains based on links via common terms is generated. Then this graph is examined at query time for terms that are nearest neighbors to ensure that each contextually appropriate LSI space is properly addressed for a user's query terms. The use of LSI in developing a query vector enables the user to select a level of similarity to the initial query. If a user prefers to receive additional documents that may be more peripherally related to the initial query, the system will expand the query vector using LSI techniques.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 depicts a flow diagram of the method of processing document collections in accordance with the present invention;
FIG. 2a and 2b depict flow diagrams of the method of processing document collections in accordance with the present invention, particularly the generation of data on the similarity of sub-collections;
FIG. 3 depicts a flow diagram of the method of querying the collection of documents processed in accordance with the methods of the present invention; and, FIG. 4 depicts a schematic diagram of one embodiment of a distributed LSI system in accordance with the present invention.
DETAILED DESCRIPTION
Referring to FIG. 1 the inventive method of the document collection processing of the present invention is set forth. At step 110 the method of the present invention generates a frequency count for each term in each document in the collection (or set) of documents. The term "data objects" in this context refers to information such as documents, files, records etc. Data objects may also be referred to herein as documents.
In an optional preprocessing step 100 the terms in each document are reduced to their canonical forms and a predetermined set of "stop" words are ignored. Stop words are typically those words that are used as concept connectors but provide no actual content such as "a" "are" "do" "for" etc. The list of common stop words is well known in the art. Suffix strippers that reduce a set of similar words to their canonical forms are also well known in the art. Such a stripper or parser will reduce a set of words such as computed, computing and computer to a stem word "comput" thereby combining the frequency counts for such words and reducing the overall size of the set of terms.
At step 120 the heterogeneous collection of data objects is partitioned by concept domain into sub-collections of like concept. If it is known that one or more separate sub-collections within a larger collection of data are homogenous in nature, the initial partitioning need not be done for those known homogenous data collections. For initial sorting of data objects into more conceptually homogeneous sub collections, the bisecting k-means algorithm in a recursive form with k=2 at each stage to obtain k clusters is preferably used. Clustering techniques have been explored in "A Comparison of Document Clustering Techniques" by M. Steinbach et al.. Technical Report 00-034, Department of Computer Science and Engineering, University of Minnesota. Although the bisecting k-means algorithm is preferred, the "standard" k-means algorithm or other types of spatial clustering algorithms may be utilized. Other types of clustering algorithms including hierarchical clustering algorithms may be used.
Preferentially, the plurality of data object clusters can be further refined by performing a series of iterations of the bisecting k-means algorithm. At step 130, the Singular Value Decomposition (described below) is then applied to reduce each of these k clusters or sub-collections of documents to generate a reduced vector space having approximately 200 orthogonal dimensions. At 200 dimensions, the size is manageable yet able to capture the semantic resolution of the sub-collection. Different sizes may be used depending on constraints such as available computing power and time.
At step 140, using the k-means or other appropriate algorithm, clustering is then performed on each of these reduced vector spaces to discover vector clusters (representing core concepts) and their centroid vectors for each sub-collection. Alternatively, instead of applying the clustering algorithm to the reduced vector space, the vector clusters and centroid vectors could be obtained from the clustering data obtained at step 120. Once these centroid vectors are obtained, a predetermined number of closest terms to each of these centroid vectors are found at step 150. In a preferred embodiment of the present invention the number of key words is set to 10 per cluster although different numbers of key words may be appropriate in different situations. These are used as keywords to label this sub-collection thereby identifying the concept cluster therein. Each of the k vector spaces provides a different resolution to the underlying concepts present in the data and the context of each one is represented by its own set of keywords.
Having computed the LSI vector space for each contextually related sub collection of documents and having extracted the keywords that represent the core concepts in each, the next step 160 is to establish the contextual similarity between these spaces. Step 160 is necessary to select and search contextually appropriate LSI vector spaces in response to a query. Two graph-link measures are developed to establish a similarity graph network. A user query is passed on to the similarity graph network where proper queries are generated for each LSI vector space, and then each works independently to retrieve useful documents.
This important step 160 is described in detail below. Sub collections C/, C2, ..., Ck denote the k concept domains obtained as a partition of the document class C using the k-means clustering algorithm. Terms T T2, ..., Tk denote the corresponding term sets for the k concept domains. With t, denoting the cardinality of T, for i = 1, 2, ..., k and Vι, 2, ..., Vk corresponding to the eigen matrices for the k term spaces in the SVD representation, then there are / factors in each of these LSI spaces and equation (4) forms the rank reduced term eigen basis for the /-th concept domain. Vl = [vf v2i ... v}]
(4)
Doument sets, Dι, D2, ■- ., Dk, are the corresponding document sets for the k concept domains. Let dt denote the cardinality of for i = 1, 2, ..., k. Further, let £//, U2, ..., Uk be the corresponding eigen matrices for the k document spaces in the SVD representation. Here, Ut = [u U2 ... u/J forms the rank reduced document eigen basis
for the z'-th concept domain. Tυ =T, n T} is the set of common terms for the concept
domains C, and C,. In addition, ty is the cardinality of Tυ, m, = V,V,' is the term similarity matrix for the concept domain C,, m is the restriction of , to the term set Q - obtained by selecting only those rows/columns of m, corresponding to the terms appearing in Q (for example, m,Q = nit for Q = T,). The projection of the term vector v into the term space generated by the SVD is given by V, v for the z'-th concept domain.
The method of the present invention exploits two different ways in which the similarity between two concept domains can be measured as set forth in FIGS. 2a and 2b. The first similarity measure is is the number of terms common to each concept domain.. With common terms, it is necessary to exclude high frequency terms that act as constructs for the grammar rather than conveying any actual meaning. This is largely achieved during document preprocessing in step 100 by filtering them with a stop-word list, but if such preprocessing was not performed the operation could be performed now in order to exclude unnecessary high frequency terms.
The first measure captures the frequency of occurrence of common terms between any two concept domains. The underlying idea is that if many terms are common to their vector spaces, then they ought to be describing the same thing, i.e., they have similar semantic content and provide similar context. This process is described with reference to FIG. 2a. Considering the concept domains C, and C,, in the case where the common set Ty is non-empty, the proximity between these two spaces is defined to be of order zero and the frequency measure to be given by equation (5). (5) At step 210 of FIG. 2a this frequency measure is determined for each pair of sub- collections. When Tv is empty, there are no common terms between sub-collections C, and Cj. There may be, however, some other space C/ which has common terms with both C, and C,, i.e., T,ι and Tβ are both non-empty. Then, the concept spaces C, and Cj could be linked via an intermediate space . At step 220 of FIG. 2b this is determined. In the case where there are several choices for this intermediate space, the "strongest link" is selected at step 230 using equations (6) and (7). Here, the proximity between C, and C, is stated as being of order one and the frequency measure is given by equation (6) with the similarity measure si given by equation (7) where p is the proximity between two spaces.
Figure imgf000013_0001
(6)
s\ = (s^ + PY
(7) The similarity measure above takes into account the proximity between the two concept spaces along with the occurrence of common terms. Using the data from steps 210 and 230 a similarity graph network can be mapped showing a relationship between sub-collections, either directly or through a linking sub-collection at step 240.
The second measure of similarity is more sensitive to the semantics of the common terms, not just how many are shared by two concept domains. The semantic relationships between the common terms (no matter how many there are) in each of the concept domains are examined to determine whether they are related in the same way.
At step 250 of FIG. 2b, the correlation between two matrices X and Y (both of
dimensions m x n) is measured, preferably by use of equations (8), (9) (10) and (11).
Figure imgf000014_0001
where 1 m n 1 m n
(9) m.n ,=1 J=l m.n 7=1
Figure imgf000014_0002
(11)
At step 260 one of the matrices (say X) is held fixed while the other one (Y) is permuted (rows/columns). For each such permutation, the Mantel test statistic is computed at step 265. At step 270 the number of times where the obtained statistic is greater than or equal to (NGE) the test statistic value obtained with the original X and Y is counted. The total number of such permutations is denoted by Nmns- Usually, around 1000 permutations are sufficient for 5% level of significance and 5000 permutations for 1% level of significance. The p-value for the test is then determined at step 275 by equation (12) and the results of the Mantel test are considered acceptable is the p-value is within a predetermined range considering the number of permutations used to achieve the level of significance. For 1000 permutations, the p- value should be less than approximately .05 to consider the test result acceptable. NGE + 1 p - value = — — — (12) runs
Corresponding to the first similarity measure, the semantic measure for a proximity of order zero is determined at step 280 by equation (13).
Figure imgf000015_0001
Similarly, the measure for the first order proximity is determined at step 285 by equation (14).
s = max r(mfp ,m^ )r(mp TPJ ,mfτ' ) (14)
Then at step 290 the final semantic similarity measure s2 is given by equation (15) where p again is the proximity between the two spaces .
s2 = {S + Py (15)
A preferred embodiment of the present invention uses the second similarity measure when comparing the semantics of LSI vector spaces. But it should be noted that its validity is given by the first similarity measure (the proportion of common terms). Suppose the second measure has a very high value (strong semantic relationship) but it turns out that there were only two common terms out of a total of 100 terms in each concept domain. Then the measure is subject to high error. In this situation, the first measure clearly exposes this fact and provides a metric for validating the semantic measure. Both measures are needed to obtain a clear indication of the semantic similarity (or lack thereof) between two concept domains. The most preferred measure of similarity, therefore, is the product of these two. Having measured the contextual similarity between vector spaces, the method resulting similarity graph network and "identifying concept" terms are used in information retrieval or data mining operation. In order to perform an information retrieval, the similarity between the query and a concept domain's vector space so that useful documents in it may be retrieved. k
With reference to FIG. 3, the usual user query Q is a set of terms in \ ]Ti as ι=l input at step 310 by the user. The user may also specify the degree of similarity desired in search results. If a greater degree of searching freedom is desired, the system will expand the query vector as described below. A representative query vector is then generated at step 320 as the normalized sum of each of the projected term vectors in the LSI space. Note that there might be several possible cases, e.g., (1) all the terms in Q are present in the concept domain term set 7}, (2) some terms are present, or (3) none are present.
At step 330, the sub-collections in which all the query terms exist in the term set for a concept domain (i.e., sub-collection) are identified. At step 340, if such multiple domains exist then, a ranking of the domains, along with the "meaning" each conveys, is helpful to decide which one to query. If a user has an idea of what he or she is looking for, then the "identifying concept" terms provided (as described above) become useful. On the other hand, for the explorative user without a fixed goal, the ranking supports serendipitous discovery.
The "identifying concept" terms are naturally terms associated with the closest (in cosine measure) projected term vectors to the query vector. Semantically, these terms are also closest to the query terms. As a member of this concept domain, this term set is the best candidate to represent the domain in trying to uncover what the user meant by the query. The ranking is just the value of the cosine measure between the "identifying concept" term vector and the query vector. A list can be presented to the user so that he or she is able to decide which domains should be searched for matching documents. Results are returned to the user in separate lists for each concept domain (sub-collection) at step 350 of FIG. 1. Once the user determines which sub- collections to query based on the lists of ranked sub-collections, at step 360, the information retrieval software uses the standard LSI approach of cosine based vector space similarity to retrieve document matches at step 370 which are then presented to the user at step 380. Alternatively, the selection of the best sub-collections to query can be performed automatically by selecting those with the highest rank first. This would tend to be used more in a strict information retrieval system rather in the more interactive text-mining environment. In a more complicated case some of the query terms are missing from the term set for the concept domain. Again, two approaches are used. In the first approach the process chooses to ignore those missing terms and just proceed as before with terms that are present. In the alternative approach, the process examines relationships between existing terms in the concept domain with the non-existent ones present in the query.
If missing terms are simply ignored, as before, an "identifying concept" term and a rank is presented to the user; but additional care must be taken, for in this case all the query terms do not match. A possible solution is to scale down the rank by the proportion of query terms that were actually used to find the concept term. Then the concept term is obtained exactly as before. The other case in which non-existent query terms are used is actually a particular instance of the next one.
In the worst-case scenario none of the query terms are present in the term set for a concept domain. The question arises whether one would want to query this domain at all. One thing is sure - if there are concept domains that fall into the previous two cases, they should definitely be exploited before any domain falling into this case. One way that this domain can be queried is to examine associations of terms across concept domains to discover synonyms existing in this domain, starting with the query terms. In other words, the entire information space is explored to obtain not just the query terms themselves, but also terms that are strongly related to them semantically. To control the method, a first order association (degree one) is imposed to limit search (where zero order implies the first case described above).
This version of the method depicted in the flow diagram 3 differs from the above discussion only in that the query vector for a concept domain is computed at step 320 as the weighted sum of its projected term vectors in a concept domain similar to some other concept domain that actually contains the query terms. The selection of this other concept domain is based on the domain similarity measure described above (the product measure performs well for this). Once the concept domain is selected that contains the query vectors and also is closest in meaning to the one to be queried, the expanded query vector is constructed for the query domain. With this expanded query vector, it is easy to generate "identifying concept" terms, as before in steps 330 through 370.
There are two main functions performed in the computation and querying of a distributed LSI space. The first function consists of creating a classification scheme for specifying the multiple LSI vector spaces and consists of steps 110 through 160 and, optionally, step 100 of FIG. 1 and, depending on the similarity technique used, the steps of FIG. 2a or 2b. The second function consists of actually querying this distributed network of spaces as described by steps 310 through 370 of FIG. 3. From a functional perspective, however, these two functions are independent of each other and the first function can be performed at various locations in a distributed network as depicted in FIG. 4. In FIG. 4 a network configuration for a distributed LSI network is set forth in which an LSI hub processor 410 is used to control the various data object clustering and information query requests. LSI hub processor 410 has three functions : brokering queries, generating similarity graph networks and indexing (ore re- indexing) newly arrived documents. As one or more servers 421-423 are added to the network each having access to a plurality of data objects in an associated database 431-433, the LSI hub processor 410 controls the distributed processing of the data objects in accordance with the method of the present invention in FIGS. 1 and FIGS. 2a and/or 2b in order to develop a comprehensive network graph of all data objects across all servers and databases. It should be understood that LSI hub processor 410 may perform some or all of the steps set forth in the partitioning and similarity processing method described above or it may only control the processing in one ore more of the servers 421-423. LSI hub processor 410 can then respond to an information retrieval or data mining query from a user terminal 440. In response to a query from the user terminal 440, the LSI hub processor executes the method of the present invention as described in FIG. 2 and sends query results back to user terminal 440 by extracting those data objects from one or more databases 431-433. From user terminal 440 the user may request LSI hub processor 410 to use the expanded query as discussed above providing extra flexibility to the user.
In this manner LSI hub processor 410 oversees the computationally intensive clustering operations, decomposing operations and generation of the centroid vectors. LSI hub processor 410 may also be used to more efficiently physically partition data between databases by redirecting the placement of similar clusters in the same database in order to create concept domains having a greater number of data objects thereby making subsequent retrieval or text mining operations more efficient. LSI hub processor 410 may also be used to index new documents in a relevant partition, either physically or virtually, in order to place documents having similar semantic attributes in the same conceptual domain. In presenting a result to a user, the LSI hub processor can be requested by the user to present either a ranked list of results grouped by concept domain or a ranked list of results across all queried domains, depending on user preference.
An embodiment of the present invention was used to partition and query the NSF Awards database that contains abstracts of proposals funded by the NSF since 1989. Information on awards made prior to 1989 is only available if the award has since been amended. A total of 22,877 abstracts selected from 378 different NSF programs were used with a total count of 114,569 unique terms.
The distributed LSI method of the present invention provides a set of concept classes, the number of these dependent on the level of resolution (or similarity), along with a set of keywords to label each class. The actual selection of the final set of concept classes is an iterative process whereby the user tunes the level of resolution to suit his or her purpose. To assist the user, the algorithm provides some metrics for the current cluster. For example, concepts classes (represented by their keywords) for two such levels of resolution are listed below.
Level of Resolution: low
Class l={ccr, automatically, implementations, techniques, project, algorithms, automatic, systems, abstraction, high-level}
Class 2 -{university, award, support, students, at, universities, institutional, provides, attend, faculty} Class 3={study, constrain, meridional, thermohaline, ocean, climate, hemispheres, greenland, observations, eastward}
Class 4={species, which, animals, how, genetic, animal, evolutionary, important, understanding, known}
Level of Resolution: High
Class 1 -{runtime, high-level, run-time, execution, concurrency, application- specific, software, object-based, object-oriented, dsm}
Class 2={problems, approximation, algorithmic, algorithms, approximating, algorithm, computationally, solving, developed, algebraic}
Class 3-{support, award, university, institutional, attend, universities, students, forum, faculty, committee}
Class 4= {materials, ceramic, fabricate, microstructures, fabrication, ceramics, fabricated, manufacture, composite, composites}
Class 5= {meridional, wind, magnetosphere, magnetospheric, circulation, hemispheres, imf magnetohydrodynamic, field-aligned, observations}
Class 6-{plate, tectonic, faulting, strike-slip, tectonics, uplift, compressional, extensional, geodetic, geodynamic}
Class 7 -{compositions, isotopic, composition, hydrous, carbonaceous, fractionation, carbon, minerals, dissolution, silicates} Class 8 -{cells, protein, proteins, cell, which, regulation, gene, regulated, biochemical, expression}
Class 9={species, evolutionary, deb, genus, populations, endangered, ecological, phylogeny, diversification, diversity}
The preliminary clusters and concept labels obtained using the present invention show that the algorithm seems adept at finding new (or hidden) concepts when the level of resolution is increased. Further, the concept labels returned by the algorithm are accurate and get refined as the level of resolution is increased.
In this case, a simple implementation of the query algorithm for distributed LSI was used. Given a query (set of terms), the algorithm produces a set of query terms for each LSI space in the distributed environment, which is further, refined by a cut-off score. The algorithm uses a set of similarity metrics, as discussed earlier. Results from individual LSI queries are collected, thresholded and presented to the user, categorized by concept. A subset of the NSF Awards database containing 250 documents was selected from each of the following NSF directorate codes:
1. ENG Engineering
2. GEO Geosciences
3. SBE Social, Behavioral and Economic Sciences
4. HER Education and Human Resources 5. MPS Mathematical and Physical Sciences
6. CSE Computer and Information Science and Engineering
7. BIO Biological Sciences Through these selections, the entire collection of 1750 documents was ensured to be semantically heterogeneous. Next, eight different LSI spaces were computed - one for all documents belonging to each directorate code, and a final one for the entire collection. The distributed query algorithm was run on the seven LSI spaces and the usual query on the comprehensive space. For comparison purposes, the actual document returned provided the final benchmark because the distributed LSI query mechanism was expected to perform better.
The main query consisted of the terms {brain, simulation}, and this was fed to the query algorithm. Further, a cut-off of 0.5 (similarity) was set system-wide. The extended query sets (using the cut-off) generated by the algorithm are listed below.
BIO: {brain, simulations (2), extended, assessment}
CSE: {neural, simulation}
EHR, ENG, GEO, MPS: {mechanisms, simulation}
SBE: {brain, simulation}
The final query results were as follows. The query on the larger LSI space returned no results which had similarity scores greater than 0.5. However, the top ten contained a couple of documents related to brain simulation but with low scores.
These two documents were reported in the results from BIO and SBE with similarity scores greater than 0.5. Another document (not found earlier) was reported from the CSE space with a score above 0.5. This document turned out to be an abstract on neural network algorithms that indeed was related to the query. The other spaces returned no documents with similarity scores above 0.5.
The above description has been presented only to illustrate and describe the invention. It is not intended to be exhaustive or to limit the invention to any precise form disclosed. Many modifications and variations are possible in light of the above teaching. The applications described were chosen and described in order to best explain the principles of the invention and its practical application to enable others skilled in the art to best utilize the invention on various applications and with various modifications as are suited to the particular use contemplated.

Claims

CLAIMSWe claim:
1. A method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of: generating a frequency count for each term in each data object in the collection; partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within; generating a term-by-data object matrix for each sub-collection; decomposing the term-by data object matrix into a reduced singular value representation; determining the centroid vectors of each sub-collection; finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub- collections.
2. The method of claim 1 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
3. The method of claim 2 wherein the step of preprocessing further comprises the reduction of various terms to a canonical form.
4. The method of claim 1 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
5. The method of claim 1 wherein the step of partitioning the collection is performed using a k-means clustering algorithm.
6. The method of claim 1 wherein the step of partitioning the collection is performed using hierarchical clustering.
7. The method of claim 1 wherein the predetermined number of terms is 10.
8. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection
9. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
10. The method of claim 1 wherein the reduced singular value representation of the term-by-data object for each sub-collection has approximately 200 orthogonal dimensions.
11 The method of claim 1 wherein the step of establishing similarity between sub-collections is based on the frequency of occurrence of common terms between sub-collections.
12. The method of claim 1 wherein the step of developing the similarity graph network is based on the semantic relationships between the common terms in each of the sub-collections.
13. The method of claim 1 wherein the step of developing the similarity graph network is based on the product of the frequency of occurrence of common terms between sub-collections and the semantic relationships between the common terms in each of the sub-collections.
14. The method of claim 11 wherein the step of developing the similarity graph network further comprises the steps of: determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections; and, choosing the linking sub-collection having the strongest link.
15. The method of claim 12 wherein the step of developing the similarity graph network further comprises the steps of: determining the correlation between a first sub-collection and a second sub- collection; permuting said first sub-collection against said second sub-collection; computing the Mantel test statistic for each permutation; counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub- collection; determining the p-value from said count; calculating the measure for a proximity of order zero; calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
\-ι s2 = (s> + p)-
16. A method of information retrieval in response to a user query from a user comprising the steps of: partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network; generating a query vector based on the user query; identifying all sub-collections likely to be response to the user query using the similarity graph network; and, identifying data objects similar to query vector in each identified sub- collection.
17. The method of claim 16 wherein the step of partitioning the collection of data objects further comprises the steps of: generating a frequency count for each term in each data object in the collection; partitioning the collection of data objects into a plurality of sub-collections using the term by data object information; generating a term-by-data object matrix for each sub-collection; decomposing the term-by data object matrix into a reduced singular value representation; determining the centroid vectors of each sub-collection; finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub- collections.
18. The method of claim 17 wherein the step of determining the centroid vectors uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.
19. The method of claim 17 wherein the method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
20. The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of: determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections; and, choosing the linking sub-collection having the strongest link.
21. The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of: determining the correlation between a first sub-collection and a second sub- collection; permuting said first sub-collection against said second sub-collection; computing the Mantel test statistic for each permutation; counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub- collection; determining the p-value from said count; calculating the measure for a proximity of order zero; calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
s2 ={s r + Pγ .
22. The method of claim 17 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
23. The method of claim 16 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
24. The method of claim 16 further comprising the steps of: ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query; selecting which of the ranked sub-collections to query; presenting the ranked sub-collections to the user; and, inputting user selection of the ranked sub-collections to be queried.
25. The method of claim 16 wherein the step of generating a query vector based on the user query further comprises expanding the user query by computing the weighted sum of its projected term vectors in one or more concept domains that are similar to another concept domain that actually contains the query terms.
26. The method of claim 16 further comprising the step of presenting the identified data objects to the user ranked by concept domain.
27. A system for the retrieval of information from a collection of data objects in response to a user query comprising: means for inputting a user query; one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within; an LSI processor hub in communication with each data server for: (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections , (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.
28. The system of claim 27 further comprising a means for presenting the identified data objects to the user.
29. A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising: means for generating a frequency count for each term in each data object in the collection; means for partitioning the collection of data objects into a plurality of sub- collections using the term-by-data object information; means for generating a term-by-data object matrix for each sub-collection; means for decomposing the term-by data object matrix into a reduced singular value representation; means for determining the centroid vectors of each sub-collection ; means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and, means for developing a similarity graph network to establish similarity between sub-collections.
PCT/US2004/012462 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing WO2004100130A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP04750497A EP1618467A4 (en) 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing
JP2006513228A JP4485524B2 (en) 2003-05-01 2004-04-23 Methods and systems for information retrieval and text mining using distributed latent semantic indexing
CA2523128A CA2523128C (en) 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/427,595 US7152065B2 (en) 2003-05-01 2003-05-01 Information retrieval and text mining using distributed latent semantic indexing
US10/427,595 2003-05-01

Publications (2)

Publication Number Publication Date
WO2004100130A2 true WO2004100130A2 (en) 2004-11-18
WO2004100130A3 WO2004100130A3 (en) 2005-03-24

Family

ID=33310195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/012462 WO2004100130A2 (en) 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing

Country Status (6)

Country Link
US (1) US7152065B2 (en)
EP (1) EP1618467A4 (en)
JP (1) JP4485524B2 (en)
CA (1) CA2523128C (en)
TW (1) TWI242730B (en)
WO (1) WO2004100130A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2463515A (en) * 2008-04-23 2010-03-24 British Telecomm Classification of online posts using keyword clusters derived from existing posts
US8825650B2 (en) 2008-04-23 2014-09-02 British Telecommunications Public Limited Company Method of classifying and sorting online content

Families Citing this family (118)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8301503B2 (en) * 2001-07-17 2012-10-30 Incucomm, Inc. System and method for providing requested information to thin clients
US20040133574A1 (en) * 2003-01-07 2004-07-08 Science Applications International Corporaton Vector space method for secure information sharing
US10475116B2 (en) * 2003-06-03 2019-11-12 Ebay Inc. Method to identify a suggested location for storing a data entry in a database
US7870134B2 (en) * 2003-08-28 2011-01-11 Newvectors Llc Agent-based clustering of abstract similar documents
US8166039B1 (en) * 2003-11-17 2012-04-24 The Board Of Trustees Of The Leland Stanford Junior University System and method for encoding document ranking vectors
US8548170B2 (en) 2003-12-10 2013-10-01 Mcafee, Inc. Document de-registration
US7774604B2 (en) 2003-12-10 2010-08-10 Mcafee, Inc. Verifying captured objects before presentation
US7814327B2 (en) 2003-12-10 2010-10-12 Mcafee, Inc. Document registration
US7984175B2 (en) * 2003-12-10 2011-07-19 Mcafee, Inc. Method and apparatus for data capture and analysis system
US7899828B2 (en) 2003-12-10 2011-03-01 Mcafee, Inc. Tag data structure for maintaining relational data over captured objects
US8656039B2 (en) 2003-12-10 2014-02-18 Mcafee, Inc. Rule parser
US7930540B2 (en) 2004-01-22 2011-04-19 Mcafee, Inc. Cryptographic policy enforcement
US20070214133A1 (en) * 2004-06-23 2007-09-13 Edo Liberty Methods for filtering data and filling in missing data using nonlinear inference
US8560534B2 (en) 2004-08-23 2013-10-15 Mcafee, Inc. Database for a capture system
US7949849B2 (en) 2004-08-24 2011-05-24 Mcafee, Inc. File system for a capture system
US7529735B2 (en) * 2005-02-11 2009-05-05 Microsoft Corporation Method and system for mining information based on relationships
US20060200461A1 (en) * 2005-03-01 2006-09-07 Lucas Marshall D Process for identifying weighted contextural relationships between unrelated documents
US7684963B2 (en) * 2005-03-29 2010-03-23 International Business Machines Corporation Systems and methods of data traffic generation via density estimation using SVD
US9177248B2 (en) 2005-03-30 2015-11-03 Primal Fusion Inc. Knowledge representation systems and methods incorporating customization
US9104779B2 (en) 2005-03-30 2015-08-11 Primal Fusion Inc. Systems and methods for analyzing and synthesizing complex knowledge representations
US10002325B2 (en) 2005-03-30 2018-06-19 Primal Fusion Inc. Knowledge representation systems and methods incorporating inference rules
US9378203B2 (en) 2008-05-01 2016-06-28 Primal Fusion Inc. Methods and apparatus for providing information of interest to one or more users
US7849090B2 (en) 2005-03-30 2010-12-07 Primal Fusion Inc. System, method and computer program for faceted classification synthesis
US8849860B2 (en) 2005-03-30 2014-09-30 Primal Fusion Inc. Systems and methods for applying statistical inference techniques to knowledge representations
US8312034B2 (en) 2005-06-24 2012-11-13 Purediscovery Corporation Concept bridge and method of operating the same
US7809551B2 (en) * 2005-07-01 2010-10-05 Xerox Corporation Concept matching system
US20070016648A1 (en) * 2005-07-12 2007-01-18 Higgins Ronald C Enterprise Message Mangement
US7725485B1 (en) * 2005-08-01 2010-05-25 Google Inc. Generating query suggestions using contextual information
US7907608B2 (en) 2005-08-12 2011-03-15 Mcafee, Inc. High speed packet capture
US7818326B2 (en) 2005-08-31 2010-10-19 Mcafee, Inc. System and method for word indexing in a capture system and querying thereof
US20080215614A1 (en) * 2005-09-08 2008-09-04 Slattery Michael J Pyramid Information Quantification or PIQ or Pyramid Database or Pyramided Database or Pyramided or Selective Pressure Database Management System
US7730011B1 (en) 2005-10-19 2010-06-01 Mcafee, Inc. Attributes of captured objects in a capture system
US7657104B2 (en) 2005-11-21 2010-02-02 Mcafee, Inc. Identifying image type in a capture system
WO2007064375A2 (en) * 2005-11-30 2007-06-07 Selective, Inc. Selective latent semantic indexing method for information retrieval applications
US7756855B2 (en) * 2006-10-11 2010-07-13 Collarity, Inc. Search phrase refinement by search term replacement
US8903810B2 (en) 2005-12-05 2014-12-02 Collarity, Inc. Techniques for ranking search results
US8429184B2 (en) 2005-12-05 2013-04-23 Collarity Inc. Generation of refinement terms for search queries
US7627559B2 (en) * 2005-12-15 2009-12-01 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
US20070143307A1 (en) * 2005-12-15 2007-06-21 Bowers Matthew N Communication system employing a context engine
US7689559B2 (en) * 2006-02-08 2010-03-30 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US20070220037A1 (en) * 2006-03-20 2007-09-20 Microsoft Corporation Expansion phrase database for abbreviated terms
US8504537B2 (en) 2006-03-24 2013-08-06 Mcafee, Inc. Signature distribution in a document registration system
US7716229B1 (en) 2006-03-31 2010-05-11 Microsoft Corporation Generating misspells from query log context usage
US7958227B2 (en) 2006-05-22 2011-06-07 Mcafee, Inc. Attributes of captured objects in a capture system
US7689614B2 (en) 2006-05-22 2010-03-30 Mcafee, Inc. Query generation for a capture system
CN101512521B (en) * 2006-06-02 2013-01-16 Tti发明有限责任公司 Concept based cross media indexing and retrieval of speech documents
CA2549536C (en) * 2006-06-06 2012-12-04 University Of Regina Method and apparatus for construction and use of concept knowledge base
US7752243B2 (en) * 2006-06-06 2010-07-06 University Of Regina Method and apparatus for construction and use of concept knowledge base
US7752557B2 (en) * 2006-08-29 2010-07-06 University Of Regina Method and apparatus of visual representations of search results
US7895210B2 (en) * 2006-09-29 2011-02-22 Battelle Memorial Institute Methods and apparatuses for information analysis on shared and distributed computing systems
US8442972B2 (en) 2006-10-11 2013-05-14 Collarity, Inc. Negative associations for search results ranking and refinement
US20080154878A1 (en) * 2006-12-20 2008-06-26 Rose Daniel E Diversifying a set of items
US7849104B2 (en) * 2007-03-01 2010-12-07 Microsoft Corporation Searching heterogeneous interrelated entities
US7552131B2 (en) 2007-03-05 2009-06-23 International Business Machines Corporation Autonomic retention classes
CN100442292C (en) * 2007-03-22 2008-12-10 华中科技大学 Method for indexing and acquiring semantic net information
US7636715B2 (en) * 2007-03-23 2009-12-22 Microsoft Corporation Method for fast large scale data mining using logistic regression
JP5045240B2 (en) * 2007-05-29 2012-10-10 富士通株式会社 Data division program, recording medium recording the program, data division apparatus, and data division method
US7921100B2 (en) * 2008-01-02 2011-04-05 At&T Intellectual Property I, L.P. Set similarity selection queries at interactive speeds
US8676732B2 (en) 2008-05-01 2014-03-18 Primal Fusion Inc. Methods and apparatus for providing information of interest to one or more users
CA2723179C (en) * 2008-05-01 2017-11-28 Primal Fusion Inc. Method, system, and computer program for user-driven dynamic generation of semantic networks and media synthesis
US9361365B2 (en) 2008-05-01 2016-06-07 Primal Fusion Inc. Methods and apparatus for searching of content using semantic synthesis
US8438178B2 (en) 2008-06-26 2013-05-07 Collarity Inc. Interactions among online digital identities
US8205242B2 (en) 2008-07-10 2012-06-19 Mcafee, Inc. System and method for data mining and security policy management
US9253154B2 (en) 2008-08-12 2016-02-02 Mcafee, Inc. Configuration management for a capture/registration system
EP2329406A1 (en) 2008-08-29 2011-06-08 Primal Fusion Inc. Systems and methods for semantic concept definition and semantic concept relationship synthesis utilizing existing domain definitions
US20100100547A1 (en) * 2008-10-20 2010-04-22 Flixbee, Inc. Method, system and apparatus for generating relevant informational tags via text mining
US20100114890A1 (en) * 2008-10-31 2010-05-06 Purediscovery Corporation System and Method for Discovering Latent Relationships in Data
US8850591B2 (en) * 2009-01-13 2014-09-30 Mcafee, Inc. System and method for concept building
US8706709B2 (en) 2009-01-15 2014-04-22 Mcafee, Inc. System and method for intelligent term grouping
US8473442B1 (en) 2009-02-25 2013-06-25 Mcafee, Inc. System and method for intelligent state management
US8447722B1 (en) 2009-03-25 2013-05-21 Mcafee, Inc. System and method for data mining and security policy management
US8667121B2 (en) 2009-03-25 2014-03-04 Mcafee, Inc. System and method for managing data and policies
US8533202B2 (en) 2009-07-07 2013-09-10 Yahoo! Inc. Entropy-based mixing and personalization
US8478749B2 (en) * 2009-07-20 2013-07-02 Lexisnexis, A Division Of Reed Elsevier Inc. Method and apparatus for determining relevant search results using a matrix framework
US9292855B2 (en) 2009-09-08 2016-03-22 Primal Fusion Inc. Synthesizing messaging using context provided by consumers
US9262520B2 (en) 2009-11-10 2016-02-16 Primal Fusion Inc. System, method and computer program for creating and manipulating data structures using an interactive graphical interface
JP5408658B2 (en) * 2009-11-16 2014-02-05 日本電信電話株式会社 Information consistency determination device, method and program thereof
US8428933B1 (en) 2009-12-17 2013-04-23 Shopzilla, Inc. Usage based query response
US8775160B1 (en) 2009-12-17 2014-07-08 Shopzilla, Inc. Usage based query response
US8875038B2 (en) 2010-01-19 2014-10-28 Collarity, Inc. Anchoring for content synchronization
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts
US9235806B2 (en) 2010-06-22 2016-01-12 Primal Fusion Inc. Methods and devices for customizing knowledge representation systems
US10474647B2 (en) 2010-06-22 2019-11-12 Primal Fusion Inc. Methods and devices for customizing knowledge representation systems
US9240020B2 (en) 2010-08-24 2016-01-19 Yahoo! Inc. Method of recommending content via social signals
US8806615B2 (en) 2010-11-04 2014-08-12 Mcafee, Inc. System and method for protecting specified data combinations
US11294977B2 (en) 2011-06-20 2022-04-05 Primal Fusion Inc. Techniques for presenting content to a user based on the user's preferences
KR101776673B1 (en) * 2011-01-11 2017-09-11 삼성전자주식회사 Apparatus and method for automatically generating grammar in natural language processing
US9104749B2 (en) 2011-01-12 2015-08-11 International Business Machines Corporation Semantically aggregated index in an indexer-agnostic index building system
US9507816B2 (en) * 2011-05-24 2016-11-29 Nintendo Co., Ltd. Partitioned database model to increase the scalability of an information system
US9098575B2 (en) 2011-06-20 2015-08-04 Primal Fusion Inc. Preference-guided semantic processing
US8533195B2 (en) * 2011-06-27 2013-09-10 Microsoft Corporation Regularized latent semantic indexing for topic modeling
US8886651B1 (en) * 2011-12-22 2014-11-11 Reputation.Com, Inc. Thematic clustering
US20130246431A1 (en) 2011-12-27 2013-09-19 Mcafee, Inc. System and method for providing data protection workflows in a network environment
CN102750315B (en) * 2012-04-25 2016-03-23 北京航空航天大学 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right
US9208254B2 (en) * 2012-12-10 2015-12-08 Microsoft Technology Licensing, Llc Query and index over documents
US9355166B2 (en) * 2013-01-31 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering signifiers in a semantics graph
US8788516B1 (en) * 2013-03-15 2014-07-22 Purediscovery Corporation Generating and using social brains with complimentary semantic brains and indexes
US10438254B2 (en) 2013-03-15 2019-10-08 Ebay Inc. Using plain text to list an item on a publication system
US10223401B2 (en) 2013-08-15 2019-03-05 International Business Machines Corporation Incrementally retrieving data for objects to provide a desired level of detail
US11204929B2 (en) 2014-11-18 2021-12-21 International Business Machines Corporation Evidence aggregation across heterogeneous links for intelligence gathering using a question answering system
US9892362B2 (en) 2014-11-18 2018-02-13 International Business Machines Corporation Intelligence gathering and analysis using a question answering system
US11244113B2 (en) 2014-11-19 2022-02-08 International Business Machines Corporation Evaluating evidential links based on corroboration for intelligence analysis
US10318870B2 (en) 2014-11-19 2019-06-11 International Business Machines Corporation Grading sources and managing evidence for intelligence analysis
US9727642B2 (en) 2014-11-21 2017-08-08 International Business Machines Corporation Question pruning for evaluating a hypothetical ontological link
US11836211B2 (en) 2014-11-21 2023-12-05 International Business Machines Corporation Generating additional lines of questioning based on evaluation of a hypothetical link between concept entities in evidential data
US10255323B1 (en) * 2015-08-31 2019-04-09 Google Llc Quantization-based fast inner product search
US10331659B2 (en) 2016-09-06 2019-06-25 International Business Machines Corporation Automatic detection and cleansing of erroneous concepts in an aggregated knowledge base
US10606893B2 (en) 2016-09-15 2020-03-31 International Business Machines Corporation Expanding knowledge graphs based on candidate missing edges to optimize hypothesis set adjudication
US10719509B2 (en) 2016-10-11 2020-07-21 Google Llc Hierarchical quantization for fast inner product search
TWI602068B (en) * 2016-10-17 2017-10-11 Data processing device and method thereof
TWI604322B (en) * 2016-11-10 2017-11-01 英業達股份有限公司 Solution searching system and method for operating a solution searching system
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10552478B1 (en) 2016-12-28 2020-02-04 Shutterstock, Inc. Image search using intersected predicted queries
US10169331B2 (en) * 2017-01-29 2019-01-01 International Business Machines Corporation Text mining for automatically determining semantic relatedness
US20180341686A1 (en) * 2017-05-26 2018-11-29 Nanfang Hu System and method for data search based on top-to-bottom similarity analysis
US11163957B2 (en) 2017-06-29 2021-11-02 International Business Machines Corporation Performing semantic graph search
US20190243914A1 (en) * 2018-02-08 2019-08-08 Adam Lugowski Parallel query processing in a distributed analytics architecture
US11392596B2 (en) 2018-05-14 2022-07-19 Google Llc Efficient inner product operations

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424971B1 (en) 1999-10-29 2002-07-23 International Business Machines Corporation System and method for interactive classification and analysis of data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) * 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US5301109A (en) * 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US6269376B1 (en) * 1998-10-26 2001-07-31 International Business Machines Corporation Method and system for clustering data in parallel in a distributed-memory multiprocessor system
US6701305B1 (en) * 1999-06-09 2004-03-02 The Boeing Company Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace
AU2001286689A1 (en) * 2000-08-24 2002-03-04 Science Applications International Corporation Word sense disambiguation
US7024400B2 (en) * 2001-05-08 2006-04-04 Sunflare Co., Ltd. Differential LSI space-based probabilistic document classifier

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424971B1 (en) 1999-10-29 2002-07-23 International Business Machines Corporation System and method for interactive classification and analysis of data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2463515A (en) * 2008-04-23 2010-03-24 British Telecomm Classification of online posts using keyword clusters derived from existing posts
US8255402B2 (en) 2008-04-23 2012-08-28 British Telecommunications Public Limited Company Method and system of classifying online data
US8825650B2 (en) 2008-04-23 2014-09-02 British Telecommunications Public Limited Company Method of classifying and sorting online content

Also Published As

Publication number Publication date
JP2006525602A (en) 2006-11-09
TWI242730B (en) 2005-11-01
US7152065B2 (en) 2006-12-19
CA2523128C (en) 2011-09-27
EP1618467A4 (en) 2008-09-17
US20040220944A1 (en) 2004-11-04
JP4485524B2 (en) 2010-06-23
TW200426627A (en) 2004-12-01
EP1618467A2 (en) 2006-01-25
WO2004100130A3 (en) 2005-03-24
CA2523128A1 (en) 2004-11-18

Similar Documents

Publication Publication Date Title
CA2523128C (en) Information retrieval and text mining using distributed latent semantic indexing
Belew Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents
RU2386997C2 (en) Method and system for coordinating web database schemes
US7644102B2 (en) Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
Shatkay Hairpins in bookstacks: information retrieval from biomedical text
Ding et al. User modeling for personalized Web search with self‐organizing map
Kashyap et al. Analysis of the multiple-attribute-tree data-base organization
Bassu et al. Distributed LSI: Scalable concept-based information retrieval with high semantic resolution
Wu et al. An empirical approach for semantic web services discovery
Dolin et al. Using automated classification for summarizing and selecting heterogeneous information sources
Vilo Pattern discovery from biosequences
Wahyudi et al. Information retrieval system for searching JSON files with vector space model method
Mishra et al. Review of Web Page Clustering
Húsek et al. Data clustering: from documents to the web
Ghonge et al. A Review on Improving the Clustering Performance in Text Mining
Baker Latent class analysis as an association model for information retrieval
Zhou et al. It's the Best Only When It Fits You Most: Finding Related Models for Serving Based on Dynamic Locality Sensitive Hashing
RU2266560C1 (en) Method utilized to search for information in poly-topic arrays of unorganized texts
Hirsch Document Clustering with Evolved Multi-Word Search Queries Where the Number of Classes is Unknown
Belew Adaptive information retrieval: using a connectionist representation to retrieve and leasrn about documents
Galbiati A phrase‐based matching function
Zhao Domain-Specific Knowledge Exploration with Ontology Hierarchical Re-Ranking and Adaptive Learning and Extension
Shi et al. Web mining: Extracting knowledge from the world wide Web
Melucci et al. Elements of Information Retrieval
Venkatesh Pairwise Document Similarity using an Incremental Approach to TF-IDF.

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2523128

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2004750497

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2006513228

Country of ref document: JP

WWP Wipo information: published in national office

Ref document number: 2004750497

Country of ref document: EP