CA2523128A1

CA2523128A1 - Information retrieval and text mining using distributed latent semantic indexing

Info

Publication number: CA2523128A1
Application number: CA002523128A
Authority: CA
Inventors: Clifford A. Behrens; Devasis Bassu
Original assignee: Individual
Current assignee: Nytell Software LLC
Priority date: 2003-05-01
Filing date: 2004-04-23
Publication date: 2004-11-18
Anticipated expiration: 2024-04-23
Also published as: US20040220944A1; JP2006525602A; WO2004100130A3; WO2004100130A2; EP1618467A2; JP4485524B2; TWI242730B; CA2523128C; EP1618467A4; US7152065B2; TW200426627A

Abstract

The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determining which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.

Claims

1. A method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.

2. The method of claim 1 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.

3. The method of claim 2 wherein the step of preprocessing further comprises the reduction of various terms to a canonical form.

4. The method of claim 1 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.

5. The method of claim 1 wherein the step of partitioning the collection is performed using a k-means clustering algorithm.

6. The method of claim 1 wherein the step of partitioning the collection is performed using hierarchical clustering.

7. The method of claim 1 wherein the predetermined number of terms is 10.

8. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection

9. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.

10. The method of claim 1 wherein the reduced singular value representation of the term-by-data object for each sub-collection has approximately 200 orthogonal dimensions.

11 The method of claim 1 wherein the step of establishing similarity between sub-collections is based on the frequency of occurrence of common terms between sub-collections.

12. The method of claim 1 wherein the step of developing the similarity graph network is based on the semantic relationships between the common terms in each of the sub-collections.

13. The method of claim 1 wherein the step of developing the similarity graph network is based on the product of the frequency of occurrence of common terms between sub-collections and the semantic relationships between the common terms in each of the sub-collections.

14. The method of claim 11 wherein the step of developing the similarity graph network further comprises the steps of:
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.

15. The method of claim 12 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;

computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein

16. A method of information retrieval in response to a user query from a user comprising the steps of:
partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network;
generating a query vector based on the user query;
identifying all sub-collections likely to be response to the user query using the similarity graph network; and, identifying data objects similar to query vector in each identified sub-collection.

17. The method of claim 16 wherein the step of partitioning the collection of data objects further comprises the steps of:
generating a frequency count for each term in each data object in the collection;

partitioning the collection of data objects into a plurality of sub-collections using the term by data object information;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.

18. ~The method of claim 17 wherein the step of determining the centroid vectors uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.

19. ~The method of claim 17 wherein the method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.

20. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:~
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.

21. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein

22. ~The method of claim 17 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.

23. ~The method of claim 16 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.

24. ~The method of claim 16 further comprising the steps of;

ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query;~
selecting which of the ranked sub-collections to query;
presenting the ranked sub-collections to the user; and, inputting user selection of the ranked sub-collections to be queried.

25. ~The method of claim 16 wherein the step of generating a query vector based on the user query further comprises expanding the user query by computing the weighted sum of its projected term vectors in one or more concept domains that are similar to another concept domain that actually contains the query terms.

26. ~The method of claim 16 further comprising the step of presenting the identified data objects to the user ranked by concept domain.

27. ~A system for the retrieval of information from a collection of data objects in response to a user query comprising:
means for inputting a user query;
one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
an LSI processor hub in communication with each data server for: (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.

28. ~The system of claim 27 further comprising a means for presenting the identified data objects to the user.

29. ~A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising:
means for generating a frequency count for each term in each data object in the collection;
means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information;
means for generating a term-by-data object matrix for each sub-collection;
means for decomposing the term-by data object matrix into a reduced singular value representation;
means for determining the centroid vectors of each sub-collection ;
means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and, means for developing a similarity graph network to establish similarity between sub-collections.