CA2523128A1 - Information retrieval and text mining using distributed latent semantic indexing - Google Patents
Information retrieval and text mining using distributed latent semantic indexing Download PDFInfo
- Publication number
- CA2523128A1 CA2523128A1 CA002523128A CA2523128A CA2523128A1 CA 2523128 A1 CA2523128 A1 CA 2523128A1 CA 002523128 A CA002523128 A CA 002523128A CA 2523128 A CA2523128 A CA 2523128A CA 2523128 A1 CA2523128 A1 CA 2523128A1
- Authority
- CA
- Canada
- Prior art keywords
- sub
- collection
- collections
- term
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Abstract
The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determining which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.
Claims (29)
1. A method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
2. The method of claim 1 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
3. The method of claim 2 wherein the step of preprocessing further comprises the reduction of various terms to a canonical form.
4. The method of claim 1 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
5. The method of claim 1 wherein the step of partitioning the collection is performed using a k-means clustering algorithm.
6. The method of claim 1 wherein the step of partitioning the collection is performed using hierarchical clustering.
7. The method of claim 1 wherein the predetermined number of terms is 10.
8. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection
9. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
10. The method of claim 1 wherein the reduced singular value representation of the term-by-data object for each sub-collection has approximately 200 orthogonal dimensions.
11 The method of claim 1 wherein the step of establishing similarity between sub-collections is based on the frequency of occurrence of common terms between sub-collections.
12. The method of claim 1 wherein the step of developing the similarity graph network is based on the semantic relationships between the common terms in each of the sub-collections.
13. The method of claim 1 wherein the step of developing the similarity graph network is based on the product of the frequency of occurrence of common terms between sub-collections and the semantic relationships between the common terms in each of the sub-collections.
14. The method of claim 11 wherein the step of developing the similarity graph network further comprises the steps of:
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
15. The method of claim 12 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
16. A method of information retrieval in response to a user query from a user comprising the steps of:
partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network;
generating a query vector based on the user query;
identifying all sub-collections likely to be response to the user query using the similarity graph network; and, identifying data objects similar to query vector in each identified sub-collection.
partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network;
generating a query vector based on the user query;
identifying all sub-collections likely to be response to the user query using the similarity graph network; and, identifying data objects similar to query vector in each identified sub-collection.
17. The method of claim 16 wherein the step of partitioning the collection of data objects further comprises the steps of:
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term by data object information;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term by data object information;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
18. ~The method of claim 17 wherein the step of determining the centroid vectors uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.
19. ~The method of claim 17 wherein the method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
20. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:~
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
21. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
22. ~The method of claim 17 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
23. ~The method of claim 16 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
24. ~The method of claim 16 further comprising the steps of;
ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query;~
selecting which of the ranked sub-collections to query;
presenting the ranked sub-collections to the user; and, inputting user selection of the ranked sub-collections to be queried.
ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query;~
selecting which of the ranked sub-collections to query;
presenting the ranked sub-collections to the user; and, inputting user selection of the ranked sub-collections to be queried.
25. ~The method of claim 16 wherein the step of generating a query vector based on the user query further comprises expanding the user query by computing the weighted sum of its projected term vectors in one or more concept domains that are similar to another concept domain that actually contains the query terms.
26. ~The method of claim 16 further comprising the step of presenting the identified data objects to the user ranked by concept domain.
27. ~A system for the retrieval of information from a collection of data objects in response to a user query comprising:
means for inputting a user query;
one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
an LSI processor hub in communication with each data server for: (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.
means for inputting a user query;
one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
an LSI processor hub in communication with each data server for: (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.
28. ~The system of claim 27 further comprising a means for presenting the identified data objects to the user.
29. ~A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising:
means for generating a frequency count for each term in each data object in the collection;
means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information;
means for generating a term-by-data object matrix for each sub-collection;
means for decomposing the term-by data object matrix into a reduced singular value representation;
means for determining the centroid vectors of each sub-collection ;
means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and, means for developing a similarity graph network to establish similarity between sub-collections.
means for generating a frequency count for each term in each data object in the collection;
means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information;
means for generating a term-by-data object matrix for each sub-collection;
means for decomposing the term-by data object matrix into a reduced singular value representation;
means for determining the centroid vectors of each sub-collection ;
means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and, means for developing a similarity graph network to establish similarity between sub-collections.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/427,595 US7152065B2 (en) | 2003-05-01 | 2003-05-01 | Information retrieval and text mining using distributed latent semantic indexing |
US10/427,595 | 2003-05-01 | ||
PCT/US2004/012462 WO2004100130A2 (en) | 2003-05-01 | 2004-04-23 | Information retrieval and text mining using distributed latent semantic indexing |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2523128A1 true CA2523128A1 (en) | 2004-11-18 |
CA2523128C CA2523128C (en) | 2011-09-27 |
Family
ID=33310195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA2523128A Expired - Fee Related CA2523128C (en) | 2003-05-01 | 2004-04-23 | Information retrieval and text mining using distributed latent semantic indexing |
Country Status (6)
Country | Link |
---|---|
US (1) | US7152065B2 (en) |
EP (1) | EP1618467A4 (en) |
JP (1) | JP4485524B2 (en) |
CA (1) | CA2523128C (en) |
TW (1) | TWI242730B (en) |
WO (1) | WO2004100130A2 (en) |
Families Citing this family (120)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8301503B2 (en) * | 2001-07-17 | 2012-10-30 | Incucomm, Inc. | System and method for providing requested information to thin clients |
US20040133574A1 (en) * | 2003-01-07 | 2004-07-08 | Science Applications International Corporaton | Vector space method for secure information sharing |
US10475116B2 (en) * | 2003-06-03 | 2019-11-12 | Ebay Inc. | Method to identify a suggested location for storing a data entry in a database |
US7870134B2 (en) * | 2003-08-28 | 2011-01-11 | Newvectors Llc | Agent-based clustering of abstract similar documents |
US8166039B1 (en) * | 2003-11-17 | 2012-04-24 | The Board Of Trustees Of The Leland Stanford Junior University | System and method for encoding document ranking vectors |
US7899828B2 (en) | 2003-12-10 | 2011-03-01 | Mcafee, Inc. | Tag data structure for maintaining relational data over captured objects |
US7774604B2 (en) | 2003-12-10 | 2010-08-10 | Mcafee, Inc. | Verifying captured objects before presentation |
US7814327B2 (en) | 2003-12-10 | 2010-10-12 | Mcafee, Inc. | Document registration |
US8656039B2 (en) | 2003-12-10 | 2014-02-18 | Mcafee, Inc. | Rule parser |
US7984175B2 (en) * | 2003-12-10 | 2011-07-19 | Mcafee, Inc. | Method and apparatus for data capture and analysis system |
US8548170B2 (en) | 2003-12-10 | 2013-10-01 | Mcafee, Inc. | Document de-registration |
US7930540B2 (en) | 2004-01-22 | 2011-04-19 | Mcafee, Inc. | Cryptographic policy enforcement |
US20070214133A1 (en) * | 2004-06-23 | 2007-09-13 | Edo Liberty | Methods for filtering data and filling in missing data using nonlinear inference |
US8560534B2 (en) | 2004-08-23 | 2013-10-15 | Mcafee, Inc. | Database for a capture system |
US7949849B2 (en) | 2004-08-24 | 2011-05-24 | Mcafee, Inc. | File system for a capture system |
US7529735B2 (en) * | 2005-02-11 | 2009-05-05 | Microsoft Corporation | Method and system for mining information based on relationships |
US20060200461A1 (en) * | 2005-03-01 | 2006-09-07 | Lucas Marshall D | Process for identifying weighted contextural relationships between unrelated documents |
US7684963B2 (en) * | 2005-03-29 | 2010-03-23 | International Business Machines Corporation | Systems and methods of data traffic generation via density estimation using SVD |
US8849860B2 (en) | 2005-03-30 | 2014-09-30 | Primal Fusion Inc. | Systems and methods for applying statistical inference techniques to knowledge representations |
US9177248B2 (en) | 2005-03-30 | 2015-11-03 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating customization |
US9378203B2 (en) | 2008-05-01 | 2016-06-28 | Primal Fusion Inc. | Methods and apparatus for providing information of interest to one or more users |
US10002325B2 (en) | 2005-03-30 | 2018-06-19 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating inference rules |
US9104779B2 (en) | 2005-03-30 | 2015-08-11 | Primal Fusion Inc. | Systems and methods for analyzing and synthesizing complex knowledge representations |
US7849090B2 (en) | 2005-03-30 | 2010-12-07 | Primal Fusion Inc. | System, method and computer program for faceted classification synthesis |
US8312034B2 (en) * | 2005-06-24 | 2012-11-13 | Purediscovery Corporation | Concept bridge and method of operating the same |
US7809551B2 (en) * | 2005-07-01 | 2010-10-05 | Xerox Corporation | Concept matching system |
US20070016648A1 (en) * | 2005-07-12 | 2007-01-18 | Higgins Ronald C | Enterprise Message Mangement |
US7725485B1 (en) * | 2005-08-01 | 2010-05-25 | Google Inc. | Generating query suggestions using contextual information |
US7907608B2 (en) | 2005-08-12 | 2011-03-15 | Mcafee, Inc. | High speed packet capture |
US7818326B2 (en) | 2005-08-31 | 2010-10-19 | Mcafee, Inc. | System and method for word indexing in a capture system and querying thereof |
US20080215614A1 (en) * | 2005-09-08 | 2008-09-04 | Slattery Michael J | Pyramid Information Quantification or PIQ or Pyramid Database or Pyramided Database or Pyramided or Selective Pressure Database Management System |
US7730011B1 (en) | 2005-10-19 | 2010-06-01 | Mcafee, Inc. | Attributes of captured objects in a capture system |
US7657104B2 (en) | 2005-11-21 | 2010-02-02 | Mcafee, Inc. | Identifying image type in a capture system |
US7630992B2 (en) * | 2005-11-30 | 2009-12-08 | Selective, Inc. | Selective latent semantic indexing method for information retrieval applications |
US7756855B2 (en) * | 2006-10-11 | 2010-07-13 | Collarity, Inc. | Search phrase refinement by search term replacement |
US8903810B2 (en) | 2005-12-05 | 2014-12-02 | Collarity, Inc. | Techniques for ranking search results |
US8429184B2 (en) | 2005-12-05 | 2013-04-23 | Collarity Inc. | Generation of refinement terms for search queries |
US7627559B2 (en) * | 2005-12-15 | 2009-12-01 | Microsoft Corporation | Context-based key phrase discovery and similarity measurement utilizing search engine query logs |
US20070143307A1 (en) * | 2005-12-15 | 2007-06-21 | Bowers Matthew N | Communication system employing a context engine |
US7689559B2 (en) * | 2006-02-08 | 2010-03-30 | Telenor Asa | Document similarity scoring and ranking method, device and computer program product |
US20070220037A1 (en) * | 2006-03-20 | 2007-09-20 | Microsoft Corporation | Expansion phrase database for abbreviated terms |
US8504537B2 (en) | 2006-03-24 | 2013-08-06 | Mcafee, Inc. | Signature distribution in a document registration system |
US7716229B1 (en) | 2006-03-31 | 2010-05-11 | Microsoft Corporation | Generating misspells from query log context usage |
US7958227B2 (en) | 2006-05-22 | 2011-06-07 | Mcafee, Inc. | Attributes of captured objects in a capture system |
US7689614B2 (en) | 2006-05-22 | 2010-03-30 | Mcafee, Inc. | Query generation for a capture system |
CA2653932C (en) * | 2006-06-02 | 2013-03-19 | Telcordia Technologies, Inc. | Concept based cross media indexing and retrieval of speech documents |
US7752243B2 (en) * | 2006-06-06 | 2010-07-06 | University Of Regina | Method and apparatus for construction and use of concept knowledge base |
CA2549536C (en) * | 2006-06-06 | 2012-12-04 | University Of Regina | Method and apparatus for construction and use of concept knowledge base |
US7752557B2 (en) * | 2006-08-29 | 2010-07-06 | University Of Regina | Method and apparatus of visual representations of search results |
US7895210B2 (en) * | 2006-09-29 | 2011-02-22 | Battelle Memorial Institute | Methods and apparatuses for information analysis on shared and distributed computing systems |
US8442972B2 (en) | 2006-10-11 | 2013-05-14 | Collarity, Inc. | Negative associations for search results ranking and refinement |
US20080154878A1 (en) * | 2006-12-20 | 2008-06-26 | Rose Daniel E | Diversifying a set of items |
US7849104B2 (en) * | 2007-03-01 | 2010-12-07 | Microsoft Corporation | Searching heterogeneous interrelated entities |
US7552131B2 (en) | 2007-03-05 | 2009-06-23 | International Business Machines Corporation | Autonomic retention classes |
CN100442292C (en) * | 2007-03-22 | 2008-12-10 | 华中科技大学 | Method for indexing and acquiring semantic net information |
US7636715B2 (en) * | 2007-03-23 | 2009-12-22 | Microsoft Corporation | Method for fast large scale data mining using logistic regression |
JP5045240B2 (en) * | 2007-05-29 | 2012-10-10 | 富士通株式会社 | Data division program, recording medium recording the program, data division apparatus, and data division method |
US7921100B2 (en) * | 2008-01-02 | 2011-04-05 | At&T Intellectual Property I, L.P. | Set similarity selection queries at interactive speeds |
GB2463515A (en) * | 2008-04-23 | 2010-03-24 | British Telecomm | Classification of online posts using keyword clusters derived from existing posts |
GB2459476A (en) | 2008-04-23 | 2009-10-28 | British Telecomm | Classification of posts for prioritizing or grouping comments. |
US9361365B2 (en) | 2008-05-01 | 2016-06-07 | Primal Fusion Inc. | Methods and apparatus for searching of content using semantic synthesis |
JP5530425B2 (en) | 2008-05-01 | 2014-06-25 | プライマル フュージョン インコーポレイテッド | Method, system, and computer program for dynamic generation of user-driven semantic networks and media integration |
US8676732B2 (en) | 2008-05-01 | 2014-03-18 | Primal Fusion Inc. | Methods and apparatus for providing information of interest to one or more users |
US8438178B2 (en) | 2008-06-26 | 2013-05-07 | Collarity Inc. | Interactions among online digital identities |
US8205242B2 (en) | 2008-07-10 | 2012-06-19 | Mcafee, Inc. | System and method for data mining and security policy management |
US9253154B2 (en) | 2008-08-12 | 2016-02-02 | Mcafee, Inc. | Configuration management for a capture/registration system |
CN106250371A (en) | 2008-08-29 | 2016-12-21 | 启创互联公司 | For utilizing the definition of existing territory to carry out the system and method that semantic concept definition and semantic concept relation is comprehensive |
US20100100547A1 (en) * | 2008-10-20 | 2010-04-22 | Flixbee, Inc. | Method, system and apparatus for generating relevant informational tags via text mining |
US20100114890A1 (en) * | 2008-10-31 | 2010-05-06 | Purediscovery Corporation | System and Method for Discovering Latent Relationships in Data |
US8850591B2 (en) * | 2009-01-13 | 2014-09-30 | Mcafee, Inc. | System and method for concept building |
US8706709B2 (en) | 2009-01-15 | 2014-04-22 | Mcafee, Inc. | System and method for intelligent term grouping |
US8473442B1 (en) | 2009-02-25 | 2013-06-25 | Mcafee, Inc. | System and method for intelligent state management |
US8447722B1 (en) | 2009-03-25 | 2013-05-21 | Mcafee, Inc. | System and method for data mining and security policy management |
US8667121B2 (en) | 2009-03-25 | 2014-03-04 | Mcafee, Inc. | System and method for managing data and policies |
US8533202B2 (en) | 2009-07-07 | 2013-09-10 | Yahoo! Inc. | Entropy-based mixing and personalization |
US8478749B2 (en) * | 2009-07-20 | 2013-07-02 | Lexisnexis, A Division Of Reed Elsevier Inc. | Method and apparatus for determining relevant search results using a matrix framework |
US9292855B2 (en) | 2009-09-08 | 2016-03-22 | Primal Fusion Inc. | Synthesizing messaging using context provided by consumers |
US9262520B2 (en) | 2009-11-10 | 2016-02-16 | Primal Fusion Inc. | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
JP5408658B2 (en) * | 2009-11-16 | 2014-02-05 | 日本電信電話株式会社 | Information consistency determination device, method and program thereof |
US8775160B1 (en) | 2009-12-17 | 2014-07-08 | Shopzilla, Inc. | Usage based query response |
US8428933B1 (en) | 2009-12-17 | 2013-04-23 | Shopzilla, Inc. | Usage based query response |
US8875038B2 (en) | 2010-01-19 | 2014-10-28 | Collarity, Inc. | Anchoring for content synchronization |
CN102141978A (en) * | 2010-02-02 | 2011-08-03 | 阿里巴巴集团控股有限公司 | Method and system for classifying texts |
US10474647B2 (en) | 2010-06-22 | 2019-11-12 | Primal Fusion Inc. | Methods and devices for customizing knowledge representation systems |
US9235806B2 (en) | 2010-06-22 | 2016-01-12 | Primal Fusion Inc. | Methods and devices for customizing knowledge representation systems |
US9240020B2 (en) | 2010-08-24 | 2016-01-19 | Yahoo! Inc. | Method of recommending content via social signals |
US8806615B2 (en) | 2010-11-04 | 2014-08-12 | Mcafee, Inc. | System and method for protecting specified data combinations |
US11294977B2 (en) | 2011-06-20 | 2022-04-05 | Primal Fusion Inc. | Techniques for presenting content to a user based on the user's preferences |
KR101776673B1 (en) * | 2011-01-11 | 2017-09-11 | 삼성전자주식회사 | Apparatus and method for automatically generating grammar in natural language processing |
US9104749B2 (en) | 2011-01-12 | 2015-08-11 | International Business Machines Corporation | Semantically aggregated index in an indexer-agnostic index building system |
US9507816B2 (en) * | 2011-05-24 | 2016-11-29 | Nintendo Co., Ltd. | Partitioned database model to increase the scalability of an information system |
US20120324367A1 (en) | 2011-06-20 | 2012-12-20 | Primal Fusion Inc. | System and method for obtaining preferences with a user interface |
US8533195B2 (en) * | 2011-06-27 | 2013-09-10 | Microsoft Corporation | Regularized latent semantic indexing for topic modeling |
US8886651B1 (en) * | 2011-12-22 | 2014-11-11 | Reputation.Com, Inc. | Thematic clustering |
US20130246336A1 (en) | 2011-12-27 | 2013-09-19 | Mcafee, Inc. | System and method for providing data protection workflows in a network environment |
CN102750315B (en) * | 2012-04-25 | 2016-03-23 | 北京航空航天大学 | Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right |
US9208254B2 (en) * | 2012-12-10 | 2015-12-08 | Microsoft Technology Licensing, Llc | Query and index over documents |
US9355166B2 (en) * | 2013-01-31 | 2016-05-31 | Hewlett Packard Enterprise Development Lp | Clustering signifiers in a semantics graph |
US8788516B1 (en) * | 2013-03-15 | 2014-07-22 | Purediscovery Corporation | Generating and using social brains with complimentary semantic brains and indexes |
US10438254B2 (en) | 2013-03-15 | 2019-10-08 | Ebay Inc. | Using plain text to list an item on a publication system |
US10223401B2 (en) * | 2013-08-15 | 2019-03-05 | International Business Machines Corporation | Incrementally retrieving data for objects to provide a desired level of detail |
US9892362B2 (en) | 2014-11-18 | 2018-02-13 | International Business Machines Corporation | Intelligence gathering and analysis using a question answering system |
US11204929B2 (en) | 2014-11-18 | 2021-12-21 | International Business Machines Corporation | Evidence aggregation across heterogeneous links for intelligence gathering using a question answering system |
US10318870B2 (en) | 2014-11-19 | 2019-06-11 | International Business Machines Corporation | Grading sources and managing evidence for intelligence analysis |
US11244113B2 (en) | 2014-11-19 | 2022-02-08 | International Business Machines Corporation | Evaluating evidential links based on corroboration for intelligence analysis |
US11836211B2 (en) | 2014-11-21 | 2023-12-05 | International Business Machines Corporation | Generating additional lines of questioning based on evaluation of a hypothetical link between concept entities in evidential data |
US9727642B2 (en) | 2014-11-21 | 2017-08-08 | International Business Machines Corporation | Question pruning for evaluating a hypothetical ontological link |
US10255323B1 (en) | 2015-08-31 | 2019-04-09 | Google Llc | Quantization-based fast inner product search |
US10331659B2 (en) | 2016-09-06 | 2019-06-25 | International Business Machines Corporation | Automatic detection and cleansing of erroneous concepts in an aggregated knowledge base |
US10606893B2 (en) | 2016-09-15 | 2020-03-31 | International Business Machines Corporation | Expanding knowledge graphs based on candidate missing edges to optimize hypothesis set adjudication |
US10719509B2 (en) | 2016-10-11 | 2020-07-21 | Google Llc | Hierarchical quantization for fast inner product search |
TWI602068B (en) * | 2016-10-17 | 2017-10-11 | Data processing device and method thereof | |
TWI604322B (en) * | 2016-11-10 | 2017-11-01 | 英業達股份有限公司 | Solution searching system and method for operating a solution searching system |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
US10552478B1 (en) * | 2016-12-28 | 2020-02-04 | Shutterstock, Inc. | Image search using intersected predicted queries |
US10169331B2 (en) * | 2017-01-29 | 2019-01-01 | International Business Machines Corporation | Text mining for automatically determining semantic relatedness |
US20180341686A1 (en) * | 2017-05-26 | 2018-11-29 | Nanfang Hu | System and method for data search based on top-to-bottom similarity analysis |
US11163957B2 (en) | 2017-06-29 | 2021-11-02 | International Business Machines Corporation | Performing semantic graph search |
US20190243914A1 (en) * | 2018-02-08 | 2019-08-08 | Adam Lugowski | Parallel query processing in a distributed analytics architecture |
US11392596B2 (en) | 2018-05-14 | 2022-07-19 | Google Llc | Efficient inner product operations |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4839853A (en) | 1988-09-15 | 1989-06-13 | Bell Communications Research, Inc. | Computer information retrieval using latent semantic structure |
US5301109A (en) | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US6269376B1 (en) * | 1998-10-26 | 2001-07-31 | International Business Machines Corporation | Method and system for clustering data in parallel in a distributed-memory multiprocessor system |
US6701305B1 (en) * | 1999-06-09 | 2004-03-02 | The Boeing Company | Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace |
US6424971B1 (en) * | 1999-10-29 | 2002-07-23 | International Business Machines Corporation | System and method for interactive classification and analysis of data |
US7024407B2 (en) * | 2000-08-24 | 2006-04-04 | Content Analyst Company, Llc | Word sense disambiguation |
US7024400B2 (en) * | 2001-05-08 | 2006-04-04 | Sunflare Co., Ltd. | Differential LSI space-based probabilistic document classifier |
-
2003
- 2003-05-01 US US10/427,595 patent/US7152065B2/en not_active Expired - Lifetime
-
2004
- 2004-04-23 EP EP04750497A patent/EP1618467A4/en not_active Withdrawn
- 2004-04-23 CA CA2523128A patent/CA2523128C/en not_active Expired - Fee Related
- 2004-04-23 JP JP2006513228A patent/JP4485524B2/en not_active Expired - Fee Related
- 2004-04-23 WO PCT/US2004/012462 patent/WO2004100130A2/en active Application Filing
- 2004-04-30 TW TW093112343A patent/TWI242730B/en active
Also Published As
Publication number | Publication date |
---|---|
US20040220944A1 (en) | 2004-11-04 |
JP2006525602A (en) | 2006-11-09 |
WO2004100130A3 (en) | 2005-03-24 |
WO2004100130A2 (en) | 2004-11-18 |
EP1618467A2 (en) | 2006-01-25 |
JP4485524B2 (en) | 2010-06-23 |
TWI242730B (en) | 2005-11-01 |
CA2523128C (en) | 2011-09-27 |
EP1618467A4 (en) | 2008-09-17 |
US7152065B2 (en) | 2006-12-19 |
TW200426627A (en) | 2004-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2523128A1 (en) | Information retrieval and text mining using distributed latent semantic indexing | |
US8073818B2 (en) | Co-location visual pattern mining for near-duplicate image retrieval | |
US8266121B2 (en) | Identifying related objects using quantum clustering | |
Huang | Similarity measures for text document clustering | |
CN101449271B (en) | Annotated by search | |
US7167823B2 (en) | Multimedia information retrieval method, program, record medium and system | |
US8359282B2 (en) | Supervised semantic indexing and its extensions | |
Boley et al. | Partitioning-based clustering for web document categorization | |
Doulamis et al. | Event detection in twitter microblogging | |
Alguliev et al. | GenDocSum+ MCLR: Generic document summarization based on maximum coverage and less redundancy | |
CN109885773A (en) | A kind of article personalized recommendation method, system, medium and equipment | |
JP2003030222A (en) | Method and system for retrieving, detecting and identifying main cluster and outlier cluster in large scale database, recording medium and server | |
Alguliev et al. | Formulation of document summarization as a 0–1 nonlinear programming problem | |
Song et al. | Probabilistic correlation-based similarity measure on text records | |
Lee et al. | Efficient image retrieval using advanced SURF and DCD on mobile platform | |
Jung | Exploiting geotagged resources for spatial clustering on social network services | |
Hare et al. | Saliency-based models of image content and their application to auto-annotation by semantic propagation | |
Bolelli et al. | Clustering scientific literature using sparse citation graph analysis | |
Wallace et al. | Towards a context aware mining of user interests for consumption of multimedia documents | |
Parekh et al. | Web usage mining: frequent pattern generation using association rule mining and clustering | |
CN111143400A (en) | Full-stack type retrieval method, system, engine and electronic equipment | |
Krusche et al. | Efficient longest common subsequence computation using bulk-synchronous parallelism | |
Husbands et al. | Term norm distribution and its effects on latent semantic indexing | |
Chaudhary et al. | A novel multimodal clustering framework for images with diverse associated text | |
CN114610859A (en) | Product recommendation method, device and equipment based on content and collaborative filtering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKLA | Lapsed |
Effective date: 20130423 |