CA2523128A1 - Information retrieval and text mining using distributed latent semantic indexing - Google Patents

Information retrieval and text mining using distributed latent semantic indexing Download PDF

Info

Publication number
CA2523128A1
CA2523128A1 CA002523128A CA2523128A CA2523128A1 CA 2523128 A1 CA2523128 A1 CA 2523128A1 CA 002523128 A CA002523128 A CA 002523128A CA 2523128 A CA2523128 A CA 2523128A CA 2523128 A1 CA2523128 A1 CA 2523128A1
Authority
CA
Canada
Prior art keywords
sub
collection
collections
term
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002523128A
Other languages
French (fr)
Other versions
CA2523128C (en
Inventor
Clifford A. Behrens
Devasis Bassu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nytell Software LLC
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2523128A1 publication Critical patent/CA2523128A1/en
Application granted granted Critical
Publication of CA2523128C publication Critical patent/CA2523128C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Abstract

The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determining which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.

Claims (29)

1. A method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
2. The method of claim 1 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
3. The method of claim 2 wherein the step of preprocessing further comprises the reduction of various terms to a canonical form.
4. The method of claim 1 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
5. The method of claim 1 wherein the step of partitioning the collection is performed using a k-means clustering algorithm.
6. The method of claim 1 wherein the step of partitioning the collection is performed using hierarchical clustering.
7. The method of claim 1 wherein the predetermined number of terms is 10.
8. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection
9. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
10. The method of claim 1 wherein the reduced singular value representation of the term-by-data object for each sub-collection has approximately 200 orthogonal dimensions.
11 The method of claim 1 wherein the step of establishing similarity between sub-collections is based on the frequency of occurrence of common terms between sub-collections.
12. The method of claim 1 wherein the step of developing the similarity graph network is based on the semantic relationships between the common terms in each of the sub-collections.
13. The method of claim 1 wherein the step of developing the similarity graph network is based on the product of the frequency of occurrence of common terms between sub-collections and the semantic relationships between the common terms in each of the sub-collections.
14. The method of claim 11 wherein the step of developing the similarity graph network further comprises the steps of:
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
15. The method of claim 12 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;

computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
16. A method of information retrieval in response to a user query from a user comprising the steps of:
partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network;
generating a query vector based on the user query;
identifying all sub-collections likely to be response to the user query using the similarity graph network; and, identifying data objects similar to query vector in each identified sub-collection.
17. The method of claim 16 wherein the step of partitioning the collection of data objects further comprises the steps of:
generating a frequency count for each term in each data object in the collection;

partitioning the collection of data objects into a plurality of sub-collections using the term by data object information;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and, developing a similarity graph network to establish similarity between sub-collections.
18. ~The method of claim 17 wherein the step of determining the centroid vectors uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.
19. ~The method of claim 17 wherein the method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
20. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:~
determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections;
and, choosing the linking sub-collection having the strongest link.
21. ~The method of claim 17 wherein the step of developing the similarity graph network further comprises the steps of:
determining the correlation between a first sub-collection and a second sub-collection;
permuting said first sub-collection against said second sub-collection;
computing the Mantel test statistic for each permutation;
counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
determining the p-value from said count;
calculating the measure for a proximity of order zero;
calculating the measure for the first order proximity; and, determining the semantic relationship based similarity measure s2 wherein
22. ~The method of claim 17 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
23. ~The method of claim 16 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
24. ~The method of claim 16 further comprising the steps of;

ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query;~
selecting which of the ranked sub-collections to query;
presenting the ranked sub-collections to the user; and, inputting user selection of the ranked sub-collections to be queried.
25. ~The method of claim 16 wherein the step of generating a query vector based on the user query further comprises expanding the user query by computing the weighted sum of its projected term vectors in one or more concept domains that are similar to another concept domain that actually contains the query terms.
26. ~The method of claim 16 further comprising the step of presenting the identified data objects to the user ranked by concept domain.
27. ~A system for the retrieval of information from a collection of data objects in response to a user query comprising:
means for inputting a user query;
one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
an LSI processor hub in communication with each data server for: (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.
28. ~The system of claim 27 further comprising a means for presenting the identified data objects to the user.
29. ~A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising:
means for generating a frequency count for each term in each data object in the collection;
means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information;
means for generating a term-by-data object matrix for each sub-collection;
means for decomposing the term-by data object matrix into a reduced singular value representation;
means for determining the centroid vectors of each sub-collection ;
means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and, means for developing a similarity graph network to establish similarity between sub-collections.
CA2523128A 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing Expired - Fee Related CA2523128C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US10/427,595 US7152065B2 (en) 2003-05-01 2003-05-01 Information retrieval and text mining using distributed latent semantic indexing
US10/427,595 2003-05-01
PCT/US2004/012462 WO2004100130A2 (en) 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing

Publications (2)

Publication Number Publication Date
CA2523128A1 true CA2523128A1 (en) 2004-11-18
CA2523128C CA2523128C (en) 2011-09-27

Family

ID=33310195

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2523128A Expired - Fee Related CA2523128C (en) 2003-05-01 2004-04-23 Information retrieval and text mining using distributed latent semantic indexing

Country Status (6)

Country Link
US (1) US7152065B2 (en)
EP (1) EP1618467A4 (en)
JP (1) JP4485524B2 (en)
CA (1) CA2523128C (en)
TW (1) TWI242730B (en)
WO (1) WO2004100130A2 (en)

Families Citing this family (120)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8301503B2 (en) * 2001-07-17 2012-10-30 Incucomm, Inc. System and method for providing requested information to thin clients
US20040133574A1 (en) * 2003-01-07 2004-07-08 Science Applications International Corporaton Vector space method for secure information sharing
US10475116B2 (en) * 2003-06-03 2019-11-12 Ebay Inc. Method to identify a suggested location for storing a data entry in a database
US7870134B2 (en) * 2003-08-28 2011-01-11 Newvectors Llc Agent-based clustering of abstract similar documents
US8166039B1 (en) * 2003-11-17 2012-04-24 The Board Of Trustees Of The Leland Stanford Junior University System and method for encoding document ranking vectors
US7899828B2 (en) 2003-12-10 2011-03-01 Mcafee, Inc. Tag data structure for maintaining relational data over captured objects
US7774604B2 (en) 2003-12-10 2010-08-10 Mcafee, Inc. Verifying captured objects before presentation
US7814327B2 (en) 2003-12-10 2010-10-12 Mcafee, Inc. Document registration
US8656039B2 (en) 2003-12-10 2014-02-18 Mcafee, Inc. Rule parser
US7984175B2 (en) * 2003-12-10 2011-07-19 Mcafee, Inc. Method and apparatus for data capture and analysis system
US8548170B2 (en) 2003-12-10 2013-10-01 Mcafee, Inc. Document de-registration
US7930540B2 (en) 2004-01-22 2011-04-19 Mcafee, Inc. Cryptographic policy enforcement
US20070214133A1 (en) * 2004-06-23 2007-09-13 Edo Liberty Methods for filtering data and filling in missing data using nonlinear inference
US8560534B2 (en) 2004-08-23 2013-10-15 Mcafee, Inc. Database for a capture system
US7949849B2 (en) 2004-08-24 2011-05-24 Mcafee, Inc. File system for a capture system
US7529735B2 (en) * 2005-02-11 2009-05-05 Microsoft Corporation Method and system for mining information based on relationships
US20060200461A1 (en) * 2005-03-01 2006-09-07 Lucas Marshall D Process for identifying weighted contextural relationships between unrelated documents
US7684963B2 (en) * 2005-03-29 2010-03-23 International Business Machines Corporation Systems and methods of data traffic generation via density estimation using SVD
US8849860B2 (en) 2005-03-30 2014-09-30 Primal Fusion Inc. Systems and methods for applying statistical inference techniques to knowledge representations
US9177248B2 (en) 2005-03-30 2015-11-03 Primal Fusion Inc. Knowledge representation systems and methods incorporating customization
US9378203B2 (en) 2008-05-01 2016-06-28 Primal Fusion Inc. Methods and apparatus for providing information of interest to one or more users
US10002325B2 (en) 2005-03-30 2018-06-19 Primal Fusion Inc. Knowledge representation systems and methods incorporating inference rules
US9104779B2 (en) 2005-03-30 2015-08-11 Primal Fusion Inc. Systems and methods for analyzing and synthesizing complex knowledge representations
US7849090B2 (en) 2005-03-30 2010-12-07 Primal Fusion Inc. System, method and computer program for faceted classification synthesis
US8312034B2 (en) * 2005-06-24 2012-11-13 Purediscovery Corporation Concept bridge and method of operating the same
US7809551B2 (en) * 2005-07-01 2010-10-05 Xerox Corporation Concept matching system
US20070016648A1 (en) * 2005-07-12 2007-01-18 Higgins Ronald C Enterprise Message Mangement
US7725485B1 (en) * 2005-08-01 2010-05-25 Google Inc. Generating query suggestions using contextual information
US7907608B2 (en) 2005-08-12 2011-03-15 Mcafee, Inc. High speed packet capture
US7818326B2 (en) 2005-08-31 2010-10-19 Mcafee, Inc. System and method for word indexing in a capture system and querying thereof
US20080215614A1 (en) * 2005-09-08 2008-09-04 Slattery Michael J Pyramid Information Quantification or PIQ or Pyramid Database or Pyramided Database or Pyramided or Selective Pressure Database Management System
US7730011B1 (en) 2005-10-19 2010-06-01 Mcafee, Inc. Attributes of captured objects in a capture system
US7657104B2 (en) 2005-11-21 2010-02-02 Mcafee, Inc. Identifying image type in a capture system
US7630992B2 (en) * 2005-11-30 2009-12-08 Selective, Inc. Selective latent semantic indexing method for information retrieval applications
US7756855B2 (en) * 2006-10-11 2010-07-13 Collarity, Inc. Search phrase refinement by search term replacement
US8903810B2 (en) 2005-12-05 2014-12-02 Collarity, Inc. Techniques for ranking search results
US8429184B2 (en) 2005-12-05 2013-04-23 Collarity Inc. Generation of refinement terms for search queries
US7627559B2 (en) * 2005-12-15 2009-12-01 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
US20070143307A1 (en) * 2005-12-15 2007-06-21 Bowers Matthew N Communication system employing a context engine
US7689559B2 (en) * 2006-02-08 2010-03-30 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US20070220037A1 (en) * 2006-03-20 2007-09-20 Microsoft Corporation Expansion phrase database for abbreviated terms
US8504537B2 (en) 2006-03-24 2013-08-06 Mcafee, Inc. Signature distribution in a document registration system
US7716229B1 (en) 2006-03-31 2010-05-11 Microsoft Corporation Generating misspells from query log context usage
US7958227B2 (en) 2006-05-22 2011-06-07 Mcafee, Inc. Attributes of captured objects in a capture system
US7689614B2 (en) 2006-05-22 2010-03-30 Mcafee, Inc. Query generation for a capture system
CA2653932C (en) * 2006-06-02 2013-03-19 Telcordia Technologies, Inc. Concept based cross media indexing and retrieval of speech documents
US7752243B2 (en) * 2006-06-06 2010-07-06 University Of Regina Method and apparatus for construction and use of concept knowledge base
CA2549536C (en) * 2006-06-06 2012-12-04 University Of Regina Method and apparatus for construction and use of concept knowledge base
US7752557B2 (en) * 2006-08-29 2010-07-06 University Of Regina Method and apparatus of visual representations of search results
US7895210B2 (en) * 2006-09-29 2011-02-22 Battelle Memorial Institute Methods and apparatuses for information analysis on shared and distributed computing systems
US8442972B2 (en) 2006-10-11 2013-05-14 Collarity, Inc. Negative associations for search results ranking and refinement
US20080154878A1 (en) * 2006-12-20 2008-06-26 Rose Daniel E Diversifying a set of items
US7849104B2 (en) * 2007-03-01 2010-12-07 Microsoft Corporation Searching heterogeneous interrelated entities
US7552131B2 (en) 2007-03-05 2009-06-23 International Business Machines Corporation Autonomic retention classes
CN100442292C (en) * 2007-03-22 2008-12-10 华中科技大学 Method for indexing and acquiring semantic net information
US7636715B2 (en) * 2007-03-23 2009-12-22 Microsoft Corporation Method for fast large scale data mining using logistic regression
JP5045240B2 (en) * 2007-05-29 2012-10-10 富士通株式会社 Data division program, recording medium recording the program, data division apparatus, and data division method
US7921100B2 (en) * 2008-01-02 2011-04-05 At&T Intellectual Property I, L.P. Set similarity selection queries at interactive speeds
GB2463515A (en) * 2008-04-23 2010-03-24 British Telecomm Classification of online posts using keyword clusters derived from existing posts
GB2459476A (en) 2008-04-23 2009-10-28 British Telecomm Classification of posts for prioritizing or grouping comments.
US9361365B2 (en) 2008-05-01 2016-06-07 Primal Fusion Inc. Methods and apparatus for searching of content using semantic synthesis
JP5530425B2 (en) 2008-05-01 2014-06-25 プライマル フュージョン インコーポレイテッド Method, system, and computer program for dynamic generation of user-driven semantic networks and media integration
US8676732B2 (en) 2008-05-01 2014-03-18 Primal Fusion Inc. Methods and apparatus for providing information of interest to one or more users
US8438178B2 (en) 2008-06-26 2013-05-07 Collarity Inc. Interactions among online digital identities
US8205242B2 (en) 2008-07-10 2012-06-19 Mcafee, Inc. System and method for data mining and security policy management
US9253154B2 (en) 2008-08-12 2016-02-02 Mcafee, Inc. Configuration management for a capture/registration system
CN106250371A (en) 2008-08-29 2016-12-21 启创互联公司 For utilizing the definition of existing territory to carry out the system and method that semantic concept definition and semantic concept relation is comprehensive
US20100100547A1 (en) * 2008-10-20 2010-04-22 Flixbee, Inc. Method, system and apparatus for generating relevant informational tags via text mining
US20100114890A1 (en) * 2008-10-31 2010-05-06 Purediscovery Corporation System and Method for Discovering Latent Relationships in Data
US8850591B2 (en) * 2009-01-13 2014-09-30 Mcafee, Inc. System and method for concept building
US8706709B2 (en) 2009-01-15 2014-04-22 Mcafee, Inc. System and method for intelligent term grouping
US8473442B1 (en) 2009-02-25 2013-06-25 Mcafee, Inc. System and method for intelligent state management
US8447722B1 (en) 2009-03-25 2013-05-21 Mcafee, Inc. System and method for data mining and security policy management
US8667121B2 (en) 2009-03-25 2014-03-04 Mcafee, Inc. System and method for managing data and policies
US8533202B2 (en) 2009-07-07 2013-09-10 Yahoo! Inc. Entropy-based mixing and personalization
US8478749B2 (en) * 2009-07-20 2013-07-02 Lexisnexis, A Division Of Reed Elsevier Inc. Method and apparatus for determining relevant search results using a matrix framework
US9292855B2 (en) 2009-09-08 2016-03-22 Primal Fusion Inc. Synthesizing messaging using context provided by consumers
US9262520B2 (en) 2009-11-10 2016-02-16 Primal Fusion Inc. System, method and computer program for creating and manipulating data structures using an interactive graphical interface
JP5408658B2 (en) * 2009-11-16 2014-02-05 日本電信電話株式会社 Information consistency determination device, method and program thereof
US8775160B1 (en) 2009-12-17 2014-07-08 Shopzilla, Inc. Usage based query response
US8428933B1 (en) 2009-12-17 2013-04-23 Shopzilla, Inc. Usage based query response
US8875038B2 (en) 2010-01-19 2014-10-28 Collarity, Inc. Anchoring for content synchronization
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts
US10474647B2 (en) 2010-06-22 2019-11-12 Primal Fusion Inc. Methods and devices for customizing knowledge representation systems
US9235806B2 (en) 2010-06-22 2016-01-12 Primal Fusion Inc. Methods and devices for customizing knowledge representation systems
US9240020B2 (en) 2010-08-24 2016-01-19 Yahoo! Inc. Method of recommending content via social signals
US8806615B2 (en) 2010-11-04 2014-08-12 Mcafee, Inc. System and method for protecting specified data combinations
US11294977B2 (en) 2011-06-20 2022-04-05 Primal Fusion Inc. Techniques for presenting content to a user based on the user's preferences
KR101776673B1 (en) * 2011-01-11 2017-09-11 삼성전자주식회사 Apparatus and method for automatically generating grammar in natural language processing
US9104749B2 (en) 2011-01-12 2015-08-11 International Business Machines Corporation Semantically aggregated index in an indexer-agnostic index building system
US9507816B2 (en) * 2011-05-24 2016-11-29 Nintendo Co., Ltd. Partitioned database model to increase the scalability of an information system
US20120324367A1 (en) 2011-06-20 2012-12-20 Primal Fusion Inc. System and method for obtaining preferences with a user interface
US8533195B2 (en) * 2011-06-27 2013-09-10 Microsoft Corporation Regularized latent semantic indexing for topic modeling
US8886651B1 (en) * 2011-12-22 2014-11-11 Reputation.Com, Inc. Thematic clustering
US20130246336A1 (en) 2011-12-27 2013-09-19 Mcafee, Inc. System and method for providing data protection workflows in a network environment
CN102750315B (en) * 2012-04-25 2016-03-23 北京航空航天大学 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right
US9208254B2 (en) * 2012-12-10 2015-12-08 Microsoft Technology Licensing, Llc Query and index over documents
US9355166B2 (en) * 2013-01-31 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering signifiers in a semantics graph
US8788516B1 (en) * 2013-03-15 2014-07-22 Purediscovery Corporation Generating and using social brains with complimentary semantic brains and indexes
US10438254B2 (en) 2013-03-15 2019-10-08 Ebay Inc. Using plain text to list an item on a publication system
US10223401B2 (en) * 2013-08-15 2019-03-05 International Business Machines Corporation Incrementally retrieving data for objects to provide a desired level of detail
US9892362B2 (en) 2014-11-18 2018-02-13 International Business Machines Corporation Intelligence gathering and analysis using a question answering system
US11204929B2 (en) 2014-11-18 2021-12-21 International Business Machines Corporation Evidence aggregation across heterogeneous links for intelligence gathering using a question answering system
US10318870B2 (en) 2014-11-19 2019-06-11 International Business Machines Corporation Grading sources and managing evidence for intelligence analysis
US11244113B2 (en) 2014-11-19 2022-02-08 International Business Machines Corporation Evaluating evidential links based on corroboration for intelligence analysis
US11836211B2 (en) 2014-11-21 2023-12-05 International Business Machines Corporation Generating additional lines of questioning based on evaluation of a hypothetical link between concept entities in evidential data
US9727642B2 (en) 2014-11-21 2017-08-08 International Business Machines Corporation Question pruning for evaluating a hypothetical ontological link
US10255323B1 (en) 2015-08-31 2019-04-09 Google Llc Quantization-based fast inner product search
US10331659B2 (en) 2016-09-06 2019-06-25 International Business Machines Corporation Automatic detection and cleansing of erroneous concepts in an aggregated knowledge base
US10606893B2 (en) 2016-09-15 2020-03-31 International Business Machines Corporation Expanding knowledge graphs based on candidate missing edges to optimize hypothesis set adjudication
US10719509B2 (en) 2016-10-11 2020-07-21 Google Llc Hierarchical quantization for fast inner product search
TWI602068B (en) * 2016-10-17 2017-10-11 Data processing device and method thereof
TWI604322B (en) * 2016-11-10 2017-11-01 英業達股份有限公司 Solution searching system and method for operating a solution searching system
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10552478B1 (en) * 2016-12-28 2020-02-04 Shutterstock, Inc. Image search using intersected predicted queries
US10169331B2 (en) * 2017-01-29 2019-01-01 International Business Machines Corporation Text mining for automatically determining semantic relatedness
US20180341686A1 (en) * 2017-05-26 2018-11-29 Nanfang Hu System and method for data search based on top-to-bottom similarity analysis
US11163957B2 (en) 2017-06-29 2021-11-02 International Business Machines Corporation Performing semantic graph search
US20190243914A1 (en) * 2018-02-08 2019-08-08 Adam Lugowski Parallel query processing in a distributed analytics architecture
US11392596B2 (en) 2018-05-14 2022-07-19 Google Llc Efficient inner product operations

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US5301109A (en) 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US6269376B1 (en) * 1998-10-26 2001-07-31 International Business Machines Corporation Method and system for clustering data in parallel in a distributed-memory multiprocessor system
US6701305B1 (en) * 1999-06-09 2004-03-02 The Boeing Company Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace
US6424971B1 (en) * 1999-10-29 2002-07-23 International Business Machines Corporation System and method for interactive classification and analysis of data
US7024407B2 (en) * 2000-08-24 2006-04-04 Content Analyst Company, Llc Word sense disambiguation
US7024400B2 (en) * 2001-05-08 2006-04-04 Sunflare Co., Ltd. Differential LSI space-based probabilistic document classifier

Also Published As

Publication number Publication date
US20040220944A1 (en) 2004-11-04
JP2006525602A (en) 2006-11-09
WO2004100130A3 (en) 2005-03-24
WO2004100130A2 (en) 2004-11-18
EP1618467A2 (en) 2006-01-25
JP4485524B2 (en) 2010-06-23
TWI242730B (en) 2005-11-01
CA2523128C (en) 2011-09-27
EP1618467A4 (en) 2008-09-17
US7152065B2 (en) 2006-12-19
TW200426627A (en) 2004-12-01

Similar Documents

Publication Publication Date Title
CA2523128A1 (en) Information retrieval and text mining using distributed latent semantic indexing
US8073818B2 (en) Co-location visual pattern mining for near-duplicate image retrieval
US8266121B2 (en) Identifying related objects using quantum clustering
Huang Similarity measures for text document clustering
CN101449271B (en) Annotated by search
US7167823B2 (en) Multimedia information retrieval method, program, record medium and system
US8359282B2 (en) Supervised semantic indexing and its extensions
Boley et al. Partitioning-based clustering for web document categorization
Doulamis et al. Event detection in twitter microblogging
Alguliev et al. GenDocSum+ MCLR: Generic document summarization based on maximum coverage and less redundancy
CN109885773A (en) A kind of article personalized recommendation method, system, medium and equipment
JP2003030222A (en) Method and system for retrieving, detecting and identifying main cluster and outlier cluster in large scale database, recording medium and server
Alguliev et al. Formulation of document summarization as a 0–1 nonlinear programming problem
Song et al. Probabilistic correlation-based similarity measure on text records
Lee et al. Efficient image retrieval using advanced SURF and DCD on mobile platform
Jung Exploiting geotagged resources for spatial clustering on social network services
Hare et al. Saliency-based models of image content and their application to auto-annotation by semantic propagation
Bolelli et al. Clustering scientific literature using sparse citation graph analysis
Wallace et al. Towards a context aware mining of user interests for consumption of multimedia documents
Parekh et al. Web usage mining: frequent pattern generation using association rule mining and clustering
CN111143400A (en) Full-stack type retrieval method, system, engine and electronic equipment
Krusche et al. Efficient longest common subsequence computation using bulk-synchronous parallelism
Husbands et al. Term norm distribution and its effects on latent semantic indexing
Chaudhary et al. A novel multimodal clustering framework for images with diverse associated text
CN114610859A (en) Product recommendation method, device and equipment based on content and collaborative filtering

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed

Effective date: 20130423