DOCUMENT CLUSTERING WITH CLUSTER
REFINEMENT AND MODEL SELECTION
CAPABILITIES
RELATED APPLICATIONS
[0001] This Application claims priority from co-pending U.S. Provisional Application Serial No. 60/350,948, filed Jan. 25, 2002, which is incorporated in its entirety by reference.
BACKGROUND OF THE INVENTION [0002] 1. Field of the Invention
[0003] This invention relates to information retrieval methods and, more specifically, to a method for document clustering with cluster refinement and model selection capabilities.
[0004] 2. Background and Related Art
[0005] 1. References
[0006] The following papers provide useful background information, for which they are incorporated herein by reference in their entirety, and are selectively referred to in the remainder of the disclosure by their accompanying reference numbers in angled brackets (i.e. <3> for the third numbered paper by L. Baker et al.):
[0007] <1> Tagged Brown Corpus: http://www.hit.uib.no/ icame/brown/bcm.html, 1979.
[0008] <2> NIST Topic Detection and Tracking Corpus: http://www.nist.gov/speech/tests/tdt/tdt98/index.htm, 1998.
[0009] <3> L. Baker and A. McCallum. Distributional Clustering of Words for Text Classification. In Proceedings ofACMSIGIR, 1998.
[0010] <4> W. Croft. Clustering Large Files of Documents using the Single-link Method. Journal of the American Society of Information Science, 28:341-344, 1977.
[0011] <5> D. R. Cutting, D. R. Karger, J. O. Pederson, and J. W. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of ACMISIGIR, 1992.
[0012] <6> R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, second edition. Wiley, New York, 2000.
[0013] <7> W. A. Gale and K. W. Church. Identifying Word Correspondences in Parallel Texts. In Proceedings of the Speech and Natural Language Work Shop, page 152, Pacific Grove, Calif., 1991.
[0014] <8> M. Goldszmidt and M. Sahami. A Probabilistic Approach to Full-text Document Clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997.
[0015] <9> T. Hofmann. The Cluster-abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. In Proceedings ofIJCAI-99, 1999.
[0016] <10> D. Pelleg and A. Moore. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), June 2000.
[0017] <11> F. Pereira, N. Tishby, and L. Lee. Distributional Clustering of English Words. In Proceedings of the Association for Computational Linguistics, pp. 183-190, 1993.
[0018] <12> J. Piatt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical report 98-14, Microsoft research. http://www.research.microsoft.com/jplatt/smo.html, 1998.
[0019] <13> P. Willett. Recent Trends in Hierarchical Document Clustering: A Critical Review, nformaton Processing & Management, 24(5):577-597, 1988.
[0020] <14> P. Willett. Document Clustering using an Inverted File Approach. Journal of Information Science, 2:223-231, 1990.
[0021] 2. Related Art
[0022] Traditional text search engines accomplish document retrieval by taking a query from the user, and then returning a set of documents matching the user's query. Nowadays, as the primary users of text search engines have shifted from librarian experts to ordinary people who do not have much knowledge about information retrieval (IR) methods, and in light of the explosive growth of accessible text documents on the Internet, traditional IR techniques are becoming more and more insufficient for meeting diversified information retrieval needs, and for handling huge volumes of relevant text documents.
[0023] Traditional IR techniques suffer from numerous problems and limitations. The following examples provide some illustrative contexts in which these problems and limitations are manifested.
[0024] First, text retrieval results are sensitive to the keywords used by the user to form queries. To retrieve the documents of interest, the user must formulate the query using the keywords that appear in the documents. This is a difficult task, if not impossible, for ordinary people who are not familiar with the vocabulary of the data corpus.
[0025] Second, traditional text search engines cover only one end of the whole spectrum of information retrieval needs, which is a narrowly specified search for documents matching the user's query <5>. They are not capable of meeting the information retrieval needs from the remaining part of the spectrum in which the user has a rather broad or vague information need (e.g. what are the major international events in the year 2001), or has no well defined goals but wants to learn more about the general contents of the data corpus.
[0026] Third, with an ever-increasing number of on-line text documents available on the Internet, it has become quite common for a keyword-based text search by a traditional search engine to return hundreds, or even thousands of hits, by which the user is often overwhelmed. As a consequence, access to the desired documents has become a more difficult and arduous task than ever before.
[0027] The above problems can be lessened by clustering documents according to their topics and main contents. If the document clusters are appropriately created, each of which is assigned an informative label, then it is probable that the user can reach his/her documents of interest without having to worry about which keywords to choose to formulate a