5,765,150 6/1998 Burrows 707/5
5,905,980 5/1999 Masuichi et al 707/1
5,920,854 7/1999 Kirsch et al 707/3
Unger, E.A. et al. ("Entropy as a Measure of Database
Information", IEEE, 1990, pp. 80-87).
Primary Examiner—-Thomas G. Black
Assistant Examiner—-William Trinh
Attorney, Agent, or Firm—Townsend and Townsend and
Crew LLP; Kenneth R. Allen
A computer-based method and system for establishing topic words to represent a document, the topic words being suitable for use in document retrieval. The method includes determining document keywords from the document; classifying each of the document keywords into one of a plurality of preestablished keyword classes; and selecting words as the topic words, each selected word from a different one of the preestablished keyword classes, to minimize a cost function on proposed topic words. The cost function may be a metric of dissimilarity, such as crossentropy, between a first distribution of likelihood of appearance by the plurality of document keywords in a typical document and a second distribution of likelihood of appearance by the plurality of document keywords in a typical document, the second distribution being approximated using proposed topic words. The cost function can be a basis for sorting the priority of the documents.
23 Claims, 6 Drawing Sheets