WO2001022280A2 - Determining trends using text mining - Google Patents

Determining trends using text mining Download PDF

Info

Publication number
WO2001022280A2
WO2001022280A2 PCT/IL2000/000582 IL0000582W WO0122280A2 WO 2001022280 A2 WO2001022280 A2 WO 2001022280A2 IL 0000582 W IL0000582 W IL 0000582W WO 0122280 A2 WO0122280 A2 WO 0122280A2
Authority
WO
WIPO (PCT)
Prior art keywords
sub
entries
pairs
groups
occurrence
Prior art date
Application number
PCT/IL2000/000582
Other languages
French (fr)
Other versions
WO2001022280A3 (en
Inventor
Ronen Feldman
Yehonatan Aumann
Yaron Ben-Yehuda
David Landau
Original Assignee
Clearforest Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clearforest Ltd. filed Critical Clearforest Ltd.
Priority to AU74428/00A priority Critical patent/AU7442800A/en
Publication of WO2001022280A2 publication Critical patent/WO2001022280A2/en
Publication of WO2001022280A3 publication Critical patent/WO2001022280A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/912Applications of a database
    • Y10S707/917Text
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • the present invention relates generally to knowledge discovery in collections of data, and specifically to text mining. BACKGROUND OF THE INVENTION
  • search engines have been developed which provide a user with documents which mention selected words or terms.
  • the user may use Boolean patterns with- "and,” “or” and “not” terms to more distinctly define the scope of the desired documents.
  • search engines do not provide an integrated picture of the distribution and impact of given terms in an entire corpus of documents.
  • Text mining is used to find hidden patterns in large textual collections. Text mining tools provide a human-tangible description of the information included in the textual collection. Because the amount of information is so large, a crucial feature of text mining tools is the way the information is organized and/or displayed. To limit the amount of informati ⁇ n that a user must digest, it is common to define a context group which defines the information of interest to the particular user. Normally, the context group includes those documents which include one or more terms from a user-defined set.
  • a central tool in text mining is visualization of the complex patterns that are discovered.
  • One such visualization approach is described, for example, in an article by Feldman R., Klosgen W., and Zilberstien A., entitled “Visualization Techniques to Explore Data Mining Results for Document Collections," in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1997), pp. 16-23, which is incorporated herein by reference.
  • This article describes a concept relationship analysis in which a set of concepts (or terms) are searched for in a corpus of textual data formed of a plurality of documents.
  • the concept relationship analysis searches for groups of concepts which appear together in relatively large numbers of documents, and these concepts are displayed together.
  • One method of representing concept relationships is by displaying context graphs.
  • context graphs the concepts (or terms) which appear together in large numbers of documents are designated by nodes. Each two nodes are connected by an edge which has a weight which is equal to the number of documents in which the terms of both nodes appear together. In order to limit the amount of data displayed, only edges which have a weight above a predetermined threshold are displayed.
  • the concepts which appear in nodes are chosen from a list of interesting terms defined by the user.
  • the corpus of documents is formed of several groups of documents, for "example, documents from different dates, and it is desired to apprehend concept relationships as they develop in time.
  • the textual collection is mined for a group of combinations of words (referred to as phrases) which appear in the documents of the collection.
  • Each combination is given freque ⁇ cy-of-occurrence values for each time group.
  • a user requests to view the frequencies of occurrence of those combinations for which the occurrences follow a desired pattern.
  • this method does not give the user any feel for the development of trends in the textual documents as a whole.
  • the trends relate to appearances of terms found by text mining in groups of documents.
  • a corpus of documents is divided into sub-groups defined by a differentiating parameter, such as the dates of the documents, or their origin.
  • a separate context graph is prepared, and the relationship between the graphs is calculated.
  • the differentiating parameter defines an order of the context graphs.
  • the context graphs are preferably displayed sequentially, either one after another or one above the other.
  • Each graph is preferably displayed with indications which show the differences between the present graph and the previous graph.
  • each edge in the graph is marked to indicate a difference between its weight in the present graph and its weight in the previous graph.
  • each edge is marked to indicate the difference between its weight in the present graph and its average weight in a predetermined number of previous graphs.
  • edges are marked graphically, for example, using different colors, widths, and/or lengths to indicate the weight differences.
  • four indications are used for the following groups of edges: new edges, edges with increased weights, edges with decreased weights, and edges with substantially stable weights.
  • the differentiating parameter is the date of the documents.
  • all the documents from a si ⁇ gie period are considered to belong to a single sub-group.
  • the periods may be of substantially any length, e.g., from minutes to years, according to a user selection.
  • the differentiating parameter comprises the origins of the documents, such as the authors, editors, countries of origin or the original languages of the documents.
  • substantially any other parameter may be used, such as the length of a document, or the average salary or number of employees of the company mentioned most frequently in a document.
  • the context graphs are displayed such that all nodes that are common to two or more of the graphs appear in substantially the same relative locations in the graphs. Therefore, the layout of the displayed form of the context graphs is prepared after all the nodes of all the graphs are known. Alternatively, the locations of the nodes and/or the distances between the nodes are used to indicate the importance of the terms of the nodes. In such cases, animation techniques are preferably used to aid the user to follow the changes in the positions of the nodes. In some preferred embodiments of the present invention, an animation sequence is used to display the changes between the context graphs.
  • the context graphs are listed, for example, in a list box, and the user can choose which context graph should be displayed relative to which other graphs. Further alternatively or additionally, a plurality of context graphs are superimposed one over the other, and each graph is displayed using a different color.
  • the corpus of documents includes a set of documents selected by a search engine, a clustering program, or by any other method of filtering and/or gathering of documents.
  • the trend graphs produced in accordance with preferred embodiments of the present invention may be used to select groups of documents on which additional filtering and/or other processing is to be performed.
  • a method for visualizing variations in a corpus of information including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including: for each of the entries, extracting characteristics of information contained therein; finding pairs of different characteristics that appear together in at least one of the entries; determining an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear; comparing the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and providing an indication of the comparative occurrence values of the pairs.
  • the entries include text documents, and the characteristics include terms appearing in the documents. Further preferably, determining the occurrence value includes counting the number of entries in which the pair appears.
  • finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least a predetermined number of the entries.
  • finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least two of the sub-groups.
  • extracting the characteristics includes automatically mining the corpus to extract characteristics therefrom.
  • the differentiating parameter defines an order
  • comparing the occurrence values includes comparing the occurrence values in a first sub- group with the occurrence values in one or more previous sub-groups in the order.
  • comparing the occurrence values includes comparing the occurrence values in the first sub-group with the occurrence values in a closest previous sub-group.
  • comparing the occurrence values includes comparing the occurrence values in the first sub-group with an average of the occurrence values in the one or more previous sub-groups.
  • providing the indication includes displaying a symbol which indicates a measure of evolution in the occurrence value in the first sub-group relative to the occurrence values in the one or more previous sub-groups in the order.
  • providing the indication includes displaying a table or graph.
  • displaying the graph includes displaying a graph in which each term is represented by a node, the pairs of characteristics that are found are represented by edges, and substantially each edge is associated with the indication of the comparative appearance of the respective pair.
  • displaying the graph includes displaying with_substantially each edge a weight of the edge, which equals the occurrence value of the respective pair in a first sub-group.
  • displaying the graph includes displaying the graph such that the lengths of the edges represent the occurrence value of the respective pair in a first sub-group.
  • displaying the graph indu ⁇ es displaying for each two sub-groups a graph which compares the occurrence values in the two sub-groups.
  • displaying the graph for each two sub-groups includes displaying the graphs such that nodes which represent the same term are displayed in substantially the same relative location. Further preferably, the graphs of each two sub-groups are displayed as an animation sequence.
  • displaying the graph includes displaying a plurality of superimposed graphs, each of which represents the appearances of the pairs in a different sub-group. Further preferably, displaying the plurality of superimposed graphs includes displaying each of the graphs in a different color.
  • providing the indication of the comparative values of the pairs includes providing an indication wherein which pairs having a characteristic in common are grouped together.
  • apparatus for visualizing variations in a corpus of information including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including: a processor which finds pairs of characteristics which appear together in at least one of the documents, determines an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear, and compares the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and a display which displays an indication of the comparative occurrence values of the pairs.
  • the processor finds characteristics selected from a group of automatically determined characteristics.
  • a method for selecting a range of values of a variable including:
  • ⁇ ⁇ providing a graphic user interface on a display, including a siide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable; positioning the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and changing the dimension of the slide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
  • changing the dimension of the slide-piece includes changing a length of the slide-piece along the axis.
  • the first and second values of the variable include the extrema of the range.
  • a computer program product for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, the documents including text, the program having computer-readable program instructions embodied therein, which instructions cause a computer to: for each of the entries, extract characteristics of information contained therein; find pairs of different characteristics that appear together in at least one of the entries; determine an occurrence value for each of the pairs of characteristics in each subgroup in which both of the characteristics appear; compare the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and provide an indication of the comparative occurrence values of the pairs.
  • a computer program product for selecting a range of values of a variable
  • the program having computer-readable program instructions embodied therein, which instructions cause a computer to: provide a graphic user interface on a display, including a slide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable; position the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and change the dimension of the siide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
  • a method for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including: for a first one of the entries in a first one of the sub-groups, extracting a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extracting the same characteristic of information; automatically determining respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and . ' providing an indication of the occurrence values.
  • providing the indication includes providing a visual indication of the occurrence values.
  • the differentiating parameter includes a sequence, most preferably a time sequence.
  • apparatus for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including: a processor, which (a) for a first one of the entries in a first one of the sub-groups, extracts a characteristic of information contained therein, (b) for a second one of the entries in a second one of the sub-groups, extracts the same characteristic of information, and (c) automatically determines respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and a display, which provides an indication of the occurrence values.
  • a computer program product for extracting data from a corpus of information, including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, the program having computer-readable program instructions embodied therein, which instructions, whe ⁇ read by a computer, cause the computer to: for a first one of the entries in a first one of the sub-groups, extract a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extract the same characteristic of information; automatically determine respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and provide an indication of the occurrence values.
  • FIG. 1 is a schematic illustration of a system for text mining, in accordance with a preferred embodiment of the present invention
  • Fig. 2 is a flow chart illustrating preparation of a trend graph from a corpus of documents, in accordance with a preferred embodiment of the present invention
  • Fig. 3 is a schematic view of a text mining input window display, in accordance with a preferred embodiment of the present invention
  • Fig. 4A is a schematic view of a trend graph, in accordance with a preferred embodiment of the present invention
  • Fig. 4B is a schematic view of a trend graph representing a period following the period represented by the graph of Fig. 4 A, in accordance with a preferred embodiment of the present invention
  • Fig. 5 is a schematic view of a comparison graph, in accordance with a preferred embodiment of the present invention.
  • Fig. 6 is a schematic view of a graphic interface, in accordance with a preferred embodiment of the present invention.
  • Fig. 1 is a schematic illustration of a system 18 for text mining and visualization, in accordance with a preferred embodiment of the present invention.
  • System 18 preferably comprises a memory 22, which stores a corpus of documents from which information is mined.
  • system 18 comprises a modem 24 or other network connection, through which access is established to collections of documents, which include some or all of the documents in the corpus.
  • System 18 preferably further comprises a computer 20, which mines information from the documents, and a display 26, on which the mined information is displayed.
  • Fig. 2 is a flow chart illustrating the actions of computer 20 in preparing trend graphs from the corpus of documents, in accordance with a preferred embodiment of the present invention.
  • the documents in the corpus are dated and/or time-stamped, and the trend graph represents changes in the corpus as a function of ⁇ me.
  • each document is associated with a different ordering parameter value, not necessarily time-related.
  • the corpus of documents may include articles drawn from The Wall Street Journal about high-tech companies, and the ordering parameter may be the average employee salary or the number of employees of the company mentioned most frequently in an article.
  • a database containing information about employees of high-tech companies would preferably be accessible, either locally or remotely, to computer 20.
  • a more complex ordering parameter such as (average employee salary) * (percentage of employees who use a PC) * (percentage of employees who have a college degree), may be used to aid a user in analyzing a very large collection of news articles.
  • computer 20 analyzes each document and prepares for each document a record which represents the document.
  • the record preferably comprises a set of terms which appear in the document, most preterabiy together with the numbers of occurrences of the terms and/or a parameter which represents the importance of the terms.
  • the records are preferably prepared in accordance with the method described in the article by Feldman, Klosgen, and Zilberstien, which is referenced in the Background of the Invention section of the present patent application.
  • the records are preferably stored in memory 22 for future text mining, and computer 20 preferably does not need to access the documents again in order to perform additional text mining sessions.
  • Fig. 3 is a schematic view of a text mining input window 38 on display 26, in accordance with a preferred embodiment of the present invention.
  • the user preferably defines a context group en which the session is performed.
  • the context group comprises those documents in which one or more selected terms appear in the accumulative or in the alternative, according to user selections.
  • Input window 38 comprises a selection window 40 in which the user selects the terms to define the context group and a selection pad 42, for selecting Boolean operations to be performed on the terms.
  • selection window 40 lists all the terms which appear in at least one of the documents of the corpus, and the user selects the terms form the list that will define the context group.
  • the terms in selection window 40 are determined automatically, as described, for example, in an article by Feldman, et al., entitled “Text Mining at the Term Level,” in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (1998), which is incorporated herein by reference.
  • the context group is defined by terms associated with "merger.”
  • the user may define the context group using any other parameters characterizing the documents in the corpus, including the authorship, origin, length, and date of the documents.
  • an additional selection window 44 enables the user to define types of terms that will be used in generating the context graphs. Ideally it would be desirable not to limit the terms appearing in the graphs. However, in most cases, such an unlimited approach would lead to an excess of meaningless data, for example, appearances of connection words ("and,” "the,” etc.), in the results. Therefore, window 44 allows the user to select the terms to appear in the results.
  • the terms appearing in the results are chosen according to predefined groups, such as companies, personal names, etc. Alternatively or additionally, the terms allowed to appear in the results may be chosen by excluding non-interesting terms.
  • the user preferably chooses in a window 46 a granularity for the time axis of the documents.
  • the granularity defines the period of time from which all documents are considered to belong to a single group.
  • the granularity may be on the order of months, as shown in Fig. 3, or on the order of hours, days, weeks or years, or substantially any time order.
  • Computer 20 searches the records which represent the documents in the context group, in order to find documents in which pairs of two different terms from among the result terms of window 44 both appear. For each pair of terms, computer 20 counts separately for each period of time the number of documents in which the pair appears. Alternatively or additionally, computer 20 assigns each pair of terms an occurrence frequency value which is based oh the number of documents in which the pair appears, the number of time each of the terms appears in the documents, and/or a weight given to each term according to its importance.
  • the search is preferably performed as described in the above-mentioned article of Feldman, Klosgen and Zilberstien.
  • the results are shown in a table 50, in which two columns 54 show the pairs of terms, and the rest of the columns show the number of documents for each time period.
  • the rows in table 50 are sorted such that pairs which include common terms appear next to each other.
  • the rows in table 50 are sorted according to the total number of documents in which the respective pairs of terms appear.
  • the rows in table 50 are sorted according to the appearance of the term pairs in a selected period, i.e., in a column or group of columns in the table.
  • a button 52 allows the user to see the results in table 50 in a graphic format, as described herei ⁇ below.
  • Fig. 4A is a schematic view of a trend graph 60, in accordance with a preferred embodiment of the present invention.
  • Graph 60 preferably represents a singie column of table 50, i.e., a single period. Rows of table 50 in which the entry of graph 60 in the single column has a value above a predetermined threshold are referred to herein as active rows. Each term which appears in one the first two columns 54 of an active row is shown by a node 62 in graph 60. Each of the active rows appears as an edge 64 in graph 60.
  • each edge 64 is displayed along with a weight 66 which is equal to the number of documents in which the pair of terms connected by the edge appears.
  • a symbol 68 is displayed next to weight 66 designating the change in the value of the weight relative to the previous column.
  • the symbol designates the change relative to an average of a number of preceding columns.
  • symbol 68 is a " ⁇ " if the weight of its edge decreased, a ">” if the weight increased, and a "*" if the weight remains substantially stable.
  • weights are considered to increase or decrease only if the change is larger than a predetermined factor, for example, 25%. Edges which change by a factor smaller than the predetermined factor are considered stable.
  • new edges and/or edges with increased weights are designated by wider lines than edges which have decreased weights.
  • other sets of symbols may be used to indicate the changes in the graphs.
  • Fig. 4B is a schematic view of a trend graph 80 representing a period following the period represented by graph 60, in accordance with a preferred embodiment of the present invention.
  • nodes 62 which appear in both graph 60 and 80 are positioned in the same locations in both graphs. Therefore, space is allocated for the nodes that will appear in the graphs representing all the columns of table 50, before displaying any of the graphs. For example, empty space 70 is left in graph 60, to leave room for nodes 72 in graph 80.
  • the positions of nodes 62 are chosen separately for each graph, arbitrarily or according to the weights of the edges 64 incident on the nodes.
  • nodes 62 having relatively higher sums of weights of the incident edges may be positioned in the center or at the top of the graph.
  • the lengths of edges 64 may be used to indicate a desired parameter.
  • the length of the edge may indicate the weight of the edge, while the thickness of the edge indicates its weight relative to one or more previous periods.
  • the user can request more information by selecting areas of the graph.
  • a window may open with a bar graph, a table or any other indication which shows the weights of the edge as a function of time.
  • the documents contributing to the selected edge may be listed, allowing the user to read the documents and judge their relevance. Further preferably, the user may request to see the graphs as they change over time in an animation sequence.
  • Fig. 5 is a schematic view of a comparison graph 100, in accordance with a preferred embodiment of the present invention.
  • Graph 100 compares text mining results of recipe documents from different document groups, for example, documents from two different countries.
  • Each major ingredient in the recipes is designated by a node 102.
  • Nodes 102 which appear together in more than a predetermined threshold number of documents are connected by an edge 104.
  • Each edge is marked with two values, corresponding to appearance of the associated terms in documents from the two different countries.
  • the values indicate the percentage of documents from the respective country in which the pair of ingredients connected by the corresponding edge 104 both appear.
  • the edges 104 are marked with the absolute number of documents.
  • edges 106 which correspond to combinations that are more popular in country #1 are displayed differently from edges 108 for combinations which are more popular in country #2.
  • the edges and values for each country may be displayed in different colors.
  • only a single value designating the difference between the values of different document groups is displayed with each edge.
  • the user may select which type of display is desired.
  • Fig. 6 is a schematic view of a graphic interface 120, showing sample graphs 122,
  • Graphs 122 and 124 are, respectively, a "single-term”-centered graph and a bar graph, in which the relationship between a single term ("Microsoft") and a set of other terms ("IBM,” Sun.” etc.) is quantitatively displayed.
  • the quantitative relationship shown in graphs 122 and 124 may comprise, for example, the number of news articles containing both the term "Microsoft" and each of the other listed terms during a specified time period.
  • graph 126 is displayed to show the most significant relationships among all of the displayed terms.
  • graph 128 shows the number of appearances of the term "Microsoft," irrespective of the other companies, during a five week period extending from April 10 to May 15.
  • a slide-bar 130 is provided with interface 120, which enables the user to move an enhanced slide-piece 132 between twc points on an axis of interest, e.g., time. Slide-bars which perform this limited function are widely available, for example, in Microsoft Windows 98. In prior art slide-bars, the slide-piece is typically moved to indicate, for example, a location in a document, a time, or a color from a range of pages, times, or colors, respectively.
  • the length of enhanced slide-piece 132 i.e., the distance between points 134 and 136 in Fig. 6, provides the user with additional information about a parameter of interest.
  • slide-bar 130 in the embodiment shown in Fig. 6 represents a set of relevant news articles spanning one year.
  • the length of enhanced slide- piece 132, as shown, is five weeks, i.e., approximately one tenth the total length of slide- bar 130.
  • graphs 122, 124, and 126 are continually updated responsive to whatever news articles are contained in a five-week period which is "covered" by the slide-piece.
  • the user is enabled to modify the length of the enhanced slide-piece in real time, so as to cause computer 20 to change the set of articles used in generating the graphs accordingly.
  • dashed lines 148 show a former setting of the slide-piece, in which approximately twelve weei s were represented by the slide-piece.
  • the user uses a mouse to grab onto the left side 144 or right side 146 of enhanced slide-piece 132, and changes its length, typically in a manner analogous to the way objects are re-sized in a Windows environment.
  • neither Windows nor any other software provides the improved and intuitive position control provided by enhanced slide-piece 132.
  • slide-piece 132 In light of this description of the operation of slide-piece 132, many applications not related to a time axis will become obvious to one skilled in the art. For example, scrolling through a document by moving slide-piece 132 could be enhanced by a "zoom" feature, effectively enabled by changing the size of the slide-piece.
  • a slide-bar which uses prior art technology would allow the user to select a single color from_a spectrum
  • a user of embodiments of the present invention would be additionally enabled to select a range of neighboring colors in an intuitive fashion.
  • aspects of the present invention described hereinabove can be embodied in a computer running software, and that the software can be stored in tangible media, e.g., hard disks, floppy disks or compac ⁇ disks, or in intangible media, e.g., in an electronic memory, or on a network such as the Internet.
  • tangible media e.g., hard disks, floppy disks or compac ⁇ disks
  • intangible media e.g., in an electronic memory, or on a network such as the Internet.

Abstract

A method for visualizing variations in a corpus of information. The corpus includes a plurality of information entries, which are divided into a plurality of sub-groups according to a differentiating parameter of the entries. For each of the entries, characteristics of information contained therein are extracted and pairs of different characteristics that appear together in at least one of the entries are found. An occurrence value is determined for each of the pairs of characteristics in each sub-group in which both of the characteristics appear. The occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups are compared, and an indication of the comparative occurrence values of the pairs is provided.

Description

DETERMINING TRENDS USING TEXT MINING
FIELD OF THE INVENTION
The present invention relates generally to knowledge discovery in collections of data, and specifically to text mining. BACKGROUND OF THE INVENTION
In recent years, the volume of text documents available on computers and computer networks is growing rapidly. It is virtually impossible to read all the available documents containing information of importance on a given subject. In order to find desired information, search engines have been developed which provide a user with documents which mention selected words or terms. The user may use Boolean patterns with- "and," "or" and "not" terms to more distinctly define the scope of the desired documents. However, the user cannot always define precisely which are the desired documents or keyword combinations. In addition, search engines do not provide an integrated picture of the distribution and impact of given terms in an entire corpus of documents.
Text mining is used to find hidden patterns in large textual collections. Text mining tools provide a human-tangible description of the information included in the textual collection. Because the amount of information is so large, a crucial feature of text mining tools is the way the information is organized and/or displayed. To limit the amount of informatiαn that a user must digest, it is common to define a context group which defines the information of interest to the particular user. Normally, the context group includes those documents which include one or more terms from a user-defined set.
A central tool in text mining is visualization of the complex patterns that are discovered. One such visualization approach is described, for example, in an article by Feldman R., Klosgen W., and Zilberstien A., entitled "Visualization Techniques to Explore Data Mining Results for Document Collections," in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1997), pp. 16-23, which is incorporated herein by reference. This article describes a concept relationship analysis in which a set of concepts (or terms) are searched for in a corpus of textual data formed of a plurality of documents. The concept relationship analysis searches for groups of concepts which appear together in relatively large numbers of documents, and these concepts are displayed together.
One method of representing concept relationships is by displaying context graphs. In context graphs, the concepts (or terms) which appear together in large numbers of documents are designated by nodes. Each two nodes are connected by an edge which has a weight which is equal to the number of documents in which the terms of both nodes appear together. In order to limit the amount of data displayed, only edges which have a weight above a predetermined threshold are displayed. In some context graphs, the concepts which appear in nodes are chosen from a list of interesting terms defined by the user.
In many cases, the corpus of documents is formed of several groups of documents, for "example, documents from different dates, and it is desired to apprehend concept relationships as they develop in time. An article by Lent B., Agrawal R., and Srikant R_, entitled "Discovering Trends in Text Databases," in Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining (1997), pp. 227-230, which is incorporated herein by reference, describes a method of detecting trends in textual collections formed of documents with timestamps, which are partitioned into time groups according to a selected granularity. The textual collection is mined for a group of combinations of words (referred to as phrases) which appear in the documents of the collection. Each combination is given frequeπcy-of-occurrence values for each time group. A user requests to view the frequencies of occurrence of those combinations for which the occurrences follow a desired pattern. However, this method does not give the user any feel for the development of trends in the textual documents as a whole.
In an article entitled "Trend graphs: Visualizing the evolution of concept relationships in large document collections," by Feldman R., Aumann Y., Zilberstien A., and Ben- Yehuda Y., in Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining (1998), which is incorporated herein by reference, a graphical tool is described for analyzing and visualizing dynamic changes in concept relationships over time. SUMMARY OF THE INVENTION
It is an object of the present invention to provide methods and apparatus for displaying trends that are discovered in large collections of information.
In some aspects of the present invention, the trends relate to appearances of terms found by text mining in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying the evolution of concept relationships in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying differences between patterns of term appearances in different groups of documents.
It is still another object of some aspects of the present invention to provide methods and apparatus for determining major changes in patterns of term appearances in groups of documents. In preferred embodiments of the present invention, a corpus of documents is divided into sub-groups defined by a differentiating parameter, such as the dates of the documents, or their origin. For each sub-group of documents, a separate context graph is prepared, and the relationship between the graphs is calculated.
In some preferred embodiments of the present invention, the differentiating parameter defines an order of the context graphs. The context graphs are preferably displayed sequentially, either one after another or one above the other. Each graph is preferably displayed with indications which show the differences between the present graph and the previous graph. Preferably, each edge in the graph is marked to indicate a difference between its weight in the present graph and its weight in the previous graph. Alternatively or additionally, each edge is marked to indicate the difference between its weight in the present graph and its average weight in a predetermined number of previous graphs.
Preferably, the edges are marked graphically, for example, using different colors, widths, and/or lengths to indicate the weight differences. In a preferred embodiment of the present invention, four indications are used for the following groups of edges: new edges, edges with increased weights, edges with decreased weights, and edges with substantially stable weights.
In some preferred embodiments of the present invention, the differentiating parameter is the date of the documents. Preferably, all the documents from a siπgie period are considered to belong to a single sub-group. The periods may be of substantially any length, e.g., from minutes to years, according to a user selection. Alternatively or additionally, the differentiating parameter comprises the origins of the documents, such as the authors, editors, countries of origin or the original languages of the documents. Further alternatively or additionally, substantially any other parameter may be used, such as the length of a document, or the average salary or number of employees of the company mentioned most frequently in a document.
~ In a preferred embodiment of the present invention, the context graphs are displayed such that all nodes that are common to two or more of the graphs appear in substantially the same relative locations in the graphs. Therefore, the layout of the displayed form of the context graphs is prepared after all the nodes of all the graphs are known. Alternatively, the locations of the nodes and/or the distances between the nodes are used to indicate the importance of the terms of the nodes. In such cases, animation techniques are preferably used to aid the user to follow the changes in the positions of the nodes. In some preferred embodiments of the present invention, an animation sequence is used to display the changes between the context graphs. Alternatively or additionally, the context graphs are listed, for example, in a list box, and the user can choose which context graph should be displayed relative to which other graphs. Further alternatively or additionally, a plurality of context graphs are superimposed one over the other, and each graph is displayed using a different color.
In some preferred embodiments of the present invention, the corpus of documents includes a set of documents selected by a search engine, a clustering program, or by any other method of filtering and/or gathering of documents. Furthermore, the trend graphs produced in accordance with preferred embodiments of the present invention may be used to select groups of documents on which additional filtering and/or other processing is to be performed. Although preferred embodiments are described herein with reference to rnining and analysis of text documents, those skilled in the art will appreciate that the principles of the present invention may also be applied to visualization of trends and other variations in collections of information of other types. For example, trends occurring among the records in a large database may be analyzed and visualized in similar fashion.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including: for each of the entries, extracting characteristics of information contained therein; finding pairs of different characteristics that appear together in at least one of the entries; determining an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear; comparing the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and providing an indication of the comparative occurrence values of the pairs.
Preferably, the entries include text documents, and the characteristics include terms appearing in the documents. Further preferably, determining the occurrence value includes counting the number of entries in which the pair appears.
Still further preferably, finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least a predetermined number of the entries.
- In a preferred embodiment, finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least two of the sub-groups.
Preferably, extracting the characteristics includes automatically mining the corpus to extract characteristics therefrom.
In a preferred embodiment, the differentiating parameter defines an order, and comparing the occurrence values includes comparing the occurrence values in a first sub- group with the occurrence values in one or more previous sub-groups in the order.
Preferably, comparing the occurrence values includes comparing the occurrence values in the first sub-group with the occurrence values in a closest previous sub-group. Alternatively or additionally, comparing the occurrence values includes comparing the occurrence values in the first sub-group with an average of the occurrence values in the one or more previous sub-groups. Further alternatively or additionally, providing the indication includes displaying a symbol which indicates a measure of evolution in the occurrence value in the first sub-group relative to the occurrence values in the one or more previous sub-groups in the order.
In a preferred embodiment, providing the indication includes displaying a table or graph. Preferably, displaying the graph includes displaying a graph in which each term is represented by a node, the pairs of characteristics that are found are represented by edges, and substantially each edge is associated with the indication of the comparative appearance of the respective pair. Typically, displaying the graph includes displaying with_substantially each edge a weight of the edge, which equals the occurrence value of the respective pair in a first sub-group. Alternatively or additionally, displaying the graph includes displaying the graph such that the lengths of the edges represent the occurrence value of the respective pair in a first sub-group.
In a preferred embodiment, displaying the graph induύes displaying for each two sub-groups a graph which compares the occurrence values in the two sub-groups. Preferably, displaying the graph for each two sub-groups includes displaying the graphs such that nodes which represent the same term are displayed in substantially the same relative location. Further preferably, the graphs of each two sub-groups are displayed as an animation sequence.
Preferably, displaying the graph includes displaying a plurality of superimposed graphs, each of which represents the appearances of the pairs in a different sub-group. Further preferably, displaying the plurality of superimposed graphs includes displaying each of the graphs in a different color.
In a preferred embodiment, providing the indication of the comparative values of the pairs includes providing an indication wherein which pairs having a characteristic in common are grouped together.
There is also provided, in accordance with a preferred embodiment of the present invention, apparatus for visualizing variations in a corpus of information including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including: a processor which finds pairs of characteristics which appear together in at least one of the documents, determines an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear, and compares the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and a display which displays an indication of the comparative occurrence values of the pairs.
In a preferred embodiment, the processor finds characteristics selected from a group of automatically determined characteristics. There is further provided, in accordance with a preferred embodiment of the present invention, a method for selecting a range of values of a variable, including:
~~ providing a graphic user interface on a display, including a siide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable; positioning the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and changing the dimension of the slide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
Preferably, changing the dimension of the slide-piece includes changing a length of the slide-piece along the axis. Further preferably, the first and second values of the variable include the extrema of the range.
There is still further provided, in accordance with a preferred embodiment of the present invention, a computer program product for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, the documents including text, the program having computer-readable program instructions embodied therein, which instructions cause a computer to: for each of the entries, extract characteristics of information contained therein; find pairs of different characteristics that appear together in at least one of the entries; determine an occurrence value for each of the pairs of characteristics in each subgroup in which both of the characteristics appear; compare the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and provide an indication of the comparative occurrence values of the pairs.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer program product for selecting a range of values of a variable, the program having computer-readable program instructions embodied therein, which instructions cause a computer to: provide a graphic user interface on a display, including a slide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable; position the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and change the dimension of the siide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range. There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for extracting data from a corpus of information, including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including: for a first one of the entries in a first one of the sub-groups, extracting a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extracting the same characteristic of information; automatically determining respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and .'providing an indication of the occurrence values.
Preferably, providing the indication includes providing a visual indication of the occurrence values. Further preferably, the differentiating parameter includes a sequence, most preferably a time sequence.
There is still additionally provided, in accordance with a preferred embodiment of the present invention, apparatus for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including: a processor, which (a) for a first one of the entries in a first one of the sub-groups, extracts a characteristic of information contained therein, (b) for a second one of the entries in a second one of the sub-groups, extracts the same characteristic of information, and (c) automatically determines respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and a display, which provides an indication of the occurrence values.
There is yet additionally provided, in accordance with a preferred embodiment of the present invention, a computer program product for extracting data from a corpus of information, including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, the program having computer-readable program instructions embodied therein, which instructions, wheβ read by a computer, cause the computer to: for a first one of the entries in a first one of the sub-groups, extract a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extract the same characteristic of information; automatically determine respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and provide an indication of the occurrence values. The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which:
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a schematic illustration of a system for text mining, in accordance with a preferred embodiment of the present invention;
Fig. 2 is a flow chart illustrating preparation of a trend graph from a corpus of documents, in accordance with a preferred embodiment of the present invention;
Fig. 3 is a schematic view of a text mining input window display, in accordance with a preferred embodiment of the present invention; Fig. 4A is a schematic view of a trend graph, in accordance with a preferred embodiment of the present invention; Fig. 4B is a schematic view of a trend graph representing a period following the period represented by the graph of Fig. 4 A, in accordance with a preferred embodiment of the present invention;
Fig. 5 is a schematic view of a comparison graph, in accordance with a preferred embodiment of the present invention; and
Fig. 6 is a schematic view of a graphic interface, in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Fig. 1 is a schematic illustration of a system 18 for text mining and visualization, in accordance with a preferred embodiment of the present invention. System 18 preferably comprises a memory 22, which stores a corpus of documents from which information is mined. Alternatively or additionally, system 18 comprises a modem 24 or other network connection, through which access is established to collections of documents, which include some or all of the documents in the corpus. System 18 preferably further comprises a computer 20, which mines information from the documents, and a display 26, on which the mined information is displayed.
Fig. 2 is a flow chart illustrating the actions of computer 20 in preparing trend graphs from the corpus of documents, in accordance with a preferred embodiment of the present invention. Preferably, the documents in the corpus are dated and/or time-stamped, and the trend graph represents changes in the corpus as a function of πme. Alternatively or additionally, each document is associated with a different ordering parameter value, not necessarily time-related. For example, the corpus of documents may include articles drawn from The Wall Street Journal about high-tech companies, and the ordering parameter may be the average employee salary or the number of employees of the company mentioned most frequently in an article. In this example, a database containing information about employees of high-tech companies would preferably be accessible, either locally or remotely, to computer 20. Alternatively or additionally, a more complex ordering parameter, such as (average employee salary) * (percentage of employees who use a PC) * (percentage of employees who have a college degree), may be used to aid a user in analyzing a very large collection of news articles.
Preferably, computer 20 analyzes each document and prepares for each document a record which represents the document. The record preferably comprises a set of terms which appear in the document, most preterabiy together with the numbers of occurrences of the terms and/or a parameter which represents the importance of the terms. The records are preferably prepared in accordance with the method described in the article by Feldman, Klosgen, and Zilberstien, which is referenced in the Background of the Invention section of the present patent application. Alternatively or additionally, term extraction methods, term processing methods, and/or graphical display methods described in co-pending US patent application 09/323,491, "Term-Level Text Mining with Taxonomies," filed June 1, 1999, which is assigned to the assignee of the present patent application and is incorporated herein by reference, are used in implementing some embodiments of the present invention.
The records are preferably stored in memory 22 for future text mining, and computer 20 preferably does not need to access the documents again in order to perform additional text mining sessions.
Reference is aiso made to Fig. 3, which is a schematic view of a text mining input window 38 on display 26, in accordance with a preferred embodiment of the present invention. In defining a text mining session, the user preferably defines a context group en which the session is performed. Preferably, the context group comprises those documents in which one or more selected terms appear in the accumulative or in the alternative, according to user selections. Input window 38 comprises a selection window 40 in which the user selects the terms to define the context group and a selection pad 42, for selecting Boolean operations to be performed on the terms.
Preferably, selection window 40 lists all the terms which appear in at least one of the documents of the corpus, and the user selects the terms form the list that will define the context group. Alternatively or additionally, the terms in selection window 40 are determined automatically, as described, for example, in an article by Feldman, et al., entitled "Text Mining at the Term Level," in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (1998), which is incorporated herein by reference. In the example shown in Fig. 3, the context group is defined by terms associated with "merger." Alternatively or additionally, the user may define the context group using any other parameters characterizing the documents in the corpus, including the authorship, origin, length, and date of the documents. Preferably, an additional selection window 44 enables the user to define types of terms that will be used in generating the context graphs. Ideally it would be desirable not to limit the terms appearing in the graphs. However, in most cases, such an unlimited approach would lead to an excess of meaningless data, for example, appearances of connection words ("and," "the," etc.), in the results. Therefore, window 44 allows the user to select the terms to appear in the results. Preferably, the terms appearing in the results are chosen according to predefined groups, such as companies, personal names, etc. Alternatively or additionally, the terms allowed to appear in the results may be chosen by excluding non-interesting terms. The user preferably chooses in a window 46 a granularity for the time axis of the documents. The granularity defines the period of time from which all documents are considered to belong to a single group. The granularity may be on the order of months, as shown in Fig. 3, or on the order of hours, days, weeks or years, or substantially any time order. After making appropriate selections, the user preferably actuates a compute button
48, which initiates the text mining. Computer 20 searches the records which represent the documents in the context group, in order to find documents in which pairs of two different terms from among the result terms of window 44 both appear. For each pair of terms, computer 20 counts separately for each period of time the number of documents in which the pair appears. Alternatively or additionally, computer 20 assigns each pair of terms an occurrence frequency value which is based oh the number of documents in which the pair appears, the number of time each of the terms appears in the documents, and/or a weight given to each term according to its importance. The search is preferably performed as described in the above-mentioned article of Feldman, Klosgen and Zilberstien. Preferably, the results are shown in a table 50, in which two columns 54 show the pairs of terms, and the rest of the columns show the number of documents for each time period.
Preferably, the rows in table 50 are sorted such that pairs which include common terms appear next to each other. Alternatively, the rows in table 50 are sorted according to the total number of documents in which the respective pairs of terms appear. Further alternatively, the rows in table 50 are sorted according to the appearance of the term pairs in a selected period, i.e., in a column or group of columns in the table. Preferably, only pairs of terms with a relatively high number of appearances are displayed in table 50, and only a predetermined number of pairs of terms are displayed. Alternatively, all pairs which have a number of occurrences above a predefined threshold are displayed. Preferably a button 52 allows the user to see the results in table 50 in a graphic format, as described hereiπbelow.
Fig. 4A is a schematic view of a trend graph 60, in accordance with a preferred embodiment of the present invention. Graph 60 preferably represents a singie column of table 50, i.e., a single period. Rows of table 50 in which the entry of graph 60 in the single column has a value above a predetermined threshold are referred to herein as active rows. Each term which appears in one the first two columns 54 of an active row is shown by a node 62 in graph 60. Each of the active rows appears as an edge 64 in graph 60. Alternatively or additionally, other rows of table 50 in which the entry at the column of graph 60 is non-zero are also considered active rows, to be represented by an edge, provided they had a value above the threshold in a previous column of table 50, typically corresponding to a preceding period. Preferably, each edge 64 is displayed along with a weight 66 which is equal to the number of documents in which the pair of terms connected by the edge appears.
Further preferably, a symbol 68 is displayed next to weight 66 designating the change in the value of the weight relative to the previous column. Alternatively, the symbol designates the change relative to an average of a number of preceding columns. For example, symbol 68 is a "<" if the weight of its edge decreased, a ">" if the weight increased, and a "*" if the weight remains substantially stable. Preferably, weights are considered to increase or decrease only if the change is larger than a predetermined factor, for example, 25%. Edges which change by a factor smaller than the predetermined factor are considered stable. Preferably, new edges and/or edges with increased weights are designated by wider lines than edges which have decreased weights. Alternatively or additionally, other sets of symbols may be used to indicate the changes in the graphs.
Fig. 4B is a schematic view of a trend graph 80 representing a period following the period represented by graph 60, in accordance with a preferred embodiment of the present invention. Preferably, nodes 62 which appear in both graph 60 and 80 are positioned in the same locations in both graphs. Therefore, space is allocated for the nodes that will appear in the graphs representing all the columns of table 50, before displaying any of the graphs. For example, empty space 70 is left in graph 60, to leave room for nodes 72 in graph 80. Thus, it is easy to follow the similarities and changes in the graphs as they are displayed, for example, when successive graphs are displayed in sequence or in pseudo-3D geometrical superposition.
Alternatively, the positions of nodes 62 are chosen separately for each graph, arbitrarily or according to the weights of the edges 64 incident on the nodes. For example, nodes 62 having relatively higher sums of weights of the incident edges may be positioned in the center or at the top of the graph. Further alternatively or additionally, the lengths of edges 64 may be used to indicate a desired parameter. For example, the length of the edge may indicate the weight of the edge, while the thickness of the edge indicates its weight relative to one or more previous periods. Preferably, the user can request more information by selecting areas of the graph.
For example, when the user double-clicks on one of edges 64, a window may open with a bar graph, a table or any other indication which shows the weights of the edge as a function of time. Alternatively or additionally, the documents contributing to the selected edge may be listed, allowing the user to read the documents and judge their relevance. Further preferably, the user may request to see the graphs as they change over time in an animation sequence.
Fig. 5 is a schematic view of a comparison graph 100, in accordance with a preferred embodiment of the present invention. Graph 100 compares text mining results of recipe documents from different document groups, for example, documents from two different countries. Each major ingredient in the recipes is designated by a node 102. Nodes 102 which appear together in more than a predetermined threshold number of documents are connected by an edge 104. Each edge is marked with two values, corresponding to appearance of the associated terms in documents from the two different countries. Preferably, the values indicate the percentage of documents from the respective country in which the pair of ingredients connected by the corresponding edge 104 both appear. Alternatively or additionally, the edges 104 are marked with the absolute number of documents. Preferably, edges 106 which correspond to combinations that are more popular in country #1 are displayed differently from edges 108 for combinations which are more popular in country #2. Alternatively or additionally, the edges and values for each country may be displayed in different colors. Thus, it is possible to compare documents from more than two groups. Further alternatively or additionally, only a single value designating the difference between the values of different document groups is displayed with each edge.
Preferably, the user may select which type of display is desired.
Fig. 6 is a schematic view of a graphic interface 120, showing sample graphs 122,
124, 126, and 128, for displaying results generated in part using some of the techniques described hereinabove, in accordance with a preferred embodiment of the- present invention. Graphs 122 and 124 are, respectively, a "single-term"-centered graph and a bar graph, in which the relationship between a single term ("Microsoft") and a set of other terms ("IBM," Sun." etc.) is quantitatively displayed. The quantitative relationship shown in graphs 122 and 124 may comprise, for example, the number of news articles containing both the term "Microsoft" and each of the other listed terms during a specified time period.
Using the same analysis as that which generated graphs 122 and 124, graph 126 is displayed to show the most significant relationships among all of the displayed terms. By contrast, graph 128 shows the number of appearances of the term "Microsoft," irrespective of the other companies, during a five week period extending from April 10 to May 15. Preferably, a slide-bar 130 is provided with interface 120, which enables the user to move an enhanced slide-piece 132 between twc points on an axis of interest, e.g., time. Slide-bars which perform this limited function are widely available, for example, in Microsoft Windows 98. In prior art slide-bars, the slide-piece is typically moved to indicate, for example, a location in a document, a time, or a color from a range of pages, times, or colors, respectively.
In this embodiment, the length of enhanced slide-piece 132, i.e., the distance between points 134 and 136 in Fig. 6, provides the user with additional information about a parameter of interest. For example, slide-bar 130 in the embodiment shown in Fig. 6 represents a set of relevant news articles spanning one year. The length of enhanced slide- piece 132, as shown, is five weeks, i.e., approximately one tenth the total length of slide- bar 130. Preferably, as the user moves the enhanced slide-piece along the slide-bar, graphs 122, 124, and 126 are continually updated responsive to whatever news articles are contained in a five-week period which is "covered" by the slide-piece.
Further preferably, and completely unlike any slide-bar known in the art, the user is enabled to modify the length of the enhanced slide-piece in real time, so as to cause computer 20 to change the set of articles used in generating the graphs accordingly. For example, dashed lines 148 show a former setting of the slide-piece, in which approximately twelve weei s were represented by the slide-piece. Preferably, the user uses a mouse to grab onto the left side 144 or right side 146 of enhanced slide-piece 132, and changes its length, typically in a manner analogous to the way objects are re-sized in a Windows environment. Notably, however, neither Windows nor any other software provides the improved and intuitive position control provided by enhanced slide-piece 132.
In light of this description of the operation of slide-piece 132, many applications not related to a time axis will become obvious to one skilled in the art. For example, scrolling through a document by moving slide-piece 132 could be enhanced by a "zoom" feature, effectively enabled by changing the size of the slide-piece. Alternatively, whereas a slide-bar which uses prior art technology would allow the user to select a single color from_a spectrum, a user of embodiments of the present invention would be additionally enabled to select a range of neighboring colors in an intuitive fashion.
It will be understood by one skilled in the an that aspects of the present invention described hereinabove can be embodied in a computer running software, and that the software can be stored in tangible media, e.g., hard disks, floppy disks or compac÷ disks, or in intangible media, e.g., in an electronic memory, or on a network such as the Internet.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.

Claims

1. A method for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, comprising: for each of the entries, extracting characteristics of information contained therein; finding pairs of different characteristics that appear together in at least one of the entries; determining an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear; comparing the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and
_ providing an indication of the comparative occurrence values of the pairs.
2. A method according to claim 1, wherein the entries comprise text documents, and wherein the characteristics comprise terms appearing in the documents. 3. A method according to claim 1, wherein determining the occurrence value comprises counting the number of entries in which the pair appears.
4. A method according to claim 1, wherein finding the pairs of characteristics comprises finding pairs of characteristics which appear together in at least a predetermined number of the entries. 5. A method according to claim 1, wherein finding the pairs of characteristics comprises finding pairs of characteristics which appear together in at least two of the subgroups.
6. A method according to claim 1, wherein extracting the characteristics comprises automatically mining the corpus to extract characteristics therefrom. 7. A method according to any one of claims 1-6, wherein the differentiating parameter defines an order, and wherein comparing the occurrence values comprises comparing the occurrence values in a first sub-group with the occurrence values in one or more previous sub-groups in the order.
8. A method according to claim 7, wherein comparing the occurrence values comprises comparing the occurrence values in the first sub-group with the occurrence values in a closest previous sub-group.
9. A method according to claim 7, wherein comparing the occurrence values comprises comparing the occurrence values in the first sub-group with an average of the occurrence values in the one or more previous sub-groups.
10. A method according to claim 7, wherein providing the indication comprises displaying a symbol which indicates a measure of evolution in the occurrence value in the first sub-group relative to the occurrence values in the one or more previous sub-groups in the order.
11. A method according to any one of claims 1-6, wherein providing the indication comprises displaying a table. 12. A method according to any one of claims 1-6, wherein providing the indication comprises displaying a graph.
13. A method according to claim 12, wherein displaying the graph comprises displaying a graph in which each term is represented by a node, the pairs of characteristics that are found are represented by edges, and substantially each edge is associated with the indication of the comparative appearance of the respective pair.
14. A method according to claim 13, wherein displaying the graph comprises displaying with substantially each edge a weight of the edge, which equals the occurrence vaiue of the respective pair in a first sub-group.
15. A method according to ciaim 13, wherein displaying the graph comprises displaying the graph such that the lengths of the edges represent the occurrence value of the respective pair in a first sub-group.
16. A method according to claim 12, wherein displaying the graph comprises displaying for each two sub-groups a graph which compares the occurrence values in the two sub-groups. 17. A method according to claim 16, wherein displaying the graph for each two subgroups comprises displaying the graphs such that nodes which represent the same term are displayed in substantially the same relative location.
18. A method according to claim 16, wherein the graphs of each two sub-groups are displayed as an animation sequence.
19. A method according to claim 12, wherein displaying the graph compπses displaying a plurality of superimposed graphs, each of which represents the appearances of the pairs in a different sub-group.
20. A method according to claim 19, wherein displaying the plurality of superimposed graphs comprises displaying each of the graphs in a different color.
21. - A method according to any one of claims 1-6, wherein providing the indication of the comparative values of the pairs comprises providing an indication wherein pairs having a characteristic in common are grouped together.
22. Apparatus for visualizing variations in a corpus of information including a plurality of information eπiries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, comprising: a processor which finds pairs of characteristics which appear together in at least one of the documents, determines an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear, and compares the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and a display which displays an indication of the comparative occurrence values of the pairs.
23. Apparatus according to claim 22, wherein the entries comprise text documents, and wherein the characteristics comprise terms appearing in the documents.
24. Apparatus according to claim 22, wherein the occurrence value comprises the number of entries in which the pair appears.
25. Apparatus according to claim 22, wherein the processor finds those pairs of characteristics which appear together in at least a predetermined number of entries. 26. Apparatus according to ciaim 22, wherein the processor finds those pairs of characteristics which appear together in at least two of the sub-groups.
27. Apparatus according to claim 22, wherein the processor finds characteristics selected from a group of automatically determined characteristics.
28. Apparatus according to any one of claims 22-27, wherein the differentiating parameter defines an order and wherein the processor compares the occurrence values in a first sub-group with the occurrence values in one or more previous sub-groups in the order.
29. Apparatus according to claim 28, wherein the processor compares the occurrence values in the first sub-group with the occurrence values in a closest previous sub-group. 30. Apparatus according to ciaim 28, wherein the processor compares the occurrence values in the first sub-group with an average of the occurrence values in the one or more previous sub-groups.
31. Apparatus according to claim 28, wherein the display displays a symbol which indicates a measure of evolution in the occurrence values in the first sub-group relative to the occurrence values in the one or more previous sub-groups in the order.
32. _ Apparatus according to any one of claims 22-27, wherein the display displays a table.
33. Apparatus according to any one of claims 22-27, wherein the display displays a graph. 34. Apparatus according to claim 33, wherein each node in the graph represents a term and each edge represents a found pair of characteristics, and substantially each edge is associated with the indication of the comparative appearance of the respective pair.
35. Apparatus according to ciaim 34, wherein the graph comprises with substantially each edge a weight of the edge which equals the occurrence value of the respective pair in a first sub-group.
36. Apparatus according to ciaim 34, wherein the graph comprises a graph in which the lengths of the edges represent the occurrence values of the respective pairs in a first sub-group.
37. Apparatus according to ciaim 33, wherein the graph comprises a plurality of graphs each of which compares the occurrence values of the pairs in two sub-groups.
38. Apparatus according to ciaim 37, wherein the plurality of graphs comprise graphs such that nodes which represent the same term are displayed in substantially the same relative location.
39. Apparatus according to claim 37, wherein the plurality of graphs are displayed as an animation sequence.
40. Apparatus according to ciaim 33, wherein the graph comprises a plurality of superimposed graphs each of which represents the occurrence values of the pairs in a different sub-group.
41. Apparatus according to claim 40, wherein the plurality of superimposed graphs comprise a plurality of superimposed graphs in which each of the graphs is displayed in a different color.
42. Apparatus according to any one of claims 22-27, wherein the display displays the pairs such that pairs which have common characteristics are grouped together.
43. A method for selecting a range of values of a variable, comprising: providing a graphic user interface on a display, including a siide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the siide-piece along the axis corresponds to a given value of the variable; positioning the siide-piece at a first position on the axis, so as to indicate a first value of the variable; and changing the dimension of the slide-piece so as to indicate a second value of the variable, whereby t e first and second values of the variable define the selected range.
44. A method according to claim 43, wherein changing the dimension of the siide- piece comprises changing a length of the siide-piece along the axis.
45. A method according to claim 43 or ciaim 44, wherein the first and second values of the variable comprise the extrema of the range.
46. A computer program product for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of subgroups according to a differentiating parameter of the entries, the documents including text, the program having computer-readable program instructions embodied therein, which instructions, when read by a computer, cause the computer to: for each of the entries, extract characteristics of information contained therein; find pairs of different characteristics that appear together in at least one of the entries; determine an occurrence vaiue for each of the pairs of characteristics in each sub- group in which both of the characteristics appear; compare the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and provide an indication of the comparative occurrence values of the pairs.
47. A computer program product for selecting a range of values of a variable, the program having computer-readable program instructions embodied therein, which instructions, when read by a computer, cause the computer to: provide a graphic user interface on a display, including a siide-piece that has an initial dimension and is translatable aioπg an axis representing the variable such that each position of the slide-piece aiong the axis corresponds to a given value of the variable; position the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and change the dimension of the slide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
48. A method for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, comprising: for a first one of the entries in a first one of the sub-groups, extracting a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extracting the same characteristic of information; automatically determining respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and providing an indication of the occurrence values.
49. A method according to claim 48, wherein providing the indication comprises providing a visual indication of the occurrence values.
50. A method according to ciaim 48, wherein the entries comprise text documents, and wherein the characteristic comprises a term appearing in the documents.
51. A method according to any one of claims 48-50, wherein the differentiating parameter comprises a sequence.
52. A method according to ciaim 51, wherein the sequence comprises a time sequence.
53. Apparatus for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, comprising: a processor, which (a) for a first one of the entries in a first one of the sub-groups extracts a characteristic of information contained therein, (b) for a second one of the entries in a second one of the sub-groups, extracts the same characteristic of in&πnarion, and (c) automatically determines respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and a display, which provides an indication of the occurrence values.
54. Apparatus according to ciaim 53, wherein the display provides a visual indication of the occurrence values.
55. Apparatus according to claim 53, wherein the entries comprise text documents, and wherein the characteristic comprises a term appearing in the documents.
56. Apparatus according to any one of claims 53-55, wherein the differentiating parameter comprises a sequence.
57. Apparatus according to claim 56, wherein the sequence comprises a time sequence.
58. A computer program product for extracting data from a corpus of information, including a plurality αf information entries, each entry being assigned to one or more subgroups according to a differentiating parameter of the entries, the program having computer-readable program instructions embodied therein, which instructions, when read by a computer, cause the computer to: for a first one of the entries in a first one of the sub-groups, extract a characteristic of information contained therein; for a second one of the entries in a second one of the sub-groups, extract the same characteristic of information; automatically determine respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and provide an indication of the occurrence values.
59. A product according to ciaim 58, wherein providing the indication comprises providing a visual indication of the occurrence values.
60. A product according to ciaim 58, wherein the entries comprise text documents, and wherein the characteristic comprises a term appearing in the documents. 61. A product according to any one of claims 58-60, wherein the differentiating parameter comprises a sequence.
2. A product according to claim 61, wherein the sequence comprises a time sequence.
PCT/IL2000/000582 1999-09-20 2000-09-19 Determining trends using text mining WO2001022280A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU74428/00A AU7442800A (en) 1999-09-20 2000-09-19 Determining trends using text mining

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/399,618 US6532469B1 (en) 1999-09-20 1999-09-20 Determining trends using text mining
US09/399,618 1999-09-20

Publications (2)

Publication Number Publication Date
WO2001022280A2 true WO2001022280A2 (en) 2001-03-29
WO2001022280A3 WO2001022280A3 (en) 2002-12-05

Family

ID=23580252

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2000/000582 WO2001022280A2 (en) 1999-09-20 2000-09-19 Determining trends using text mining

Country Status (3)

Country Link
US (1) US6532469B1 (en)
AU (1) AU7442800A (en)
WO (1) WO2001022280A2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1436737A1 (en) * 2001-10-18 2004-07-14 Handysoft Co., Ltd. Workflow mining system and method
US7283951B2 (en) 2001-08-14 2007-10-16 Insightful Corporation Method and system for enhanced data searching
US7398201B2 (en) 2001-08-14 2008-07-08 Evri Inc. Method and system for enhanced data searching
US7493253B1 (en) 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
US7526425B2 (en) 2001-08-14 2009-04-28 Evri Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
US7788251B2 (en) 2005-10-11 2010-08-31 Ixreveal, Inc. System, method and computer program product for concept-based searching and analysis
US7831559B1 (en) 2001-05-07 2010-11-09 Ixreveal, Inc. Concept-based trends and exceptions tracking
US8838633B2 (en) 2010-08-11 2014-09-16 Vcvc Iii Llc NLP-based sentiment analysis
US8856096B2 (en) 2005-11-16 2014-10-07 Vcvc Iii Llc Extending keyword searching to syntactically and semantically annotated data
US9092416B2 (en) 2010-03-30 2015-07-28 Vcvc Iii Llc NLP-based systems and methods for providing quotations
US9116995B2 (en) 2011-03-30 2015-08-25 Vcvc Iii Llc Cluster-based identification of news stories
US9245243B2 (en) 2009-04-14 2016-01-26 Ureveal, Inc. Concept-based analysis of structured and unstructured data using concept inheritance
US9405848B2 (en) 2010-09-15 2016-08-02 Vcvc Iii Llc Recommending mobile device activities
US9613004B2 (en) 2007-10-17 2017-04-04 Vcvc Iii Llc NLP-based entity recognition and disambiguation
US9710556B2 (en) 2010-03-01 2017-07-18 Vcvc Iii Llc Content recommendation based on collections of entities
US9934313B2 (en) 2007-03-14 2018-04-03 Fiver Llc Query templates and labeled search tip system, methods and techniques
USRE46973E1 (en) 2001-05-07 2018-07-31 Ureveal, Inc. Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information
US10049150B2 (en) 2010-11-01 2018-08-14 Fiver Llc Category-based content recommendation

Families Citing this family (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352400B2 (en) 1991-12-23 2013-01-08 Hoffberg Steven M Adaptive pattern recognition based controller apparatus and method and human-factored interface therefore
US7966078B2 (en) 1999-02-01 2011-06-21 Steven Hoffberg Network media appliance system and method
US6757866B1 (en) * 1999-10-29 2004-06-29 Verizon Laboratories Inc. Hyper video: information retrieval using text from multimedia
US6569206B1 (en) * 1999-10-29 2003-05-27 Verizon Laboratories Inc. Facilitation of hypervideo by automatic IR techniques in response to user requests
US6996775B1 (en) * 1999-10-29 2006-02-07 Verizon Laboratories Inc. Hypervideo: information retrieval using time-related multimedia:
US6493707B1 (en) * 1999-10-29 2002-12-10 Verizon Laboratories Inc. Hypervideo: information retrieval using realtime buffers
JP2001318939A (en) * 2000-05-09 2001-11-16 Hitachi Ltd Method and device for processing document and medium storing processing program
US6718323B2 (en) * 2000-08-09 2004-04-06 Hewlett-Packard Development Company, L.P. Automatic method for quantifying the relevance of intra-document search results
GB2367225B (en) * 2000-09-26 2002-08-07 Oracle Corp User interface
US7627588B1 (en) 2001-05-07 2009-12-01 Ixreveal, Inc. System and method for concept based analysis of unstructured data
AU2003230990B2 (en) * 2002-04-19 2008-09-18 Computer Associates Think, Inc. System and method for navigating search results
US20030229470A1 (en) * 2002-06-10 2003-12-11 Nenad Pejic System and method for analyzing patent-related information
TWI221989B (en) * 2002-12-24 2004-10-11 Ind Tech Res Inst Example-based concept-oriented data extraction method
US8243636B2 (en) 2003-05-06 2012-08-14 Apple Inc. Messaging system and service
WO2005008358A2 (en) * 2003-07-22 2005-01-27 Kinor Technologies Inc. Information access using ontologies
US20050055265A1 (en) * 2003-09-05 2005-03-10 Mcfadden Terrence Paul Method and system for analyzing the usage of an expression
JP2005122295A (en) * 2003-10-14 2005-05-12 Fujitsu Ltd Relationship figure creation program, relationship figure creation method, and relationship figure generation device
US20050273839A1 (en) * 2004-06-02 2005-12-08 Nokia Corporation System and method for automated context-based data presentation
EP1835455A1 (en) * 2005-01-05 2007-09-19 Musicstrands, S.A.U. System and method for recommending multimedia elements
US7693887B2 (en) * 2005-02-01 2010-04-06 Strands, Inc. Dynamic identification of a new set of media items responsive to an input mediaset
US7734569B2 (en) 2005-02-03 2010-06-08 Strands, Inc. Recommender system for identifying a new set of media items responsive to an input set of media items and knowledge base metrics
US7797321B2 (en) 2005-02-04 2010-09-14 Strands, Inc. System for browsing through a music catalog using correlation metrics of a knowledge base of mediasets
US7840570B2 (en) * 2005-04-22 2010-11-23 Strands, Inc. System and method for acquiring and adding data on the playing of elements or multimedia files
US8290962B1 (en) * 2005-09-28 2012-10-16 Google Inc. Determining the relationship between source code bases
US7877387B2 (en) * 2005-09-30 2011-01-25 Strands, Inc. Systems and methods for promotional media item selection and promotional program unit generation
US20090070267A9 (en) * 2005-09-30 2009-03-12 Musicstrands, Inc. User programmed media delivery service
US7739254B1 (en) * 2005-09-30 2010-06-15 Google Inc. Labeling events in historic news
JP4955690B2 (en) 2005-10-04 2012-06-20 アップル インコーポレイテッド Method and apparatus for visualizing a music library
EP1963957A4 (en) 2005-12-19 2009-05-06 Strands Inc User-to-user recommender
US20070162546A1 (en) * 2005-12-22 2007-07-12 Musicstrands, Inc. Sharing tags among individual user media libraries
US8271542B1 (en) 2006-01-03 2012-09-18 Robert V London Metadata producer
US7676485B2 (en) * 2006-01-20 2010-03-09 Ixreveal, Inc. Method and computer program product for converting ontologies into concept semantic networks
US20070244880A1 (en) * 2006-02-03 2007-10-18 Francisco Martin Mediaset generation system
JP5075132B2 (en) 2006-02-10 2012-11-14 アップル インコーポレイテッド System and method for prioritizing mobile media player files
WO2007092053A1 (en) * 2006-02-10 2007-08-16 Strands, Inc. Dynamic interactive entertainment
US8521611B2 (en) * 2006-03-06 2013-08-27 Apple Inc. Article trading among members of a community
US20070213965A1 (en) * 2006-03-10 2007-09-13 American Chemical Society Method and system for preclassification and clustering of chemical substances
US20070211059A1 (en) * 2006-03-10 2007-09-13 American Chemical Society Method and system for substance relationship visualization
US7410921B2 (en) * 2006-04-11 2008-08-12 Corning Incorporated High thermal expansion cyclosilicate glass-ceramics
US20070244859A1 (en) * 2006-04-13 2007-10-18 American Chemical Society Method and system for displaying relationship between structured data and unstructured data
US20080033587A1 (en) * 2006-08-03 2008-02-07 Keiko Kurita A system and method for mining data from high-volume text streams and an associated system and method for analyzing mined data
CN101611401B (en) * 2006-10-20 2012-10-03 苹果公司 Personal music recommendation mapping
US8335998B1 (en) * 2006-12-29 2012-12-18 Global Prior Art, Inc. Interactive global map
US20080176618A1 (en) * 2007-01-19 2008-07-24 Waterleaf Limited Method and System for Presenting Electronic Casino Games to a Player
US8671000B2 (en) * 2007-04-24 2014-03-11 Apple Inc. Method and arrangement for providing content to multimedia devices
US8700604B2 (en) 2007-10-17 2014-04-15 Evri, Inc. NLP-based content recommender
JPWO2009101954A1 (en) * 2008-02-15 2011-06-09 日本電気株式会社 Text information analysis system
US20090276368A1 (en) * 2008-04-28 2009-11-05 Strands, Inc. Systems and methods for providing personalized recommendations of products and services based on explicit and implicit user data and feedback
US20090276351A1 (en) * 2008-04-30 2009-11-05 Strands, Inc. Scaleable system and method for distributed prediction markets
EP2304597A4 (en) * 2008-05-31 2012-10-31 Apple Inc Adaptive recommender technology
US20090299945A1 (en) * 2008-06-03 2009-12-03 Strands, Inc. Profile modeling for sharing individual user preferences
JP5536991B2 (en) * 2008-06-10 2014-07-02 任天堂株式会社 GAME DEVICE, GAME DATA DISTRIBUTION SYSTEM, AND GAME PROGRAM
US9496003B2 (en) 2008-09-08 2016-11-15 Apple Inc. System and method for playlist generation based on similarity data
US8332406B2 (en) 2008-10-02 2012-12-11 Apple Inc. Real-time visualization of user consumption of media items
US20100169328A1 (en) * 2008-12-31 2010-07-01 Strands, Inc. Systems and methods for making recommendations using model-based collaborative filtering with user communities and items collections
US8321398B2 (en) * 2009-07-01 2012-11-27 Thomson Reuters (Markets) Llc Method and system for determining relevance of terms in text documents
US20110015921A1 (en) * 2009-07-17 2011-01-20 Minerva Advisory Services, Llc System and method for using lingual hierarchy, connotation and weight of authority
US20110029928A1 (en) * 2009-07-31 2011-02-03 Apple Inc. System and method for displaying interactive cluster-based media playlists
US20110060738A1 (en) * 2009-09-08 2011-03-10 Apple Inc. Media item clustering based on similarity data
US9311392B2 (en) * 2010-02-12 2016-04-12 Nec Corporation Document analysis apparatus, document analysis method, and computer-readable recording medium
US8983905B2 (en) 2011-10-03 2015-03-17 Apple Inc. Merging playlists from multiple sources
CN104054075A (en) * 2011-12-06 2014-09-17 派赛普申合伙公司 Text mining, analysis and output system
JP6221323B2 (en) 2013-04-22 2017-11-01 カシオ計算機株式会社 Graph display device and control program thereof
JP6318615B2 (en) 2013-12-27 2018-05-09 カシオ計算機株式会社 Graph display control device, electronic device, and program
JP6287412B2 (en) 2014-03-19 2018-03-07 カシオ計算機株式会社 Graphic drawing apparatus, graphic drawing method and program
JP6318822B2 (en) * 2014-04-24 2018-05-09 カシオ計算機株式会社 Graph display control device, graph display control method, and program
JP6394163B2 (en) 2014-08-07 2018-09-26 カシオ計算機株式会社 Graph display device, graph display method and program
US10915543B2 (en) 2014-11-03 2021-02-09 SavantX, Inc. Systems and methods for enterprise data search and analysis
US10360229B2 (en) 2014-11-03 2019-07-23 SavantX, Inc. Systems and methods for enterprise data search and analysis
US11328128B2 (en) 2017-02-28 2022-05-10 SavantX, Inc. System and method for analysis and navigation of data
EP3590053A4 (en) * 2017-02-28 2020-11-25 SavantX, Inc. System and method for analysis and navigation of data
US9996527B1 (en) * 2017-03-30 2018-06-12 International Business Machines Corporation Supporting interactive text mining process with natural language and dialog
US10936653B2 (en) 2017-06-02 2021-03-02 Apple Inc. Automatically predicting relevant contexts for media items
CN108319698B (en) * 2018-02-02 2021-01-15 华中科技大学 Game-based flow graph dividing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10143238A (en) * 1996-11-13 1998-05-29 Mitsubishi Electric Corp Plant monitoring device
WO1999005614A1 (en) * 1997-07-23 1999-02-04 Datops S.A. Information mining tool
JP2000172701A (en) * 1998-12-04 2000-06-23 Fujitsu Ltd Document data providing device, document data providing system, document data providing method and storage medium recording program providing document data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2175187A1 (en) 1993-10-28 1995-05-04 William K. Thomson Database search summary with user determined characteristics
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10143238A (en) * 1996-11-13 1998-05-29 Mitsubishi Electric Corp Plant monitoring device
WO1999005614A1 (en) * 1997-07-23 1999-02-04 Datops S.A. Information mining tool
JP2000172701A (en) * 1998-12-04 2000-06-23 Fujitsu Ltd Document data providing device, document data providing system, document data providing method and storage medium recording program providing document data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PATENT ABSTRACTS OF JAPAN vol. 1998, no. 10, 31 August 1998 (1998-08-31) & JP 10 143238 A (MITSUBISHI ELECTRIC CORP), 29 May 1998 (1998-05-29) *
PATENT ABSTRACTS OF JAPAN vol. 2000, no. 09, 13 October 2000 (2000-10-13) & JP 2000 172701 A (FUJITSU LTD), 23 June 2000 (2000-06-23) *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USRE46973E1 (en) 2001-05-07 2018-07-31 Ureveal, Inc. Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information
US7831559B1 (en) 2001-05-07 2010-11-09 Ixreveal, Inc. Concept-based trends and exceptions tracking
US7890514B1 (en) 2001-05-07 2011-02-15 Ixreveal, Inc. Concept-based searching of unstructured objects
US8131540B2 (en) 2001-08-14 2012-03-06 Evri, Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
US7283951B2 (en) 2001-08-14 2007-10-16 Insightful Corporation Method and system for enhanced data searching
US7398201B2 (en) 2001-08-14 2008-07-08 Evri Inc. Method and system for enhanced data searching
US7526425B2 (en) 2001-08-14 2009-04-28 Evri Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
US7953593B2 (en) 2001-08-14 2011-05-31 Evri, Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
EP1436737A1 (en) * 2001-10-18 2004-07-14 Handysoft Co., Ltd. Workflow mining system and method
EP1436737A4 (en) * 2001-10-18 2009-08-12 Handysoft Co Ltd Workflow mining system and method
US7493253B1 (en) 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
US8442814B2 (en) 2002-07-12 2013-05-14 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US8812292B2 (en) 2002-07-12 2014-08-19 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US9292494B2 (en) 2002-07-12 2016-03-22 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US7788251B2 (en) 2005-10-11 2010-08-31 Ixreveal, Inc. System, method and computer program product for concept-based searching and analysis
US8856096B2 (en) 2005-11-16 2014-10-07 Vcvc Iii Llc Extending keyword searching to syntactically and semantically annotated data
US9378285B2 (en) 2005-11-16 2016-06-28 Vcvc Iii Llc Extending keyword searching to syntactically and semantically annotated data
US9934313B2 (en) 2007-03-14 2018-04-03 Fiver Llc Query templates and labeled search tip system, methods and techniques
US9613004B2 (en) 2007-10-17 2017-04-04 Vcvc Iii Llc NLP-based entity recognition and disambiguation
US10282389B2 (en) 2007-10-17 2019-05-07 Fiver Llc NLP-based entity recognition and disambiguation
US9245243B2 (en) 2009-04-14 2016-01-26 Ureveal, Inc. Concept-based analysis of structured and unstructured data using concept inheritance
US9710556B2 (en) 2010-03-01 2017-07-18 Vcvc Iii Llc Content recommendation based on collections of entities
US9092416B2 (en) 2010-03-30 2015-07-28 Vcvc Iii Llc NLP-based systems and methods for providing quotations
US10331783B2 (en) 2010-03-30 2019-06-25 Fiver Llc NLP-based systems and methods for providing quotations
US8838633B2 (en) 2010-08-11 2014-09-16 Vcvc Iii Llc NLP-based sentiment analysis
US9405848B2 (en) 2010-09-15 2016-08-02 Vcvc Iii Llc Recommending mobile device activities
US10049150B2 (en) 2010-11-01 2018-08-14 Fiver Llc Category-based content recommendation
US9116995B2 (en) 2011-03-30 2015-08-25 Vcvc Iii Llc Cluster-based identification of news stories

Also Published As

Publication number Publication date
US6532469B1 (en) 2003-03-11
WO2001022280A3 (en) 2002-12-05
AU7442800A (en) 2001-04-24

Similar Documents

Publication Publication Date Title
US6532469B1 (en) Determining trends using text mining
US7840524B2 (en) Method and apparatus for indexing, searching and displaying data
JP3001460B2 (en) Document classification device
US8555182B2 (en) Interface for managing search term importance relationships
US5987460A (en) Document retrieval-assisting method and system for the same and document retrieval service using the same with document frequency and term frequency
US6772148B2 (en) Classification of information sources using graphic structures
EP1384170B1 (en) Search user interface with enhanced accessibility and ease-of-use features based on visual metaphors
US7111253B2 (en) Method and apparatus for displaying hierarchical information
US7280957B2 (en) Method and apparatus for generating overview information for hierarchically related information
EP0722145A1 (en) Information retrieval system and method of operation
JP3614618B2 (en) Document search support method and apparatus, and document search service using the same
US20040133555A1 (en) Systems and methods for organizing data
US20060230334A1 (en) Visual thesaurus as applied to media clip searching
WO2002054287A2 (en) Multi-query data visualization
CN106951554B (en) Hierarchical news hotspot and evolution mining and visualization method thereof
US7107550B2 (en) Method and apparatus for segmenting hierarchical information for display purposes
US5548699A (en) Apparatus for presenting information according to evaluations of units of the information
CN110852059B (en) Document content difference contrast visual analysis method based on grouping
JP2000010986A (en) Retrieval support method for document data base and storage medium where program thereof is stored
Hirzalla et al. A multimedia query specification language
Bonnel et al. Meaning metaphor for visualizing search results
Spangler et al. Mindmap: Utilizing multiple taxonomies and visualization to understand a document collection
Rao et al. Natural technologies for knowledge work: information visualization and knowledge extraction
JPH07121565A (en) Information presenting device
JPH10162011A (en) Information retrieval method, information retrieval system, information retrieval terminal equipment, and information retrieval device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

NENP Non-entry into the national phase

Ref country code: JP