WO2016200359A1 - Term scores - Google Patents

Term scores Download PDF

Info

Publication number
WO2016200359A1
WO2016200359A1 PCT/US2015/034590 US2015034590W WO2016200359A1 WO 2016200359 A1 WO2016200359 A1 WO 2016200359A1 US 2015034590 W US2015034590 W US 2015034590W WO 2016200359 A1 WO2016200359 A1 WO 2016200359A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
terms
text
engine
segment
Prior art date
Application number
PCT/US2015/034590
Other languages
French (fr)
Inventor
Shanchan WU
Steven J. Simske
Original Assignee
Hewlett-Packard Development Company, L.P
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P filed Critical Hewlett-Packard Development Company, L.P
Priority to PCT/US2015/034590 priority Critical patent/WO2016200359A1/en
Publication of WO2016200359A1 publication Critical patent/WO2016200359A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • FIG. 1 is a block diagram of an example computing system
  • FIG. 2 is another block diagram of an example computing system
  • FIG. 3A illustrates an example arrangement of terms in segments
  • FIG. 3B illustrates an example network of terms
  • FIG. 4A is another illustration of an example arrangement of terms in segments
  • FIG. 4B is another illustration of an example network of terms
  • FIG. 5 illustrates an example display of an example computing system.
  • FIG. 6 shows a flowchart of an example method
  • FIG. 7 is a block diagram of an example computing device. DETAILED DESCRIPTION
  • determining a set of key terms that accurately represent or categorize a digital text may allow the readers to quickly determine the main concepts and topics discussed in the text. Such key terms may be extracted from the text itself. To determine whether a particular term (e.g., a word or a sequence of words) in the text should be used as a key term, the term's frequency of appearance in the text may be compared to the term's frequency of appearance in a larger body or "a corpus" of texts. For example, a term's frequency (TF) within a given text may be multiplied by the term's inverse document frequency (IDF), where IDF corresponds to the inverse of the term's frequency in a corpus of texts.
  • TF term's frequency
  • IDF inverse document frequency
  • a higher TF * IDF value may indicate that the term appears in the text unusually often, which may indicate that the term has an unusually high importance in the current text and could potentially be a good (descriptive) key term. Relying only on the TF * IDF may not produce the best key terms for some types of texts. For example, a textbook may often use a very particular and specialized vocabulary, making it difficult to determine and obtain a corpus of texts that would enable the TF * IDF measure to produce descriptive key terms.
  • the computing system may include, among other things, a term extraction engine to extract a plurality of terms from a text, where the text includes a plurality of segments; a network construction engine to connect each term in the plurality of terms to each term included in any segment, in the plurality of segments, in which the term is also included; and a term ranking engine to calculate a score for each term in the plurality of terms based on the term's connections to other terms in the plurality of terms.
  • FIG. 1 is a block diagram of an example computing system 100.
  • Computing system 100 may include one or more computing devices, where a computing device may include a smartphone, cell phone, tablet, laptop, desktop, server, application-specific computing device, any other processing device or equipment, or a combination thereof.
  • Computing system 100 may include a term extraction engine 112, a network construction engine 113, and a term ranking engine 114.
  • FIG. 2 is a block diagram of another example computing system 100. As illustrated in FIG. 2, in some examples, computing system 100 may also include a network visualization engine 115, a memory 116, and a display 118.
  • Display 118 may be embedded into computing system 100 or communicatively coupled to computing system 100, and may be implemented using any suitable technology, such as LCD, LED, OLED, TFT, Plasma, etc. In some examples, display 118 may be a touch-sensitive display.
  • Memory 116 may also be embedded in computing system 100 or communicatively coupled thereto, and may include any type of volatile or non-volatile memory, such as a random-access memory (RAM), flash memory, hard drive, memristor-based memory, and so forth.
  • Engines 112, 113, 114, and 115 may each generally represent any combination of hardware and programming. Each of the engines is discussed in more detail below.
  • term extraction engine 112 may obtain a text, where "text" may refer to any number of documents or portions thereof, any number of articles or portions thereof, any number of books (e.g., fiction or non-fiction books, textbooks, etc.) or portions thereof, or any other type of textual content.
  • the text may be obtained from any type of machine-readable source, such as a file, a web page, an image of text, a data stream comprising text, etc.
  • the text may be obtained from a memory (e.g., 116)on computing system 100, or receive it from a remote device, e.g., via a local-area network, a wide-area network (e.g., the Internet), or any combination of these or other types of networks.
  • a memory e.g., 116
  • a remote device e.g., via a local-area network, a wide-area network (e.g., the Internet), or any combination of these or other types of networks.
  • engine 112 may pre-process the text which may include removing non-text content, removing stop words (e.g., prepositions, pronouns, etc.), stemming the words, or otherwise formatting and/or filtering the text.
  • engine 112 may extract from the text a plurality of terms, i.e., two or more terms.
  • each extracted term may be either a single word (e.g., "software") or a sequence of two or more words (e.g., "operating system”).
  • multi-word terms terms consisting of only one word may be referred to as "single-word terms.”
  • engine 112 may extract from the text only labeled terms, i.e., only terms that have been at some point labeled by people as established concepts, terms of art, names of categories, etc. For example, engine 112 may extract from the text only terms that appear in a predefined set of taxonomies, indexes, encyclopedias, or other types of databases storing terms pre-labeled by people as concepts. For example, terms having dedicated Wikipedia articles may be considered by engine 112 as labeled terms. Thus, in some examples, engine 112 may disregard any unlabeled terms in the obtained text.
  • engine 112 may disregard any single-word terms and extract only multi-word terms (e.g., labeled multi-word terms). Because a meaning of a multi-word term is less likely than a single-word term to change across various contexts, using only multi-word terms or as many multiword terms as possible may help reduce any discrepancies associated with the same term meaning different things in different contexts.
  • engine 112 may be configured to extract from the text at least M terms, where M may be a predefined number such as 10, 30, etc. In these examples, engine 112 may determine whether the text includes at least M labeled multi-word terms, and if so, it may not extract any additional terms. However, if the number of labeled multi-word terms in the text is less than M, engine 112 may extract additional terms.
  • engine 112 may first divide the text into a plurality of segments of fixed or variable sizes. For example, engine 112 may divide the text into segments such that each segment contains a certain number (e.g., 1 ,
  • engine 112 may apply a topic model such as a Latent
  • the topic model may generate a set of topics, where each topic may be described by a set of terms referred to herein as "topic terms.”
  • the topic model may also provide, for each topic term, a weight associated with that topic term.
  • engine 112 may rank all topic terms by weight, and supplement the extracted labeled multi-word terms with highest ranking topic terms until the total number of obtained terms reaches M or until no topic terms remain.
  • some of the topic terms generated by the topic model may also be labeled terms, for example, labeled single-word terms. Such topic terms may be referred to herein as "labeled topic terms.”
  • the topic model may first supplement the extracted labeled multi-word terms with the highest ranking labeled topic terms.
  • engine 112 may also add highest ranking unlabeled topic terms until the total number of terms reaches M, or until no topic terms remain.
  • term extraction engine 112 may pass the extracted terms to network construction engine 113.
  • Engine 113 may obtain the extracted terms from engine 112, and construct, based on the extracted terms, a network of terms. In some examples, engine 113 may first determine which terms are found in which segments of the text. As discussed above, the text may be divided (e.g., virtually) by engine 112 or any other engine of computing system 100 into a plurality of segments, where each segment may correspond to one or more paragraphs, pages, sections, chapters, topics, or other units of text. Some extracted terms may be found in one segment, and some extracted terms may be found in a plurality of segments.
  • engine 113 may construct a network of terms by adding each term as a node and connecting any pair of terms found together in at least one segment in the text.
  • the connections between the terms may also be referred to herein as "edges.”
  • FIG. 3A shows an example arrangement 300 of five terms t1-t5 into four segments u1- u4.
  • term t1 is found only in segment u1; term t2 is found in segments u1 and u2; term t3 is found in segments u1 and u2; term t4 is found in segments u1 and u3; and term t5 is found in segments u3 and u4.
  • FIG. 3A shows an example arrangement 300 of five terms t1-t5 into four segments u1- u4.
  • term t1 is found only in segment u1
  • term t2 is found in segments u1 and u2
  • term t3 is found in segments u1 and u2
  • term t4 is found in segments
  • FIG. 3B illustrates an example network 310 constructed based on arrangement 300 of FIG. 3A.
  • network 310 any two terms found in the same segment are connected to each other. It is appreciated that a pair of terms may be found together in more than one segment. For example, terms t2 and t3 are found together in segment u1 and also in segment u2.
  • each segment may be associated with a weight.
  • a segment's weight may reflect, for example, a total number of terms extracted from that segment.
  • a segment's weight may be inversely proportional to the number of terms extracted from that segment.
  • FIG. 4A illustrates the example arrangement 300 of FIG. 3A with weights w1-w4 assigned to segments u1- u4, respectively.
  • each segment is assigned a weight that is a reciprocal of the number of terms extracted from that segment.
  • engine 113 may assign a weight to each edge in the network.
  • the weight of an edge may be associated with (e.g., be a sum of) weights of the segment(s) that include both terms connected by the edge.
  • FIG. 4B shows network 310 with weights assigned to all of the network's edges.
  • the edge between terms t2 and t3 is assigned a weight of 0.75 (i.e., the sum of w1 and w2) because terms t2 and t3 appear together in segments u1 and u2.
  • term ranking engine 114 may score and rank the network's terms.
  • a score of a term may be calculated based on the term's centrality within the network.
  • the term's centrality within the network may be determined, for example, based on the sum of the weights of the edges connecting the term to its adjacent terms.
  • a centrality of the term ft may be calculated by engine 1 13 based on the following formula:
  • N is the total number of terms in the network
  • index j runs on all terms adjacent to the term ft
  • w(tj, ft is the weight from the edge ft to the edge ft
  • w(t h tk) is the weight from the edge ft to the edge tk.
  • the centrality of the term t may also depend on centralities of its adjacent terms.
  • engine 113 may calculate the term's centrality using the following formula:
  • the term's ranking may in some examples depend on how frequently (or infrequently) the term appears in a corpus of texts.
  • engine 113 may determine such frequency by calculating the inverse document frequency (IDF) of a term t using, for example, the following formula:
  • IDF ⁇ t) log(- ⁇ n t +0.5 where n is the total number of documents in the corpus and nt is the number of documents in the corpus that contain the term t.
  • the corpus of texts may include only texts from the same type or category as the text being processed.
  • engine 1 14 may select a corpus from a plurality of corpuses, such that the selected corpus matches or is associated with the text.
  • term ranking engine 114 may select from a plurality of corpuses a corpus containing only or mostly textbooks.
  • engine 114 may score and rank each term based on the term's centrality, the term's inverse document frequency, or both.
  • engine 114 may calculate a score R of each term t based on a linear combination of the term's centrality and the term's inverse document frequency, for example, using the following formula:
  • R(t) a - N ⁇ Centrality (t) + ⁇ ⁇ IDF(t) where N is the number of the extracted terms, and a and ⁇ are two predetermined and calibrated constants.
  • engine 114 may rank all terms based on their score. For example, higher scoring terms may be ranked higher, and vice versa.
  • engine 112 may extract a larger number of terms
  • engine 113 may construct a network from the larger number of terms, and engine 114 may calculate scores for those terms. Engine 114 may then select all terms whose scores are above a predefined threshold. Next, engine 113 may construct a new network of selected terms only, after which engine 114 may recalculate the scores of the selected terms (whose centrality values may have changed) based on the new network.
  • engine 114 may, for example, store the terms along with their scoring, ranking, and any other information, in a memory (e.g., memory 116), and/or send it to another (e.g., remote) device. In some examples, engine 114 may only store and/or send information about a predefined number of top-ranked (and top-scoring) terms while discarding the other terms.
  • a memory e.g., memory 116
  • engine 114 may only store and/or send information about a predefined number of top-ranked (and top-scoring) terms while discarding the other terms.
  • engine 114 may represent the ranked terms (or only a predefined number of top-ranked terms) on a display. For example, engine 114 may send the term information to network visualization engine 115, which may process the term information and display the information on a display (e.g., display 118).
  • network visualization engine 115 may process the term information and display the information on a display (e.g., display 118).
  • Network visualization engine 115 may represent the terms on a display, for example, in a form of a graph that may be similar to the network constructed by construction engine 113.
  • Engine 115 may represent each term on the display using text and/or a predefined shape (e.g., a circle) of a certain size and color.
  • the size of the shape representing a particular term may be associated with (e.g., be proportional to) the term's score and/or rank. This may cause higher scoring and higher ranking terms appear larger than lower scoring and lower ranking terms.
  • engine 115 may provide a visual indication of each term's relative score or rank, that is, a visual indication of the term's score or rank relative to other terms' scores or ranks.
  • engine 115 may provide a visual indication of the term's actual (absolute) score or rank, e.g., in a numerical form.
  • engine 115 may also receive a user input associated with a particular term on the display, and in response to the user input, display to the user information associated with the particular term. For example, the user may click or touch on or around the displayed term, and engine 115 may, in response to the click or touch, display one or more portions of the text that contain the term. A portion may include, for example, any number of sentences, paragraphs, pages, sections, or other units of text containing the term.
  • engine 115 may open and display a document containing the text by launching an appropriate application, such as a word processor, a PDF viewer, a web browser, etc.
  • engine 115 may also present the text such that at least one portion containing the term is visible.
  • engine 115 may also select (e.g., highlight) the term within the visible portion.
  • engine 115 may also display the connections between the terms, for example, using straight lines.
  • engine 115 may also visually indicate the connections' relative or absolute weights, because, as discussed above, a connection's weight may reflect the level of correlation between the two connected terms.
  • engine 115 may represent each connection as with a straight line whose length, thickness, color, or another parameter may represent the connection's weight. For example, higher weighing connections may be represented by thicker lines, by shorter lines, by lines of a particular color, and so forth.
  • FIG. 5 illustrates some of the examples discussed above.
  • network visualization engine 115 provides to display 118 a graph 510 representing an example network constructed by construction engine 113.
  • the processed text is a textbook about operating systems, and the graph contains various terms extracted from the text and their connections to each other.
  • the larger circles may correspond to higher ranked terms (e.g., "operating system,” “virtual memory,” and “CPU") and the thicker connection lines may indicate stronger correlations (e.g., a very strong correlation between the terms “operating system” and “virtual memory,” and a strong correlation between the terms “virtual memory” and “caching”).
  • FIG. 5 illustrates some of the examples discussed above.
  • network visualization engine 115 provides to display 118 a graph 510 representing an example network constructed by construction engine 113.
  • the processed text is a textbook about operating systems
  • the graph contains various terms extracted from the text and their connections to each other.
  • the larger circles may correspond to higher ranked terms (e.g., "operating
  • the user may click on a term (e.g., "context switch"), and network visualization engine 115 may open, in response to the click, a window 510 displaying the portion of the textbook containing the selected term.
  • Engine 115 may also highlight every occurrence of the selected term in the text.
  • engines 112, 113, 114, and 115 were described as any combinations of hardware and programming. Such components may be implemented in a number of fashions.
  • the programming may be processor executable instructions stored on a tangible, non-transitory computer-readable medium and the hardware may include a processing resource for executing those instructions.
  • the processing resource may include one or multiple processors (e.g., central processing units (CPUs), semiconductor-based microprocessors, graphics processing units (GPUs), field-programmable gate arrays (FPGAs) configured to retrieve and execute instructions, or other electronic circuitry), which may be integrated in a single device or distributed across devices.
  • processors e.g., central processing units (CPUs), semiconductor-based microprocessors, graphics processing units (GPUs), field-programmable gate arrays (FPGAs) configured to retrieve and execute instructions, or other electronic circuitry
  • the computer-readable medium can be said to store program instructions that when executed by the processor resource implement the functionality of the respective component.
  • the computer-readable medium may be integrated in the same device as the processor resource or it may be separate but accessible to that device and the processor resource.
  • the program instructions can be part of an installation package that when installed can be executed by the processor resource to implement the corresponding component.
  • the computer-readable medium may be a portable medium such as a CD, DVD, or flash drive or a memory maintained by a server from which the installation package can be downloaded and installed.
  • the program instructions may be part of an application or applications already installed, and the computer-readable medium may include integrated memory such as a hard drive, solid state drive, or the like.
  • FIG. 6 is a flowchart of an example method 600.
  • Method 600 may be described below as being executed or performed by a system or by a computing device such as computing system 100 of FIG. 1. Other suitable systems and/or computing devices may be used as well.
  • Method 600 may be implemented in the form of executable instructions stored on at least one non-transitory machine-readable storage medium of the system and executed by at least one processor of the system.
  • method 600 may be implemented in the form of electronic circuitry (e.g., hardware).
  • one or more blocks of method 600 may be executed substantially concurrently or in a different order than shown in FIG. 6.
  • method 600 may include more or less blocks than are shown in FIG. 6.
  • one or more of the blocks of method 600 may, at certain times, be ongoing and/or may repeat.
  • method 600 may extract a term from a segment of a text, where the text may include a plurality of segments.
  • the method may add the term to a network of terms by connecting the term to a set of other terms extracted from the segment.
  • extracting the term from the segment of the text may include, for example, determining whether the term is included in a database (or a list) of labeled terms.
  • the method may determine the term's centrality within the network of terms.
  • each of the term's connections to the set of other terms extracted from the segment may be associated with a weight, and determining the term's centrality within the network may include calculating a sum of the weights of the term's connections to the set of other terms extracted from the segment.
  • the weight associated with each of the term's connections may be calculated based on a number of segments in the plurality of segments that include both terms connected by the connection and/or based on a number of other terms extracted from those segments.
  • the method may determine the term's frequency within a corpus of texts.
  • the method may determine the term's score based on the term's centrality and the term's frequency.
  • the method may display the term on a display (e.g., 118) and providing a visual indication of the term's score on the display.
  • the method may also receive a user input associated with the term, and responsive to the user input displaying a portion of the text, where the portion includes the term.
  • FIG. 7 is a block diagram of an example computing system 700.
  • Computing device 700 may be similar to computing system 100 of FIG. 1.
  • computing device 700 includes a processor 710 and a non-transitory machine-readable storage medium 720.
  • processor 710 and a non-transitory machine-readable storage medium 720.
  • the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.
  • Processor 710 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in non-transitory machine-readable storage medium 420.
  • processor 710 may fetch, decode, and execute instructions 722, 724, 726, 728, or any other instructions (not shown for brevity).
  • processor 710 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in machine-readable storage medium 720.
  • executable instruction representations e.g., boxes
  • executable instructions and/or electronic circuits included within one box may, in alternate examples, be included in a different box shown in the figures or in a different box not shown.
  • Non-transitory machine-readable storage medium 720 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions.
  • medium 720 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like.
  • Medium 720 may be disposed within computing device 700, as shown in FIG. 7. In this situation, the executable instructions may be "installed" on computing device 700.
  • medium 720 may be a portable, external or remote storage medium, for example, that allows computing device 700 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package". As described herein, medium 720 may be encoded with executable instructions.
  • instructions 722 when executed by a processor (e.g., 710), may cause a computing device (e.g., 700) to extract a term from a segment of a text, the text comprising a plurality of segments.
  • Instructions 724 when executed by the processor, may cause the computing device to connect the term to a set of other terms extracted from the segment via a set of connections.
  • Instructions 726 when executed by the processor, may cause the computing device to caicuiate a weight for each connection in the set of connections based on a number of other terms extracted from the segment.
  • Instructions 728 when executed by the processor, may cause the computing device to determine a score of the term based on the weights of the set of connections and based on a frequency at which the term appears in a corpus of texts. As discussed above, in some examples, additional instructions (not shown for brevity) may cause the computing device to display the term and a visual indication of the term's score.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Examples disclosed herein relate, among other things, to a method. The method may include extracting a term from a segment of a text, where the text includes a plurality of segments, adding the term to a network of terms by connecting the term to a set of other terms extracted from the segment, and determining a score of the term based on the terms centrality within the network and/or based on the terms frequency within a corpus of texts.

Description

TERM SCORES
BACKGROUND
[0001 ] With today's ever growing availability of digital texts such as articles, textbooks, fiction or non-fiction books, etc., it may be useful to automatically obtain a set of key terms associated with each text. The key terms may allow users, for example, to quickly determine the main concepts and topics discussed in the text. Determining the set of representative key terms may also enable better indexing, classification, sequencing, and querying of digital texts.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] The following detailed description references the drawings, wherein:
[0003] FIG. 1 is a block diagram of an example computing system;
[0004] FIG. 2 is another block diagram of an example computing system;
[0005] FIG. 3A illustrates an example arrangement of terms in segments;
[0006] FIG. 3B illustrates an example network of terms;
[0007] FIG. 4A is another illustration of an example arrangement of terms in segments;
[0008] FIG. 4B is another illustration of an example network of terms;
[0009] FIG. 5 illustrates an example display of an example computing system.
[0010] FIG. 6 shows a flowchart of an example method; and
[0011] FIG. 7 is a block diagram of an example computing device. DETAILED DESCRIPTION
[0012] As mentioned above, determining a set of key terms that accurately represent or categorize a digital text may allow the readers to quickly determine the main concepts and topics discussed in the text. Such key terms may be extracted from the text itself. To determine whether a particular term (e.g., a word or a sequence of words) in the text should be used as a key term, the term's frequency of appearance in the text may be compared to the term's frequency of appearance in a larger body or "a corpus" of texts. For example, a term's frequency (TF) within a given text may be multiplied by the term's inverse document frequency (IDF), where IDF corresponds to the inverse of the term's frequency in a corpus of texts. A higher TF*IDF value may indicate that the term appears in the text unusually often, which may indicate that the term has an unusually high importance in the current text and could potentially be a good (descriptive) key term. Relying only on the TF*IDF may not produce the best key terms for some types of texts. For example, a textbook may often use a very particular and specialized vocabulary, making it difficult to determine and obtain a corpus of texts that would enable the TF*IDF measure to produce descriptive key terms.
[0013] Examples disclosed herein describe, among other things, a computing system. The computing system may include, among other things, a term extraction engine to extract a plurality of terms from a text, where the text includes a plurality of segments; a network construction engine to connect each term in the plurality of terms to each term included in any segment, in the plurality of segments, in which the term is also included; and a term ranking engine to calculate a score for each term in the plurality of terms based on the term's connections to other terms in the plurality of terms.
[0014] FIG. 1 is a block diagram of an example computing system 100. Computing system 100 may include one or more computing devices, where a computing device may include a smartphone, cell phone, tablet, laptop, desktop, server, application- specific computing device, any other processing device or equipment, or a combination thereof. Computing system 100 may include a term extraction engine 112, a network construction engine 113, and a term ranking engine 114. FIG. 2 is a block diagram of another example computing system 100. As illustrated in FIG. 2, in some examples, computing system 100 may also include a network visualization engine 115, a memory 116, and a display 118.
[0015] Display 118 may be embedded into computing system 100 or communicatively coupled to computing system 100, and may be implemented using any suitable technology, such as LCD, LED, OLED, TFT, Plasma, etc. In some examples, display 118 may be a touch-sensitive display. Memory 116 may also be embedded in computing system 100 or communicatively coupled thereto, and may include any type of volatile or non-volatile memory, such as a random-access memory (RAM), flash memory, hard drive, memristor-based memory, and so forth. Engines 112, 113, 114, and 115 may each generally represent any combination of hardware and programming. Each of the engines is discussed in more detail below.
[0016] In some examples, term extraction engine 112 may obtain a text, where "text" may refer to any number of documents or portions thereof, any number of articles or portions thereof, any number of books (e.g., fiction or non-fiction books, textbooks, etc.) or portions thereof, or any other type of textual content. The text may be obtained from any type of machine-readable source, such as a file, a web page, an image of text, a data stream comprising text, etc. In some examples, the text may be obtained from a memory (e.g., 116)on computing system 100, or receive it from a remote device, e.g., via a local-area network, a wide-area network (e.g., the Internet), or any combination of these or other types of networks.
[0017] In some examples, after obtaining the text, engine 112 may pre-process the text which may include removing non-text content, removing stop words (e.g., prepositions, pronouns, etc.), stemming the words, or otherwise formatting and/or filtering the text. After optionally pre-processing the text, engine 112 may extract from the text a plurality of terms, i.e., two or more terms. In some examples, each extracted term may be either a single word (e.g., "software") or a sequence of two or more words (e.g., "operating system"). For brevity, terms consisting of more than one word may be referred to herein as "multi-word terms" and terms consisting of only one word may be referred to as "single-word terms."
[0018] In some examples, engine 112 may extract from the text only labeled terms, i.e., only terms that have been at some point labeled by people as established concepts, terms of art, names of categories, etc. For example, engine 112 may extract from the text only terms that appear in a predefined set of taxonomies, indexes, encyclopedias, or other types of databases storing terms pre-labeled by people as concepts. For example, terms having dedicated Wikipedia articles may be considered by engine 112 as labeled terms. Thus, in some examples, engine 112 may disregard any unlabeled terms in the obtained text.
[0019] Alternatively or in addition, in some examples, engine 112 may disregard any single-word terms and extract only multi-word terms (e.g., labeled multi-word terms). Because a meaning of a multi-word term is less likely than a single-word term to change across various contexts, using only multi-word terms or as many multiword terms as possible may help reduce any discrepancies associated with the same term meaning different things in different contexts.
[0020] In some examples, engine 112 may be configured to extract from the text at least M terms, where M may be a predefined number such as 10, 30, etc. In these examples, engine 112 may determine whether the text includes at least M labeled multi-word terms, and if so, it may not extract any additional terms. However, if the number of labeled multi-word terms in the text is less than M, engine 112 may extract additional terms.
[0021] To extract additional terms, engine 112 may first divide the text into a plurality of segments of fixed or variable sizes. For example, engine 112 may divide the text into segments such that each segment contains a certain number (e.g., 1 ,
10, etc.) of paragraphs, pages, sections, chapters, topics, or other units of text. After dividing the text into segments, engine 112 may apply a topic model such as a Latent
Dirichlet Allocation (LDA) model on all segments of the text. In response, the topic model may generate a set of topics, where each topic may be described by a set of terms referred to herein as "topic terms." The topic model may also provide, for each topic term, a weight associated with that topic term.
[0022] In some examples, engine 112 may rank all topic terms by weight, and supplement the extracted labeled multi-word terms with highest ranking topic terms until the total number of obtained terms reaches M or until no topic terms remain. In some examples, some of the topic terms generated by the topic model may also be labeled terms, for example, labeled single-word terms. Such topic terms may be referred to herein as "labeled topic terms." In these examples, the topic model may first supplement the extracted labeled multi-word terms with the highest ranking labeled topic terms. In some examples, if all labeled topic terms are insufficient to bring the total number of obtained terms to M, engine 112 may also add highest ranking unlabeled topic terms until the total number of terms reaches M, or until no topic terms remain.
[0023] After extracting the terms, term extraction engine 112 may pass the extracted terms to network construction engine 113. Engine 113 may obtain the extracted terms from engine 112, and construct, based on the extracted terms, a network of terms. In some examples, engine 113 may first determine which terms are found in which segments of the text. As discussed above, the text may be divided (e.g., virtually) by engine 112 or any other engine of computing system 100 into a plurality of segments, where each segment may correspond to one or more paragraphs, pages, sections, chapters, topics, or other units of text. Some extracted terms may be found in one segment, and some extracted terms may be found in a plurality of segments.
[0024] After determining which extracted terms are found in which segments, engine 113 may construct a network of terms by adding each term as a node and connecting any pair of terms found together in at least one segment in the text. The connections between the terms may also be referred to herein as "edges." To illustrate, FIG. 3A shows an example arrangement 300 of five terms t1-t5 into four segments u1- u4. In this example, term t1 is found only in segment u1; term t2 is found in segments u1 and u2; term t3 is found in segments u1 and u2; term t4 is found in segments u1 and u3; and term t5 is found in segments u3 and u4. FIG. 3B illustrates an example network 310 constructed based on arrangement 300 of FIG. 3A. In network 310, any two terms found in the same segment are connected to each other. It is appreciated that a pair of terms may be found together in more than one segment. For example, terms t2 and t3 are found together in segment u1 and also in segment u2.
[0025] In some examples, each segment may be associated with a weight. A segment's weight may reflect, for example, a total number of terms extracted from that segment. For example, a segment's weight may be inversely proportional to the number of terms extracted from that segment. For example, FIG. 4A illustrates the example arrangement 300 of FIG. 3A with weights w1-w4 assigned to segments u1- u4, respectively. In this example, each segment is assigned a weight that is a reciprocal of the number of terms extracted from that segment. Accordingly, segment u1 is assigned a weight of w1=0.25 because segment u1 includes four extracted terms t1-t4; segment u2 is assigned a weight of w2=0.5 because segment u2 includes two extracted terms t2 and t3; segment u3 is assigned a weight of w3=0.5 because segment u3 includes two extracted terms t4 and t5; and segment u4 is assigned a weight of w4=1 because segment u4 includes only one extracted term t5.
[0026] In some examples, engine 113 may assign a weight to each edge in the network. The weight of an edge may be associated with (e.g., be a sum of) weights of the segment(s) that include both terms connected by the edge. To illustrate, FIG. 4B shows network 310 with weights assigned to all of the network's edges. In this example, the edge between terms t1 and t2 is assigned a weight of 0.25 because terms t1 and t2 appear together only in segment u1 whose weight is w1=0.25, and the edge between terms t2 and t3 is assigned a weight of 0.75 (i.e., the sum of w1 and w2) because terms t2 and t3 appear together in segments u1 and u2. Accordingly, if segments having many extracted terms weigh less, an edge between two terms found in a segment with few other terms may tend to have a higher weight, reflecting a potentially stronger correlation between the terms in such segments. [0027] After network constructions engine 113 constructs the network of terms, term ranking engine 114 may score and rank the network's terms. In some examples, a score of a term may be calculated based on the term's centrality within the network. The term's centrality within the network may be determined, for example, based on the sum of the weights of the edges connecting the term to its adjacent terms. For example, a centrality of the term ft may be calculated by engine 1 13 based on the following formula:
Figure imgf000009_0001
where d is a predefined constant (e.g., 0.85), N is the total number of terms in the network, index j runs on all terms adjacent to the term ft, w(tj, ft is the weight from the edge ft to the edge ft, and w(th tk) is the weight from the edge ft to the edge tk. In some examples, the centrality of the term t may also depend on centralities of its adjacent terms. For example, engine 113 may calculate the term's centrality using the following formula:
Figure imgf000009_0002
[0030] Alternatively or in addition to being dependent on the term's centrality, the term's ranking may in some examples depend on how frequently (or infrequently) the term appears in a corpus of texts. In some examples, engine 113 may determine such frequency by calculating the inverse document frequency (IDF) of a term t using, for example, the following formula:
n-nt+0.5
[0031] IDF {t) = log(- ■ nt+0.5 where n is the total number of documents in the corpus and nt is the number of documents in the corpus that contain the term t. In some examples, the corpus of texts may include only texts from the same type or category as the text being processed. For example, engine 1 14 may select a corpus from a plurality of corpuses, such that the selected corpus matches or is associated with the text. For example, if the text being processed is a textbook, term ranking engine 114 may select from a plurality of corpuses a corpus containing only or mostly textbooks.
[0032] As discussed above, in some examples engine 114 may score and rank each term based on the term's centrality, the term's inverse document frequency, or both. In some examples, engine 114 may calculate a score R of each term t based on a linear combination of the term's centrality and the term's inverse document frequency, for example, using the following formula:
[0033] R(t) = a - N Centrality (t) + β · IDF(t) where N is the number of the extracted terms, and a and β are two predetermined and calibrated constants. After calculating scores for all the extracted terms, engine 114 may rank all terms based on their score. For example, higher scoring terms may be ranked higher, and vice versa.
[0034] In some examples, instead of extracting a fixed number of terms M, engine 112 may extract a larger number of terms, engine 113 may construct a network from the larger number of terms, and engine 114 may calculate scores for those terms. Engine 114 may then select all terms whose scores are above a predefined threshold. Next, engine 113 may construct a new network of selected terms only, after which engine 114 may recalculate the scores of the selected terms (whose centrality values may have changed) based on the new network.
[0035] After scoring and ranking the terms, engine 114 may, for example, store the terms along with their scoring, ranking, and any other information, in a memory (e.g., memory 116), and/or send it to another (e.g., remote) device. In some examples, engine 114 may only store and/or send information about a predefined number of top-ranked (and top-scoring) terms while discarding the other terms.
[0036] In some examples, instead of or in addition to storing the terms, engine 114 may represent the ranked terms (or only a predefined number of top-ranked terms) on a display. For example, engine 114 may send the term information to network visualization engine 115, which may process the term information and display the information on a display (e.g., display 118).
[0037] Network visualization engine 115 may represent the terms on a display, for example, in a form of a graph that may be similar to the network constructed by construction engine 113. Engine 115 may represent each term on the display using text and/or a predefined shape (e.g., a circle) of a certain size and color. In some examples, the size of the shape representing a particular term may be associated with (e.g., be proportional to) the term's score and/or rank. This may cause higher scoring and higher ranking terms appear larger than lower scoring and lower ranking terms. Thus, engine 115 may provide a visual indication of each term's relative score or rank, that is, a visual indication of the term's score or rank relative to other terms' scores or ranks. Alternatively or in addition, in some examples, engine 115 may provide a visual indication of the term's actual (absolute) score or rank, e.g., in a numerical form.
[0038] In some examples, engine 115 may also receive a user input associated with a particular term on the display, and in response to the user input, display to the user information associated with the particular term. For example, the user may click or touch on or around the displayed term, and engine 115 may, in response to the click or touch, display one or more portions of the text that contain the term. A portion may include, for example, any number of sentences, paragraphs, pages, sections, or other units of text containing the term. In some examples, to display a portion of text, engine 115 may open and display a document containing the text by launching an appropriate application, such as a word processor, a PDF viewer, a web browser, etc. In some examples, engine 115 may also present the text such that at least one portion containing the term is visible. In some examples, engine 115 may also select (e.g., highlight) the term within the visible portion.
[0039] In some examples, engine 115 may also display the connections between the terms, for example, using straight lines. In some examples, engine 115 may also visually indicate the connections' relative or absolute weights, because, as discussed above, a connection's weight may reflect the level of correlation between the two connected terms. In some examples, engine 115 may represent each connection as with a straight line whose length, thickness, color, or another parameter may represent the connection's weight. For example, higher weighing connections may be represented by thicker lines, by shorter lines, by lines of a particular color, and so forth.
[0040] FIG. 5 illustrates some of the examples discussed above. In the example of FIG. 5, network visualization engine 115 provides to display 118 a graph 510 representing an example network constructed by construction engine 113. In this example, the processed text is a textbook about operating systems, and the graph contains various terms extracted from the text and their connections to each other. In this example, the larger circles may correspond to higher ranked terms (e.g., "operating system," "virtual memory," and "CPU") and the thicker connection lines may indicate stronger correlations (e.g., a very strong correlation between the terms "operating system" and "virtual memory," and a strong correlation between the terms "virtual memory" and "caching"). As also illustrated in the example of FIG. 5, the user may click on a term (e.g., "context switch"), and network visualization engine 115 may open, in response to the click, a window 510 displaying the portion of the textbook containing the selected term. Engine 115 may also highlight every occurrence of the selected term in the text.
[0041] In the foregoing discussion, engines 112, 113, 114, and 115 were described as any combinations of hardware and programming. Such components may be implemented in a number of fashions. The programming may be processor executable instructions stored on a tangible, non-transitory computer-readable medium and the hardware may include a processing resource for executing those instructions. The processing resource, for example, may include one or multiple processors (e.g., central processing units (CPUs), semiconductor-based microprocessors, graphics processing units (GPUs), field-programmable gate arrays (FPGAs) configured to retrieve and execute instructions, or other electronic circuitry), which may be integrated in a single device or distributed across devices. The computer-readable medium can be said to store program instructions that when executed by the processor resource implement the functionality of the respective component. The computer-readable medium may be integrated in the same device as the processor resource or it may be separate but accessible to that device and the processor resource. In one example, the program instructions can be part of an installation package that when installed can be executed by the processor resource to implement the corresponding component. In this case, the computer-readable medium may be a portable medium such as a CD, DVD, or flash drive or a memory maintained by a server from which the installation package can be downloaded and installed. In another example, the program instructions may be part of an application or applications already installed, and the computer-readable medium may include integrated memory such as a hard drive, solid state drive, or the like.
[0042] FIG. 6 is a flowchart of an example method 600. Method 600 may be described below as being executed or performed by a system or by a computing device such as computing system 100 of FIG. 1. Other suitable systems and/or computing devices may be used as well. Method 600 may be implemented in the form of executable instructions stored on at least one non-transitory machine-readable storage medium of the system and executed by at least one processor of the system. Alternatively or in addition, method 600 may be implemented in the form of electronic circuitry (e.g., hardware). In alternate examples of the present disclosure, one or more blocks of method 600 may be executed substantially concurrently or in a different order than shown in FIG. 6. In alternate examples of the present disclosure, method 600 may include more or less blocks than are shown in FIG. 6. In some examples, one or more of the blocks of method 600 may, at certain times, be ongoing and/or may repeat.
[0043] At block 605, method 600 may extract a term from a segment of a text, where the text may include a plurality of segments. At block 610, the method may add the term to a network of terms by connecting the term to a set of other terms extracted from the segment. As discussed above, extracting the term from the segment of the text may include, for example, determining whether the term is included in a database (or a list) of labeled terms.
[0044] At block 615, the method may determine the term's centrality within the network of terms. As discussed above, each of the term's connections to the set of other terms extracted from the segment may be associated with a weight, and determining the term's centrality within the network may include calculating a sum of the weights of the term's connections to the set of other terms extracted from the segment. The weight associated with each of the term's connections may be calculated based on a number of segments in the plurality of segments that include both terms connected by the connection and/or based on a number of other terms extracted from those segments.
[0045] At block 620, the method may determine the term's frequency within a corpus of texts. At block 625, the method may determine the term's score based on the term's centrality and the term's frequency. As discussed above, in some examples, the method may display the term on a display (e.g., 118) and providing a visual indication of the term's score on the display. As also discussed above, in some examples, the method may also receive a user input associated with the term, and responsive to the user input displaying a portion of the text, where the portion includes the term.
[0046] FIG. 7 is a block diagram of an example computing system 700. Computing device 700 may be similar to computing system 100 of FIG. 1. In the example of FIG.7, computing device 700 includes a processor 710 and a non-transitory machine-readable storage medium 720. Although the following descriptions refer to a single processor and a single machine-readable storage medium, it is appreciated that multiple processors and multiple machine-readable storage mediums may be anticipated in other examples. In such other examples, the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors. [0047] Processor 710 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in non-transitory machine-readable storage medium 420. In the particular example shown in FIG. 7, processor 710 may fetch, decode, and execute instructions 722, 724, 726, 728, or any other instructions (not shown for brevity). As an alternative or in addition to retrieving and executing instructions, processor 710 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in machine-readable storage medium 720. With respect to the executable instruction representations (e.g., boxes) described and shown herein, it should be understood that part or all of the executable instructions and/or electronic circuits included within one box may, in alternate examples, be included in a different box shown in the figures or in a different box not shown.
[0048] Non-transitory machine-readable storage medium 720 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, medium 720 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Medium 720 may be disposed within computing device 700, as shown in FIG. 7. In this situation, the executable instructions may be "installed" on computing device 700. Alternatively, medium 720 may be a portable, external or remote storage medium, for example, that allows computing device 700 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package". As described herein, medium 720 may be encoded with executable instructions.
[0049] Referring to FIG. 7, instructions 722, when executed by a processor (e.g., 710), may cause a computing device (e.g., 700) to extract a term from a segment of a text, the text comprising a plurality of segments. Instructions 724, when executed by the processor, may cause the computing device to connect the term to a set of other terms extracted from the segment via a set of connections. Instructions 726, when executed by the processor, may cause the computing device to caicuiate a weight for each connection in the set of connections based on a number of other terms extracted from the segment. Instructions 728, when executed by the processor, may cause the computing device to determine a score of the term based on the weights of the set of connections and based on a frequency at which the term appears in a corpus of texts. As discussed above, in some examples, additional instructions (not shown for brevity) may cause the computing device to display the term and a visual indication of the term's score.

Claims

1. A method comprising:
extracting a term from a segment of a text, the text comprising a plurality of segments;
adding the term to a network of terms by connecting the term to a set of other terms extracted from the segment;
determining the term's centraiity within the network of terms;
determining the term's frequency within a corpus of texts; and
based on the term's centraiity and the term's frequency, determining a score of the term.
2. The method of claim 1 , further comprising:
displaying the term on a display; and
providing a visual indication of the term's score on the display.
3. The method of claim 2, further comprising receiving a user input associated with the term, and responsive to the user input displaying a portion of the text, where the portion includes the term.
4. The method of claim 1 , wherein extracting the term from the segment of the text comprises determining whether the term is included in a database of labeled terms.
5. The method of claim 1 , wherein each of the term's connections to the set of other terms extracted from the segment is associated with a weight, and wherein determining the term's centraiity within the network comprises calculating a sum of the weights of the term's connections to the set of other terms extracted from the segment.
6. The method of claim 5, wherein the weight associated with each of the term's connections is calculated based on a number of segments in the plurality of segments that comprise both terms connected by the connection.
7. The method of claim 6, wherein the weight associated with each of the term's connection is calculated also based on a number of other terms extracted from the number of segments in the plurality of segments that comprise both terms connected by the connection.
8. A computing system comprising:
a term extraction engine to:
extract a plurality of terms from a text, wherein the text comprises a plurality of segments;
a network construction engine to:
connect each term in the plurality of terms to each term included in any segment, in the plurality of segments, in which the term is also included; and
a term ranking engine to:
calculate a score for each term in the plurality of terms based on the term's connections to other terms in the plurality of terms.
9. The computing system of claim 8, wherein the term ranking engine is to calculate the score for each term in the plurality of terms further based on an inverse document frequency associated with the term.
10. The computing system of claim 8, wherein the network construction engine is further to assign weights to all connections between the terms in the plurality of terms, and wherein the term ranking engine is to calculate the score for each term in the plurality of terms based on weights of the term's connections.
11. The computing system of claim 10, wherein the network construction engine is to assign a weight to each connection based on a number of segments in which terms connected by the connection are included together and based on a number of other terms included in the number of segments.
12. The computing system of claim 8, further comprising:
a display; and
a network visualization engine to:
represent a set of terms from the plurality of terms on the display, and represent scores of the set of terms on the display.
13. The computing system of claim 8, wherein extracting the plurality of terms comprises:
extracting from the text a set of multi-word terms included in a database of labeled terms; and
based on a determination that a size of the set of multi-word terms is less than a predefined size, extracting from the text a set of additional terms using a topic model.
14. A non-transitory machine-readable storage medium encoded with instructions executable by a processor of a computing device to cause the computing device to: extract a term from a segment of a text, the text comprising a plurality of segments;
connect the term to a set of other terms extracted from the segment via a set of connections;
calculate a weight for each connection in the set of connections based on a number of other terms extracted from the segment; and determine a score of the term based on the weights of the set of connections and based on a frequency at which the term appears in a corpus of texts.
15. The non-transitory machine-readable storage medium of claim 14, wherein the instructions further cause the computing device to display the term and a visual indication of the term's score.
PCT/US2015/034590 2015-06-06 2015-06-06 Term scores WO2016200359A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/034590 WO2016200359A1 (en) 2015-06-06 2015-06-06 Term scores

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/034590 WO2016200359A1 (en) 2015-06-06 2015-06-06 Term scores

Publications (1)

Publication Number Publication Date
WO2016200359A1 true WO2016200359A1 (en) 2016-12-15

Family

ID=57503464

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/034590 WO2016200359A1 (en) 2015-06-06 2015-06-06 Term scores

Country Status (1)

Country Link
WO (1) WO2016200359A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492110A (en) * 2018-11-28 2019-03-19 南京中孚信息技术有限公司 Document Classification Method and device
US20210264112A1 (en) * 2020-02-25 2021-08-26 Prosper Funding LLC Bot dialog manager

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006001906A2 (en) * 2004-06-14 2006-01-05 University Of North Texas Graph-based ranking algorithms for text processing
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
US20110060983A1 (en) * 2009-09-08 2011-03-10 Wei Jia Cai Producing a visual summarization of text documents
US20110270845A1 (en) * 2010-04-29 2011-11-03 International Business Machines Corporation Ranking Information Content Based on Performance Data of Prior Users of the Information Content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006001906A2 (en) * 2004-06-14 2006-01-05 University Of North Texas Graph-based ranking algorithms for text processing
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
US20110060983A1 (en) * 2009-09-08 2011-03-10 Wei Jia Cai Producing a visual summarization of text documents
US20110270845A1 (en) * 2010-04-29 2011-11-03 International Business Machines Corporation Ranking Information Content Based on Performance Data of Prior Users of the Information Content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WILLIYAN D. ABILHOA ET AL.: "A keyword extraction method from twitter messages represented as graphs", APPLIED MATHEMATICS AND COMPUTATION, vol. 240, 2014, pages 308 - 325, XP028848232 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492110A (en) * 2018-11-28 2019-03-19 南京中孚信息技术有限公司 Document Classification Method and device
US20210264112A1 (en) * 2020-02-25 2021-08-26 Prosper Funding LLC Bot dialog manager
US11886816B2 (en) * 2020-02-25 2024-01-30 Prosper Funding LLC Bot dialog manager

Similar Documents

Publication Publication Date Title
CN107436922B (en) Text label generation method and device
US10025783B2 (en) Identifying similar documents using graphs
US10417335B2 (en) Automated quantitative assessment of text complexity
WO2011035389A1 (en) Document analysis and association system and method
KR20130142124A (en) Systems and methods regarding keyword extraction
KR101541306B1 (en) Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
US9990359B2 (en) Computer-based analysis of virtual discussions for products and services
US9619457B1 (en) Techniques for automatically identifying salient entities in documents
CN110019669B (en) Text retrieval method and device
CN104871151A (en) Method for summarizing document
US20140289260A1 (en) Keyword Determination
CN111126060A (en) Method, device and equipment for extracting subject term and storage medium
US8954463B2 (en) Use of statistical language modeling for generating exploratory search results
US20140181097A1 (en) Providing organized content
CN109522275B (en) Label mining method based on user production content, electronic device and storage medium
US10255379B2 (en) System and method for displaying timeline search results
CN109614478A (en) Construction method, key word matching method and the device of term vector model
CN107315735B (en) Method and equipment for note arrangement
WO2016200359A1 (en) Term scores
JP5869948B2 (en) Passage dividing method, apparatus, and program
CN109948040A (en) Storage, recommended method and the system of object information, equipment and storage medium
CN105893397B (en) A kind of video recommendation method and device
KR20190050180A (en) keyword extraction method and apparatus for science document
CN114328895A (en) News abstract generation method and device and computer equipment
US20160070692A1 (en) Determining segments for documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15895084

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15895084

Country of ref document: EP

Kind code of ref document: A1