US20110213763A1 - Web content mining of pair-based data - Google Patents

Web content mining of pair-based data Download PDF

Info

Publication number
US20110213763A1
US20110213763A1 US13/101,061 US201113101061A US2011213763A1 US 20110213763 A1 US20110213763 A1 US 20110213763A1 US 201113101061 A US201113101061 A US 201113101061A US 2011213763 A1 US2011213763 A1 US 2011213763A1
Authority
US
United States
Prior art keywords
term
generating
couplet
seed
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/101,061
Inventor
Weizhu Chen
Long Jiang
Ming Zhou
Benyu Zhang
Zheng Chen
Jian Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US13/101,061 priority Critical patent/US20110213763A1/en
Publication of US20110213763A1 publication Critical patent/US20110213763A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • Web mining is the application of data mining techniques to discover patterns from the web.
  • the web mining may be divided into a web usage mining, web content mining or web structure mining.
  • the web content mining is a process to discover useful information from the content of a web page.
  • the useful information may include text, image, audio or video data.
  • Text mining refers to the process of deriving high quality information from text.
  • a web search engine may be used for the text mining.
  • the web search engine searches for information on the World Wide Web based on a search term.
  • the search engine may return search results which may contain a part or all of the search terms.
  • a filter may be used to refine the search result.
  • the web search engine and/or filter may not be effective when a user is looking for data which has a particular pair-based relationship to the search term.
  • the user may be looking to obtain a lower part (e.g., a first sentence) of a Chinese couplet when he or she enters a search term containing an upper part (e.g., a second sentence) of the Chinese couplet which goes together with the lower part.
  • the search results which simply list any web text containing the upper part, may not be adequate.
  • the search result may be too abundant and random, so the user may have to spend time to sort the search results to obtain some useful lower parts which can go with the upper part.
  • Described herein is technology for, among other things, mining pair-based data on the web.
  • the associated online pair-based data mining system and offline SVM training system are also disclosed herein.
  • the technology may be implemented via a web page.
  • the technology involves web-mining pair-based data based on a query by a user, where the query is pair-based data.
  • a search result produced by a search engine is parsed to generate a snippet set.
  • the snippet set is then subjected to a filter to generate one or more pair-based candidate data.
  • the pair-based candidate data are then subjected to a support vector machine classifier.
  • the support vector machine classifier is trained offline with manually labeled pair-based data having features or characteristics unique to the pair-based data. Once the training is completed, the support vector machine classifier classifies the pair-based candidate data, thus generating one or more pair-based output data.
  • embodiments provide technology for extracting pair-based data on the web.
  • the techniques and tools described herein provide for efficient data mining of the pair-based data.
  • Such technology is ideal for a web application and/or a search application catered toward extracting pair-based data on the World Wide Web. Because of the efficiency of the technology described herein, it is possible for extracting a pool of pair-based data available on the web that are more precisely associated with a search term.
  • FIG. 1 is a block diagram of an exemplary computing system environment for implementing embodiments.
  • FIG. 2 is a block diagram of an exemplary online pair-based data mining system aided by an offline SVM training system for implementing embodiments.
  • FIG. 3 is a block diagram of an exemplary filter used for the online pair-based data mining system of FIG. 2 , in accordance with an embodiment.
  • FIG. 4 is a flowchart of an exemplary process for generating pair-based output data, in accordance with an embodiment.
  • FIG. 5 illustrates is a flowchart of an exemplary process for generating sentences suitable for Chinese couplet, in accordance with an embodiment.
  • Described herein is technology for, among other things, web-mining pair-based data based on a pair-based data seed.
  • the associated filtering and/or classification schemes are also disclosed herein.
  • the technology may be implemented via a web page.
  • the technology involves the generation of pair-based output data based on a pair-based input data from a user.
  • a pool of the pair-based output data is generated by subjecting the pair-based input data to a search engine, a parser, a filter, and a support vector machine.
  • FIG. 1 is a block diagram of an exemplary computing system environment for implementing embodiments.
  • an exemplary system for implementing embodiments includes a general purpose computing system environment, such as computing system environment 100 .
  • Pair-based data include two parts: a first item and a second item.
  • the two items may have an objective relationship (e.g., a semantic relationship found in a Chinese couplet and/or a translated term).
  • a pair-based input data 102 (e.g., the first item or the second item) may be subject to a filtering stage 104 and a classification stage 110 to generate one or more pair-based output data 114 .
  • the filtering stage 104 includes a search module 106 and a filter 108 .
  • the search module 106 may produce a search result when the pair-based input data 102 is processed.
  • the output of the search module 106 may be processed through the filter 108 to generate an input to the classification stage 110 .
  • the classification stage 110 includes a support vector machine (SVM) classifier 112 .
  • the SVM classifier 112 generates the pair-based output data 114 .
  • FIG. 2 is a block diagram of an exemplary online pair-based data mining system 200 aided by an offline SVM training system 250 for implementing embodiments.
  • a pair-based input data 102 is prepared as a part of a pair-based data seed.
  • the pair-based input data 102 is processed as a query to a search engine 202 to generate a search result.
  • the search engine may include a MSN Search Web®, Google®, Yahoo®, Baidu®, and any other search engine.
  • the search result is parsed (e.g., by using a parser 204 ) to extract a snippet set 206 .
  • the snippet set 206 may be one or more short excerpts of the text that match the query (e.g., the pair-based input data 102 ). Alternatively, the snippet set 206 may provide information associated with the query and/or ideas for terms to use in subsequent searches.
  • the snippet set 206 is then subject to the filter 108 to generate one or more pair-based candidate data.
  • the filter 108 may be based on a number of criteria to generate the pair-based candidate data 208 .
  • the pair-based candidate data 208 are subject to a classification stage.
  • the classification stage comprises an offline process as well as an online process.
  • the support vector machine (SVM) classifier 112 is used to classify the pair-based candidate data 208 .
  • the SVM classifier 112 is well-known to those skilled in the art of machine learning.
  • the SVM classifier 112 may be a learning machine that attempts to maximize the margin between sets of data.
  • the SVM classifier 112 may classify a given input of data without explicitly being told what features separate the classes of data. This may be necessary because humans are often unable to distinguish which features set two sets of data apart when there are hundreds or possibly thousands of different features that make up the data.
  • the SVM classifier 112 may separate the pair-based candidate data 208 into positive candidate data and negative candidate data.
  • the offline SVM training system 250 is used to generate a SVM classifier model 256 by conducting a training which subject manually labeled pair-based data 252 . Positive examples of the manually labeled pair-based data 252 may share features unique to the pair-based input data 102 . Then, the SVM classifier model 256 obtained by the SVM training 254 is loaded to the SVM classifier 112 . Based on the SVM classifier model 256 , the SVM classifier 112 classifies the pair-based candidate data 208 , thus generating the pair-based output data 210 (e.g., by keeping the positive candidate data while dropping the negative candidate data).
  • One or more of the pair-based output data 210 may be subject to the online pair-based based data mining system 200 as the pair-based input data 102 to generate additional pair-based output data 210 . Additionally, a counterpart of the pair-based input data 102 may be subjected to the online pair-based data mining system 200 and the offline SVM training system 250 to mine more pair-based output data 210 .
  • the pair-based input data may be a term in a first language, and the counterpart may be a foreign term which corresponds (e.g., semantically) to the term in the first language.
  • the online pair-based data mining system 200 and the offline SVM training system 250 may be used to generate one or more new sentences suitable for a Chinese couple by subjecting a seed of the Chinese couplet to the systems.
  • the Chinese couplet includes two sentences written as calligraphy on vertical red banners, typically placed on either side of a door or in a large hall. Such couplets are often displayed during special occasions such as weddings or during the Chinese New Year. Other types of couplets include birthday couplets, elegiac couplets, decoration couplets, professional or other human association couplets, and the like.
  • Chinese couplets use condensed language, but have deep and sometimes ambivalent or double meaning.
  • the two sentences making up the Chinese couplet are called a “first sentence” and a “second sentence.”
  • An example of the Chinese couplet is “ ” and “ ” where the first sentence is “ ” and the second sentence is “ .”
  • the correspondence between individual words of the first and second sentences is shown as follows:
  • the Chinese couplet can be of different length.
  • a short couplet can include one or two Chinese characters while a longer couplet can reach several hundred Chinese characters.
  • the Chinese couplets can also have diverse forms and/or meanings. For instance, one form of the Chinese couplet may include the first and second sentences having the similar meaning. Another form of the Chinese couplet may include the sentences having the opposite meaning.
  • the Chinese couplet conforms to the following rules or principles: First, The two sentences of the Chinese couplet have the same number of words and/or characters. Each Chinese character has one syllable when spoken. Each Chinese word can have one or more characters, and consequently, be pronounced with one or more syllables. Each word of the first sentence should have the same number of Chinese characters as the corresponding word of the second sentence.
  • tones of the Chinese couplet are generally coinciding and harmonious.
  • the traditional custom is that the character at the end of first sentence should be pronounced in a sharp downward tone.
  • the character at the end of the second sentence should be pronounced with a level tone.
  • sequence of parts of speech in the second sentence should be identical to the sequence of parts of speech in the first sentence.
  • position of a noun in the first sentence should correspond to the same position as the noun in the second sentence.
  • the content of the second sentence should be mutually inter-related with the first sentence but cannot be duplicated.
  • the writing styles of the two sentences should be same. For instance, if there is repetition of words, or characters, or pronunciation in the first sentence, there should be a same sort of repetition in the second sentence. And if a character is composed of two other characters or more in the first sentence, there should be a character that is composed of the same number of characters in the second sentence.
  • the seed for the Chinese couplet may be the first sentence and/or the second sentence.
  • a search result may be obtained.
  • the search result is then processed by using the parser 204 to generate the snippet set 206 associated with the first sentence of the Chinese couplet.
  • the snippet set 206 is subject to the filter 108 which passes through a subset of the snippet set 206 conforming to the features of the Chinese couplet.
  • FIG. 3 is a block diagram of an exemplary filter used for the online pair-based data mining system 250 of FIG. 2 , in accordance with an embodiment.
  • the filter 108 may include an identity filter 302 , a neighbor filter 304 , a length filter 306 , and a frequency filter 308 .
  • the identity filter 302 is used to check to see if each of the snippet set contains at least the first sentence. That is to say, in each of the snippet set, there should at least one candidate pair with its first sentence matching the query (e.g., the first sentence of the pair-based data seed). If this turns out to be true in a snippet, the particular snippet is regarded as a good snippet for extracting pair candidates; otherwise the snippet is discarded.
  • the text may be divided into sentences based on a punctuation mark and/or arranged in an orderly manner. Then, the sentences may be paired up to form sentence pairs. The neighbor filter 304 passes through only the neighboring sentences in pairs and/or discard the rest.
  • the SVM training 254 is conducted by subjecting the SVM classifier 112 to manually labeled Chinese couplets.
  • Features unique to the Chinese couplet e.g., a sentence length, a tone, a sequence, a content, and a writing style of the Chinese couplet
  • a SVM classifier model associated with the Chinese couplet may be generated based on the SVM training 254 and/or loaded to the SVM classifier 112 .
  • the SVM classifier 112 is then used to classify the candidate sentences into positive candidate sentences or negative candidate sentences.
  • the positive candidate sentences are regarded as high-quality candidate sentences, and/or used as pair-based data seeds (e.g., or used in the SVM training 254 ).
  • a client may harvest a list of sentences suitable for a Chinese couplet by iterating the processes described in the online pair-based data mining system 250 and/or the offline SVM training system 250 (e.g., by using the second sentence of the Chinese couplet).
  • the claimed subject matter is described in terms of these example environments. Description in these terms is provided for convenience only. It is not intended that the invention be limited to application in this example environment. In fact, after reading the following description, it will become apparent to a person skilled in the relevant art how to implement the claimed subject matter in alternative embodiments.
  • Table 1 illustrates the improvement in the accuracy of mining candidate sentences suitable for a Chinese couplet when the method and/or tool described by the online pair-based data mining system 200 and/or the offline SVM training system 250 is implemented.
  • FIG. 4 is a flowchart of an exemplary process for generating pair-based output data, in accordance with an embodiment. It is appreciated that not all steps of process 400 are necessary for the general goal of process 400 to be achieved. Moreover, it is appreciated that additional steps may also be included in process 400 in accordance with alternative embodiments.
  • Process 400 begins at step 401 where a SVM training is conducted with manually labeled pair-based data to generate a SVM classifier model.
  • the SVM classifier is loaded to an online SVM classifier.
  • a pair-based input data is processed through a search engine.
  • a search result is parsed to obtain a snippet set.
  • one or more pair-based candidate data are generated by filtering the snippet set.
  • one or more pair-based output data are generated by using the online SVM classifier.
  • the process described in FIG. 4 may be embedded in a computer readable medium such that when the computer readable medium is executed by a computer causes the computer to perform the process comprising generating a set of snippets by parsing a search result of a pair-based input data, subjecting the set of snippets to one or more filters to generate one or more pair-based candidate data (e.g., where the filter is associated with characteristics of the pair-based input data) and generating one or more pair-based output data by classifying the pair-based candidate data with a support vector machine classifier.
  • the term “computer readable medium” as used herein refers to a statutory article of manufacture that is not a signal or carrier wave per se.
  • FIG. 5 illustrates is a flowchart of an exemplary process for generating sentences suitable for Chinese couplet, in accordance with an embodiment. It is appreciated that not all steps of process 500 are necessary for the general goal of process 500 to be achieved. Moreover, it is appreciated that additional steps may also be included in process 500 in accordance with alternative embodiments.
  • Process 500 begins at step 501 where a SVM training is conducted with manually labeled Chinese couplets to generate a SVM classifier model.
  • the SVM classifier is loaded to an online SVM classifier.
  • a first sentence of a Chinese couplet is processed through a search engine.
  • a search result is parsed to obtain a snippet set.
  • one or more candidate sentences for the Chinese couplet are generated by filtering the snippet set.
  • one or more new sentences suitable for the Chinese couplet are generated by using the online SVM classifier.
  • embodiments provide technology for performing web-mining pair-based data.
  • the techniques, methods and/or tools described herein provide for filtering and classifying candidate data to generate more precise pair-data meeting the criteria set by the user.
  • Such technology is ideal to generate pair-based data available on the web. Because of the efficiency of the technology described herein, it is possible for an algorithm implemented based on the technology to mine pair-based data which meets criteria set by the user within a threshold.

Abstract

Described herein is technology for, among other things, mining pair-based data on the web. The technology involves an online pair-based data mining system as well as an offline SVM training system. By subjecting a pair-based input data to the systems, one may grow a pool of pair-based data which share characteristics of the pair-based input data in more efficient manner.

Description

    RELATED APPLICATIONS
  • This application is a Continuation of, and claims priority from, U.S. patent application Ser. No. 11/941,968, filed Nov. 19, 2007, which is incorporated herein by reference in its entirety.
  • BACKGROUND
  • Web mining is the application of data mining techniques to discover patterns from the web. The web mining may be divided into a web usage mining, web content mining or web structure mining. The web content mining is a process to discover useful information from the content of a web page. The useful information may include text, image, audio or video data.
  • Text mining refers to the process of deriving high quality information from text. In general, a web search engine may be used for the text mining. The web search engine searches for information on the World Wide Web based on a search term. The search engine may return search results which may contain a part or all of the search terms. Additionally, a filter may be used to refine the search result.
  • However, the web search engine and/or filter may not be effective when a user is looking for data which has a particular pair-based relationship to the search term. For example, the user may be looking to obtain a lower part (e.g., a first sentence) of a Chinese couplet when he or she enters a search term containing an upper part (e.g., a second sentence) of the Chinese couplet which goes together with the lower part. In this case, the search results, which simply list any web text containing the upper part, may not be adequate. The search result may be too abundant and random, so the user may have to spend time to sort the search results to obtain some useful lower parts which can go with the upper part.
  • SUMMARY
  • This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
  • Described herein is technology for, among other things, mining pair-based data on the web. The associated online pair-based data mining system and offline SVM training system are also disclosed herein. The technology may be implemented via a web page.
  • The technology involves web-mining pair-based data based on a query by a user, where the query is pair-based data. Once the query is entered by the user, a search result produced by a search engine is parsed to generate a snippet set. The snippet set is then subjected to a filter to generate one or more pair-based candidate data. The pair-based candidate data are then subjected to a support vector machine classifier. The support vector machine classifier is trained offline with manually labeled pair-based data having features or characteristics unique to the pair-based data. Once the training is completed, the support vector machine classifier classifies the pair-based candidate data, thus generating one or more pair-based output data.
  • Thus, embodiments provide technology for extracting pair-based data on the web. The techniques and tools described herein provide for efficient data mining of the pair-based data. Such technology is ideal for a web application and/or a search application catered toward extracting pair-based data on the World Wide Web. Because of the efficiency of the technology described herein, it is possible for extracting a pool of pair-based data available on the web that are more precisely associated with a search term.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments and, together with the description, serve to explain their principles:
  • FIG. 1 is a block diagram of an exemplary computing system environment for implementing embodiments.
  • FIG. 2 is a block diagram of an exemplary online pair-based data mining system aided by an offline SVM training system for implementing embodiments.
  • FIG. 3 is a block diagram of an exemplary filter used for the online pair-based data mining system of FIG. 2, in accordance with an embodiment.
  • FIG. 4 is a flowchart of an exemplary process for generating pair-based output data, in accordance with an embodiment.
  • FIG. 5 illustrates is a flowchart of an exemplary process for generating sentences suitable for Chinese couplet, in accordance with an embodiment.
  • DETAILED DESCRIPTION
  • Reference will now be made in detail to the preferred embodiments of the claimed subject matter, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the claimed subject matter to these embodiments. On the contrary, the claimed subject matter is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the claimed subject matter as defined by the claims. Furthermore, in the detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. However, it will be obvious to one of ordinary skill in the art that the claimed subject matter may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the claimed subject matter.
  • Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer or digital system memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, etc., is herein, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these physical manipulations take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system or similar electronic computing device. For reasons of convenience, and with reference to common usage, these signals are referred to as bits, values, elements, symbols, characters, terms, numbers, or the like with reference to the claimed subject matter.
  • It should be borne in mind, however, that all of these terms are to be interpreted as referencing physical manipulations and quantities and are merely convenient labels and are to be interpreted further in view of terms commonly used in the art. Unless specifically stated otherwise as apparent from the discussion herein, it is understood that throughout discussions of the present embodiment, discussions utilizing terms such as “determining” or “outputting” or “transmitting” or “recording” or “locating” or “storing” or “displaying” or “receiving” or “recognizing” or “utilizing” or “generating” or “providing” or “accessing” or “checking” or “notifying” or “delivering” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data. The data is represented as physical (electronic) quantities within the computer system's registers and memories and is transformed into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
  • Described herein is technology for, among other things, web-mining pair-based data based on a pair-based data seed. The associated filtering and/or classification schemes are also disclosed herein. The technology may be implemented via a web page.
  • The technology involves the generation of pair-based output data based on a pair-based input data from a user. During the process, a pool of the pair-based output data is generated by subjecting the pair-based input data to a search engine, a parser, a filter, and a support vector machine.
  • FIG. 1 is a block diagram of an exemplary computing system environment for implementing embodiments. With reference to FIG. 1, an exemplary system for implementing embodiments includes a general purpose computing system environment, such as computing system environment 100. Pair-based data include two parts: a first item and a second item. The two items may have an objective relationship (e.g., a semantic relationship found in a Chinese couplet and/or a translated term).
  • As illustrated in FIG. 1, a pair-based input data 102 (e.g., the first item or the second item) may be subject to a filtering stage 104 and a classification stage 110 to generate one or more pair-based output data 114. The filtering stage 104 includes a search module 106 and a filter 108. The search module 106 may produce a search result when the pair-based input data 102 is processed. The output of the search module 106 may be processed through the filter 108 to generate an input to the classification stage 110. The classification stage 110 includes a support vector machine (SVM) classifier 112. The SVM classifier 112 generates the pair-based output data 114.
  • FIG. 2 is a block diagram of an exemplary online pair-based data mining system 200 aided by an offline SVM training system 250 for implementing embodiments. As illustrated in FIG. 2, a pair-based input data 102 is prepared as a part of a pair-based data seed. For the pair-based data seed, there may be two items—the pair-based input data 102 and its counterpart. The pair-based input data 102 is processed as a query to a search engine 202 to generate a search result. The search engine may include a MSN Search Web®, Google®, Yahoo®, Baidu®, and any other search engine. Then, the search result is parsed (e.g., by using a parser 204) to extract a snippet set 206. The snippet set 206 may be one or more short excerpts of the text that match the query (e.g., the pair-based input data 102). Alternatively, the snippet set 206 may provide information associated with the query and/or ideas for terms to use in subsequent searches.
  • The snippet set 206 is then subject to the filter 108 to generate one or more pair-based candidate data. The filter 108 may be based on a number of criteria to generate the pair-based candidate data 208. In order to obtain high precision pair-based output data 210, the pair-based candidate data 208 are subject to a classification stage. The classification stage comprises an offline process as well as an online process. During the offline and online processes, the support vector machine (SVM) classifier 112 is used to classify the pair-based candidate data 208.
  • It is appreciated that the SVM classifier 112 is well-known to those skilled in the art of machine learning. The SVM classifier 112 may be a learning machine that attempts to maximize the margin between sets of data. The SVM classifier 112 may classify a given input of data without explicitly being told what features separate the classes of data. This may be necessary because humans are often unable to distinguish which features set two sets of data apart when there are hundreds or possibly thousands of different features that make up the data. The SVM classifier 112 may separate the pair-based candidate data 208 into positive candidate data and negative candidate data.
  • For the SVM classifier 112 to function properly, a training of the SVM classifier 112 may be necessary. The offline SVM training system 250 is used to generate a SVM classifier model 256 by conducting a training which subject manually labeled pair-based data 252. Positive examples of the manually labeled pair-based data 252 may share features unique to the pair-based input data 102. Then, the SVM classifier model 256 obtained by the SVM training 254 is loaded to the SVM classifier 112. Based on the SVM classifier model 256, the SVM classifier 112 classifies the pair-based candidate data 208, thus generating the pair-based output data 210 (e.g., by keeping the positive candidate data while dropping the negative candidate data).
  • One or more of the pair-based output data 210 may be subject to the online pair-based based data mining system 200 as the pair-based input data 102 to generate additional pair-based output data 210. Additionally, a counterpart of the pair-based input data 102 may be subjected to the online pair-based data mining system 200 and the offline SVM training system 250 to mine more pair-based output data 210. In one example embodiment, the pair-based input data may be a term in a first language, and the counterpart may be a foreign term which corresponds (e.g., semantically) to the term in the first language.
  • In one example embodiment, the online pair-based data mining system 200 and the offline SVM training system 250 may be used to generate one or more new sentences suitable for a Chinese couple by subjecting a seed of the Chinese couplet to the systems. The Chinese couplet includes two sentences written as calligraphy on vertical red banners, typically placed on either side of a door or in a large hall. Such couplets are often displayed during special occasions such as weddings or during the Chinese New Year. Other types of couplets include birthday couplets, elegiac couplets, decoration couplets, professional or other human association couplets, and the like.
  • Chinese couplets use condensed language, but have deep and sometimes ambivalent or double meaning. The two sentences making up the Chinese couplet are called a “first sentence” and a “second sentence.” An example of the Chinese couplet is “
    Figure US20110213763A1-20110901-P00001
    ” and “
    Figure US20110213763A1-20110901-P00002
    ” where the first sentence is “
    Figure US20110213763A1-20110901-P00003
    ” and the second sentence is “
    Figure US20110213763A1-20110901-P00004
    .” The correspondence between individual words of the first and second sentences is shown as follows:
  • Figure US20110213763A1-20110901-P00005
    (sky)
    Figure US20110213763A1-20110901-P00006
    (sea)
    Figure US20110213763A1-20110901-P00007
    (high)
    Figure US20110213763A1-20110901-P00008
    (wide)
    Figure US20110213763A1-20110901-P00009
    (enables)
    Figure US20110213763A1-20110901-P00010
    (allows)
    Figure US20110213763A1-20110901-P00011
    (bird)
    Figure US20110213763A1-20110901-P00012
    (fish)
    Figure US20110213763A1-20110901-P00013
     (fly)
    Figure US20110213763A1-20110901-P00014
    (jump)
  • The Chinese couplet can be of different length. A short couplet can include one or two Chinese characters while a longer couplet can reach several hundred Chinese characters. The Chinese couplets can also have diverse forms and/or meanings. For instance, one form of the Chinese couplet may include the first and second sentences having the similar meaning. Another form of the Chinese couplet may include the sentences having the opposite meaning.
  • In general, the Chinese couplet conforms to the following rules or principles: First, The two sentences of the Chinese couplet have the same number of words and/or characters. Each Chinese character has one syllable when spoken. Each Chinese word can have one or more characters, and consequently, be pronounced with one or more syllables. Each word of the first sentence should have the same number of Chinese characters as the corresponding word of the second sentence.
  • Secondly, tones of the Chinese couplet are generally coinciding and harmonious. The traditional custom is that the character at the end of first sentence should be pronounced in a sharp downward tone. The character at the end of the second sentence should be pronounced with a level tone.
  • Third, the sequence of parts of speech in the second sentence should be identical to the sequence of parts of speech in the first sentence. For instance, the position of a noun in the first sentence should correspond to the same position as the noun in the second sentence.
  • Fourth, the content of the second sentence should be mutually inter-related with the first sentence but cannot be duplicated.
  • Fifth, the writing styles of the two sentences should be same. For instance, if there is repetition of words, or characters, or pronunciation in the first sentence, there should be a same sort of repetition in the second sentence. And if a character is composed of two other characters or more in the first sentence, there should be a character that is composed of the same number of characters in the second sentence.
  • The seed for the Chinese couplet may be the first sentence and/or the second sentence. When the first sentence is subject to the search engine 202, a search result may be obtained. The search result is then processed by using the parser 204 to generate the snippet set 206 associated with the first sentence of the Chinese couplet. The snippet set 206 is subject to the filter 108 which passes through a subset of the snippet set 206 conforming to the features of the Chinese couplet.
  • FIG. 3 is a block diagram of an exemplary filter used for the online pair-based data mining system 250 of FIG. 2, in accordance with an embodiment. As illustrated in FIG. 3, the filter 108 may include an identity filter 302, a neighbor filter 304, a length filter 306, and a frequency filter 308. The identity filter 302 is used to check to see if each of the snippet set contains at least the first sentence. That is to say, in each of the snippet set, there should at least one candidate pair with its first sentence matching the query (e.g., the first sentence of the pair-based data seed). If this turns out to be true in a snippet, the particular snippet is regarded as a good snippet for extracting pair candidates; otherwise the snippet is discarded.
  • For the good snippet, the text may be divided into sentences based on a punctuation mark and/or arranged in an orderly manner. Then, the sentences may be paired up to form sentence pairs. The neighbor filter 304 passes through only the neighboring sentences in pairs and/or discard the rest.
  • The length filter 306 is used to discard those neighboring sentences in pairs which do not have the same length for both the first sentence and the second sentence. For all the candidate pairs of neighboring sentences generated, those with its frequency less than a threshold k (e.g., k=2) in the snippet set are discarded (e.g., by using the frequency filter 308).
  • In the offline SVM training system 250, the SVM training 254 is conducted by subjecting the SVM classifier 112 to manually labeled Chinese couplets. Features unique to the Chinese couplet (e.g., a sentence length, a tone, a sequence, a content, and a writing style of the Chinese couplet) may be used in the SVM training 254. A SVM classifier model associated with the Chinese couplet may be generated based on the SVM training 254 and/or loaded to the SVM classifier 112. The SVM classifier 112 is then used to classify the candidate sentences into positive candidate sentences or negative candidate sentences. The positive candidate sentences are regarded as high-quality candidate sentences, and/or used as pair-based data seeds (e.g., or used in the SVM training 254).
  • A client may harvest a list of sentences suitable for a Chinese couplet by iterating the processes described in the online pair-based data mining system 250 and/or the offline SVM training system 250 (e.g., by using the second sentence of the Chinese couplet). The claimed subject matter is described in terms of these example environments. Description in these terms is provided for convenience only. It is not intended that the invention be limited to application in this example environment. In fact, after reading the following description, it will become apparent to a person skilled in the relevant art how to implement the claimed subject matter in alternative embodiments.
  • Table 1 illustrates the improvement in the accuracy of mining candidate sentences suitable for a Chinese couplet when the method and/or tool described by the online pair-based data mining system 200 and/or the offline SVM training system 250 is implemented.
  • TABLE 1
    Top-1 Top-3 Top-5 Top-10
    Precision Precision Precision Precision
    conventional   6.22%  14.07%  19.35%  35.00%
    mining technique
    with the system(s)  17.05%  37.32%  32.64%   88.5%
    difference +10.83% +23.25% +13.29% +53.50%

    As shown in table 1, there was 53.5% improvement in top-10 precision (e.g., the first ten sentences generated which meet the criteria of being a suitable lower part of a Chinese couplet to the upper part being queried), 13.29% improvement in top 5 precision, and 10.83% improvement in top-1 precision when the method and/or system described in FIGS. 2 and 3 were used in place of the conventional mining technique.
  • FIG. 4 is a flowchart of an exemplary process for generating pair-based output data, in accordance with an embodiment. It is appreciated that not all steps of process 400 are necessary for the general goal of process 400 to be achieved. Moreover, it is appreciated that additional steps may also be included in process 400 in accordance with alternative embodiments.
  • Process 400 begins at step 401 where a SVM training is conducted with manually labeled pair-based data to generate a SVM classifier model. At step 402, the SVM classifier is loaded to an online SVM classifier. At step 410, a pair-based input data is processed through a search engine. At step 420, a search result is parsed to obtain a snippet set. At step 430, one or more pair-based candidate data are generated by filtering the snippet set. At step 440, one or more pair-based output data are generated by using the online SVM classifier.
  • In one example embodiment, the process described in FIG. 4 may be embedded in a computer readable medium such that when the computer readable medium is executed by a computer causes the computer to perform the process comprising generating a set of snippets by parsing a search result of a pair-based input data, subjecting the set of snippets to one or more filters to generate one or more pair-based candidate data (e.g., where the filter is associated with characteristics of the pair-based input data) and generating one or more pair-based output data by classifying the pair-based candidate data with a support vector machine classifier. The term “computer readable medium” as used herein refers to a statutory article of manufacture that is not a signal or carrier wave per se.
  • FIG. 5 illustrates is a flowchart of an exemplary process for generating sentences suitable for Chinese couplet, in accordance with an embodiment. It is appreciated that not all steps of process 500 are necessary for the general goal of process 500 to be achieved. Moreover, it is appreciated that additional steps may also be included in process 500 in accordance with alternative embodiments.
  • Process 500 begins at step 501 where a SVM training is conducted with manually labeled Chinese couplets to generate a SVM classifier model. At step 502, the SVM classifier is loaded to an online SVM classifier. At step 510, a first sentence of a Chinese couplet is processed through a search engine. At step 520, a search result is parsed to obtain a snippet set. At step 530, one or more candidate sentences for the Chinese couplet are generated by filtering the snippet set. At step 540, one or more new sentences suitable for the Chinese couplet are generated by using the online SVM classifier.
  • Thus, embodiments provide technology for performing web-mining pair-based data. The techniques, methods and/or tools described herein provide for filtering and classifying candidate data to generate more precise pair-data meeting the criteria set by the user. Such technology is ideal to generate pair-based data available on the web. Because of the efficiency of the technology described herein, it is possible for an algorithm implemented based on the technology to mine pair-based data which meets criteria set by the user within a threshold.
  • The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (20)

1. A system comprising:
a computer;
a parser implemented at least in part by the computer, configured for receiving results of a search based on a seed term, and further configured for generating a snippet set comprising at least one term from the received results that is associated with the seed term based on rules that include the seed term and the at least one term sharing a common number of characters, words, or syllables, being harmonious, having matching parts of speech, and sharing a common writing style; and
a filter implemented at least in part by the computer and configured for generating at least one couplet comprising the seed term and a second term from the snippet set that complies with the rules.
2. The system of claim 1 wherein the filter comprises a support vector machine classifier configured for generating, at least in part, the at least one couplet.
3. The system of claim 2 wherein the generating comprises keeping positive candidate data and dropping negative candidate data as determined by the support vector machine classifier.
4. The system of claim 1 wherein the search is performed via a search engine.
5. The system of claim 1 wherein the seed term is in a first language and the at least one term is in a second language.
6. The system of claim 1 wherein the filter comprises:
an identity filter configured for discarding the snippet set in response to the snippet set not including at least the seed term, or
a neighbor filter configured for generating term pairs based on the at least one term of the snippet set, or
a length filter configured for discarding ones of the term pairs that do not have a same length, or
a frequency filter configured for discarding ones of the term pairs having a frequency that is less than a threshold.
7. The system of claim 1 wherein the at least one couplet is a Chinese couplet.
8. A method comprising:
generating, by a computer, a snippet set comprising at least one term from results of a search based on a seed term, the at least one term associated with the seed term based on rules that include the seed term and the at least one term sharing a common number of characters, words, or syllables, being harmonious, having matching parts of speech, and sharing a common writing style; and
generating at least one couplet comprising the seed term and a second term from the snippet set that complies with the rules.
9. The method of claim 8 wherein the generating the at least one couplet is performed by a support vector machine classifier.
10. The method of claim 9 wherein the generating the at least one couplet comprises keeping positive candidate data and dropping negative candidate data as determined by the support vector machine classifier.
11. The method of claim 8 wherein the search is performed via a search engine.
12. The method of claim 8 wherein the seed term is in a first language and the at least one term is in a second language.
13. The method of claim 8 wherein the generating the at least one couplet comprises:
discarding the snippet set in response to the snippet set not including at least the seed term, or
generating term pairs based on the at least one term of the snippet set, or
discarding ones of the term pairs that do not have a same length, or
discarding ones of the term pairs having a frequency that is less than a threshold.
14. The method of claim 8 wherein the at least one couplet is a Chinese couplet.
15. At least one computer readable medium storing computer-executable instructions that, when executed by a computer, cause the computer to perform method comprising:
generating a snippet set comprising at least one term from results of a search based on a seed term, the at least one term associated with the seed term based on rules that include the seed term and the at least one term sharing a common number of characters, words, or syllables, being harmonious, having matching parts of speech, and sharing a common writing style; and
generating at least one couplet comprising the seed term and a second term from the snippet set that complies with the rules.
16. The at least one computer readable medium of claim 15 wherein the generating the at least one couplet is performed by a support vector machine classifier.
17. The at least one computer readable medium of claim 16 wherein the generating the at least one couplet comprises keeping positive candidate data and dropping negative candidate data as determined by the support vector machine classifier.
18. The at least one computer readable medium of claim 15 wherein the search is performed via a search engine.
19. The at least one computer readable medium of claim 15 wherein the seed term is in a first language and the at least one term is in a second language.
20. The at least one computer readable medium of claim 15 wherein the generating the at least one couplet comprises:
discarding the snippet set in response to the snippet set not including at least the seed term, or
generating term pairs based on the at least one term of the snippet set, or
discarding ones of the term pairs that do not have a same length, or
discarding ones of the term pairs having a frequency that is less than a threshold.
US13/101,061 2007-11-19 2011-05-04 Web content mining of pair-based data Abandoned US20110213763A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/101,061 US20110213763A1 (en) 2007-11-19 2011-05-04 Web content mining of pair-based data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/941,968 US7962507B2 (en) 2007-11-19 2007-11-19 Web content mining of pair-based data
US13/101,061 US20110213763A1 (en) 2007-11-19 2011-05-04 Web content mining of pair-based data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/941,968 Continuation US7962507B2 (en) 2007-11-19 2007-11-19 Web content mining of pair-based data

Publications (1)

Publication Number Publication Date
US20110213763A1 true US20110213763A1 (en) 2011-09-01

Family

ID=40643049

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/941,968 Active 2030-03-12 US7962507B2 (en) 2007-11-19 2007-11-19 Web content mining of pair-based data
US13/101,061 Abandoned US20110213763A1 (en) 2007-11-19 2011-05-04 Web content mining of pair-based data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/941,968 Active 2030-03-12 US7962507B2 (en) 2007-11-19 2007-11-19 Web content mining of pair-based data

Country Status (1)

Country Link
US (2) US7962507B2 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI386822B (en) * 2007-09-05 2013-02-21 Shing Lung Chen A method for establishing a multilingual translation data base rapidly
WO2010105265A2 (en) * 2009-03-13 2010-09-16 Jean-Pierre Makeyev Text creation system and method
US9489350B2 (en) 2010-04-30 2016-11-08 Orbis Technologies, Inc. Systems and methods for semantic search, content correlation and visualization
US9015080B2 (en) 2012-03-16 2015-04-21 Orbis Technologies, Inc. Systems and methods for semantic inference and reasoning
JP5936698B2 (en) * 2012-08-27 2016-06-22 株式会社日立製作所 Word semantic relation extraction device
US9189531B2 (en) 2012-11-30 2015-11-17 Orbis Technologies, Inc. Ontology harmonization and mediation systems and methods
US9412048B2 (en) 2014-04-21 2016-08-09 Haier Us Appliance Solutions, Inc. Systems and methods for cookware detection
US9449220B2 (en) * 2014-04-21 2016-09-20 Haier Us Appliance Solutions, Inc. Systems and methods for cookware detection
CN104091174B (en) * 2014-07-13 2017-04-19 西安电子科技大学 portrait style classification method based on support vector machine
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device
CN111126061B (en) * 2019-12-24 2023-07-14 北京百度网讯科技有限公司 Antithetical couplet information generation method and device

Citations (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4712174A (en) * 1984-04-24 1987-12-08 Computer Poet Corporation Method and apparatus for generating text
US4942526A (en) * 1985-10-25 1990-07-17 Hitachi, Ltd. Method and system for generating lexicon of cooccurrence relations in natural language
US5721939A (en) * 1995-08-03 1998-02-24 Xerox Corporation Method and apparatus for tokenizing text
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US5805832A (en) * 1991-07-25 1998-09-08 International Business Machines Corporation System for parametric text to text language translation
US5930746A (en) * 1996-03-20 1999-07-27 The Government Of Singapore Parsing and translating natural language sentences automatically
US5946648A (en) * 1996-06-28 1999-08-31 Microsoft Corporation Identification of words in Japanese text by a computer system
US6002997A (en) * 1996-06-21 1999-12-14 Tou; Julius T. Method for translating cultural subtleties in machine translation
US6081774A (en) * 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US6173252B1 (en) * 1997-03-13 2001-01-09 International Business Machines Corp. Apparatus and methods for Chinese error check by means of dynamic programming and weighted classes
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
US6311152B1 (en) * 1999-04-08 2001-10-30 Kent Ridge Digital Labs System for chinese tokenization and named entity recognition
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents
US6385629B1 (en) * 1999-11-15 2002-05-07 International Business Machine Corporation System and method for the automatic mining of acronym-expansion pairs patterns and formation rules
US6408266B1 (en) * 1997-04-01 2002-06-18 Yeong Kaung Oon Didactic and content oriented word processing method with incrementally changed belief system
US20020099712A1 (en) * 2001-01-23 2002-07-25 Neo-Core, L.L.C. Method of operating an extensible markup language database
US20020123877A1 (en) * 2001-01-10 2002-09-05 En-Dong Xun Method and apparatus for performing machine translation using a unified language model and translation model
US6505197B1 (en) * 1999-11-15 2003-01-07 International Business Machines Corporation System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences
US20030036040A1 (en) * 1999-11-01 2003-02-20 Kurzweil Cyberart Technologies, Inc., A Delaware Corporation Basic poetry generation
US6539376B1 (en) * 1999-11-15 2003-03-25 International Business Machines Corporation System and method for the automatic mining of new relationships
US20030083861A1 (en) * 2001-07-11 2003-05-01 Weise David N. Method and apparatus for parsing text using mutual information
US20030208354A1 (en) * 2002-05-03 2003-11-06 Industrial Technology Research Institute Method for named-entity recognition and verification
US20040006466A1 (en) * 2002-06-28 2004-01-08 Ming Zhou System and method for automatic detection of collocation mistakes in documents
US20040034525A1 (en) * 2002-08-15 2004-02-19 Pentheroudakis Joseph E. Method and apparatus for expanding dictionaries during parsing
US20040111453A1 (en) * 2002-12-06 2004-06-10 Harris Christopher K. Effective multi-class support vector machine classification
US20040122660A1 (en) * 2002-12-20 2004-06-24 International Business Machines Corporation Creating taxonomies and training data in multiple languages
US20040215598A1 (en) * 2002-07-10 2004-10-28 Jerzy Bala Distributed data mining and compression method and system
US20050049990A1 (en) * 2003-08-29 2005-03-03 Milenova Boriana L. Support vector machines processing system
US20050071148A1 (en) * 2003-09-15 2005-03-31 Microsoft Corporation Chinese word segmentation
US20050080781A1 (en) * 2001-12-18 2005-04-14 Ryan Simon David Information resource taxonomy
US6941262B1 (en) * 1999-11-01 2005-09-06 Kurzweil Cyberart Technologies, Inc. Poet assistant's graphical user interface (GUI)
US6993534B2 (en) * 2002-05-08 2006-01-31 International Business Machines Corporation Data store for knowledge-based data mining system
US7113903B1 (en) * 2001-01-30 2006-09-26 At&T Corp. Method and apparatus for providing stochastic finite-state machine translation
US7117208B2 (en) * 2000-09-28 2006-10-03 Oracle Corporation Enterprise web mining system and method
US20070005345A1 (en) * 2005-07-01 2007-01-04 Microsoft Corporation Generating Chinese language couplets
US20070100680A1 (en) * 2005-10-21 2007-05-03 Shailesh Kumar Method and apparatus for retail data mining using pair-wise co-occurrence consistency
US7219099B2 (en) * 2002-05-10 2007-05-15 Oracle International Corporation Data mining model building using attribute importance
US20070204211A1 (en) * 2006-02-24 2007-08-30 Paxson Dana W Apparatus and method for creating literary macrames
US7269802B1 (en) * 1999-11-01 2007-09-11 Kurzweil Cyberart Technologies, Inc. Poetry screen saver
US7305585B2 (en) * 2002-05-23 2007-12-04 Exludus Technologies Inc. Asynchronous and autonomous data replication
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
US20080015458A1 (en) * 2006-07-17 2008-01-17 Buarque De Macedo Pedro Steven Methods of diagnosing and treating neuropsychological disorders
US7814086B2 (en) * 2006-11-16 2010-10-12 Yahoo! Inc. System and method for determining semantically related terms based on sequences of search queries

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689585B2 (en) 2004-04-15 2010-03-30 Microsoft Corporation Reinforced clustering of multi-type data objects for search term suggestion
US7617176B2 (en) * 2004-07-13 2009-11-10 Microsoft Corporation Query-based snippet clustering for search result grouping
US8000955B2 (en) 2006-12-20 2011-08-16 Microsoft Corporation Generating Chinese language banners

Patent Citations (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4712174A (en) * 1984-04-24 1987-12-08 Computer Poet Corporation Method and apparatus for generating text
US4942526A (en) * 1985-10-25 1990-07-17 Hitachi, Ltd. Method and system for generating lexicon of cooccurrence relations in natural language
US5805832A (en) * 1991-07-25 1998-09-08 International Business Machines Corporation System for parametric text to text language translation
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5721939A (en) * 1995-08-03 1998-02-24 Xerox Corporation Method and apparatus for tokenizing text
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US5930746A (en) * 1996-03-20 1999-07-27 The Government Of Singapore Parsing and translating natural language sentences automatically
US6002997A (en) * 1996-06-21 1999-12-14 Tou; Julius T. Method for translating cultural subtleties in machine translation
US5946648A (en) * 1996-06-28 1999-08-31 Microsoft Corporation Identification of words in Japanese text by a computer system
US6173252B1 (en) * 1997-03-13 2001-01-09 International Business Machines Corp. Apparatus and methods for Chinese error check by means of dynamic programming and weighted classes
US6408266B1 (en) * 1997-04-01 2002-06-18 Yeong Kaung Oon Didactic and content oriented word processing method with incrementally changed belief system
US6081774A (en) * 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
US6311152B1 (en) * 1999-04-08 2001-10-30 Kent Ridge Digital Labs System for chinese tokenization and named entity recognition
US20050210058A1 (en) * 1999-11-01 2005-09-22 Kurzweil Cyberart Technologies, Inc., A Delaware Corporation Poet personalities
US7269802B1 (en) * 1999-11-01 2007-09-11 Kurzweil Cyberart Technologies, Inc. Poetry screen saver
US6941262B1 (en) * 1999-11-01 2005-09-06 Kurzweil Cyberart Technologies, Inc. Poet assistant's graphical user interface (GUI)
US7184949B2 (en) * 1999-11-01 2007-02-27 Kurzweil Cyberart Technologies, Inc. Basic poetry generation
US6647395B1 (en) * 1999-11-01 2003-11-11 Kurzweil Cyberart Technologies, Inc. Poet personalities
US20030036040A1 (en) * 1999-11-01 2003-02-20 Kurzweil Cyberart Technologies, Inc., A Delaware Corporation Basic poetry generation
US6505197B1 (en) * 1999-11-15 2003-01-07 International Business Machines Corporation System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences
US6539376B1 (en) * 1999-11-15 2003-03-25 International Business Machines Corporation System and method for the automatic mining of new relationships
US6385629B1 (en) * 1999-11-15 2002-05-07 International Business Machine Corporation System and method for the automatic mining of acronym-expansion pairs patterns and formation rules
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents
US7117208B2 (en) * 2000-09-28 2006-10-03 Oracle Corporation Enterprise web mining system and method
US20020123877A1 (en) * 2001-01-10 2002-09-05 En-Dong Xun Method and apparatus for performing machine translation using a unified language model and translation model
US20020099712A1 (en) * 2001-01-23 2002-07-25 Neo-Core, L.L.C. Method of operating an extensible markup language database
US7113903B1 (en) * 2001-01-30 2006-09-26 At&T Corp. Method and apparatus for providing stochastic finite-state machine translation
US20030083861A1 (en) * 2001-07-11 2003-05-01 Weise David N. Method and apparatus for parsing text using mutual information
US20050080781A1 (en) * 2001-12-18 2005-04-14 Ryan Simon David Information resource taxonomy
US20030208354A1 (en) * 2002-05-03 2003-11-06 Industrial Technology Research Institute Method for named-entity recognition and verification
US6993534B2 (en) * 2002-05-08 2006-01-31 International Business Machines Corporation Data store for knowledge-based data mining system
US7219099B2 (en) * 2002-05-10 2007-05-15 Oracle International Corporation Data mining model building using attribute importance
US7305585B2 (en) * 2002-05-23 2007-12-04 Exludus Technologies Inc. Asynchronous and autonomous data replication
US20040006466A1 (en) * 2002-06-28 2004-01-08 Ming Zhou System and method for automatic detection of collocation mistakes in documents
US20040215598A1 (en) * 2002-07-10 2004-10-28 Jerzy Bala Distributed data mining and compression method and system
US20040034525A1 (en) * 2002-08-15 2004-02-19 Pentheroudakis Joseph E. Method and apparatus for expanding dictionaries during parsing
US20040111453A1 (en) * 2002-12-06 2004-06-10 Harris Christopher K. Effective multi-class support vector machine classification
US20040122660A1 (en) * 2002-12-20 2004-06-24 International Business Machines Corporation Creating taxonomies and training data in multiple languages
US20050049990A1 (en) * 2003-08-29 2005-03-03 Milenova Boriana L. Support vector machines processing system
US20050071148A1 (en) * 2003-09-15 2005-03-31 Microsoft Corporation Chinese word segmentation
US20070005345A1 (en) * 2005-07-01 2007-01-04 Microsoft Corporation Generating Chinese language couplets
US20070100680A1 (en) * 2005-10-21 2007-05-03 Shailesh Kumar Method and apparatus for retail data mining using pair-wise co-occurrence consistency
US20070204211A1 (en) * 2006-02-24 2007-08-30 Paxson Dana W Apparatus and method for creating literary macrames
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
US20080015458A1 (en) * 2006-07-17 2008-01-17 Buarque De Macedo Pedro Steven Methods of diagnosing and treating neuropsychological disorders
US7814086B2 (en) * 2006-11-16 2010-10-12 Yahoo! Inc. System and method for determining semantically related terms based on sequences of search queries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Feng, "Feature Selection Based on Genetic Algorithms and Support Vector Machines For Handwritten Similar Chinese Characters Recognition," Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29, August 2004," Pages 3600-3605 *

Also Published As

Publication number Publication date
US7962507B2 (en) 2011-06-14
US20090132530A1 (en) 2009-05-21

Similar Documents

Publication Publication Date Title
US7962507B2 (en) Web content mining of pair-based data
CN108280206B (en) Short text classification method based on semantic enhancement
US8892420B2 (en) Text segmentation with multiple granularity levels
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
US20190303375A1 (en) Relevant passage retrieval system
WO2019228203A1 (en) Short text classification method and system
JP5751253B2 (en) Information extraction system, method and program
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN106651696B (en) Approximate question pushing method and system
Abbasi et al. Applying authorship analysis to Arabic web content
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN111191022A (en) Method and device for generating short titles of commodities
Tandel et al. Multi-document text summarization-a survey
CN111078893A (en) Method for efficiently acquiring and identifying linguistic data for dialog meaning graph in large scale
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN110705247A (en) Based on x2-C text similarity calculation method
CN109657043A (en) Automatically generate the method, apparatus, equipment and storage medium of article
CN113761128A (en) Event key information extraction method combining domain synonym dictionary and pattern matching
CN111460147A (en) Title short text classification method based on semantic enhancement
CN114416914B (en) Processing method based on picture question and answer
KR101265467B1 (en) Method for extracting experience and classifying verb in blog
AlShenaifi et al. Faheem at NADI shared task: Identifying the dialect of Arabic tweet
Shekhar et al. Computational linguistic retrieval framework using negative bootstrapping for retrieving transliteration variants
Al-Zyoud et al. Arabic stemming techniques: comparisons and new vision
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: EXPRESSLY ABANDONED -- DURING EXAMINATION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0509

Effective date: 20141014