CA1301345C - Methods for part-of-speech determination and usage - Google Patents
Methods for part-of-speech determination and usageInfo
- Publication number
- CA1301345C CA1301345C CA000590100A CA590100A CA1301345C CA 1301345 C CA1301345 C CA 1301345C CA 000590100 A CA000590100 A CA 000590100A CA 590100 A CA590100 A CA 590100A CA 1301345 C CA1301345 C CA 1301345C
- Authority
- CA
- Canada
- Prior art keywords
- speech
- word
- words
- probability
- contextual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B19/00—Teaching not covered by other main groups of this subclass
- G09B19/04—Speaking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Human Computer Interaction (AREA)
- Educational Technology (AREA)
- Educational Administration (AREA)
- Probability & Statistics with Applications (AREA)
- Entrepreneurship & Innovation (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Abstract There are disclosed methods and applications for determination of parts of speech (part-of-speech tagging) and noun phrase parsing for text or other non-verbal record of a communication. The part-of-speech tagging method optimizes the product of individual word lexical probabilities and normalized three-word contextual probabilities. Normalization involves dividing by contained two-word contextual probabilities. The method for noun phrase parsing involves optimizing the choices of, typically non-recursive, noun phrases by considering all possible beginnings and endings thereof, preferably based on the output of the part-of-speech tagging method. The disclosed applications include text-to-speechsynthesis and text searching and editing, among a broader range of possible applications.
Description
~3~
METHODS FOR PART-OF-SPEECH
DETERMINATION AND USAGE
Field of the Invention This invention relates to me~hods for part-of-speech determination and S to methods for usage of the results, including intermediate methods of noun-phrase parsing, and including speech synthesis, speech recognition, training of writers, proofreading, indexing and data retrieval.
Back~round o~ the Invention It has been long recognized that the ability to determine the parts of 10 speech, especially for words that can be used as different parts of speech, is relevant to many different p~oblems in the use of the English language. For example, it is known that speech "stress", including pitch, duration and energy, is dependent on the particular parts of speech of words and their sentence order.
Accordingly, speech synthesis needs parts-of-speech analysis of the input wlitten 15 or non-verbal text to produce a result that sounds like human speech.
Moreover, automatic part-of-speech determination can play an important role in automatic speech recognition, in the educadon and training of writers by computer-assisted methods, in edidng and proofreading of documents generated at a word-processing work station, in the indexing of a document, and 20 in various forrns of retrieval of word-dependent data from a data base.
~ or example, some of these uses can be found in various versions o AT&T's Writer's Workbench(~. See the article by Barbara Wallraff, "The Literate Computer," in The Atlantlc Monthly, January 1988, pp. 64ff, especially page 68, the last two paragraphs. The relationship of parts of speech to indexing can be 25 found in U.S. Patent No. 4,580,218 issued April 1, 1986, to C. L. Raye.
Heretofore, two principal methods for automatic part-of-speech determination have been discussed in the literature and, to some extent, employed.
The first depends on a variety of "ad hoc" rules designed to detect particular situadons~of interest. ~ese rules may relate, for example, to using word endings30 to predict part-of-speech, or to some adaptation thereof. Some ad hoc rules for part-of-speech detennination have been used in the Writer's Workbench~
application program running under the UNIXTM Operating System. These rules tend to be very limited in the situations they can successfully resolve and to lack underlying unity. That technique is described in Computer Science Technical 35 Report, No. 81, "PARTS - A System for Assigning Word Classes to English Text", by L. L. Cherry, June 1978, Bell Telephone Lab~rato~es, Incorporated.
~t~
~3~3~S
... .
METHODS FOR PART-OF-SPEECH
DETERMINATION AND USAGE
Field of the Invention This invention relates to me~hods for part-of-speech determination and S to methods for usage of the results, including intermediate methods of noun-phrase parsing, and including speech synthesis, speech recognition, training of writers, proofreading, indexing and data retrieval.
Back~round o~ the Invention It has been long recognized that the ability to determine the parts of 10 speech, especially for words that can be used as different parts of speech, is relevant to many different p~oblems in the use of the English language. For example, it is known that speech "stress", including pitch, duration and energy, is dependent on the particular parts of speech of words and their sentence order.
Accordingly, speech synthesis needs parts-of-speech analysis of the input wlitten 15 or non-verbal text to produce a result that sounds like human speech.
Moreover, automatic part-of-speech determination can play an important role in automatic speech recognition, in the educadon and training of writers by computer-assisted methods, in edidng and proofreading of documents generated at a word-processing work station, in the indexing of a document, and 20 in various forrns of retrieval of word-dependent data from a data base.
~ or example, some of these uses can be found in various versions o AT&T's Writer's Workbench(~. See the article by Barbara Wallraff, "The Literate Computer," in The Atlantlc Monthly, January 1988, pp. 64ff, especially page 68, the last two paragraphs. The relationship of parts of speech to indexing can be 25 found in U.S. Patent No. 4,580,218 issued April 1, 1986, to C. L. Raye.
Heretofore, two principal methods for automatic part-of-speech determination have been discussed in the literature and, to some extent, employed.
The first depends on a variety of "ad hoc" rules designed to detect particular situadons~of interest. ~ese rules may relate, for example, to using word endings30 to predict part-of-speech, or to some adaptation thereof. Some ad hoc rules for part-of-speech detennination have been used in the Writer's Workbench~
application program running under the UNIXTM Operating System. These rules tend to be very limited in the situations they can successfully resolve and to lack underlying unity. That technique is described in Computer Science Technical 35 Report, No. 81, "PARTS - A System for Assigning Word Classes to English Text", by L. L. Cherry, June 1978, Bell Telephone Lab~rato~es, Incorporated.
~t~
~3~3~S
... .
The second principal method, which potentially has greater underlying unity is the "n-gram" technique described in the article "The Automatic Tagging of the LOB Corpus", in ICAME News, Vol. 7, pp. 13-33, by G. Leech et al., 1983, University of Lancaster, England. Part of the technique there described 5 makes the assigned part of speech depend on the current best choices of parts of speech of certain preceding or following words, based on certain rules as to likely combinations of successive parts of speech. With this analysis, various ad hoc rules are also used, so that, overall, this method is still less accurate than desirable. In addition, this method fails to model lexical probabilities in a 10 systematic fashion.
The foregoing techniques have not generated substantial interest among researchers in the art because of the foregoing considerations and becausethe results have been disappointing.
Indeed, it has been speculated that any "n-gram" technique will yield 15 poor results because it cannot take a sufficiently wide, or overall, view of the likely structure of the sentence. On the other hand, it has not been possible toprogram robustly into a computer the kind of overall view a human mind takes in analyzing the parts of speech in a sentence. See the book A Theory of Syntactic Recognition for Natural ~, by M. Marcus, MIT Press, Cambridge, MA, 20 1980. Consequently, the "n-gram" type part-of-speech deterrnination, as contrasted to "n-gram" word frequency-of-occurrence analysis, have been largely limited to tasks such as helping to generate larger bodies of fully "tagged" text to be used in further research. For that purpose, the results must be correc~ed by the intervention of a very capable human.
Nevertheless, it would be desirable to be able to identify parts-of-speech with a high degree of likelihood with relatively simple techniques, like the "n-gram" technique, so that it may ~e readily applied in all the applications mentioned at the outset, above.
Summary of the Invention According to one feature of my invention, parts of speech are assigned to words in a message by optimizing the product of individual word lexical probabilities and normalized three-word contextual probabilities.
Normali~ation employs the contained two-word contextual probabilities.
Endpoints of sentences (including multiple spaces between them), punctuation and35 words occurring with low frequency are assigned lexical probabilities and areotherwise treated as if they were words, so that discontinuides encountered in . ;. ~
13~L3~S
The foregoing techniques have not generated substantial interest among researchers in the art because of the foregoing considerations and becausethe results have been disappointing.
Indeed, it has been speculated that any "n-gram" technique will yield 15 poor results because it cannot take a sufficiently wide, or overall, view of the likely structure of the sentence. On the other hand, it has not been possible toprogram robustly into a computer the kind of overall view a human mind takes in analyzing the parts of speech in a sentence. See the book A Theory of Syntactic Recognition for Natural ~, by M. Marcus, MIT Press, Cambridge, MA, 20 1980. Consequently, the "n-gram" type part-of-speech deterrnination, as contrasted to "n-gram" word frequency-of-occurrence analysis, have been largely limited to tasks such as helping to generate larger bodies of fully "tagged" text to be used in further research. For that purpose, the results must be correc~ed by the intervention of a very capable human.
Nevertheless, it would be desirable to be able to identify parts-of-speech with a high degree of likelihood with relatively simple techniques, like the "n-gram" technique, so that it may ~e readily applied in all the applications mentioned at the outset, above.
Summary of the Invention According to one feature of my invention, parts of speech are assigned to words in a message by optimizing the product of individual word lexical probabilities and normalized three-word contextual probabilities.
Normali~ation employs the contained two-word contextual probabilities.
Endpoints of sentences (including multiple spaces between them), punctuation and35 words occurring with low frequency are assigned lexical probabilities and areotherwise treated as if they were words, so that discontinuides encountered in . ;. ~
13~L3~S
prior n-gram part-of-speech assignment and the prior use of "ad hoc" rules tend to be avo;ded. The generality of the technique is thereby established.
According to another feature of my invention, a message in which the words have had parts-of-speech previously assigned has its noun phrases identified S in a way that facilitates their use for speech synthesis. This noun phrase parsing also may have other applications. Specifically, the noun phrase parsing method is a highly probabilistic method that initially assigns beginnings and ends of nounphrases at every start or end of a word and progressively elimina~es such assignments by eliminating the lowest probability assignments, until only very 10 high probability non-recursive assignments remain. By non-recursive assignments, I mean that no noun phrase assignment is retained that is p~ly or wholly within another noun phrase.
Alternatively, the method of this featllre of my invention can also retain some high-probability noun phrases that occur wholly within other noun 15 phrases, since such assignments are useful in practice, for example, in speech synthesis.
Some noun phrase assignments which are always eliminated are endings without corresponding beginnings (e.g., at the start of a sentence), or beginnings without endings (e.g., at the end of a sentence), but my method further 20 eliminates low-probability assignments of the beginnings and ends of noun phrases; or, to put it another way, retains only the highest probability assignments.
According to a subsidiary feature of my invention, other low-probability noun phrases are eliminated by repetitively scanning each sentence of a rnessage from beginning to end and, on each scan, multiplying the probabilities 25 for each pair of a beginning and an end, and then keeping those combinations with a p~oduct near or above the highest probability previously obtained for theregion of the sentence, or at least are not inconsistent with other high probability noun phrases.
According to sdll another feature of my invendon, the output of my 30 parts-of-speech assignment rnethod may be the input to my noun-phrase-parsingmethod. In this context the maximum likelihood optimization techniques used in both methods tend to reinforce each other, since each method, by itself, is superior ; in performance to that of its p~ior art.
.~
. ~ .... ;.. ~ , . . -~3~345 3a In accordance with one aspect of the invention there is provided an automated method for assigning parts of speech to words in a message, of the type comprising the steps of: eleetronieally reading stored representations of the messa~e, generating lexieal probabilities for each word to be a particular part of speech~ and selecting. in response to the lexical - 5 probability for the subject word and in response to the contextual probabilities for at least one adjacent word to be a particular part of speech7 the contextual probability for the subject word to be a partieular part of speeeh, SAID METHOD BEING CEIARACrERIZED IN THAT:
the generating step includes representing certain words, spaees beEore and after sentences and punctuation s~mbols as words having empirically-determined frequencies of occurrence in a non-verbal record oE the message hlclucling, smoothing part ot speech trequencies for at least certain worcl; and the selecting step includes maximizing the con~extual probabilities referred to parts oE speech of nearby words including at least the tollowing word.
In accordance with another aspect of the invention there is provided an automated method for determining, in a message, beginnings and ends of noun phrases to which parts of speech have been assigned with reasonable probability, of the type including j~ the steps oE estimating whether the words around eaeh noun in the message could be part of a noun phrase, and utilizing the resulting estimates, SAID METHOD BEING
CHARACTERIZED BY the steps o~: assigning all possible noun phrase bounclaries~
eliminating nll non-paire(l bounclaries, ancl optimizing conte~tual noun phrase bounclary probabilities.
~ ~3~34S
According to another feature of my invention, a message in which the words have had parts-of-speech previously assigned has its noun phrases identified S in a way that facilitates their use for speech synthesis. This noun phrase parsing also may have other applications. Specifically, the noun phrase parsing method is a highly probabilistic method that initially assigns beginnings and ends of nounphrases at every start or end of a word and progressively elimina~es such assignments by eliminating the lowest probability assignments, until only very 10 high probability non-recursive assignments remain. By non-recursive assignments, I mean that no noun phrase assignment is retained that is p~ly or wholly within another noun phrase.
Alternatively, the method of this featllre of my invention can also retain some high-probability noun phrases that occur wholly within other noun 15 phrases, since such assignments are useful in practice, for example, in speech synthesis.
Some noun phrase assignments which are always eliminated are endings without corresponding beginnings (e.g., at the start of a sentence), or beginnings without endings (e.g., at the end of a sentence), but my method further 20 eliminates low-probability assignments of the beginnings and ends of noun phrases; or, to put it another way, retains only the highest probability assignments.
According to a subsidiary feature of my invention, other low-probability noun phrases are eliminated by repetitively scanning each sentence of a rnessage from beginning to end and, on each scan, multiplying the probabilities 25 for each pair of a beginning and an end, and then keeping those combinations with a p~oduct near or above the highest probability previously obtained for theregion of the sentence, or at least are not inconsistent with other high probability noun phrases.
According to sdll another feature of my invendon, the output of my 30 parts-of-speech assignment rnethod may be the input to my noun-phrase-parsingmethod. In this context the maximum likelihood optimization techniques used in both methods tend to reinforce each other, since each method, by itself, is superior ; in performance to that of its p~ior art.
.~
. ~ .... ;.. ~ , . . -~3~345 3a In accordance with one aspect of the invention there is provided an automated method for assigning parts of speech to words in a message, of the type comprising the steps of: eleetronieally reading stored representations of the messa~e, generating lexieal probabilities for each word to be a particular part of speech~ and selecting. in response to the lexical - 5 probability for the subject word and in response to the contextual probabilities for at least one adjacent word to be a particular part of speech7 the contextual probability for the subject word to be a partieular part of speeeh, SAID METHOD BEING CEIARACrERIZED IN THAT:
the generating step includes representing certain words, spaees beEore and after sentences and punctuation s~mbols as words having empirically-determined frequencies of occurrence in a non-verbal record oE the message hlclucling, smoothing part ot speech trequencies for at least certain worcl; and the selecting step includes maximizing the con~extual probabilities referred to parts oE speech of nearby words including at least the tollowing word.
In accordance with another aspect of the invention there is provided an automated method for determining, in a message, beginnings and ends of noun phrases to which parts of speech have been assigned with reasonable probability, of the type including j~ the steps oE estimating whether the words around eaeh noun in the message could be part of a noun phrase, and utilizing the resulting estimates, SAID METHOD BEING
CHARACTERIZED BY the steps o~: assigning all possible noun phrase bounclaries~
eliminating nll non-paire(l bounclaries, ancl optimizing conte~tual noun phrase bounclary probabilities.
~ ~3~34S
Brief Description of the D ~
Further features and advantages of my invention will become apparent from the following detailed description, taken together with the drawing, in which:
FIC;. 1 is a flow diagram of a parts-of-speech assignment method 5 according to my invention;
FIG. 2 is a flow diagra}n of a noun phrase parsing method according to my invention;
FIG. 3 is a block-diagrammatic showing of a speech synthesizer employ~ng the methods of FIGs. 1 and 2; and F~G. 4 is a block-diagramrnatic showing of a text editing employing the method of FIG. 1;
Descriptlon of Illustrative Embodiments In the method of FIG. 1, we shall assume for purposes of illustration that the message was a text message which has been read and stored in an 1~ electronic form. The first step then becomes, as indicated in block 11, to read the stored text, sentence by sentence. This step requires determining sentence boundaries. Ther0 are many known techniques, but I prefer to make the initial assumption that every period ends a sentence and then to discard that sentence and its results when my method subsequently demonstrates that the period had a more 20 likely use.
In any event, my method proceeds to operate on each sentence, starting frGrn the end.
The subsequent steps can be grouped into three general steps:
token-izing the words (block 12);
2S computing the lexical part-of-speech probabilities (block 13), starting from the end of the sentence; and optimizing the contextual part-of-speech probabilities (block 14~, with, of course, ~he general final step (lS) of applying the result to any of the manypossible uses of part-of-speech analysis.
These general steps can be broken down into many more detailed - steps, as will now be explained.
In token-izing words, I make certain rninor but important modifications of the usual linguistic approach to part-of-speech analysis.
Neve~theless, for convenience, I use the same designations of parts of speech as35 set out in the "List of Tags" in the book by W. Nelson Francis et al. Fre~y Analysis of Eng~ Usa~e, Houghton Mifflin Co., 1982, at pages 6-8. They will be repeated herein wherever helpful to understanding examples.
Token-izing includes the identification of words and certain non-words, such as punctuation and parentheses. In addition, I have found it important to assign two blanlc spaces after every sentence period to generate a new set of5 frequencies for such spaces in a tagged body of text such as that which forrned the basis for the Francis et al. book (the antecedent body of text is comrnonly called the "Brown Corpus"). Token types involved in the process are the actual words ofa sentence and structural indicators which inform the process that the end of a sentence has been reached. Those structural indicators include, for example, and10 end-of-sentence indicator, such as the machine-readable character ~or a period, a heading or paragraph indicator represented by a corre~sponding folrnatting character stored in the manuscript, filed, or file, along with the text words, and an end-of-file indicator.
Looking ahead a bit, we shall see each final word in a sentence will 15 have its contextual probability measured together with that for the period and the following blank. These three form a "trigram"; and the probability analysis therefore is exploring the quesdon: "How ILlcely is it that this word, as a certain part of speech, can end a sentence?" In this case the contextual probabilities of obseIving the period in this position is very high (near 1.0); and the contextual 20 probability for the blank is 1Ø In My event, those probabilities are the same in both numerator and denominator of the norrnalized probability, so the resultant contextual probability is just the measured probability of seeing the subject part of speech at the end of a sentence which, in t~un, is a stadstic that can be tabulated from the text corpus and stored in a perrnanent memory of the computer.
After token-izing the observed words and characters, as explained in colmection with block 12, my method next computes the lexical part of speech probabilities (the probability of observing part of speech i given word ~, dependent upon frequency of occurrence, as follows: If every sense of every wordof interest appeared with a reasonably high frequency in the Brown Corpus, that 30 calculation would be simply the quotient of the observed frequency of occurrence of the word as a particular part of speech, divided by its total frequency of occurrence, regardless of part of speech.
I replace this calculation, for words or characters of low frequency of occurrence, as follows: consider ~hat, under Zipf's law, no matter how much text35 we look at, there will always be a large tail of words that appear only a few times.
In the Brown COIpus, for example, 40,000 words appear five times or less. If a ~ ~3~ 5 word such as yawn appears once as a noun and once as a verb, what is the probability that it can be an adjective? It is impossible to say without more information. Fortunately, dictionaries can help alleviate this problem to some extent. We add one the to the frequency count of possibilities in the dictionary.
S For example, yawn happens to be listed in our dictionary as either a noun or averb. Thus, we smooth the possibilities. In this case, the probabilities remain unchanged. Both before and after smoothing, we estimate yawn to be a noun 50%
of the time, and a verb the rest. There is no chance that yawn is an adjective.
In some other cases, smoothing makes a big difference. Consider the 10 word cans. This word appears 5 times as a plural noun and never as a verb in the Brown Corpus. The lexicon tand its morphological routines), fortunately, give both possibilities. Thus, the revised estimate is that cans appears 6t7 times as a plural noun and lt7 times as a verb.
Thus, we add "one" to each observed ~requency of occurrence as each 15 possible part of speech, according to the training material, an unabridged dicdonary; and calculate the lexical probabilities therefrom.
To start to construct our probability search tree for this word, we now multiply that lexical probability by the normalized estimated contextual probability, i.e., the frequency of observing part of speech X given the succeeding 20 parts of speech Y and Zt already determined, divided by the "bigram" frequency of observing part of speech Y given part of speech Z. The latter two data can betabulat~ from an alrendy tagged corpus, referenced by Francis et al in their book.
The tabulated data are stored in a computer memory.
We proceed to repeat the above process for the subject word as every 25 other part of speech it can be, keeping only the maximum probabilides from our prior sets of calculations. Before we proceed to the next to last word in the ` sentence, we have a~rived at a maximum product probability for the last word.
Two things can already be observed a~out the process. First, the lexical probabilides that are used in thç product lie along a continium and are not 30 just one of three arbi~arily assigned values, as used in the Leech et al. reference.
Second, while the applications of the mathematics may seem trivial for words which in fact tum out to be at the end of a sentence, the important point is that it is the same mathematics which is used everywhere.
As we proceed to give a more complete, specific example, keep in 35 mind that the probability estimates were obtained by training on the tagged Brown Corpus, which is referred to but not rontained in the above-cited analysis by -~-" 13Q~ S
Further features and advantages of my invention will become apparent from the following detailed description, taken together with the drawing, in which:
FIC;. 1 is a flow diagram of a parts-of-speech assignment method 5 according to my invention;
FIG. 2 is a flow diagra}n of a noun phrase parsing method according to my invention;
FIG. 3 is a block-diagrammatic showing of a speech synthesizer employ~ng the methods of FIGs. 1 and 2; and F~G. 4 is a block-diagramrnatic showing of a text editing employing the method of FIG. 1;
Descriptlon of Illustrative Embodiments In the method of FIG. 1, we shall assume for purposes of illustration that the message was a text message which has been read and stored in an 1~ electronic form. The first step then becomes, as indicated in block 11, to read the stored text, sentence by sentence. This step requires determining sentence boundaries. Ther0 are many known techniques, but I prefer to make the initial assumption that every period ends a sentence and then to discard that sentence and its results when my method subsequently demonstrates that the period had a more 20 likely use.
In any event, my method proceeds to operate on each sentence, starting frGrn the end.
The subsequent steps can be grouped into three general steps:
token-izing the words (block 12);
2S computing the lexical part-of-speech probabilities (block 13), starting from the end of the sentence; and optimizing the contextual part-of-speech probabilities (block 14~, with, of course, ~he general final step (lS) of applying the result to any of the manypossible uses of part-of-speech analysis.
These general steps can be broken down into many more detailed - steps, as will now be explained.
In token-izing words, I make certain rninor but important modifications of the usual linguistic approach to part-of-speech analysis.
Neve~theless, for convenience, I use the same designations of parts of speech as35 set out in the "List of Tags" in the book by W. Nelson Francis et al. Fre~y Analysis of Eng~ Usa~e, Houghton Mifflin Co., 1982, at pages 6-8. They will be repeated herein wherever helpful to understanding examples.
Token-izing includes the identification of words and certain non-words, such as punctuation and parentheses. In addition, I have found it important to assign two blanlc spaces after every sentence period to generate a new set of5 frequencies for such spaces in a tagged body of text such as that which forrned the basis for the Francis et al. book (the antecedent body of text is comrnonly called the "Brown Corpus"). Token types involved in the process are the actual words ofa sentence and structural indicators which inform the process that the end of a sentence has been reached. Those structural indicators include, for example, and10 end-of-sentence indicator, such as the machine-readable character ~or a period, a heading or paragraph indicator represented by a corre~sponding folrnatting character stored in the manuscript, filed, or file, along with the text words, and an end-of-file indicator.
Looking ahead a bit, we shall see each final word in a sentence will 15 have its contextual probability measured together with that for the period and the following blank. These three form a "trigram"; and the probability analysis therefore is exploring the quesdon: "How ILlcely is it that this word, as a certain part of speech, can end a sentence?" In this case the contextual probabilities of obseIving the period in this position is very high (near 1.0); and the contextual 20 probability for the blank is 1Ø In My event, those probabilities are the same in both numerator and denominator of the norrnalized probability, so the resultant contextual probability is just the measured probability of seeing the subject part of speech at the end of a sentence which, in t~un, is a stadstic that can be tabulated from the text corpus and stored in a perrnanent memory of the computer.
After token-izing the observed words and characters, as explained in colmection with block 12, my method next computes the lexical part of speech probabilities (the probability of observing part of speech i given word ~, dependent upon frequency of occurrence, as follows: If every sense of every wordof interest appeared with a reasonably high frequency in the Brown Corpus, that 30 calculation would be simply the quotient of the observed frequency of occurrence of the word as a particular part of speech, divided by its total frequency of occurrence, regardless of part of speech.
I replace this calculation, for words or characters of low frequency of occurrence, as follows: consider ~hat, under Zipf's law, no matter how much text35 we look at, there will always be a large tail of words that appear only a few times.
In the Brown COIpus, for example, 40,000 words appear five times or less. If a ~ ~3~ 5 word such as yawn appears once as a noun and once as a verb, what is the probability that it can be an adjective? It is impossible to say without more information. Fortunately, dictionaries can help alleviate this problem to some extent. We add one the to the frequency count of possibilities in the dictionary.
S For example, yawn happens to be listed in our dictionary as either a noun or averb. Thus, we smooth the possibilities. In this case, the probabilities remain unchanged. Both before and after smoothing, we estimate yawn to be a noun 50%
of the time, and a verb the rest. There is no chance that yawn is an adjective.
In some other cases, smoothing makes a big difference. Consider the 10 word cans. This word appears 5 times as a plural noun and never as a verb in the Brown Corpus. The lexicon tand its morphological routines), fortunately, give both possibilities. Thus, the revised estimate is that cans appears 6t7 times as a plural noun and lt7 times as a verb.
Thus, we add "one" to each observed ~requency of occurrence as each 15 possible part of speech, according to the training material, an unabridged dicdonary; and calculate the lexical probabilities therefrom.
To start to construct our probability search tree for this word, we now multiply that lexical probability by the normalized estimated contextual probability, i.e., the frequency of observing part of speech X given the succeeding 20 parts of speech Y and Zt already determined, divided by the "bigram" frequency of observing part of speech Y given part of speech Z. The latter two data can betabulat~ from an alrendy tagged corpus, referenced by Francis et al in their book.
The tabulated data are stored in a computer memory.
We proceed to repeat the above process for the subject word as every 25 other part of speech it can be, keeping only the maximum probabilides from our prior sets of calculations. Before we proceed to the next to last word in the ` sentence, we have a~rived at a maximum product probability for the last word.
Two things can already be observed a~out the process. First, the lexical probabilides that are used in thç product lie along a continium and are not 30 just one of three arbi~arily assigned values, as used in the Leech et al. reference.
Second, while the applications of the mathematics may seem trivial for words which in fact tum out to be at the end of a sentence, the important point is that it is the same mathematics which is used everywhere.
As we proceed to give a more complete, specific example, keep in 35 mind that the probability estimates were obtained by training on the tagged Brown Corpus, which is referred to but not rontained in the above-cited analysis by -~-" 13Q~ S
Francis et al. It is a corpus of approximately 1,000,000 words with part of speech tags assigned and laboriously checked by hand.
Overall performance of my method has been surprisingly good, considering that its operation is strictly local in nature and that, in general, it has 5 no way to look on both sides of a noun phrase, for exarnple, to determine the usage of what may an auxiliary verb, for instance.
J~ If every possibility in the dictionary must be given equal weight, parsing is very difficult. Dictionaries tend to focus on what is possible, not on what is likely. Consider the trivial sentence, "I see a bird." For all practical10 purposes, every word in the sentence is unambiguous. According to Francis andKucera, the word "I" appears as a pronoun in 5837 out of 5838 observations (100%), "see" appears as a verb in 771 out of 772 observations (100%), "a"
appears as an article in 23013 out of 23019 observations (100%) and "bird"
appears as a noun in 26 out of 26 observations (100%). However, according to 15 Webster's Seventh New Collegiate Dicdonary, every word is ambiguous. In addition to the desired assignments of tags (parts of speech), the first three words are listed as nouns and the last as an intransitive verb. One might hope that these spurious assignments could be ruled out by the parser as syntacdcally ill-formed.
Unfortunately, the prior art has no consistent way to achieve that result. If the 20 parser is going to accept noun phrases of the form:
[NP [N city] [N school]~N committee][N meeting]], then it cannot rule out [NP[N Il[N see ] [N a] [N bird]] twhere "NP" stancls for "noun phrase"; and "N"
stands for "noun").
25 Similarly, the parser probably also has to accept bird as an intransitive verb, since there is nothing syntactically wrong with:
[S[NP[N I][N see]rN a]] [VP [V bird]]], where "S" stands for "subject" and "VP" stands for "verb phrase" and "V" stands for "verb".
These part-of-speech assignments are not wrong; they are just extremely 30 improbable.
~3~34S
Consider once again the sentence, "I see a bird." The problem is to find an assignment of parts of speech to words that optimizes both lexical and contextllal probabilities, both of which are estimated from the Tagged Brown Corpus. The lexical probabilities are estimated from the following frequencies S (PPSS = singular pronoun; NP = proper noun; VB - verb; UH = interjection; IN = preposition; AT = article; NN = noun):
Word Parts of Speech -- ...
see VB 771 UH
a AT 23013 In (French) 6 bird NN 26 15 The lexical probabilides ar estimated in the ~ ~bvious way. For example, the probability tha~ "I" is a pronoun, Prob(PPSS ¦ "I"), is estimated as the freq(PPSS "I")/freq("I") or 5837/5838. The probability that "see" is a verb is esdmated to be 77U772. The other lexical probability esdmates follow the same pattern.
The contextual probability, the probability of observing part of speech X, given the following two parts of speech Y and Z, is esdmated by dividing the trigram part-of-speech frequency XY~ by the bigram part-of-speech frequency YZ.
- Thus, ~or example, the probability of observing a verb before an article and a noun is estimated to be the ratio of the freq(VB, AT, NN) over the freq(AT, NN) 25 or3412/53091 = 0.064. The probability of observing a noun in the same contextis esdmated as the ratio of freq~NN, AT, NN) over 53091 or 629/~3091 = 0.01.
The other contextual probability esdmates follow the sarne pattern.
~ search is performed in order to find the assignment of part of speech tags to words that optimizes the product of the lexical and contextual 30 probabilides. Conceptually, the sea.rch enumerates all possible assignments of parts of speech to input words. In this case, there are four input words, three of which are two ways ambiguous, producing a set of 2*2*2*1=8 possible assignments of parts of speech to input words:
3~3~S
g I see a bird PPSS VB AT NN
PPSS VB IN NN
PPSS UH AT NN
PPSS UH IN NN
NP VB AT NN
NP VB IN NN
NP UH AT NN
NP UH IN NN
Each of the eight sequences are then scored by the product of the lexical probabilities and the contextual probabilities, and the best sequence isselected. In this case, the first sequence is by far the best.
In fa~t, it is not necessary to enumerate all possible assignments lS because the scoring funcdon cannot see more than tWO words away. In other words, in the process of enumerating part-o~-speech sequences, it is possible insome cases to know that some sequence cannot possibly compete with another and can therefore be abandoned. Because of this fact, only O(n) paths will be enumerated. Let us illustrate this optimization with an example:
Find all assignments of parts of speech to "bird" and score the partial sequence. Henceforth, all scores are to be interpreted as log probabilities.
(-~.848072 "NN") Find all assignments of parts of speech to "a" and score. At this point, there are two paths:
(-7.4453945 "AT" "NN") (-15.01957 "IN" "NN") Now, find assignments of "see" and score~ At this point, the number of paths still seems to be growing exponentially.
(-10.1914 "VB" "AT" "NN") (-18.54318 "VB" "lN" "NN") ~-29.974142 "UH" "AT" "NN"~
(-36.53299 "UH" "IN" "NN") Now, find assignments of "I" and score. Note that it is no longer necessary, though, to hypothesize that "a" might bc a French preposition IN
35 because all four paths, PPSS VB IN NN, NN VB IN NN, PPSS UH IN NN and NP UH ~ NN score less well than some other path and there is no way that any additional input could change the relative score. In particular, the path PPSS VB
,'` ' ,.
~3~ 5 IN NN scores lower than the path PPSS VB AT NN, and additional input will not help PPSS VB IN NN because the contextual scoring function has a limited window of three parts of speech, and that is not enough to see past the existingPPSS and VB.
5 (-12.927581 "PPSS" "VB" "AT" "NN") (-24.177242 "NP" "VB" "AT" "NN") (-35.667458 "PPSS" "UH" "AT" "NN") (-44.33943 "NP" "UH" "AT" "NN") The search continues two more iterations, assuming blank parts of speech for 10 words out of range.
` (-13.262333 blank "PPSS" "VB" "AT" "NN") (-26.5196 blank "NP" "VB" "AT" "NN") Finally, the result is: PPSS VB AT NN.
(-13.262333 blank blank "PPSS" "VB" "AT" "NN") 15 A slightly more interesting example is: "Can thcy can cans."
cans (-5.456845 "NNS"), where "NNS" stands for "plural noun".
can (-12.603266 "NN" "NNS") 20 (-15.935471 "VB" "NNS") (-15.946739 "MD" "NN3"~, where "MD" stands for "model auxiliary".
they (-18.02618 "PPSS" "MD" "NNS"~
25 (-18.779934"PPSS""VB""NNS") (-21.411636 "PP~S" "NN" "NNS") ~3~3~LS
can (-21.766554 "MD" "PPSS" "VB" "NNS") (-26.45485 "NN" "PPSS" "MD" "NNS") (-28.306572 "VB" "PPSS" "MD" "NNS") 5 (-21.932137 blank "MD" "PPSS" "VB" "NNS") (-30.170452 blank "VB" "PPSS" "MD" "NNS") (-31.453785 blank "NN" "PPSS" "MD" "NNS") And the result is: Can/MD they/PPSS can/VB cans/NNS
For other details of the method -- optimizing probabilities, refer to 10 Appendix A.
This description completes the descnption of operation through that of block 14.
As an example of the utilization occurring in block 15, display is conceptually the simplest, but sdll practical, particularly in an interactive system 15 with a human operator. More elaborate example of utilization will be given hereinafter in the description of PIGs. 3 and 4. But first, it is desirable to descnbe one more tool. That tool is noun phrase parsing, using an extension of my method.
Similar stochastic methods have been applied to locate simple noun 20 phrases with very high accuracy. The proposed method is a stochastic analog of precedence parsing. Recall that precedence parsing makes use of a table that says whether to insert an open or close bracket between any two categories (terrninal or nonterminal). The proposed method makes use of a table that gives the probabilities of an open and close bracket between all pairs of parts of speech. A
25 sample is shown below for the five parts of speech: AT (article), NN (singular noun), NNS (non-singular noun), VB (uninflected verb), IN ~preposition). These probabilities were estimated ~om about 40,000 words of training material selected from the Brown Corpus. The training material was parsed into noun phrases by laborious semi-automatic means.
~3~:~34S
Probability of starting a Noun Phrase, Between First and Second Words .. . .. _ _ _ _ .
Second Word S
: AT NN NNS VB IN
First NN .9~ .01 0 0 0 ~;~ 10 Word NNS 1.0 .02 .11 0 0 VB 1.0 1.0 1.0 0 0 IN 1.0 1.0 1.0 0 0 Probability of Ending a Noun Phrase, Between First and Second Words __ - Second Word AT NN NNS VB IN
; 20 AT 0 0 0 0 1.0 ; First NN 1.0 .01 0 1.0 1.0 Word NNS 1.0 .02 .11 1.0 1.0 IN O O O O .02 The stochastic par ser is given a sequence of parts of speech as input .~ and is asked to insert brackets corresponding to the beginning and end of noun phrases. Conceptually, the parser enumerates all possible parsings of the input and scores each of them by the precedence probabilities. Consider, for example, 30 the input sequence: NN VB. There are 5 possible ways to bracket this sequence;~ (assuming no recursion):
:` :
NN VB
[NNl VB
[NN VB]
35 [NNl [VB]
.
: ~ "
13~3~34S
NN [VB]
Each of these parsings is scored by multiplying 6 precedence probabilities, the probability of an open/close bracket appearing (or not appearing) in any one of the three positions (before the NN, after the NN or a~ter the VB).5 The parsing with the highest score is returned as output.
The method works remarkably well considering how simple it is.
There is some tendency to underestimate the number of brackets and run two nolm phrases together.
It will be noted that noun phrase parsing, as described in FI(3. 2, 10 assumes the output from the part of speech assignment of FIG. 1 as its input. But it could also use the results of any other part of speech assignment technique.
In either event, in block 22, all possible noun phrase boundaries are assigned. In block 23, non-paired boundaries are eliminated. For each sentence, these would include an ending boundary at the start of the sentence, and a 15 beginning boundary at the end of a sentence (including blanks).
The operation of block 24 involves laying out a probability tree for each self-consistent assignment of noun-phrase boundaries. The highest probability assignments are then retained for later processing, e.g., utilization of the results, as indicated in block 25.
Now, let us turn to a more specific application of my invention. Part of speech tagging is an important practical problem with potential applications in many areas including speech synthesis9 speech recognition, spelling correction, proofreading, query answering, machine translation and searching large text databases (e.g., patents, newspapers). I am particularly interested in speech synthesis 25 applications, where it is clear that pronunciation sometimes depends on part of speech. Consider the following three examples where pronunciatioll depends on part of speech.
First, there are words lilce "wind" where the noun has a different vowel than the verb. That is, the noun "wind" has a short vowel as in "the wind 30 is strong," whereas the verb "wind" has a long vowel as in "Do not forget to wind your watch."
Secondly, the pronoun "that" is stressed as in "l~id you see THAT?"
unlike the complementizer "tha~," as in "It is a shame that he is leaving."
Thirdly, note the difference between "oily FLUID" and 35 "ll~ANSMISSION fluid"; as a general rule, an adjective-noun sequence such as "oily FLUID" is typically stressed on the right whereas a noun-noun sequence ~.30~34S
such as "TRANSMISSION fluid" is typically stressed on the left, as stated, for example, by Erik Fudge in English Word Stress, George Allen & Unroin (Publishers) Ltd., London 1984. These are but three of the many constructions which would sound more natural if the synthesizer had access to accurate part of5 speech infonnation.
In FIG. 3, the part of speech tagger 31 is a computer employing the method of FIG. 1. Noun phrase parser 32 is a computer employing the method of . 2.
The outputs of tagger 31 and parser 32 are applied in a syntax 10 analyzer to provide the input signals for the absolute stress signal generator 18 of FIG. 1 of U.S. Patent No. 3,704,345 issued to C. H. Coker, et al.
As an example of the ~ules under discussion, attention is directed to Appendix S.l at pages 144-149 of the Fudge book, which sets forth the rulss for noun phrases.
In other respects, the operation of the embodiment of FIG. 3 is like that of the embodiment of FIG. 1 of the Coker patent.
Similarly, in the embodiments of FIG. 4, part of speech tagger 41 functions as described in FIG. l; and noun phrase parser 42 functions as described in FIG. 2.
In that case, the noun phrase and parts of speech information is applied in the text editing system 43, which is of the type described in U. S. Patent No. 4,674,065 issued to F. E~. Lange et al. Specifically, part-of-speech tagger 41 and noun phrase parser 42 provide a substitute for "parts of ; speech" Sec~ion 33 in the Lange et al. patent to assist in generating the editing 25 displays therein. The accuracy inherent is my method of FIGs. 1 and 2 should yield more useful edidng displays than is the case in the prior art.
Alternatively, text editing system 43, may be the Writer's Workbench(~ system described in Computer Science Technical Report, No. 91 "Writing Tools - The STYLÆ ~ Diction Programs", by L. L. ChelTy, et al., 30 February 1981, Bell Telephone Laborato~ies, Incolpora~ed. My methods would be a substitute for the method designated "PARTS" therein.
It should be apparent that various modifications of my invention can be made without departing firom the spirit and scope thereof.
For example, one way of implementing the stress rules of the Fudge 35 book would be by the algorithm disclosed by Jonathan Allen et al., in the book From Text to Speech- The MIT TaLIc System, the Cambridge University Press, , ~..3~13~
Cambridge (1987), especially Chapter 10, "The Fundarnental Frequency Generator".
Further, the lexical probabilities are not the only probabilities that could be improved by smoothing. Contextual frequencies also seem to follow 5 Zip~s Law. That is, for the set of all sequences of three parts of speech, we have plotted the frequency of the sequence against its rank on log paper and observedthe classic linear relationship and slope of almost -1. It is clear that smoothing techniques could well be applied to contextual frequencies alternatives. The same can also be said for the precedence probabilities used in noun phrase parsing The techniques of my invention also have relevance to other applications, such as speech recognition. Part-of-speech contextual probabilities could make possible better choices for a spoken word which is to be recognized.
My techniques can also be substituted directly for the described part-of-speech tagging in the system for interrogadng a database disclosed in 15 U.S. Patent No. 4,688,194, issued August 18, 1987 to C. W. Thompson et al.
Other modifications and applications of my invention are also within its spirit and scope.
~.
~3~SL345 ~rrRNDIX A
INPUT a file of lhe form:
<word> <pos> <lex prob> <pos> <lex_prob> <pos> <lex prob>...
<word> <pos> <Iex prob~ <pos~ <Iex prob> <pos> <Icx prob>...
<word> <pos> <Iex prob> <pos> <lex prob> <pos> dex prob>...
<word> <pos> <lex prob> <pos> <Iex prob> <pos> dex prob>...
Each line corresponds lo a word (tokcn) in the senlencc (in reverse order). The <pos> and <lex prob> are parts of speech and lexical probabilities.
OUTPUT the besl sequence of parts of specch.
new active pa~hs:= { ) ;set of no paths ~ path is a record of of a sequence of parts of speech and a score. The variable old active palhs is initialized to a set of 1 path lhe path conlains a sequence of no parts of speech and a likelihood score of 1Ø
old aclive palhs:=[<parls: [], score:1.0>1 ;sct of 1 palh input:
line:= rea(lline() if (line = end of rlle) goto rlnish word := poplleld(line) while (lino is not emply) pos:-poprleld(lhlo) lex prob:=poprlol(l(lino) loop for ol(l palh in old dCliVG pallls old pnrts:=oltl_palll->pllrls old scoro:=old palll-~scoro now parts:= conca~enate(old parts, pos) now scoro:= lox prob * okl score * conloxlual-prob(new parts) new palhs:= make record(new parts, new score) if (new score > score of paths in new active palhs with lhe same last two parls of speech) now active_palh:= add new path lo new aclivc palhs old active paths:=new active paths new active_pa~hs:=l) ' ' goto input finish:
rmd path in new active palhs wilh besl score oulput path-~parts contextual prob([ ..x y z]):
relurn(freq(x y z)/freq~x y)) ~(O
' : ~
., ~3~1345 APPENDIX A
Input file:
Word PosLex Prob PosLcx Prob blank blank 1.0 blank blank 1.0 bird NN 1.0 a AT23013/23019 IN6/23019 see VB771/772 UHlJ772 blank blank 1.0 blank blank 1.0 Output file:
blank blank NN AT VB PPSS blank blank Traceofold aetive paths:
(heneeforth, seores should be interpreted as log probabilities) After processing the wor~l "bird", old aelive paths is l~parts: [NN blank blank] score: -4.848072>) After processing the word "a," ol(l active paths is l<p,uts: [AT NN blank blank] score: -7.44S3945>
<par~s: IN NN blank blank] seoro: -15.01957>) .lUler the wor(l "see"
t<Parts: [VB ATNNblankblank] seore:-10.1914>
epnrts: [VB IN NN blank blank] seore:-18.54318>
<parts: [UH AT NN blank blank] seore: -29.974142>
<parts: ruH IN NN blank blank] seoro: -36.53299>) After the word "I"
I<parts: [PPSS VB AT NN blank blank] score:-12.927581>
<parts: [NP VB AT NN blank blank] score: -24.177242>
<parts: [PPSS UH AT NN blank blank] score: -35.667458>
<parts: [NP UH AT NN blank blank] score:-44.33943>) The seareh eontinues two more iterations, assuming blank parts of speeeh for words out of range.
{<parts: tblank PPSS VB AT NN blank blank] seore: -13.262333>
<parts: [blank NN VB AT NN blank blank] seore:-26.5196~) fiDally <parts: [blank blank PPSS VB AT NN blank blank] seore:-13.262333>) ~; -,. ~.. ., ~
Overall performance of my method has been surprisingly good, considering that its operation is strictly local in nature and that, in general, it has 5 no way to look on both sides of a noun phrase, for exarnple, to determine the usage of what may an auxiliary verb, for instance.
J~ If every possibility in the dictionary must be given equal weight, parsing is very difficult. Dictionaries tend to focus on what is possible, not on what is likely. Consider the trivial sentence, "I see a bird." For all practical10 purposes, every word in the sentence is unambiguous. According to Francis andKucera, the word "I" appears as a pronoun in 5837 out of 5838 observations (100%), "see" appears as a verb in 771 out of 772 observations (100%), "a"
appears as an article in 23013 out of 23019 observations (100%) and "bird"
appears as a noun in 26 out of 26 observations (100%). However, according to 15 Webster's Seventh New Collegiate Dicdonary, every word is ambiguous. In addition to the desired assignments of tags (parts of speech), the first three words are listed as nouns and the last as an intransitive verb. One might hope that these spurious assignments could be ruled out by the parser as syntacdcally ill-formed.
Unfortunately, the prior art has no consistent way to achieve that result. If the 20 parser is going to accept noun phrases of the form:
[NP [N city] [N school]~N committee][N meeting]], then it cannot rule out [NP[N Il[N see ] [N a] [N bird]] twhere "NP" stancls for "noun phrase"; and "N"
stands for "noun").
25 Similarly, the parser probably also has to accept bird as an intransitive verb, since there is nothing syntactically wrong with:
[S[NP[N I][N see]rN a]] [VP [V bird]]], where "S" stands for "subject" and "VP" stands for "verb phrase" and "V" stands for "verb".
These part-of-speech assignments are not wrong; they are just extremely 30 improbable.
~3~34S
Consider once again the sentence, "I see a bird." The problem is to find an assignment of parts of speech to words that optimizes both lexical and contextllal probabilities, both of which are estimated from the Tagged Brown Corpus. The lexical probabilities are estimated from the following frequencies S (PPSS = singular pronoun; NP = proper noun; VB - verb; UH = interjection; IN = preposition; AT = article; NN = noun):
Word Parts of Speech -- ...
see VB 771 UH
a AT 23013 In (French) 6 bird NN 26 15 The lexical probabilides ar estimated in the ~ ~bvious way. For example, the probability tha~ "I" is a pronoun, Prob(PPSS ¦ "I"), is estimated as the freq(PPSS "I")/freq("I") or 5837/5838. The probability that "see" is a verb is esdmated to be 77U772. The other lexical probability esdmates follow the same pattern.
The contextual probability, the probability of observing part of speech X, given the following two parts of speech Y and Z, is esdmated by dividing the trigram part-of-speech frequency XY~ by the bigram part-of-speech frequency YZ.
- Thus, ~or example, the probability of observing a verb before an article and a noun is estimated to be the ratio of the freq(VB, AT, NN) over the freq(AT, NN) 25 or3412/53091 = 0.064. The probability of observing a noun in the same contextis esdmated as the ratio of freq~NN, AT, NN) over 53091 or 629/~3091 = 0.01.
The other contextual probability esdmates follow the sarne pattern.
~ search is performed in order to find the assignment of part of speech tags to words that optimizes the product of the lexical and contextual 30 probabilides. Conceptually, the sea.rch enumerates all possible assignments of parts of speech to input words. In this case, there are four input words, three of which are two ways ambiguous, producing a set of 2*2*2*1=8 possible assignments of parts of speech to input words:
3~3~S
g I see a bird PPSS VB AT NN
PPSS VB IN NN
PPSS UH AT NN
PPSS UH IN NN
NP VB AT NN
NP VB IN NN
NP UH AT NN
NP UH IN NN
Each of the eight sequences are then scored by the product of the lexical probabilities and the contextual probabilities, and the best sequence isselected. In this case, the first sequence is by far the best.
In fa~t, it is not necessary to enumerate all possible assignments lS because the scoring funcdon cannot see more than tWO words away. In other words, in the process of enumerating part-o~-speech sequences, it is possible insome cases to know that some sequence cannot possibly compete with another and can therefore be abandoned. Because of this fact, only O(n) paths will be enumerated. Let us illustrate this optimization with an example:
Find all assignments of parts of speech to "bird" and score the partial sequence. Henceforth, all scores are to be interpreted as log probabilities.
(-~.848072 "NN") Find all assignments of parts of speech to "a" and score. At this point, there are two paths:
(-7.4453945 "AT" "NN") (-15.01957 "IN" "NN") Now, find assignments of "see" and score~ At this point, the number of paths still seems to be growing exponentially.
(-10.1914 "VB" "AT" "NN") (-18.54318 "VB" "lN" "NN") ~-29.974142 "UH" "AT" "NN"~
(-36.53299 "UH" "IN" "NN") Now, find assignments of "I" and score. Note that it is no longer necessary, though, to hypothesize that "a" might bc a French preposition IN
35 because all four paths, PPSS VB IN NN, NN VB IN NN, PPSS UH IN NN and NP UH ~ NN score less well than some other path and there is no way that any additional input could change the relative score. In particular, the path PPSS VB
,'` ' ,.
~3~ 5 IN NN scores lower than the path PPSS VB AT NN, and additional input will not help PPSS VB IN NN because the contextual scoring function has a limited window of three parts of speech, and that is not enough to see past the existingPPSS and VB.
5 (-12.927581 "PPSS" "VB" "AT" "NN") (-24.177242 "NP" "VB" "AT" "NN") (-35.667458 "PPSS" "UH" "AT" "NN") (-44.33943 "NP" "UH" "AT" "NN") The search continues two more iterations, assuming blank parts of speech for 10 words out of range.
` (-13.262333 blank "PPSS" "VB" "AT" "NN") (-26.5196 blank "NP" "VB" "AT" "NN") Finally, the result is: PPSS VB AT NN.
(-13.262333 blank blank "PPSS" "VB" "AT" "NN") 15 A slightly more interesting example is: "Can thcy can cans."
cans (-5.456845 "NNS"), where "NNS" stands for "plural noun".
can (-12.603266 "NN" "NNS") 20 (-15.935471 "VB" "NNS") (-15.946739 "MD" "NN3"~, where "MD" stands for "model auxiliary".
they (-18.02618 "PPSS" "MD" "NNS"~
25 (-18.779934"PPSS""VB""NNS") (-21.411636 "PP~S" "NN" "NNS") ~3~3~LS
can (-21.766554 "MD" "PPSS" "VB" "NNS") (-26.45485 "NN" "PPSS" "MD" "NNS") (-28.306572 "VB" "PPSS" "MD" "NNS") 5 (-21.932137 blank "MD" "PPSS" "VB" "NNS") (-30.170452 blank "VB" "PPSS" "MD" "NNS") (-31.453785 blank "NN" "PPSS" "MD" "NNS") And the result is: Can/MD they/PPSS can/VB cans/NNS
For other details of the method -- optimizing probabilities, refer to 10 Appendix A.
This description completes the descnption of operation through that of block 14.
As an example of the utilization occurring in block 15, display is conceptually the simplest, but sdll practical, particularly in an interactive system 15 with a human operator. More elaborate example of utilization will be given hereinafter in the description of PIGs. 3 and 4. But first, it is desirable to descnbe one more tool. That tool is noun phrase parsing, using an extension of my method.
Similar stochastic methods have been applied to locate simple noun 20 phrases with very high accuracy. The proposed method is a stochastic analog of precedence parsing. Recall that precedence parsing makes use of a table that says whether to insert an open or close bracket between any two categories (terrninal or nonterminal). The proposed method makes use of a table that gives the probabilities of an open and close bracket between all pairs of parts of speech. A
25 sample is shown below for the five parts of speech: AT (article), NN (singular noun), NNS (non-singular noun), VB (uninflected verb), IN ~preposition). These probabilities were estimated ~om about 40,000 words of training material selected from the Brown Corpus. The training material was parsed into noun phrases by laborious semi-automatic means.
~3~:~34S
Probability of starting a Noun Phrase, Between First and Second Words .. . .. _ _ _ _ .
Second Word S
: AT NN NNS VB IN
First NN .9~ .01 0 0 0 ~;~ 10 Word NNS 1.0 .02 .11 0 0 VB 1.0 1.0 1.0 0 0 IN 1.0 1.0 1.0 0 0 Probability of Ending a Noun Phrase, Between First and Second Words __ - Second Word AT NN NNS VB IN
; 20 AT 0 0 0 0 1.0 ; First NN 1.0 .01 0 1.0 1.0 Word NNS 1.0 .02 .11 1.0 1.0 IN O O O O .02 The stochastic par ser is given a sequence of parts of speech as input .~ and is asked to insert brackets corresponding to the beginning and end of noun phrases. Conceptually, the parser enumerates all possible parsings of the input and scores each of them by the precedence probabilities. Consider, for example, 30 the input sequence: NN VB. There are 5 possible ways to bracket this sequence;~ (assuming no recursion):
:` :
NN VB
[NNl VB
[NN VB]
35 [NNl [VB]
.
: ~ "
13~3~34S
NN [VB]
Each of these parsings is scored by multiplying 6 precedence probabilities, the probability of an open/close bracket appearing (or not appearing) in any one of the three positions (before the NN, after the NN or a~ter the VB).5 The parsing with the highest score is returned as output.
The method works remarkably well considering how simple it is.
There is some tendency to underestimate the number of brackets and run two nolm phrases together.
It will be noted that noun phrase parsing, as described in FI(3. 2, 10 assumes the output from the part of speech assignment of FIG. 1 as its input. But it could also use the results of any other part of speech assignment technique.
In either event, in block 22, all possible noun phrase boundaries are assigned. In block 23, non-paired boundaries are eliminated. For each sentence, these would include an ending boundary at the start of the sentence, and a 15 beginning boundary at the end of a sentence (including blanks).
The operation of block 24 involves laying out a probability tree for each self-consistent assignment of noun-phrase boundaries. The highest probability assignments are then retained for later processing, e.g., utilization of the results, as indicated in block 25.
Now, let us turn to a more specific application of my invention. Part of speech tagging is an important practical problem with potential applications in many areas including speech synthesis9 speech recognition, spelling correction, proofreading, query answering, machine translation and searching large text databases (e.g., patents, newspapers). I am particularly interested in speech synthesis 25 applications, where it is clear that pronunciation sometimes depends on part of speech. Consider the following three examples where pronunciatioll depends on part of speech.
First, there are words lilce "wind" where the noun has a different vowel than the verb. That is, the noun "wind" has a short vowel as in "the wind 30 is strong," whereas the verb "wind" has a long vowel as in "Do not forget to wind your watch."
Secondly, the pronoun "that" is stressed as in "l~id you see THAT?"
unlike the complementizer "tha~," as in "It is a shame that he is leaving."
Thirdly, note the difference between "oily FLUID" and 35 "ll~ANSMISSION fluid"; as a general rule, an adjective-noun sequence such as "oily FLUID" is typically stressed on the right whereas a noun-noun sequence ~.30~34S
such as "TRANSMISSION fluid" is typically stressed on the left, as stated, for example, by Erik Fudge in English Word Stress, George Allen & Unroin (Publishers) Ltd., London 1984. These are but three of the many constructions which would sound more natural if the synthesizer had access to accurate part of5 speech infonnation.
In FIG. 3, the part of speech tagger 31 is a computer employing the method of FIG. 1. Noun phrase parser 32 is a computer employing the method of . 2.
The outputs of tagger 31 and parser 32 are applied in a syntax 10 analyzer to provide the input signals for the absolute stress signal generator 18 of FIG. 1 of U.S. Patent No. 3,704,345 issued to C. H. Coker, et al.
As an example of the ~ules under discussion, attention is directed to Appendix S.l at pages 144-149 of the Fudge book, which sets forth the rulss for noun phrases.
In other respects, the operation of the embodiment of FIG. 3 is like that of the embodiment of FIG. 1 of the Coker patent.
Similarly, in the embodiments of FIG. 4, part of speech tagger 41 functions as described in FIG. l; and noun phrase parser 42 functions as described in FIG. 2.
In that case, the noun phrase and parts of speech information is applied in the text editing system 43, which is of the type described in U. S. Patent No. 4,674,065 issued to F. E~. Lange et al. Specifically, part-of-speech tagger 41 and noun phrase parser 42 provide a substitute for "parts of ; speech" Sec~ion 33 in the Lange et al. patent to assist in generating the editing 25 displays therein. The accuracy inherent is my method of FIGs. 1 and 2 should yield more useful edidng displays than is the case in the prior art.
Alternatively, text editing system 43, may be the Writer's Workbench(~ system described in Computer Science Technical Report, No. 91 "Writing Tools - The STYLÆ ~ Diction Programs", by L. L. ChelTy, et al., 30 February 1981, Bell Telephone Laborato~ies, Incolpora~ed. My methods would be a substitute for the method designated "PARTS" therein.
It should be apparent that various modifications of my invention can be made without departing firom the spirit and scope thereof.
For example, one way of implementing the stress rules of the Fudge 35 book would be by the algorithm disclosed by Jonathan Allen et al., in the book From Text to Speech- The MIT TaLIc System, the Cambridge University Press, , ~..3~13~
Cambridge (1987), especially Chapter 10, "The Fundarnental Frequency Generator".
Further, the lexical probabilities are not the only probabilities that could be improved by smoothing. Contextual frequencies also seem to follow 5 Zip~s Law. That is, for the set of all sequences of three parts of speech, we have plotted the frequency of the sequence against its rank on log paper and observedthe classic linear relationship and slope of almost -1. It is clear that smoothing techniques could well be applied to contextual frequencies alternatives. The same can also be said for the precedence probabilities used in noun phrase parsing The techniques of my invention also have relevance to other applications, such as speech recognition. Part-of-speech contextual probabilities could make possible better choices for a spoken word which is to be recognized.
My techniques can also be substituted directly for the described part-of-speech tagging in the system for interrogadng a database disclosed in 15 U.S. Patent No. 4,688,194, issued August 18, 1987 to C. W. Thompson et al.
Other modifications and applications of my invention are also within its spirit and scope.
~.
~3~SL345 ~rrRNDIX A
INPUT a file of lhe form:
<word> <pos> <lex prob> <pos> <lex_prob> <pos> <lex prob>...
<word> <pos> <Iex prob~ <pos~ <Iex prob> <pos> <Icx prob>...
<word> <pos> <Iex prob> <pos> <lex prob> <pos> dex prob>...
<word> <pos> <lex prob> <pos> <Iex prob> <pos> dex prob>...
Each line corresponds lo a word (tokcn) in the senlencc (in reverse order). The <pos> and <lex prob> are parts of speech and lexical probabilities.
OUTPUT the besl sequence of parts of specch.
new active pa~hs:= { ) ;set of no paths ~ path is a record of of a sequence of parts of speech and a score. The variable old active palhs is initialized to a set of 1 path lhe path conlains a sequence of no parts of speech and a likelihood score of 1Ø
old aclive palhs:=[<parls: [], score:1.0>1 ;sct of 1 palh input:
line:= rea(lline() if (line = end of rlle) goto rlnish word := poplleld(line) while (lino is not emply) pos:-poprleld(lhlo) lex prob:=poprlol(l(lino) loop for ol(l palh in old dCliVG pallls old pnrts:=oltl_palll->pllrls old scoro:=old palll-~scoro now parts:= conca~enate(old parts, pos) now scoro:= lox prob * okl score * conloxlual-prob(new parts) new palhs:= make record(new parts, new score) if (new score > score of paths in new active palhs with lhe same last two parls of speech) now active_palh:= add new path lo new aclivc palhs old active paths:=new active paths new active_pa~hs:=l) ' ' goto input finish:
rmd path in new active palhs wilh besl score oulput path-~parts contextual prob([ ..x y z]):
relurn(freq(x y z)/freq~x y)) ~(O
' : ~
., ~3~1345 APPENDIX A
Input file:
Word PosLex Prob PosLcx Prob blank blank 1.0 blank blank 1.0 bird NN 1.0 a AT23013/23019 IN6/23019 see VB771/772 UHlJ772 blank blank 1.0 blank blank 1.0 Output file:
blank blank NN AT VB PPSS blank blank Traceofold aetive paths:
(heneeforth, seores should be interpreted as log probabilities) After processing the wor~l "bird", old aelive paths is l~parts: [NN blank blank] score: -4.848072>) After processing the word "a," ol(l active paths is l<p,uts: [AT NN blank blank] score: -7.44S3945>
<par~s: IN NN blank blank] seoro: -15.01957>) .lUler the wor(l "see"
t<Parts: [VB ATNNblankblank] seore:-10.1914>
epnrts: [VB IN NN blank blank] seore:-18.54318>
<parts: [UH AT NN blank blank] seore: -29.974142>
<parts: ruH IN NN blank blank] seoro: -36.53299>) After the word "I"
I<parts: [PPSS VB AT NN blank blank] score:-12.927581>
<parts: [NP VB AT NN blank blank] score: -24.177242>
<parts: [PPSS UH AT NN blank blank] score: -35.667458>
<parts: [NP UH AT NN blank blank] score:-44.33943>) The seareh eontinues two more iterations, assuming blank parts of speeeh for words out of range.
{<parts: tblank PPSS VB AT NN blank blank] seore: -13.262333>
<parts: [blank NN VB AT NN blank blank] seore:-26.5196~) fiDally <parts: [blank blank PPSS VB AT NN blank blank] seore:-13.262333>) ~; -,. ~.. ., ~
Claims (8)
1. An automated method for assigning parts of speech to words in a message, of the type comprising the steps of:
electronically reading stored representations of the message, generating lexical probabilities for each word to be a particular part of speech, and selecting, in response to the lexical probability for the subject word and in response to the contextual probabilities for at least one adjacent word to be a particular part of speech, the contextual probability for the subject word to be a particular part of speech, SAID METHOD BEING CHARACTERIZED IN THAT:
the generating step includes representing certain words, spaces before and after sentences and punctuation symbols as words having empirically-determined frequencies of occurrence in a non-verbal record of the message including, smoothing part of speech frequencies for at least certain word; and the selecting step includes maximizing the contextual probabilities referred to parts of speech of nearby words including at least the following word.
electronically reading stored representations of the message, generating lexical probabilities for each word to be a particular part of speech, and selecting, in response to the lexical probability for the subject word and in response to the contextual probabilities for at least one adjacent word to be a particular part of speech, the contextual probability for the subject word to be a particular part of speech, SAID METHOD BEING CHARACTERIZED IN THAT:
the generating step includes representing certain words, spaces before and after sentences and punctuation symbols as words having empirically-determined frequencies of occurrence in a non-verbal record of the message including, smoothing part of speech frequencies for at least certain word; and the selecting step includes maximizing the contextual probabilities referred to parts of speech of nearby words including at least the following word.
2. An automated method of the type claimed in claim 1, FURTHER CHARACTERIZED BY the steps:
assigning all possible noun phrase boundaries, eliminating all non-paired boundaries, and optimizing contextual noun phrase boundary probabilities.
assigning all possible noun phrase boundaries, eliminating all non-paired boundaries, and optimizing contextual noun phrase boundary probabilities.
3. An automated method of the type claimed in claim 2, FURTHER CHARACTERIZED BY the step:
assigning word stress dependent upon the results of the optimizing steps.
assigning word stress dependent upon the results of the optimizing steps.
4. An automated method of the type claimed in claim 3, FURTHER CHARACTERIZED BY
means responsive to the assigned word stress for synthesizing speech corresponding to the message.
means responsive to the assigned word stress for synthesizing speech corresponding to the message.
5. An automated method of the type claimed in claim 2, FURTHER CHARACTERIZED BY
employing the highest selected contextual probabilities for the words in a message to detect contextual errors in the message.
employing the highest selected contextual probabilities for the words in a message to detect contextual errors in the message.
6. An automated method for determining, in a message, beginnings and ends of noun phrases to which parts of speech have been assigned with reasonable probability, of the type including the steps of estimating whether the words around each noun in the message could be part of a noun phrase, and utilizing the resulting estimates, SAID METHOD BEING CHARACTERIZED BY the steps of:
assigning all possible noun phrase boundaries, eliminating all non-paired boundaries, and optimizing contextual noun phrase boundary probabilities.
assigning all possible noun phrase boundaries, eliminating all non-paired boundaries, and optimizing contextual noun phrase boundary probabilities.
7. An automated method of the type claimed in claim 1 or 6, SAID METHOD BEING CHARACTERIZED BY
assigning parts of speech in the message by n-gram analysis with respect to the parts of speech of near-by words, including the steps of representing certain non-words as words having empirically determined frequencies of occurrence in a non-verbal record of the message, computing an optimum normalized contextual probability for each other near-by word in the message to be a particular part of speech in relationship to the contextual part-of-speech probabilities of differing uses of said non-words, where the normalized contextual probability is the trigram part-of-speech probability divided by the bigram part-of-speech probability, all determined by starting at the end of the sentence, including blank spaces.
assigning parts of speech in the message by n-gram analysis with respect to the parts of speech of near-by words, including the steps of representing certain non-words as words having empirically determined frequencies of occurrence in a non-verbal record of the message, computing an optimum normalized contextual probability for each other near-by word in the message to be a particular part of speech in relationship to the contextual part-of-speech probabilities of differing uses of said non-words, where the normalized contextual probability is the trigram part-of-speech probability divided by the bigram part-of-speech probability, all determined by starting at the end of the sentence, including blank spaces.
8. An automated method of the type claimed in claim 1 SAID METHOD BEING FURTHER CHARACTERIZED IN THAT
the generating step includes smoothing frequencies by reference to a dictionary for parts-of-speech usage of words having relatively low frequencies of occurrence as a particular part of speech, and the selecting step further includes determining the product of lexical probability of the contextual probability, where the lexical probability is estimated as the quotient of the frequency of occurrence of the word as a particular part of speech, divided by its frequency of occurrence as all parts of speech, and the contextual probability is estimated by dividing the trigram frequency by the bigram frequency, where the trigram frequency is the frequency of occurrence of the particular part of speech in sequence with the two following parts of speech, as already determined for the two following words, and the bigram frequency is the frequency of occurrence of the particular part of speech of the following word in sequence with the next-following part ofspeech, as already determined for the next-following word; and reiterating the determining step for a number of possible part-of-speech combinations, including retaining products which exceed prior products for the same word.
the generating step includes smoothing frequencies by reference to a dictionary for parts-of-speech usage of words having relatively low frequencies of occurrence as a particular part of speech, and the selecting step further includes determining the product of lexical probability of the contextual probability, where the lexical probability is estimated as the quotient of the frequency of occurrence of the word as a particular part of speech, divided by its frequency of occurrence as all parts of speech, and the contextual probability is estimated by dividing the trigram frequency by the bigram frequency, where the trigram frequency is the frequency of occurrence of the particular part of speech in sequence with the two following parts of speech, as already determined for the two following words, and the bigram frequency is the frequency of occurrence of the particular part of speech of the following word in sequence with the next-following part ofspeech, as already determined for the next-following word; and reiterating the determining step for a number of possible part-of-speech combinations, including retaining products which exceed prior products for the same word.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US07/152,740 US5146405A (en) | 1988-02-05 | 1988-02-05 | Methods for part-of-speech determination and usage |
US152,740 | 1988-02-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1301345C true CA1301345C (en) | 1992-05-19 |
Family
ID=22544213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000590100A Expired - Fee Related CA1301345C (en) | 1988-02-05 | 1989-02-03 | Methods for part-of-speech determination and usage |
Country Status (9)
Country | Link |
---|---|
US (1) | US5146405A (en) |
EP (1) | EP0327266B1 (en) |
JP (1) | JPH0769910B2 (en) |
KR (1) | KR970006402B1 (en) |
AU (1) | AU617749B2 (en) |
CA (1) | CA1301345C (en) |
DE (1) | DE68923981T2 (en) |
ES (1) | ES2076952T3 (en) |
IN (1) | IN175380B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9152623B2 (en) | 2012-11-02 | 2015-10-06 | Fido Labs, Inc. | Natural language processing system and method |
US10956670B2 (en) | 2018-03-03 | 2021-03-23 | Samurai Labs Sp. Z O.O. | System and method for detecting undesirable and potentially harmful online behavior |
Families Citing this family (193)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5530863A (en) * | 1989-05-19 | 1996-06-25 | Fujitsu Limited | Programming language processing system with program translation performed by term rewriting with pattern matching |
US5157759A (en) * | 1990-06-28 | 1992-10-20 | At&T Bell Laboratories | Written language parser system |
US5418717A (en) * | 1990-08-27 | 1995-05-23 | Su; Keh-Yih | Multiple score language processing system |
JP2764343B2 (en) * | 1990-09-07 | 1998-06-11 | 富士通株式会社 | Clause / phrase boundary extraction method |
NL9100849A (en) * | 1991-05-16 | 1992-12-16 | Oce Nederland Bv | METHOD FOR CORRECTING AN ERROR IN A NATURAL LANGUAGE WITH THE USE OF A COMPUTER SYSTEM AND AN APPARATUS SUITABLE FOR CARRYING OUT THIS METHOD |
US5477451A (en) * | 1991-07-25 | 1995-12-19 | International Business Machines Corp. | Method and system for natural language translation |
US5475796A (en) * | 1991-12-20 | 1995-12-12 | Nec Corporation | Pitch pattern generation apparatus |
US5541836A (en) * | 1991-12-30 | 1996-07-30 | At&T Corp. | Word disambiguation apparatus and methods |
US5267345A (en) * | 1992-02-10 | 1993-11-30 | International Business Machines Corporation | Speech recognition apparatus which predicts word classes from context and words from word classes |
US5383120A (en) * | 1992-03-02 | 1995-01-17 | General Electric Company | Method for tagging collocations in text |
US5293584A (en) * | 1992-05-21 | 1994-03-08 | International Business Machines Corporation | Speech recognition system for natural language translation |
JPH06195373A (en) * | 1992-12-24 | 1994-07-15 | Sharp Corp | Machine translation system |
US5440481A (en) * | 1992-10-28 | 1995-08-08 | The United States Of America As Represented By The Secretary Of The Navy | System and method for database tomography |
JPH0756957A (en) * | 1993-08-03 | 1995-03-03 | Xerox Corp | Method for provision of information to user |
US5873056A (en) * | 1993-10-12 | 1999-02-16 | The Syracuse University | Natural language processing system for semantic vector representation which accounts for lexical ambiguity |
EP0680653B1 (en) * | 1993-10-15 | 2001-06-20 | AT&T Corp. | A method for training a tts system, the resulting apparatus, and method of use thereof |
JP2986345B2 (en) * | 1993-10-18 | 1999-12-06 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Voice recording indexing apparatus and method |
US5510981A (en) * | 1993-10-28 | 1996-04-23 | International Business Machines Corporation | Language translation apparatus and method using context-based translation models |
SE513456C2 (en) * | 1994-05-10 | 2000-09-18 | Telia Ab | Method and device for speech to text conversion |
US5537317A (en) * | 1994-06-01 | 1996-07-16 | Mitsubishi Electric Research Laboratories Inc. | System for correcting grammer based parts on speech probability |
US5485372A (en) * | 1994-06-01 | 1996-01-16 | Mitsubishi Electric Research Laboratories, Inc. | System for underlying spelling recovery |
US5610812A (en) * | 1994-06-24 | 1997-03-11 | Mitsubishi Electric Information Technology Center America, Inc. | Contextual tagger utilizing deterministic finite state transducer |
US5850561A (en) * | 1994-09-23 | 1998-12-15 | Lucent Technologies Inc. | Glossary construction tool |
AU5969896A (en) * | 1995-06-07 | 1996-12-30 | International Language Engineering Corporation | Machine assisted translation tools |
US5721938A (en) * | 1995-06-07 | 1998-02-24 | Stuckey; Barbara K. | Method and device for parsing and analyzing natural language sentences and text |
US6330538B1 (en) | 1995-06-13 | 2001-12-11 | British Telecommunications Public Limited Company | Phonetic unit duration adjustment for text-to-speech system |
US5873660A (en) * | 1995-06-19 | 1999-02-23 | Microsoft Corporation | Morphological search and replace |
US5828991A (en) * | 1995-06-30 | 1998-10-27 | The Research Foundation Of The State University Of New York | Sentence reconstruction using word ambiguity resolution |
US5794177A (en) * | 1995-07-19 | 1998-08-11 | Inso Corporation | Method and apparatus for morphological analysis and generation of natural language text |
US5680628A (en) * | 1995-07-19 | 1997-10-21 | Inso Corporation | Method and apparatus for automated search and retrieval process |
US5721902A (en) * | 1995-09-15 | 1998-02-24 | Infonautics Corporation | Restricted expansion of query terms using part of speech tagging |
US5819260A (en) * | 1996-01-22 | 1998-10-06 | Lexis-Nexis | Phrase recognition method and apparatus |
SG49804A1 (en) * | 1996-03-20 | 1998-06-15 | Government Of Singapore Repres | Parsing and translating natural language sentences automatically |
US5999896A (en) * | 1996-06-25 | 1999-12-07 | Microsoft Corporation | Method and system for identifying and resolving commonly confused words in a natural language parser |
US5878386A (en) * | 1996-06-28 | 1999-03-02 | Microsoft Corporation | Natural language parser with dictionary-based part-of-speech probabilities |
US5802533A (en) * | 1996-08-07 | 1998-09-01 | Walker; Randall C. | Text processor |
US6279017B1 (en) * | 1996-08-07 | 2001-08-21 | Randall C. Walker | Method and apparatus for displaying text based upon attributes found within the text |
US7672829B2 (en) * | 1997-03-04 | 2010-03-02 | Hiroshi Ishikura | Pivot translation method and system |
WO1998039711A1 (en) * | 1997-03-04 | 1998-09-11 | Hiroshi Ishikura | Language analysis system and method |
DE69811921T2 (en) * | 1997-09-24 | 2003-11-13 | Lernout & Hauspie Speechprod | DEVICE AND METHOD FOR DISTINATING SIMILAR-SOUNDING WORDS IN VOICE RECOGNITION |
US6182028B1 (en) * | 1997-11-07 | 2001-01-30 | Motorola, Inc. | Method, device and system for part-of-speech disambiguation |
US6260008B1 (en) * | 1998-01-08 | 2001-07-10 | Sharp Kabushiki Kaisha | Method of and system for disambiguating syntactic word multiples |
US6098042A (en) * | 1998-01-30 | 2000-08-01 | International Business Machines Corporation | Homograph filter for speech synthesis system |
GB9806085D0 (en) * | 1998-03-23 | 1998-05-20 | Xerox Corp | Text summarisation using light syntactic parsing |
CN1159662C (en) | 1998-05-13 | 2004-07-28 | 国际商业机器公司 | Automatic punctuating for continuous speech recognition |
US6167370A (en) * | 1998-09-09 | 2000-12-26 | Invention Machine Corporation | Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures |
US6185524B1 (en) * | 1998-12-31 | 2001-02-06 | Lernout & Hauspie Speech Products N.V. | Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores |
CA2367320A1 (en) | 1999-03-19 | 2000-09-28 | Trados Gmbh | Workflow management system |
DE19942171A1 (en) * | 1999-09-03 | 2001-03-15 | Siemens Ag | Method for sentence end determination in automatic speech processing |
US20060116865A1 (en) | 1999-09-17 | 2006-06-01 | Www.Uniscape.Com | E-services translation utilizing machine translation and translation memory |
WO2001033409A2 (en) * | 1999-11-01 | 2001-05-10 | Kurzweil Cyberart Technologies, Inc. | Computer generated poetry system |
US7392185B2 (en) | 1999-11-12 | 2008-06-24 | Phoenix Solutions, Inc. | Speech based learning/training system using semantic decoding |
US6615172B1 (en) | 1999-11-12 | 2003-09-02 | Phoenix Solutions, Inc. | Intelligent query engine for processing voice based queries |
US7050977B1 (en) | 1999-11-12 | 2006-05-23 | Phoenix Solutions, Inc. | Speech-enabled server for internet website and method |
US9076448B2 (en) * | 1999-11-12 | 2015-07-07 | Nuance Communications, Inc. | Distributed real time speech recognition system |
US6633846B1 (en) | 1999-11-12 | 2003-10-14 | Phoenix Solutions, Inc. | Distributed realtime speech recognition system |
US6665640B1 (en) | 1999-11-12 | 2003-12-16 | Phoenix Solutions, Inc. | Interactive speech based learning/training system formulating search queries based on natural language parsing of recognized user queries |
US7725307B2 (en) | 1999-11-12 | 2010-05-25 | Phoenix Solutions, Inc. | Query engine for processing voice based queries including semantic decoding |
US7120574B2 (en) | 2000-04-03 | 2006-10-10 | Invention Machine Corporation | Synonym extension of search queries with validation |
US7962326B2 (en) * | 2000-04-20 | 2011-06-14 | Invention Machine Corporation | Semantic answering system and method |
SE517005C2 (en) * | 2000-05-31 | 2002-04-02 | Hapax Information Systems Ab | Segmentation of text |
US6684202B1 (en) * | 2000-05-31 | 2004-01-27 | Lexis Nexis | Computer-based system and method for finding rules of law in text |
US6941513B2 (en) | 2000-06-15 | 2005-09-06 | Cognisphere, Inc. | System and method for text structuring and text generation |
US6952666B1 (en) * | 2000-07-20 | 2005-10-04 | Microsoft Corporation | Ranking parser for a natural language processing system |
US6738765B1 (en) | 2000-08-11 | 2004-05-18 | Attensity Corporation | Relational text index creation and searching |
US6741988B1 (en) | 2000-08-11 | 2004-05-25 | Attensity Corporation | Relational text index creation and searching |
US6732097B1 (en) | 2000-08-11 | 2004-05-04 | Attensity Corporation | Relational text index creation and searching |
US7171349B1 (en) | 2000-08-11 | 2007-01-30 | Attensity Corporation | Relational text index creation and searching |
US6728707B1 (en) | 2000-08-11 | 2004-04-27 | Attensity Corporation | Relational text index creation and searching |
US6732098B1 (en) | 2000-08-11 | 2004-05-04 | Attensity Corporation | Relational text index creation and searching |
US8272873B1 (en) | 2000-10-16 | 2012-09-25 | Progressive Language, Inc. | Language learning system |
DE10057634C2 (en) * | 2000-11-21 | 2003-01-30 | Bosch Gmbh Robert | Process for processing text in a computer unit and computer unit |
US7263488B2 (en) * | 2000-12-04 | 2007-08-28 | Microsoft Corporation | Method and apparatus for identifying prosodic word boundaries |
US6978239B2 (en) * | 2000-12-04 | 2005-12-20 | Microsoft Corporation | Method and apparatus for speech synthesis without prosody modification |
US6910004B2 (en) * | 2000-12-19 | 2005-06-21 | Xerox Corporation | Method and computer system for part-of-speech tagging of incomplete sentences |
US20020129066A1 (en) * | 2000-12-28 | 2002-09-12 | Milward David R. | Computer implemented method for reformatting logically complex clauses in an electronic text-based document |
US6859771B2 (en) * | 2001-04-23 | 2005-02-22 | Microsoft Corporation | System and method for identifying base noun phrases |
US7177792B2 (en) * | 2001-05-31 | 2007-02-13 | University Of Southern California | Integer programming decoder for machine translation |
WO2003005166A2 (en) * | 2001-07-03 | 2003-01-16 | University Of Southern California | A syntax-based statistical translation model |
US9009590B2 (en) * | 2001-07-31 | 2015-04-14 | Invention Machines Corporation | Semantic processor for recognition of cause-effect relations in natural language documents |
JP2003242176A (en) * | 2001-12-13 | 2003-08-29 | Sony Corp | Information processing device and method, recording medium and program |
US6988063B2 (en) * | 2002-02-12 | 2006-01-17 | Sunflare Co., Ltd. | System and method for accurate grammar analysis using a part-of-speech tagged (POST) parser and learners' model |
WO2004001623A2 (en) | 2002-03-26 | 2003-12-31 | University Of Southern California | Constructing a translation lexicon from comparable, non-parallel corpora |
US20030191645A1 (en) * | 2002-04-05 | 2003-10-09 | Guojun Zhou | Statistical pronunciation model for text to speech |
US7286987B2 (en) * | 2002-06-28 | 2007-10-23 | Conceptual Speech Llc | Multi-phoneme streamer and knowledge representation speech recognition system and method |
US7567902B2 (en) * | 2002-09-18 | 2009-07-28 | Nuance Communications, Inc. | Generating speech recognition grammars from a large corpus of data |
US20040167887A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Integration of structured data with relational facts from free text for data mining |
US10733976B2 (en) * | 2003-03-01 | 2020-08-04 | Robert E. Coifman | Method and apparatus for improving the transcription accuracy of speech recognition software |
US7496498B2 (en) * | 2003-03-24 | 2009-02-24 | Microsoft Corporation | Front-end architecture for a multi-lingual text-to-speech system |
KR100481598B1 (en) * | 2003-05-26 | 2005-04-08 | 한국전자통신연구원 | Apparatus and method for analyzing compounded morpheme |
US8548794B2 (en) | 2003-07-02 | 2013-10-01 | University Of Southern California | Statistical noun phrase translation |
US7711545B2 (en) * | 2003-07-02 | 2010-05-04 | Language Weaver, Inc. | Empirical methods for splitting compound words with application to machine translation |
US7475010B2 (en) * | 2003-09-03 | 2009-01-06 | Lingospot, Inc. | Adaptive and scalable method for resolving natural language ambiguities |
US7813916B2 (en) | 2003-11-18 | 2010-10-12 | University Of Utah | Acquisition and application of contextual role knowledge for coreference resolution |
US7983896B2 (en) | 2004-03-05 | 2011-07-19 | SDL Language Technology | In-context exact (ICE) matching |
US20100262621A1 (en) * | 2004-03-05 | 2010-10-14 | Russ Ross | In-context exact (ice) matching |
WO2005089340A2 (en) * | 2004-03-15 | 2005-09-29 | University Of Southern California | Training tree transducers |
US8296127B2 (en) * | 2004-03-23 | 2012-10-23 | University Of Southern California | Discovery of parallel text portions in comparable collections of corpora and training using comparable texts |
US8666725B2 (en) | 2004-04-16 | 2014-03-04 | University Of Southern California | Selection and use of nonstatistical translation components in a statistical machine translation framework |
US7664748B2 (en) * | 2004-07-12 | 2010-02-16 | John Eric Harrity | Systems and methods for changing symbol sequences in documents |
GB2417103A (en) * | 2004-08-11 | 2006-02-15 | Sdl Plc | Natural language translation system |
US8600728B2 (en) | 2004-10-12 | 2013-12-03 | University Of Southern California | Training for a text-to-text application which uses string to tree conversion for training and decoding |
US20060122834A1 (en) * | 2004-12-03 | 2006-06-08 | Bennett Ian M | Emotion detection device & method for use in distributed systems |
US8886517B2 (en) | 2005-06-17 | 2014-11-11 | Language Weaver, Inc. | Trust scoring for language translation systems |
US8676563B2 (en) | 2009-10-01 | 2014-03-18 | Language Weaver, Inc. | Providing human-generated and machine-generated trusted translations |
US7974833B2 (en) | 2005-06-21 | 2011-07-05 | Language Weaver, Inc. | Weighted system of expressing language information using a compact notation |
JP2007024960A (en) | 2005-07-12 | 2007-02-01 | Internatl Business Mach Corp <Ibm> | System, program and control method |
US7389222B1 (en) | 2005-08-02 | 2008-06-17 | Language Weaver, Inc. | Task parallelization in a text-to-text system |
US7813918B2 (en) * | 2005-08-03 | 2010-10-12 | Language Weaver, Inc. | Identifying documents which form translated pairs, within a document collection |
JP2007058509A (en) * | 2005-08-24 | 2007-03-08 | Toshiba Corp | Language processing system |
US8700404B1 (en) * | 2005-08-27 | 2014-04-15 | At&T Intellectual Property Ii, L.P. | System and method for using semantic and syntactic graphs for utterance classification |
US7624020B2 (en) * | 2005-09-09 | 2009-11-24 | Language Weaver, Inc. | Adapter for allowing both online and offline training of a text to text system |
US10319252B2 (en) | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
US20100280818A1 (en) * | 2006-03-03 | 2010-11-04 | Childers Stephen R | Key Talk |
US8943080B2 (en) | 2006-04-07 | 2015-01-27 | University Of Southern California | Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections |
US7562811B2 (en) | 2007-01-18 | 2009-07-21 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
JP2009537038A (en) | 2006-05-07 | 2009-10-22 | バーコード リミティド | System and method for improving quality control in a product logistic chain |
US8886518B1 (en) | 2006-08-07 | 2014-11-11 | Language Weaver, Inc. | System and method for capitalizing machine translated text |
US8521506B2 (en) | 2006-09-21 | 2013-08-27 | Sdl Plc | Computer-implemented method, computer software and apparatus for use in a translation system |
US9984071B2 (en) | 2006-10-10 | 2018-05-29 | Abbyy Production Llc | Language ambiguity detection of text |
US8195447B2 (en) | 2006-10-10 | 2012-06-05 | Abbyy Software Ltd. | Translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions |
US20080086298A1 (en) * | 2006-10-10 | 2008-04-10 | Anisimovich Konstantin | Method and system for translating sentences between langauges |
US9633005B2 (en) | 2006-10-10 | 2017-04-25 | Abbyy Infopoisk Llc | Exhaustive automatic processing of textual information |
US9645993B2 (en) | 2006-10-10 | 2017-05-09 | Abbyy Infopoisk Llc | Method and system for semantic searching |
US9047275B2 (en) | 2006-10-10 | 2015-06-02 | Abbyy Infopoisk Llc | Methods and systems for alignment of parallel text corpora |
US8548795B2 (en) * | 2006-10-10 | 2013-10-01 | Abbyy Software Ltd. | Method for translating documents from one language into another using a database of translations, a terminology dictionary, a translation dictionary, and a machine translation system |
US8214199B2 (en) * | 2006-10-10 | 2012-07-03 | Abbyy Software, Ltd. | Systems for translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions |
US9235573B2 (en) | 2006-10-10 | 2016-01-12 | Abbyy Infopoisk Llc | Universal difference measure |
US8145473B2 (en) | 2006-10-10 | 2012-03-27 | Abbyy Software Ltd. | Deep model statistics method for machine translation |
US8433556B2 (en) | 2006-11-02 | 2013-04-30 | University Of Southern California | Semi-supervised training for statistical word alignment |
US9122674B1 (en) | 2006-12-15 | 2015-09-01 | Language Weaver, Inc. | Use of annotations in statistical machine translation |
EP2122506A4 (en) * | 2007-01-10 | 2011-11-30 | Sysomos Inc | Method and system for information discovery and text analysis |
US8468149B1 (en) | 2007-01-26 | 2013-06-18 | Language Weaver, Inc. | Multi-lingual online community |
US8615389B1 (en) | 2007-03-16 | 2013-12-24 | Language Weaver, Inc. | Generation and exploitation of an approximate language model |
US8959011B2 (en) | 2007-03-22 | 2015-02-17 | Abbyy Infopoisk Llc | Indicating and correcting errors in machine translation systems |
US8831928B2 (en) | 2007-04-04 | 2014-09-09 | Language Weaver, Inc. | Customizable machine translation service |
WO2008135962A2 (en) | 2007-05-06 | 2008-11-13 | Varcode Ltd. | A system and method for quality management utilizing barcode indicators |
KR100887726B1 (en) * | 2007-05-28 | 2009-03-12 | 엔에이치엔(주) | Method and System for Automatic Word Spacing |
US8825466B1 (en) | 2007-06-08 | 2014-09-02 | Language Weaver, Inc. | Modification of annotated bilingual segment pairs in syntax-based machine translation |
US8812296B2 (en) | 2007-06-27 | 2014-08-19 | Abbyy Infopoisk Llc | Method and system for natural language dictionary generation |
CA2694327A1 (en) | 2007-08-01 | 2009-02-05 | Ginger Software, Inc. | Automatic context sensitive language correction and enhancement using an internet corpus |
US8595642B1 (en) | 2007-10-04 | 2013-11-26 | Great Northern Research, LLC | Multiple shell multi faceted graphical user interface |
EP2218042B1 (en) | 2007-11-14 | 2020-01-01 | Varcode Ltd. | A system and method for quality management utilizing barcode indicators |
US11704526B2 (en) | 2008-06-10 | 2023-07-18 | Varcode Ltd. | Barcoded indicators for quality management |
US9262409B2 (en) | 2008-08-06 | 2016-02-16 | Abbyy Infopoisk Llc | Translation of a selected text fragment of a screen |
US8190423B2 (en) * | 2008-09-05 | 2012-05-29 | Trigent Software Ltd. | Word sense disambiguation using emergent categories |
GB2468278A (en) * | 2009-03-02 | 2010-09-08 | Sdl Plc | Computer assisted natural language translation outputs selectable target text associated in bilingual corpus with input target text from partial translation |
US9262403B2 (en) | 2009-03-02 | 2016-02-16 | Sdl Plc | Dynamic generation of auto-suggest dictionary for natural language translation |
KR20120009446A (en) * | 2009-03-13 | 2012-01-31 | 인벤션 머신 코포레이션 | System and method for automatic semantic labeling of natural language texts |
US8990064B2 (en) | 2009-07-28 | 2015-03-24 | Language Weaver, Inc. | Translating documents based on content |
WO2011035425A1 (en) * | 2009-09-25 | 2011-03-31 | Shady Shehata | Methods and systems for extracting keyphrases from natural text for search engine indexing |
US8380486B2 (en) | 2009-10-01 | 2013-02-19 | Language Weaver, Inc. | Providing machine-generated translations and corresponding trust levels |
US20110161067A1 (en) * | 2009-12-29 | 2011-06-30 | Dynavox Systems, Llc | System and method of using pos tagging for symbol assignment |
US20110161073A1 (en) * | 2009-12-29 | 2011-06-30 | Dynavox Systems, Llc | System and method of disambiguating and selecting dictionary definitions for one or more target words |
CN102884518A (en) * | 2010-02-01 | 2013-01-16 | 金格软件有限公司 | Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices |
US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
US8788260B2 (en) * | 2010-05-11 | 2014-07-22 | Microsoft Corporation | Generating snippets based on content features |
US9128929B2 (en) | 2011-01-14 | 2015-09-08 | Sdl Language Technologies | Systems and methods for automatically estimating a translation time including preparation time in addition to the translation itself |
US11003838B2 (en) | 2011-04-18 | 2021-05-11 | Sdl Inc. | Systems and methods for monitoring post translation editing |
US8694303B2 (en) | 2011-06-15 | 2014-04-08 | Language Weaver, Inc. | Systems and methods for tuning parameters in statistical machine translation |
EP2546760A1 (en) | 2011-07-11 | 2013-01-16 | Accenture Global Services Limited | Provision of user input in systems for jointly discovering topics and sentiment |
US8676730B2 (en) * | 2011-07-11 | 2014-03-18 | Accenture Global Services Limited | Sentiment classifiers based on feature extraction |
US8620837B2 (en) | 2011-07-11 | 2013-12-31 | Accenture Global Services Limited | Determination of a basis for a new domain model based on a plurality of learned models |
US8886515B2 (en) | 2011-10-19 | 2014-11-11 | Language Weaver, Inc. | Systems and methods for enhancing machine translation post edit review processes |
US8942973B2 (en) | 2012-03-09 | 2015-01-27 | Language Weaver, Inc. | Content page URL translation |
US8989485B2 (en) | 2012-04-27 | 2015-03-24 | Abbyy Development Llc | Detecting a junction in a text line of CJK characters |
US8971630B2 (en) | 2012-04-27 | 2015-03-03 | Abbyy Development Llc | Fast CJK character recognition |
US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
US9263059B2 (en) | 2012-09-28 | 2016-02-16 | International Business Machines Corporation | Deep tagging background noises |
US8807422B2 (en) | 2012-10-22 | 2014-08-19 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US9152622B2 (en) | 2012-11-26 | 2015-10-06 | Language Weaver, Inc. | Personalized machine translation via online adaptation |
US9811517B2 (en) | 2013-01-29 | 2017-11-07 | Tencent Technology (Shenzhen) Company Limited | Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text |
CN103971684B (en) * | 2013-01-29 | 2015-12-09 | 腾讯科技(深圳)有限公司 | A kind of add punctuate method, system and language model method for building up, device |
CN104143331B (en) | 2013-05-24 | 2015-12-09 | 腾讯科技(深圳)有限公司 | A kind of method and system adding punctuate |
US9311299B1 (en) * | 2013-07-31 | 2016-04-12 | Google Inc. | Weakly supervised part-of-speech tagging with coupled token and type constraints |
US9213694B2 (en) | 2013-10-10 | 2015-12-15 | Language Weaver, Inc. | Efficient online domain adaptation |
DE202013104836U1 (en) | 2013-10-29 | 2014-01-30 | Foseco International Limited | feeder structure |
RU2592395C2 (en) | 2013-12-19 | 2016-07-20 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Resolution semantic ambiguity by statistical analysis |
RU2586577C2 (en) | 2014-01-15 | 2016-06-10 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Filtering arcs parser graph |
CA2949782C (en) * | 2014-04-25 | 2023-09-05 | Mayo Foundation For Medical Education And Research | Enhancing reading accuracy, efficiency and retention |
RU2596600C2 (en) | 2014-09-02 | 2016-09-10 | Общество с ограниченной ответственностью "Аби Девелопмент" | Methods and systems for processing images of mathematical expressions |
US9626358B2 (en) | 2014-11-26 | 2017-04-18 | Abbyy Infopoisk Llc | Creating ontologies by analyzing natural language texts |
WO2016144963A1 (en) * | 2015-03-10 | 2016-09-15 | Asymmetrica Labs Inc. | Systems and methods for asymmetrical formatting of word spaces according to the uncertainty between words |
US9703394B2 (en) * | 2015-03-24 | 2017-07-11 | Google Inc. | Unlearning techniques for adaptive language models in text entry |
JP6649472B2 (en) | 2015-05-18 | 2020-02-19 | バーコード リミティド | Thermochromic ink indicia for activatable quality labels |
EP3320315B1 (en) | 2015-07-07 | 2020-03-04 | Varcode Ltd. | Electronic quality indicator |
US10635863B2 (en) | 2017-10-30 | 2020-04-28 | Sdl Inc. | Fragment recall and adaptive automated translation |
US10817676B2 (en) | 2017-12-27 | 2020-10-27 | Sdl Inc. | Intelligent routing services and systems |
US10599767B1 (en) * | 2018-05-31 | 2020-03-24 | The Ultimate Software Group, Inc. | System for providing intelligent part of speech processing of complex natural language |
US11256867B2 (en) | 2018-10-09 | 2022-02-22 | Sdl Inc. | Systems and methods of machine learning for digital assets and message creation |
RU2721190C1 (en) | 2018-12-25 | 2020-05-18 | Общество с ограниченной ответственностью "Аби Продакшн" | Training neural networks using loss functions reflecting relationships between neighbouring tokens |
CN111353295A (en) * | 2020-02-27 | 2020-06-30 | 广东博智林机器人有限公司 | Sequence labeling method and device, storage medium and computer equipment |
US11594213B2 (en) * | 2020-03-03 | 2023-02-28 | Rovi Guides, Inc. | Systems and methods for interpreting natural language search queries |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3704345A (en) * | 1971-03-19 | 1972-11-28 | Bell Telephone Labor Inc | Conversion of printed text into synthetic speech |
JPS58175074A (en) * | 1982-04-07 | 1983-10-14 | Toshiba Corp | Analyzing system of sentence structure |
US4456973A (en) * | 1982-04-30 | 1984-06-26 | International Business Machines Corporation | Automatic text grade level analyzer for a text processing system |
US4674065A (en) * | 1982-04-30 | 1987-06-16 | International Business Machines Corporation | System for detecting and correcting contextual errors in a text processing system |
US4688195A (en) * | 1983-01-28 | 1987-08-18 | Texas Instruments Incorporated | Natural-language interface generating system |
US4580218A (en) * | 1983-09-08 | 1986-04-01 | At&T Bell Laboratories | Indexing subject-locating method |
US4692941A (en) * | 1984-04-10 | 1987-09-08 | First Byte | Real-time text-to-speech conversion system |
JPS6140672A (en) * | 1984-07-31 | 1986-02-26 | Hitachi Ltd | Processing system for dissolution of many parts of speech |
-
1988
- 1988-02-05 US US07/152,740 patent/US5146405A/en not_active Expired - Lifetime
-
1989
- 1989-01-27 DE DE68923981T patent/DE68923981T2/en not_active Expired - Fee Related
- 1989-01-27 EP EP89300790A patent/EP0327266B1/en not_active Expired - Lifetime
- 1989-01-27 ES ES89300790T patent/ES2076952T3/en not_active Expired - Lifetime
- 1989-02-01 AU AU28990/89A patent/AU617749B2/en not_active Ceased
- 1989-02-03 CA CA000590100A patent/CA1301345C/en not_active Expired - Fee Related
- 1989-02-04 JP JP1024794A patent/JPH0769910B2/en not_active Expired - Fee Related
- 1989-02-04 KR KR1019890001364A patent/KR970006402B1/en not_active IP Right Cessation
-
1990
- 1990-01-16 IN IN46MA1990 patent/IN175380B/en unknown
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9152623B2 (en) | 2012-11-02 | 2015-10-06 | Fido Labs, Inc. | Natural language processing system and method |
US10956670B2 (en) | 2018-03-03 | 2021-03-23 | Samurai Labs Sp. Z O.O. | System and method for detecting undesirable and potentially harmful online behavior |
US11151318B2 (en) | 2018-03-03 | 2021-10-19 | SAMURAI LABS sp. z. o.o. | System and method for detecting undesirable and potentially harmful online behavior |
US11507745B2 (en) | 2018-03-03 | 2022-11-22 | Samurai Labs Sp. Z O.O. | System and method for detecting undesirable and potentially harmful online behavior |
US11663403B2 (en) | 2018-03-03 | 2023-05-30 | Samurai Labs Sp. Z O.O. | System and method for detecting undesirable and potentially harmful online behavior |
Also Published As
Publication number | Publication date |
---|---|
DE68923981T2 (en) | 1996-05-15 |
EP0327266A3 (en) | 1992-01-02 |
ES2076952T3 (en) | 1995-11-16 |
JPH0769910B2 (en) | 1995-07-31 |
US5146405A (en) | 1992-09-08 |
AU617749B2 (en) | 1991-12-05 |
AU2899089A (en) | 1989-08-10 |
DE68923981D1 (en) | 1995-10-05 |
IN175380B (en) | 1995-06-10 |
EP0327266B1 (en) | 1995-08-30 |
EP0327266A2 (en) | 1989-08-09 |
KR970006402B1 (en) | 1997-04-28 |
KR890013549A (en) | 1989-09-23 |
JPH01224796A (en) | 1989-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA1301345C (en) | Methods for part-of-speech determination and usage | |
Chelba et al. | Retrieval and browsing of spoken content | |
Gaizauskas et al. | University of Sheffield: Description of the LaSIE system as used for MUC-6 | |
US5510981A (en) | Language translation apparatus and method using context-based translation models | |
Silverman et al. | ToBI: A standard for labeling English prosody. | |
EP0525470B1 (en) | Method and system for natural language translation | |
US4868750A (en) | Collocational grammar system | |
Och | Statistical machine translation: From single word models to alignment templates | |
US5794177A (en) | Method and apparatus for morphological analysis and generation of natural language text | |
US6424983B1 (en) | Spelling and grammar checking system | |
US5680511A (en) | Systems and methods for word recognition | |
US6393388B1 (en) | Example-based translation method and system employing multi-stage syntax dividing | |
KR100734741B1 (en) | Recognizing words and their parts of speech in one or more natural languages | |
US6928448B1 (en) | System and method to match linguistic structures using thesaurus information | |
WO1997004405A9 (en) | Method and apparatus for automated search and retrieval processing | |
Fujii et al. | A method for open-vocabulary speech-driven text retrieval | |
Lee et al. | Reestimation and best-first parsing algorithm for probabilistic dependency grammars | |
Kaszkiel et al. | TREC 7 Ad Hoc, Speech, and Interactive tracks at MDS/CSIRO | |
Tapsai et al. | Thai Language Segmentation by Automatic Ranking Trie with Misspelling Correction | |
Boda | From stochastic speech recognition to understanding: an hmm-based approach | |
Hunyadi | Linguistic analysis of large corpora: approaches to computational linguistics in Hungary | |
KR950002705B1 (en) | Speech synthetic system | |
EP0741362A2 (en) | Automatic construction of conditional exponential models from elementary feature | |
Black et al. | Probabilistic parsing of unrestricted english text, with a highly-detailed grammar | |
Fournier | Preprocessing on bilingual data for Statistical Machine Translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKLA | Lapsed |