EP0953970A2 - Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word - Google Patents

Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word Download PDF

Info

Publication number
EP0953970A2
EP0953970A2 EP99303390A EP99303390A EP0953970A2 EP 0953970 A2 EP0953970 A2 EP 0953970A2 EP 99303390 A EP99303390 A EP 99303390A EP 99303390 A EP99303390 A EP 99303390A EP 0953970 A2 EP0953970 A2 EP 0953970A2
Authority
EP
European Patent Office
Prior art keywords
pronunciation
pronunciations
letter
sequence
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP99303390A
Other languages
German (de)
French (fr)
Other versions
EP0953970B1 (en
EP0953970A3 (en
Inventor
Roland Kuhn
Jean-Claude Junqua
Matteo Contolini
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/069,308 external-priority patent/US6230131B1/en
Priority claimed from US09/067,764 external-priority patent/US6016471A/en
Priority claimed from US09/070,300 external-priority patent/US6029132A/en
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Publication of EP0953970A2 publication Critical patent/EP0953970A2/en
Publication of EP0953970A3 publication Critical patent/EP0953970A3/en
Application granted granted Critical
Publication of EP0953970B1 publication Critical patent/EP0953970B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present invention relates generally to speech processing. More particularly, the invention relates to a system for generating pronunciations of spelled words.
  • the invention can be employed in a variety of different contexts, including speech recognition, speech synthesis and lexicography.
  • Speech synthesizers convert text to speech by retrieving digitally-sampled sound units from a dictionary and concatenating these sound units to form sentences.
  • the present invention addresses the problem from a different angle.
  • the invention uses a specially constructed mixed-decision tree that encompasses both letter sequence and phoneme sequence decision-making rules. More specifically, the mixed-decision tree embodies a series of yes-no questions residing at the internal nodes of the tree. Some of these questions involve letters and their adjacent neighbors in a spelled word sequence; other of these questions involve phonemes and their neighboring phonemes in the word sequence.
  • the internal nodes ultimately lead to leaf nodes that contain probability data about which phonetic pronunciations of a given letter are most likely to be correct in pronouncing the word defined by its letter sequence.
  • the pronunciation generator of the invention uses this mixed-decision tree to score different pronunciation candidates, allowing it to select the most probable candidate as the best pronunciation for a given spelled word.
  • Generation of the best pronunciation is preferably a two-stage process in which a letter-only tree is used in the first stage to generate a plurality of pronunciation candidates. These candidates are then scored using the mixed-decision tree in the second stage to select the best candidate.
  • the mixed-decision tree is advantageously used in a two-stage pronunciation generator, the mixed tree is useful in solving some problems that do not require letter-only first stage processing.
  • the mixed-decision tree can be used to score pronunciations generated by linguists using manual techniques.
  • Figure 1 shows a spelled letter-to-pronunciation generator.
  • the mixed-decision tree of the invention can be used in a variety of different applications in addition to the pronunciation generator illustrated here.
  • the pronunciation generator has been selected for illustration because it highlights many aspects and benefits of the mixed-decision tree structure.
  • the pronunciation generator employs two stages, the first stage employing a set of letter-only decision trees 10 and the second stage employing a set of mixed-decision trees 12 .
  • An input sequence 14 such as the sequence of letters B-I-B-L-E, is fed to a dynamic programming phoneme sequence generator 16 .
  • the sequence generator uses the letter-only trees 10 to generate a list of pronunciations 18 , representing possible pronunciation candidates of the spelled word input sequence.
  • the sequence generator sequentially examines each letter in the sequence, applying the decision tree associated with that letter to select a phoneme pronunciation for that letter based on probability data contained in the letter-only tree.
  • the set of letter-only decision trees includes a decision tree for each letter in the alphabet.
  • Figure 2 shows an example of a letter-only decision tree for the letter E.
  • the decision tree comprises a plurality of internal nodes (illustrated as ovals in the Figure) and a plurality of leaf nodes (illustrated as rectangles in the Figure).
  • Each internal node is populated with a yes-no question.
  • Yes-no questions are questions that can be answered either yes or no.
  • these questions are directed to the given letter (in this case the letter E) and its neighboring letters in the input sequence.
  • each internal node branches either left or right depending on whether the answer to the associated question is yes or no.
  • the null phoneme, i.e., silence, is represented by the symbol '-'.
  • the sequence generator 16 (Fig. 1 ) thus uses the letter-only decision trees 10 to construct one or more pronunciation hypotheses that are stored in list 18 .
  • each pronunciation has associated with it a numerical score arrived at by combining the probability scores of the individual phonemes selected using the decision tree 10 .
  • Word pronunciations may be scored by constructing a matrix of possible combinations and then using dynamic programming to select the n-best candidates.
  • the n-best candidates may be selected using a substitution technique that first identifies the most probable word candidate and then generates additional candidates through iterative substitution, as follows.
  • the pronunciation with the highest probability score is selected first, by multiplying the respective scores of the highest-scoring phonemes (identified by examining the leaf nodes) and then using this selection as the most probable candidate or first-best word candidate. Additional (n-best) candidates are then selected by examining the phoneme data in the leaf nodes again to identify the phoneme, not previously selected, that has the smallest difference from an initially selected phoneme. This minimally-different phoneme is then substituted for the initially selected one to thereby generate the second-best word candidate. The above process may be repeated iteratively until the desired number of n-best candidates have been selected. List 18 may be sorted in descending score order, so that the pronunciation judged the best by the letter-only analysis appears first in the list.
  • a letter-only analysis will frequently produce poor results. This is because the letter-only analysis has no way of determining at each letter what phoneme will be generated by subsequent letters. Thus a letter-only analysis can generate a high scoring pronunciation that actually would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-l-iy-z.
  • the sequence generator using letter-only trees has no mechanism to screen out word pronunciations that would never occur in natural speech.
  • a mixed-tree score estimator 20 uses the set of mixed-decision trees 12 to assess the viability of each pronunciation in list 18 .
  • the score estimator works by sequentially examining each letter in the input sequence along with the phonemes assigned to each letter by sequence generator 16 .
  • the set of mixed trees has a mixed tree for each letter of the alphabet.
  • An exemplary mixed tree is shown in Figure 3 .
  • the mixed tree has internal nodes and leaf nodes.
  • the internal nodes are illustrated as ovals and the leaf nodes as rectangles in Figure 3 .
  • the internal nodes are each populated with a yes-no question and the leaf nodes are each populated with probability data.
  • the tree structure of the mixed tree resembles that of the letter-only tree, there is one important difference.
  • the internal nodes of the mixed tree can contain two different classes of questions.
  • An internal node can contain a question about a given letter and its neighboring letters in the sequence, or it can contain a question about the phoneme associated with that letter and neighboring phonemes corresponding to that sequence.
  • the decision tree is thus mixed, in that it contains mixed classes of questions.
  • the abbreviations used in Figure 3 are similar to those used in Figure 2 , with some additional abbreviations.
  • the symbol L represents a question about a letter and its neighboring letters.
  • the symbol P represents a question about a phoneme and its neighboring phonemes.
  • the abbreviations CONS and SYL are phoneme classes, namely consonant and syllabic.
  • the numbers in the leaf nodes give phoneme probabilities as they did in the letter-only trees.
  • the mixed-tree score estimator rescores each of the pronunciations in list 18 based on the mixed-tree questions and using the probability data in the lead nodes of the mixed trees. If desired, the list of pronunciations may be stored in association with the respective score as in list 22 . If desired, list 22 can be sorted in descending order so that the first listed pronunciation is the one with the highest score.
  • the pronunciation occupying the highest score position in list 22 will be different from the pronunciation occupying the highest score position in list 18 . This occurs because the mixed-tree score estimator, using the mixed trees 12 , screens out those pronunciations that do not contain self-consistent phoneme sequences or otherwise represent pronunciations that would not occur in natural speech.
  • selector module 24 can access list 22 to retrieve one or more of the pronunciations in the list. Typically selector 24 retrieves the pronunciation with the highest score and provides this as the output pronunciation 26 .
  • the pronunciation generator depicted in Figure 1 represents only one possible embodiment employing the mixed tree of the invention.
  • the dynamic programming phoneme sequence generator 16 and its associated letter-only decision trees 10 may be dispensed with in applications where one or more pronunciations for a given spelled word sequence are already available. This situation might be encountered where a previously developed pronunciation dictionary is available.
  • the mixed-tree score estimator 20 with its associated mixed trees 12 , may be used to score the entries in the pronunciation dictionary, identifying those having low scores, thereby flagging suspicious pronunciations in the dictionary being constructed.
  • Such a system may, for example, be incorporated into a lexicographer's productivity tool.
  • the output pronunciation or pronunciations selected from list 22 can be used to form pronunciation dictionaries for both speech recognition and speech synthesis applications.
  • the pronunciation dictionary may be used during the recognizer training phase by supplying pronunciations for words that are not already found in the recognizer lexicon.
  • the pronunciation dictionaries may be used to generate phoneme sounds for concatenated playback.
  • the system may be used, for example, to augment the features of an E-mail reader or other text-to-speech application.
  • the mixed-tree scoring system of the invention can be used in a variety of applications where a single one or list of possible pronunciations is desired. For example, in a dynamic on-line dictionary the user types a word and the system provides a list of possible pronunciations, in order of probability.
  • the scoring system can also be used as a user feedback tool for language learning systems.
  • a language learning system with speech recognition capability is used to display a spelled word and to analyze the speaker's attempts at pronouncing that word in the new language, and the system tells the user how probable or improbable his or her pronunciation is for that word.
  • the system for generating the letter-only trees and the mixed trees is illustrated in Figure 4 .
  • tree generator 40 At the heart of the decision tree generation system is tree generator 40 .
  • the tree generator employs a tree-growing algorithm that operates upon a predetermined set of training data 42 supplied by the developer of the system.
  • the training data comprise aligned letter, phoneme pairs that correspond to known proper pronunciations of words.
  • the training data may be generated through the alignment process illustrated in Figure 5 .
  • Figure 5 illustrates an alignment process being performed on an exemplary word BIBLE.
  • the spelled word 44 and its pronunciation 46 are fed to a dynamic programming alignment module 48 which aligns the letters of the spelled word with the phonemes of the corresponding pronunciation. Note in the illustrated example the final E is silent.
  • the letter phoneme pairs are then stored as data 42 .
  • the tree generator works in conjunction with three additional components: a set of possible yes-no questions 50 , a set of rules 52 for selecting the best questions for each node or for deciding if the node should be a lead node, and a pruning method 53 to prevent over-training.
  • the set of possible yes-no questions may include letter questions 54 and phoneme questions 56 , depending on whether a letter-only tree or a mixed tree is being grown. When growing a letter-only tree, only letter questions 54 are used; when growing a mixed tree both letter questions 54 and phoneme questions 56 are used.
  • the rules for selecting the best question to populate at each node in the presently preferred embodiment are designed to follow the Gini criterion.
  • Other splitting criteria can be used instead.
  • splitting criteria reference may be had to Breiman, Friedman et al, "Classification and Regression Trees.”
  • the Gini criterion is used to select a question from the set of possible yes-no questions 50 and to employ a stopping rule that decides when a node is a leaf node.
  • the Gini criterion employs a concept called "impurity.” Impurity is always a non-negative number.
  • Gini impurity may be defined as follows. If C is the set of classes to which data items can belong, and T is the current tree node, let f(1
  • the items that answer "yes” to Q 1 include four examples of "iy” and one example of "-” (the other five items answer “no” to Q 1 .)
  • the items that answer "yes” to Q 2 include three examples of "iy” and three examples of "eh” (the other four items answer "no” to Q 2 ).
  • Figure 6 diagrammatically compares these two cases.
  • the Gini criterion answers which question the system should choose for this node, Q 1 or Q 2 .
  • the Gini criterion for choosing the correct question is: find the question in which the drop in impurity in going from parent nodes to children nodes is maximized.
  • Q 1 gave the greatest drop in impurity. It will therefore be chosen instead of Q 2 .
  • the rule set 52 declares a best question for a node to be that question which brings about the greatest drop in impurity in going from the parent node to its children.
  • the tree generator applies the rules 52 to grow a decision tree of yes-no questions selected from set 50 .
  • the generator will continue to grow the tree until the optimal-sized tree has been grown.
  • Rules 52 include a set of stopping rules that will terminate tree growth when the tree is grown to a predetermined size. In the preferred embodiment the tree is grown to a size larger than ultimately desired.
  • pruning methods 53 are used to cut back the tree to its desired size.
  • the pruning method may implement the Breiman technique as described in the reference cited above.
  • the tree generator thus generates sets of letter-only trees, shown generally at 60 or mixed trees, shown generally at 70 , depending on whether the set of possible yes-no questions 50 includes letter-only questions alone or in combination with phoneme questions.
  • the corpus of training data 42 comprises letter, phoneme pairs, as discussed above. In growing letter-only trees, only the letter portions of these pairs are used in populating the internal nodes. Conversely, when growing mixed trees, both the letter and phoneme components of the training data pairs may be used to populate internal nodes. In both instances the phoneme portions of the pairs are used to populate the leaf nodes. Probability data associated with the phoneme data in the lead nodes are generated by counting the number of occurrences a given phoneme is aligned with a given letter over the training data corpus.
  • the letter-to-pronunciation decision trees generated by the above-described method can be stored in memory for use in a variety of different speech-processing applications. While these applications are many and varied, a few examples will next be presented to better highlight some of the capabilities and advantages of these trees.
  • Figure 6 illustrates the use of both the letter-only trees and the mixed trees to generate pronunciations from spelled-word letter sequences.
  • the illustrated embodiment employs both letter-only and mixed tree components together, other applications may use only one component and not the other.
  • the set of letter-only trees are stored in memory at 80 and the mixed trees are stored in memory at 82 . In many applications there will be one tree for each letter in the alphabet.
  • Dynamic programming sequence generator 84 operates upon input sequence 86 to generate a pronunciation at 88 based on the letter-only trees 80 . Essentially, each letter in the input sequence is considered individually and the applicable letter-only tree is used to select the most probable pronunciation for that letter.
  • the letter-only trees ask a series of yes-no questions about the given letter and its neighboring letters in the sequence. After all letters in the sequence have been considered, the resultant pronunciation is generated by concatenating the phonemes selected by the sequence generator.
  • the mixed tree set 82 can be used. Whereas letter-only trees ask only questions about letters, the mixed trees can ask questions about letters and also about phonemes. Scorer 90 may receive phoneme information from the output of sequence generator 84 . In this regard, sequence generator 84 , using the letter-only trees 80 , can generate a plurality of different pronunciations, sorting those pronunciations based on their respective probability scores. This sorted lists of pronunciations may be stored at 92 for access by the scorer 90 .
  • Scorer 90 receives as input the same input sequence 86 as was supplied to sequence generator 84 . Scorer 90 applies the mixed-tree 82 questions to the sequence of letters, using data from store 92 when asked to respond to a phoneme question. The resulting output at 94 is typically a better pronunciation than provided at 88 . The reason for this is the mixed trees tend to filter out pronunciations that would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-I-iy-z.
  • scorer generator 90 can also produce a sorted list of n possible pronunciations as at 96 .
  • the scores associated with each pronunciation represent the composite of the individual probability scores assigned to each phoneme in the pronunciation. These scores can, themselves, be used in applications where dubious pronunciations need to be identified. For example, the phonetic transcription supplied by a team of lexicographers could be checked using the mixed trees to quickly identify any questionable pronunciations.
  • Figure 8 shows a two stage spelled letter-to-pronunciation generator.
  • the mixed-decision tree approach of the invention can be used in a variety of different applications in addition to the pronunciation generator illustrated here.
  • the two stage pronunciation generator has been selected for illustration because it highlights many aspects and benefits of the mixed-decision tree structure.
  • the two stage pronunciation generator includes a first stage 116 which preferably employs a set of letter-syntax-context-dialect decision trees 110 and a second stage 120 which employs a set of phoneme-mixed decision trees 112 which examine input sequence 114 at a phoneme level.
  • Letter-syntax-context-dialect decision trees examine questions involving letters and their adjacent neighbors in a spelled word sequence (i.e., letter-related questions); other questions examined are what words precede or follow a particular word (i.e., context-related questions); still other questions examined are what part of speech the word has within a sentence as well as what syntax other words have in the sentence (i.e., syntax-related questions); still further questions examined are what dialect it is desired to be spoken.
  • a user selects which dialect is to be spoken by dialect selection device 150 .
  • An alternate embodiment of the present invention includes using letter-related questions and at least one of the word-level characteristics (i.e., syntax-related questions or context-related questions). For example, one embodiment utilizes a set of letter-syntax decision trees for the first stage. Another embodiment utilizes a set of letter-context-dialect decision trees which do not examine syntax of the input sequence.
  • the present invention is not limited to words occurring in a sentence, but includes other linguistical constructs which exhibit syntax, such as fragmented sentences or phrases.
  • An input sequence 114 such as the sequence of letters of a sentence, is fed to the text-based pronunciation generator 116 .
  • input sequence 114 could be the following sentence: "Did you know who read the autobiography?"
  • Syntax data 115 is an input to text-based pronunciation generator 116 . This input provides information for the text-based pronunciation generator 116 to correctly course through the letter-syntax-context-dialect decision trees 110 .
  • Syntax data 115 addresses what parts of speech each word has in the input sequence 114 . For example, the word "read” in the above input sequence example would be tagged as a verb (as opposed to a noun or an adjective) by syntax tagger software module 129 .
  • syntax tagger software technology is available from such institutions as the University Pennsylvania under project "Xtag.” Moreover, the following reference discusses syntax tagger software technology: George Foster, "Statistical Lexical Disambiguation”, Masters Thesis in Computer Science, McGill University, Montreal, Canada (November 11, 1991).
  • the text-based pronunciation generator 116 uses decision trees 110 to generate a list of pronunciations 118 , representing possible pronunciation candidates of the spelled word input sequence.
  • Each pronunciation (e.g., pronunciation A) of list 118 represents a pronunciation of input sequence 114 including preferably how each word is stressed. Moreover, the rate at which each word is spoken is determined in the preferred embodiment.
  • Sentence rate calculator software module 152 is utilized by text-based pronunciation generator 116 to determine how quickly each word should be spoken. For example, sentence rate calculator 152 examines the context of the sentence to determine if certain words in the sentence should be spoken at a faster or slower rate than normal. For example, a sentence with an exclamation marker at the end produces rate data which indicates that a predetermined number of words before the end of the sentence are to have a shorter duration than normal to better convey the impact of an exclamatory statement.
  • the text-based pronunciation generator 116 examines in order each letter and word in the sequence, applying the decision tree associated with that letter or word's syntax (or word's context) to select a phoneme pronunciation for that letter based on probability data contained in the decision tree.
  • the set of decision trees 110 includes a decision tree for each letter in the alphabet and syntax of the language involved.
  • Figure 9 shows an example of a letter-syntax-context-dialect decision tree 140 applicable to the letter "E" in the word "READ.”
  • the decision tree comprises a plurality of internal nodes (illustrated as ovals in the Figure) and a plurality of leaf nodes (illustrated as rectangles in the Figure). Each internal node is populated with a yes-no question. Yes-no questions are questions that can be answered either yes or no.
  • each internal node branches either left or right depending on whether the answer to the associated question is yes or no.
  • the first internal node inquires about the dialect to be spoken. Internal node 138 is representative of such an inquiry. If the southern dialect is to be spoken, then southern dialect decision tree 139 is coursed through which ultimately produces phoneme values at the leaf nodes which are more distinctive of a southern dialect.
  • the leaf nodes are populated with probability data that associate possible phoneme pronunciations with numeric values representing the probability that the particular phoneme represents the correct pronunciation of the given letter.
  • the null phoneme i.e., silence, is represented by the symbol '-'.
  • the "E” in the present-tense verbs "READ” and “LEAD” is assigned its correct pronunciation, "iy” at leaf node 142 with probability 1.0 by the decision tree 140 .
  • the "E” in the past tense of "read” (e.g., "Who read a book) is assigned pronunciation “eh” at leaf node 144 with probability 0.9.
  • Decision trees 110 preferably includes context-related questions.
  • context-related question of internal nodes may examine whether the word “you” is preceded by the word “did.” In such a context, the "y” in “you” is typically pronounced in colloquial speech as "ja".
  • the present invention also generates prosody-indicative data, so as to convey stress, pitch, grave, or pause aspects when speaking a sentence. Syntax-related questions help to determine how the phoneme is to be stressed, or pitched or graved. For example, internal node 141 (of Figure 9 ) inquires whether the first word in the sentence is an interrogatory pronoun, such as "who" in the exemplary sentence "who read a book?" Since in this example, the first word in this example is an interrogatory pronoun, then leaf node 144 with its phoneme stress is selected. Leaf node 146 illustrates the other option where the phonemes are not stressed.
  • an interrogatory pronoun such as "who" in the exemplary sentence "who read a book?" Since in this example, the first word in this example is an interrogatory pronoun, then leaf node 144 with its phoneme stress is selected.
  • Leaf node 146 illustrates the other option where the phonemes are not stressed.
  • the phonemes of the last syllable of the last word in the sentence would have a pitch mark so as to more naturally convey the questioning aspect of the sentence.
  • the present invention able to accommodate natural pausing in speaking a sentence.
  • the present invention includes such pausing detail by asking questions about punctuation, such as commas and periods.
  • the text-based pronunciation generator 116 (Fig. 8 ) thus uses decision trees 110 to construct one or more pronunciation hypotheses that are stored in list 118 .
  • each pronunciation has associated with it a numerical score arrived at by combining the probability scores of the individual phonemes selected using decision trees 110 .
  • Word pronunciations may be scored by constructing a matrix of possible combinations and then using dynamic programming to select the n-best candidates.
  • the n-best candidates may be selected using a substitution technique that first identifies the most probable word candidate and then generates additional candidates through iterative substitution, as follows.
  • the pronunciation with the highest probability score is selected first, by multiplying the respective scores of the highest-scoring phonemes (identified by examining the leaf nodes) and then using this selection as the most probable candidate or first-best word candidate.
  • Additional (n-best) candidates are then selected by examining the phoneme data in the leaf nodes again to identify the phoneme, not previously selected, that has the smallest difference from an initially selected phoneme. This minimally-different phoneme is then substituted for the initially selected one to thereby generate the second-best word candidate.
  • the above process may be repeated iteratively until the desired number of n-best candidates have been selected.
  • List 118 may be sorted in descending score order, so that the pronunciation judged the best by the letter-only analysis appears first in the list.
  • Decision trees 110 frequently produce only moderately successful results. This is because these decision trees have no way of determining at each letter what phoneme will be generated by subsequent letters. Thus decision trees 110 can generate a high scoring pronunciation that actually would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-l-iy-z. The pronunciation generator using decision trees 110 has no mechanism to screen out word pronunciations that would never occur in natural speech.
  • a phoneme-mixed tree score estimator 120 uses the set of phoneme-mixed decision trees 112 to assess the viability of each pronunciation in list 118 .
  • the score estimator 120 works by sequentially examining each letter in the input sequence 114 along with the phonemes assigned to each letter by text-based pronunciation generator 116 .
  • the phoneme-mixed tree score estimator 120 rescores each of the pronunciations in list 118 based on the phoneme-mixed tree questions 112 and using the probability data in the leaf nodes of the mixed trees. If desired, the list of pronunciations may be stored in association with the respective score as in list 122 . If desired, list 122 can be sorted in descending order so that the first listed pronunciation is the one with the highest score.
  • the pronunciation occupying the highest score position in list 122 will be different from the pronunciation occupying the highest score position in list 118 . This occurs because the phoneme-mixed tree score estimator 120 , using the phoneme-mixed trees 112 , screens out those pronunciations that do not contain self-consistent phoneme sequences or otherwise represent pronunciations that would not occur in natural speech.
  • phoneme-mixed tree score estimator 120 utilizes sentence rate calculator 152 in order to determine rate data for the pronunciations in list 122 . Moreover, estimator 120 utilizes phoneme-mixed trees that allow questions about dialect to be examined and that also allow questions to determine stress and other prosody aspects at the leaf nodes in a manner similar to the aforementioned approach.
  • selector module 124 can access list 122 to retrieve one or more of the pronunciations in the list. Typically selector 124 retrieves the pronunciation with the highest score and provides this as the output pronunciation 126 .
  • the pronunciation generator depicted in Figure 8 represents only one possible embodiment employing the mixed tree approach of the invention.
  • the output pronunciation or pronunciations selected from list 122 can be used to form pronunciation dictionaries for both speech recognition and speech synthesis applications.
  • the pronunciation dictionary may be used during the recognizer training phase by supplying pronunciations for words that are not already found in the recognizer lexicon.
  • the pronunciation dictionaries may be used to generate phoneme sounds for concatenated playback.
  • the system may be used, for example, to augment the features of an E-mail reader or other text-to-speech application.
  • the mixed-tree scoring system (i.e., letter, syntax, context, and phoneme) of the invention can be used in a variety of applications where a single one or list of possible pronunciations is desired.
  • a user types a sentence, and the system provides a list of possible pronunciations for the sentence, in order of probability.
  • the scoring system can also be used as a user feedback tool for language learning systems.
  • a language learning system with speech recognition capability is used to display a spelled sentence and to analyze the speaker's attempts at pronouncing that sentence in the new language. The system indicates to the user how probable or improbable his or her pronunciation is for that sentence.
  • the technical effect of the present invention may be realised by a suitably programmed computer and the present invention also provides a computer program product comprising a computer readable storage medium having recorded thereon computer interpretable or compilable code that, when loaded onto a suitable computer and executed, will realise the technical effect.
  • the present invention also encompasses such code itself.

Abstract

The mixed decision tree includes a network of yes-no questions about adjacent letters in a spelled word sequence and also about adjacent phonemes in the phoneme sequence corresponding to the spelled word sequence. Leaf nodes of the mixed decision tree provide information about which phonetic transcriptions are most probable. Using the mixed trees, scores are developed for each of a plurality of possible pronunciations, and these scores can be used to select the best pronunciation as well as to rank pronunciations in order of probability. The pronunciations generated by the system can be used in speech synthesis and speech recognition applications as well as lexicography applications.

Description

  • The present invention relates generally to speech processing. More particularly, the invention relates to a system for generating pronunciations of spelled words. The invention can be employed in a variety of different contexts, including speech recognition, speech synthesis and lexicography.
  • Spelled words accompanied by their pronunciations occur in many different contexts within the field of speech processing. In speech recognition phonetic transcriptions for each word in the dictionary are needed to train the recognizer prior to use. Traditionally phonetic transcriptions are manually created by lexicographers who are skilled in the nuances of phonetic pronunciation of the particular language of interest. Developing a good phonetic transcription for each word in the dictionary is time consuming and requires a great deal of skill. Much of this labor and specialized expertise could be dispensed with if there were a reliable system that could generate phonetic transcriptions of words based on their letter spelling. Such a system could extend current recognition systems to recognize words such as geographic locations and surnames that are not currently found in existing dictionaries.
  • Spelled words are also encountered frequently in the speech synthesis field. Present day speech synthesizers convert text to speech by retrieving digitally-sampled sound units from a dictionary and concatenating these sound units to form sentences.
  • As the above examples demonstrate, both the speech recognition and the speech synthesis fields of speech processing would benefit from the ability to generate accurate pronunciations from spelled words. The need for this technology is not limited to speech processing, however. Lexicographers have today completed fairly large and accurate pronunciation dictionaries for many of the major world languages. However, there still remain many hundreds of regional languages for which good phonetic transcriptions do not exist. Because the task of producing a good phonetic transcription has heretofore been largely a manual one, it may be years before some regional languages will be transcribed, if at all. The transcription process could be greatly accelerated if there were a good computer-implemented technique for scoring transcription accuracy. Such a scoring system would use an existing language transcription corpus to identify those entries in the transcription prototype whose pronunciations are suspect. This would greatly enhance the speed at which a quality transcription is generated.
  • Heretofore most attempts at spelled word-to-pronunciation transcription have relied solely upon the letters themselves. These techniques leave a great deal to be desired. For example, a letter-only pronunciation generator would have great difficulty properly pronouncing the word Bible. Based on the sequence of letters only the letter-only system would likely pronounce the word "Bib-l", much as a grade school child learning to read might do. The fault in conventional systems lies in the inherent ambiguity imposed by the pronunciation rules of many languages. The English language, for example, has hundreds of different pronunciation rules, making it difficult and computationally expensive to approach the problem on a word-by-word basis.
  • The present invention addresses the problem from a different angle. The invention uses a specially constructed mixed-decision tree that encompasses both letter sequence and phoneme sequence decision-making rules. More specifically, the mixed-decision tree embodies a series of yes-no questions residing at the internal nodes of the tree. Some of these questions involve letters and their adjacent neighbors in a spelled word sequence; other of these questions involve phonemes and their neighboring phonemes in the word sequence. The internal nodes ultimately lead to leaf nodes that contain probability data about which phonetic pronunciations of a given letter are most likely to be correct in pronouncing the word defined by its letter sequence.
  • The pronunciation generator of the invention uses this mixed-decision tree to score different pronunciation candidates, allowing it to select the most probable candidate as the best pronunciation for a given spelled word. Generation of the best pronunciation is preferably a two-stage process in which a letter-only tree is used in the first stage to generate a plurality of pronunciation candidates. These candidates are then scored using the mixed-decision tree in the second stage to select the best candidate.
  • Although the mixed-decision tree is advantageously used in a two-stage pronunciation generator, the mixed tree is useful in solving some problems that do not require letter-only first stage processing. For example, the mixed-decision tree can be used to score pronunciations generated by linguists using manual techniques.
  • For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.
  • Figure 1 is a block diagram illustrating the components and steps of the invention;
  • Figure 2 is a tree diagram illustrating a letter-only tree;
  • Figure 3 is a tree diagram illustrating a mixed tree in accordance with the invention;
  • Figure 4 is a block diagram illustrating a presently preferred system for generating the mixed tree in accordance with the invention;
  • Figure 5 is a flowchart illustrating a method for generating training data through an alignment process;
  • Figure 6 is a block diagram illustrating use of the decision-tree in an exemplary pronunciation generator;
  • Figure 7 illustrates application of the Gini criterion in assessing which question to use in populating a node.
  • Figure 8 is a block diagram of a letter-to-sound pronunciation generator according to the invention; and
  • Figure 9 is a tree diagram illustrating a letter-syntax-context-dialect mixed decision tree.
  • To illustrate the principles of the invention the exemplary embodiment of Figure 1 shows a spelled letter-to-pronunciation generator. As will be explained more fully below, the mixed-decision tree of the invention can be used in a variety of different applications in addition to the pronunciation generator illustrated here. The pronunciation generator has been selected for illustration because it highlights many aspects and benefits of the mixed-decision tree structure.
  • The pronunciation generator employs two stages, the first stage employing a set of letter-only decision trees 10 and the second stage employing a set of mixed-decision trees 12. An input sequence 14, such as the sequence of letters B-I-B-L-E, is fed to a dynamic programming phoneme sequence generator 16. The sequence generator uses the letter-only trees 10 to generate a list of pronunciations 18, representing possible pronunciation candidates of the spelled word input sequence.
  • The sequence generator sequentially examines each letter in the sequence, applying the decision tree associated with that letter to select a phoneme pronunciation for that letter based on probability data contained in the letter-only tree.
  • Preferably the set of letter-only decision trees includes a decision tree for each letter in the alphabet. Figure 2 shows an example of a letter-only decision tree for the letter E. The decision tree comprises a plurality of internal nodes (illustrated as ovals in the Figure) and a plurality of leaf nodes (illustrated as rectangles in the Figure). Each internal node is populated with a yes-no question. Yes-no questions are questions that can be answered either yes or no. In the letter-only tree these questions are directed to the given letter (in this case the letter E) and its neighboring letters in the input sequence. Note in Figure 2 that each internal node branches either left or right depending on whether the answer to the associated question is yes or no.
  • Abbreviations are used in Figure 2 as follows: numbers in questions, such as "+1" or "-1" refer to positions in the spelling relative to the current letter. For example, "+1L=='R'?" means "Is the letter after the current letter (which in this case is the letter E) an R?" The abbreviations CONS and VOW represent classes of letters, namely consonants and vowels. The absence of a neighboring letter, or null letter, is represented by the symbol -, which is used as a filler or placeholder where aligning certain letters with corresponding phoneme pronunciations. The symbol # denotes a word boundary.
  • The leaf nodes are populated with probability data that associate possible phoneme pronunciations with numeric values representing the probability that the particular phoneme represents the correct pronunciation of the given letter. For example, the notation "iy=>0.51" means "the probability of phoneme 'iy' in this leaf is 0.51." The null phoneme, i.e., silence, is represented by the symbol '-'.
  • The sequence generator 16 (Fig. 1) thus uses the letter-only decision trees 10 to construct one or more pronunciation hypotheses that are stored in list 18. Preferably each pronunciation has associated with it a numerical score arrived at by combining the probability scores of the individual phonemes selected using the decision tree 10. Word pronunciations may be scored by constructing a matrix of possible combinations and then using dynamic programming to select the n-best candidates. Alternatively, the n-best candidates may be selected using a substitution technique that first identifies the most probable word candidate and then generates additional candidates through iterative substitution, as follows.
  • The pronunciation with the highest probability score is selected first, by multiplying the respective scores of the highest-scoring phonemes (identified by examining the leaf nodes) and then using this selection as the most probable candidate or first-best word candidate. Additional (n-best) candidates are then selected by examining the phoneme data in the leaf nodes again to identify the phoneme, not previously selected, that has the smallest difference from an initially selected phoneme. This minimally-different phoneme is then substituted for the initially selected one to thereby generate the second-best word candidate. The above process may be repeated iteratively until the desired number of n-best candidates have been selected. List 18 may be sorted in descending score order, so that the pronunciation judged the best by the letter-only analysis appears first in the list.
  • As noted above, a letter-only analysis will frequently produce poor results. This is because the letter-only analysis has no way of determining at each letter what phoneme will be generated by subsequent letters. Thus a letter-only analysis can generate a high scoring pronunciation that actually would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-l-iy-z. The sequence generator using letter-only trees has no mechanism to screen out word pronunciations that would never occur in natural speech.
  • The second stage of the pronunciation system addresses the above problem. A mixed-tree score estimator 20 uses the set of mixed-decision trees 12 to assess the viability of each pronunciation in list 18. The score estimator works by sequentially examining each letter in the input sequence along with the phonemes assigned to each letter by sequence generator 16.
  • Like the set of letter-only trees, the set of mixed trees has a mixed tree for each letter of the alphabet. An exemplary mixed tree is shown in Figure 3. Like the letter-only tree, the mixed tree has internal nodes and leaf nodes. The internal nodes are illustrated as ovals and the leaf nodes as rectangles in Figure 3. The internal nodes are each populated with a yes-no question and the leaf nodes are each populated with probability data. Although the tree structure of the mixed tree resembles that of the letter-only tree, there is one important difference. The internal nodes of the mixed tree can contain two different classes of questions. An internal node can contain a question about a given letter and its neighboring letters in the sequence, or it can contain a question about the phoneme associated with that letter and neighboring phonemes corresponding to that sequence. The decision tree is thus mixed, in that it contains mixed classes of questions.
  • The abbreviations used in Figure 3 are similar to those used in Figure 2, with some additional abbreviations. The symbol L represents a question about a letter and its neighboring letters. The symbol P represents a question about a phoneme and its neighboring phonemes. For example the question "+1L=='D'?" means "Is the letter in the +1 position a 'D'?" The abbreviations CONS and SYL are phoneme classes, namely consonant and syllabic. For example, the question "+1P==CONS?" means "Is the phoneme in the +1 position a consonant?" The numbers in the leaf nodes give phoneme probabilities as they did in the letter-only trees.
  • The mixed-tree score estimator rescores each of the pronunciations in list 18 based on the mixed-tree questions and using the probability data in the lead nodes of the mixed trees. If desired, the list of pronunciations may be stored in association with the respective score as in list 22. If desired, list 22 can be sorted in descending order so that the first listed pronunciation is the one with the highest score.
  • In many instances the pronunciation occupying the highest score position in list 22 will be different from the pronunciation occupying the highest score position in list 18. This occurs because the mixed-tree score estimator, using the mixed trees 12, screens out those pronunciations that do not contain self-consistent phoneme sequences or otherwise represent pronunciations that would not occur in natural speech.
  • If desired a selector module 24 can access list 22 to retrieve one or more of the pronunciations in the list. Typically selector 24 retrieves the pronunciation with the highest score and provides this as the output pronunciation 26.
  • As noted above, the pronunciation generator depicted in Figure 1 represents only one possible embodiment employing the mixed tree of the invention. As an alternative embodiment, the dynamic programming phoneme sequence generator 16, and its associated letter-only decision trees 10 may be dispensed with in applications where one or more pronunciations for a given spelled word sequence are already available. This situation might be encountered where a previously developed pronunciation dictionary is available. In such case the mixed-tree score estimator 20, with its associated mixed trees 12, may be used to score the entries in the pronunciation dictionary, identifying those having low scores, thereby flagging suspicious pronunciations in the dictionary being constructed. Such a system may, for example, be incorporated into a lexicographer's productivity tool.
  • The output pronunciation or pronunciations selected from list 22 can be used to form pronunciation dictionaries for both speech recognition and speech synthesis applications. In the speech recognition context, the pronunciation dictionary may be used during the recognizer training phase by supplying pronunciations for words that are not already found in the recognizer lexicon. In the synthesis context the pronunciation dictionaries may be used to generate phoneme sounds for concatenated playback. The system may be used, for example, to augment the features of an E-mail reader or other text-to-speech application.
  • The mixed-tree scoring system of the invention can be used in a variety of applications where a single one or list of possible pronunciations is desired. For example, in a dynamic on-line dictionary the user types a word and the system provides a list of possible pronunciations, in order of probability. The scoring system can also be used as a user feedback tool for language learning systems. A language learning system with speech recognition capability is used to display a spelled word and to analyze the speaker's attempts at pronouncing that word in the new language, and the system tells the user how probable or improbable his or her pronunciation is for that word.
  • Generating the Decision Trees
  • The system for generating the letter-only trees and the mixed trees is illustrated in Figure 4. At the heart of the decision tree generation system is tree generator 40. The tree generator employs a tree-growing algorithm that operates upon a predetermined set of training data 42 supplied by the developer of the system. Typically the training data comprise aligned letter, phoneme pairs that correspond to known proper pronunciations of words. The training data may be generated through the alignment process illustrated in Figure 5. Figure 5 illustrates an alignment process being performed on an exemplary word BIBLE. The spelled word 44 and its pronunciation 46 are fed to a dynamic programming alignment module 48 which aligns the letters of the spelled word with the phonemes of the corresponding pronunciation. Note in the illustrated example the final E is silent. The letter phoneme pairs are then stored as data 42.
  • Returning to Figure 4, the tree generator works in conjunction with three additional components: a set of possible yes-no questions 50, a set of rules 52 for selecting the best questions for each node or for deciding if the node should be a lead node, and a pruning method 53 to prevent over-training.
  • The set of possible yes-no questions may include letter questions 54 and phoneme questions 56, depending on whether a letter-only tree or a mixed tree is being grown. When growing a letter-only tree, only letter questions 54 are used; when growing a mixed tree both letter questions 54 and phoneme questions 56 are used.
  • The rules for selecting the best question to populate at each node in the presently preferred embodiment are designed to follow the Gini criterion. Other splitting criteria can be used instead. For more information regarding splitting criteria reference may be had to Breiman, Friedman et al, "Classification and Regression Trees." Essentially, the Gini criterion is used to select a question from the set of possible yes-no questions 50 and to employ a stopping rule that decides when a node is a leaf node. The Gini criterion employs a concept called "impurity." Impurity is always a non-negative number. It is applied to a node such that a node containing equal proportions of all possible categories has maximum impurity and a node containing only one of the possible categories has a zero impurity (the minimum possible value). There are several functions that satisfy the above conditions. These depend upon the counts of each category within a node Gini impurity may be defined as follows. If C is the set of classes to which data items can belong, and T is the current tree node, let f(1|T) be the proportion of training data items in node T that belong to class 1, f(2|T) the proportion of items belonging to class 2, etc. Then,
    Figure 00130001
  • To illustrate by example, assume the system is growing a tree for the letter "E." In a given node T of that tree, the system may, for example, have 10 examples of how "E" is pronounced in words. In 5 of these examples, "E" is pronounced "iy" (the sound "ee" in "cheeze); in 3 of the examples "E" is pronounced "eh" (the sound of "e" in "bed"); and in the remaining 2 examples, "E" is "-" (i.e., silent as in "e" in "maple").
  • Assume the system is considering two possible yes-no questions, Q1 and Q2 that can be applied to the 10 examples. The items that answer "yes" to Q1 include four examples of "iy" and one example of "-" (the other five items answer "no" to Q1.) The items that answer "yes" to Q2 include three examples of "iy" and three examples of "eh" (the other four items answer "no" to Q2). Figure 6 diagrammatically compares these two cases.
  • The Gini criterion answers which question the system should choose for this node, Q1 or Q2. The Gini criterion for choosing the correct question is: find the question in which the drop in impurity in going from parent nodes to children nodes is maximized. This impurity drop ΔT is defined as Δl = i(T) - pyes* i(yes) - pno*i(no), where pyes is the proportion of items going to the "yes" child and pno is the proportion of items going to the "no" child.
  • Applying the Gini criterion to the above example:
    Figure 00140001
       Δl for Q1 is thus: i(T) - pyes(Q1) = 1 - 0.82 - 0.22 = 0.32 i(T) - pno(Q1) = 1 - 0.22- 0.62 = 0.56 So ΔI (Q1) = 0.62-0.5*0.32-0.5*0.56 = 0.18.
  • For Q2, we have I(yes, Q2) = 1 - 0.52 - 0.52 = 0.5, and for i(no, Q2) = (same) = 0.5. So, Δl(Q2) = 0.6 - (0.6)*(0.5) - (0.4)*(0.5) = 0.12.
  • In this case, Q1 gave the greatest drop in impurity. It will therefore be chosen instead of Q2.
  • The rule set 52 declares a best question for a node to be that question which brings about the greatest drop in impurity in going from the parent node to its children.
  • The tree generator applies the rules 52 to grow a decision tree of yes-no questions selected from set 50. The generator will continue to grow the tree until the optimal-sized tree has been grown. Rules 52 include a set of stopping rules that will terminate tree growth when the tree is grown to a predetermined size. In the preferred embodiment the tree is grown to a size larger than ultimately desired. Then pruning methods 53 are used to cut back the tree to its desired size. The pruning method may implement the Breiman technique as described in the reference cited above.
  • The tree generator thus generates sets of letter-only trees, shown generally at 60 or mixed trees, shown generally at 70, depending on whether the set of possible yes-no questions 50 includes letter-only questions alone or in combination with phoneme questions. The corpus of training data 42 comprises letter, phoneme pairs, as discussed above. In growing letter-only trees, only the letter portions of these pairs are used in populating the internal nodes. Conversely, when growing mixed trees, both the letter and phoneme components of the training data pairs may be used to populate internal nodes. In both instances the phoneme portions of the pairs are used to populate the leaf nodes. Probability data associated with the phoneme data in the lead nodes are generated by counting the number of occurrences a given phoneme is aligned with a given letter over the training data corpus.
  • The letter-to-pronunciation decision trees generated by the above-described method can be stored in memory for use in a variety of different speech-processing applications. While these applications are many and varied, a few examples will next be presented to better highlight some of the capabilities and advantages of these trees.
  • Figure 6 illustrates the use of both the letter-only trees and the mixed trees to generate pronunciations from spelled-word letter sequences. Although the illustrated embodiment employs both letter-only and mixed tree components together, other applications may use only one component and not the other. In the illustrated embodiment the set of letter-only trees are stored in memory at 80 and the mixed trees are stored in memory at 82. In many applications there will be one tree for each letter in the alphabet. Dynamic programming sequence generator 84 operates upon input sequence 86 to generate a pronunciation at 88 based on the letter-only trees 80. Essentially, each letter in the input sequence is considered individually and the applicable letter-only tree is used to select the most probable pronunciation for that letter. As explained above, the letter-only trees ask a series of yes-no questions about the given letter and its neighboring letters in the sequence. After all letters in the sequence have been considered, the resultant pronunciation is generated by concatenating the phonemes selected by the sequence generator.
  • To improve pronunciation the mixed tree set 82 can be used. Whereas letter-only trees ask only questions about letters, the mixed trees can ask questions about letters and also about phonemes. Scorer 90 may receive phoneme information from the output of sequence generator 84. In this regard, sequence generator 84, using the letter-only trees 80, can generate a plurality of different pronunciations, sorting those pronunciations based on their respective probability scores. This sorted lists of pronunciations may be stored at 92 for access by the scorer 90.
  • Scorer 90 receives as input the same input sequence 86 as was supplied to sequence generator 84. Scorer 90 applies the mixed-tree 82 questions to the sequence of letters, using data from store 92 when asked to respond to a phoneme question. The resulting output at 94 is typically a better pronunciation than provided at 88. The reason for this is the mixed trees tend to filter out pronunciations that would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-I-iy-z.
  • If desired, scorer generator 90 can also produce a sorted list of n possible pronunciations as at 96. The scores associated with each pronunciation represent the composite of the individual probability scores assigned to each phoneme in the pronunciation. These scores can, themselves, be used in applications where dubious pronunciations need to be identified. For example, the phonetic transcription supplied by a team of lexicographers could be checked using the mixed trees to quickly identify any questionable pronunciations.
  • A Letter-to-Sound Pronunciation Generator
  • To illustrate the principles of the invention the exemplary embodiment of Figure 8 shows a two stage spelled letter-to-pronunciation generator. As will be explained more fully below, the mixed-decision tree approach of the invention can be used in a variety of different applications in addition to the pronunciation generator illustrated here. The two stage pronunciation generator has been selected for illustration because it highlights many aspects and benefits of the mixed-decision tree structure.
  • The two stage pronunciation generator includes a first stage 116 which preferably employs a set of letter-syntax-context-dialect decision trees 110 and a second stage 120 which employs a set of phoneme-mixed decision trees 112 which examine input sequence 114 at a phoneme level. Letter-syntax-context-dialect decision trees examine questions involving letters and their adjacent neighbors in a spelled word sequence (i.e., letter-related questions); other questions examined are what words precede or follow a particular word (i.e., context-related questions); still other questions examined are what part of speech the word has within a sentence as well as what syntax other words have in the sentence (i.e., syntax-related questions); still further questions examined are what dialect it is desired to be spoken. Preferably, a user selects which dialect is to be spoken by dialect selection device 150.
  • An alternate embodiment of the present invention includes using letter-related questions and at least one of the word-level characteristics (i.e., syntax-related questions or context-related questions). For example, one embodiment utilizes a set of letter-syntax decision trees for the first stage. Another embodiment utilizes a set of letter-context-dialect decision trees which do not examine syntax of the input sequence.
  • It should be understood that the present invention is not limited to words occurring in a sentence, but includes other linguistical constructs which exhibit syntax, such as fragmented sentences or phrases.
  • An input sequence 114, such as the sequence of letters of a sentence, is fed to the text-based pronunciation generator 116. For example, input sequence 114 could be the following sentence: "Did you know who read the autobiography?"
  • Syntax data 115 is an input to text-based pronunciation generator 116. This input provides information for the text-based pronunciation generator 116 to correctly course through the letter-syntax-context-dialect decision trees 110. Syntax data 115 addresses what parts of speech each word has in the input sequence 114. For example, the word "read" in the above input sequence example would be tagged as a verb (as opposed to a noun or an adjective) by syntax tagger software module 129. Syntax tagger software technology is available from such institutions as the University Pennsylvania under project "Xtag." Moreover, the following reference discusses syntax tagger software technology: George Foster, "Statistical Lexical Disambiguation", Masters Thesis in Computer Science, McGill University, Montreal, Canada (November 11, 1991).
  • The text-based pronunciation generator 116 uses decision trees 110 to generate a list of pronunciations 118, representing possible pronunciation candidates of the spelled word input sequence. Each pronunciation (e.g., pronunciation A) of list 118 represents a pronunciation of input sequence 114 including preferably how each word is stressed. Moreover, the rate at which each word is spoken is determined in the preferred embodiment.
  • Sentence rate calculator software module 152 is utilized by text-based pronunciation generator 116 to determine how quickly each word should be spoken. For example, sentence rate calculator 152 examines the context of the sentence to determine if certain words in the sentence should be spoken at a faster or slower rate than normal. For example, a sentence with an exclamation marker at the end produces rate data which indicates that a predetermined number of words before the end of the sentence are to have a shorter duration than normal to better convey the impact of an exclamatory statement.
  • The text-based pronunciation generator 116 examines in order each letter and word in the sequence, applying the decision tree associated with that letter or word's syntax (or word's context) to select a phoneme pronunciation for that letter based on probability data contained in the decision tree. Preferably the set of decision trees 110 includes a decision tree for each letter in the alphabet and syntax of the language involved.
  • Figure 9 shows an example of a letter-syntax-context-dialect decision tree 140 applicable to the letter "E" in the word "READ." The decision tree comprises a plurality of internal nodes (illustrated as ovals in the Figure) and a plurality of leaf nodes (illustrated as rectangles in the Figure). Each internal node is populated with a yes-no question. Yes-no questions are questions that can be answered either yes or no. In the letter-syntax-context-dialect decision tree 140 these questions are directed to: a given letter (e.g., in this case the letter "E") and its neighboring letters in the input sequence; or the syntax of the word in the sentence (e.g., noun, verb, etc.); or the context and dialect of the sentence. Note in Figure 9 that each internal node branches either left or right depending on whether the answer to the associated question is yes or no.
  • Preferably, the first internal node inquires about the dialect to be spoken. Internal node 138 is representative of such an inquiry. If the southern dialect is to be spoken, then southern dialect decision tree 139 is coursed through which ultimately produces phoneme values at the leaf nodes which are more distinctive of a southern dialect.
  • The abbreviations used in Figure 9 are as follows: numbers in questions, such as "+1" or "-1" refer to positions in the spelling relative to the current letter. The symbol L represents a question about a letter and its neighboring letters. For example, "-1L=='R' or 'L'?" means "is the letter before the current letter (which is 'E') an 'L' or an 'R'?". Abbreviations 'CONS' and 'VOW' are classes of letters: consonant and vowel. The symbol '#' indicates a word boundary. The term 'tag(i)' denotes a question about the syntactic tag of the ith word, where i=0 denotes the current word, i=-1 denotes the preceding word, i=+1 denotes the following word, etc. Thus, "tag(0)==PRES?" means "is the current word a present-tense verb?".
  • The leaf nodes are populated with probability data that associate possible phoneme pronunciations with numeric values representing the probability that the particular phoneme represents the correct pronunciation of the given letter. The null phoneme, i.e., silence, is represented by the symbol '-'.
  • For example, the "E" in the present-tense verbs "READ" and "LEAD" is assigned its correct pronunciation, "iy" at leaf node 142 with probability 1.0 by the decision tree 140. The "E" in the past tense of "read" (e.g., "Who read a book") is assigned pronunciation "eh" at leaf node 144 with probability 0.9.
  • Decision trees 110 (of Figure 8) preferably includes context-related questions. For example, context-related question of internal nodes may examine whether the word "you" is preceded by the word "did." In such a context, the "y" in "you" is typically pronounced in colloquial speech as "ja".
  • The present invention also generates prosody-indicative data, so as to convey stress, pitch, grave, or pause aspects when speaking a sentence. Syntax-related questions help to determine how the phoneme is to be stressed, or pitched or graved. For example, internal node 141 (of Figure 9) inquires whether the first word in the sentence is an interrogatory pronoun, such as "who" in the exemplary sentence "who read a book?" Since in this example, the first word in this example is an interrogatory pronoun, then leaf node 144 with its phoneme stress is selected. Leaf node 146 illustrates the other option where the phonemes are not stressed.
  • As another example, in an interrogative sentence, the phonemes of the last syllable of the last word in the sentence would have a pitch mark so as to more naturally convey the questioning aspect of the sentence. Still another example includes the present invention able to accommodate natural pausing in speaking a sentence. The present invention includes such pausing detail by asking questions about punctuation, such as commas and periods.
  • The text-based pronunciation generator 116 (Fig. 8) thus uses decision trees 110 to construct one or more pronunciation hypotheses that are stored in list 118. Preferably each pronunciation has associated with it a numerical score arrived at by combining the probability scores of the individual phonemes selected using decision trees 110. Word pronunciations may be scored by constructing a matrix of possible combinations and then using dynamic programming to select the n-best candidates.
  • Alternatively, the n-best candidates may be selected using a substitution technique that first identifies the most probable word candidate and then generates additional candidates through iterative substitution, as follows. The pronunciation with the highest probability score is selected first, by multiplying the respective scores of the highest-scoring phonemes (identified by examining the leaf nodes) and then using this selection as the most probable candidate or first-best word candidate. Additional (n-best) candidates are then selected by examining the phoneme data in the leaf nodes again to identify the phoneme, not previously selected, that has the smallest difference from an initially selected phoneme. This minimally-different phoneme is then substituted for the initially selected one to thereby generate the second-best word candidate. The above process may be repeated iteratively until the desired number of n-best candidates have been selected. List 118 may be sorted in descending score order, so that the pronunciation judged the best by the letter-only analysis appears first in the list.
  • Decision trees 110 frequently produce only moderately successful results. This is because these decision trees have no way of determining at each letter what phoneme will be generated by subsequent letters. Thus decision trees 110 can generate a high scoring pronunciation that actually would not occur in natural speech. For example, the proper name, Achilles, would likely result in a pronunciation that phoneticizes both II's: ah-k-ih-l-l-iy-z. In natural speech, the second I is actually silent: ah-k-ih-l-iy-z. The pronunciation generator using decision trees 110 has no mechanism to screen out word pronunciations that would never occur in natural speech.
  • The second stage 120 of the pronunciation system 108 addresses the above problem. A phoneme-mixed tree score estimator 120 uses the set of phoneme-mixed decision trees 112 to assess the viability of each pronunciation in list 118. The score estimator 120 works by sequentially examining each letter in the input sequence 114 along with the phonemes assigned to each letter by text-based pronunciation generator 116.
  • The phoneme-mixed tree score estimator 120 rescores each of the pronunciations in list 118 based on the phoneme-mixed tree questions 112 and using the probability data in the leaf nodes of the mixed trees. If desired, the list of pronunciations may be stored in association with the respective score as in list 122. If desired, list 122 can be sorted in descending order so that the first listed pronunciation is the one with the highest score.
  • In many instances the pronunciation occupying the highest score position in list 122 will be different from the pronunciation occupying the highest score position in list 118. This occurs because the phoneme-mixed tree score estimator 120, using the phoneme-mixed trees 112, screens out those pronunciations that do not contain self-consistent phoneme sequences or otherwise represent pronunciations that would not occur in natural speech.
  • In the preferred embodiment, phoneme-mixed tree score estimator 120 utilizes sentence rate calculator 152 in order to determine rate data for the pronunciations in list 122. Moreover, estimator 120 utilizes phoneme-mixed trees that allow questions about dialect to be examined and that also allow questions to determine stress and other prosody aspects at the leaf nodes in a manner similar to the aforementioned approach.
  • If desired a selector module 124 can access list 122 to retrieve one or more of the pronunciations in the list. Typically selector 124 retrieves the pronunciation with the highest score and provides this as the output pronunciation 126.
  • As noted above, the pronunciation generator depicted in Figure 8 represents only one possible embodiment employing the mixed tree approach of the invention. In an alternate embodiment, the output pronunciation or pronunciations selected from list 122 can be used to form pronunciation dictionaries for both speech recognition and speech synthesis applications. In the speech recognition context, the pronunciation dictionary may be used during the recognizer training phase by supplying pronunciations for words that are not already found in the recognizer lexicon. In the synthesis context the pronunciation dictionaries may be used to generate phoneme sounds for concatenated playback. The system may be used, for example, to augment the features of an E-mail reader or other text-to-speech application.
  • The mixed-tree scoring system (i.e., letter, syntax, context, and phoneme) of the invention can be used in a variety of applications where a single one or list of possible pronunciations is desired. For example, in a dynamic on-line language learning system, a user types a sentence, and the system provides a list of possible pronunciations for the sentence, in order of probability. The scoring system can also be used as a user feedback tool for language learning systems. A language learning system with speech recognition capability is used to display a spelled sentence and to analyze the speaker's attempts at pronouncing that sentence in the new language. The system indicates to the user how probable or improbable his or her pronunciation is for that sentence.
  • While the invention has been described in its presently preferred form it will be understood that there are numerous applications for the mixed-tree pronunciation system. Accordingly, the invention is capable of certain modifications and changes without departing from the scope bf the invention as set forth in the appended claims.
  • The technical effect of the present invention may be realised by a suitably programmed computer and the present invention also provides a computer program product comprising a computer readable storage medium having recorded thereon computer interpretable or compilable code that, when loaded onto a suitable computer and executed, will realise the technical effect. The present invention also encompasses such code itself.

Claims (23)

  1. An apparatus for generating at least one phonetic pronunciation for an input sequence of letters selected from a predetermined alphabet, comprising:
    a memory for storing a plurality of letter-only decision trees corresponding to said alphabet,
    said letter-only decision trees having internal nodes representing yes-no questions about a given letter and its neighboring letters in a given sequence;
    said memory further storing a plurality of mixed decision trees corresponding to said alphabet,
    said mixed decision trees having a first plurality of internal nodes representing yes-no questions about a given letter and its neighboring letters in said given sequence and having a second plurality of internal nodes representing yes-no questions about a phoneme and its neighboring phonemes in said given sequence,
    said letter-only decision trees and said mixed decision trees further having leaf nodes representing probability data that associates said given letter with a plurality of phoneme pronunciations;
    a phoneme sequence generator coupled to said letter-only decision tree for processing an input sequence of letters and generating a first set of phonetic pronunciations corresponding to said input sequence of letters;
    a score estimator coupled to said mixed decision tree for processing said first set to generate a second set of scored phonetic pronunciations, the scored phonetic pronunciations representing at least one phonetic pronunciation of said input sequence.
  2. The apparatus of claim 1 wherein said second set comprises a plurality of pronunciations each with an associated score derived from said probability data and further comprising a pronunciation selector receptive of said second set and operable to select one pronunciation from said second set based on said associated score.
  3. The apparatus of claim 1 or 2 wherein said phoneme sequence generator produces a predetermined number of different pronunciations corresponding to given input sequence.
  4. The apparatus of claim 3 wherein said phone me sequence generator produces a predetermined number of different pronunciations representing the n-best pronunciations according to said probability data.
  5. The apparatus of claim 4 wherein said score estimator rescores said n-best pronunciations based on said mixed decision trees.
  6. The apparatus of any one of claims 1 to 5 wherein said sequence generator constructs a matrix of possible phoneme combinations representing different pronunciations.
  7. The apparatus of claim 6 wherein sequence generator selects the n-best phoneme combinations from said matrix using dynamic programming.
  8. The apparatus of claim 6 wherein sequence generator selects the n-best phoneme combinations from said matrix by iterative substitution.
  9. The apparatus of any one of claims 1 to 8 further comprising a speech recognition system having a pronunciation dictionary used for recognizer training and wherein at least a portion of said second set populates said dictionary to supply pronunciations for words based on their spelling.
  10. The apparatus of any one of claims 1 to 9 further comprising a speech synthesis system receptive of at least a portion of said second set for generating an audible synthesized pronunciation of words based on their spelling.
  11. The apparatus of claim 10 wherein said speech synthesis system is incorporated into an e-mail reader.
  12. The apparatus of claim 10 wherein said speech synthesis system is incorporated into a dictionary for providing a list of possible pronunciations in order of probability.
  13. The apparatus of any one of claims 1 to 10 further comprising a language learning system that displays a spelled word and analyzes a speaker's attempt at pronouncing that word using at least one of said letter-only decision tree and said mixed decision tree to tell the speaker how probable his or her pronunciation was for that word.
  14. A method for processing spelling-to-pronunciation data, comprising the steps of:
    providing a first set of yes-no questions about letters and their relationship to neighboring letters in an input sequence;
    providing a second set of yes-no questions about phonemes and their relationship to neighboring phonemes in an input sequence;
    providing a corpus of training data representing a plurality of different sets of pairs each pair containing a letter sequence and a phoneme sequence, said letter sequence selected from an alphabet;
    using said first and second sets and said training data to generate decision trees for at least a portion of said alphabet, said decision trees each having a plurality of internal nodes and a plurality of leaf nodes;
    populating said internal nodes with questions selected from said first and second sets; and
    populating said leaf nodes with probability data that associates said portion of said alphabet with a plurality of phoneme pronunciations based on said training data.
  15. The method of claim 14 further comprising providing said corpus of training data as aligned letter sequence-phoneme sequence pairs.
  16. The method of claim 14 or 15 wherein said step of providing a corpus of training data further comprises providing a plurality of input sequences containing sequences of phonemes representing pronunciation of words formed by said sequence of letters; and aligning selected ones of said phonemes with selected ones of said letters to define aligned letter-phoneme pairs.
  17. The method of claim 14, 15 or 16 further comprising supplying an input string of letters with at least one associated phoneme pronunciation and using said decision trees to score said pronunciation based on said probability data.
  18. The method of claim 14, 15 or 16 further comprising supplying an input string of letters with a plurality of associated phoneme pronunciations and using decision trees to select one of said plurality of pronunciation based on said probability data.
  19. The method of claim 14, 15 or 16 further comprising supplying an input string of letters representing a word with a plurality of associated phoneme pronunciations and using said decision trees to generate a phonetic transcription of said word based on said probability data.
  20. The method of claim 19 further comprising using said phonetic transcription to populate a dictionary associated with a speech recognizer.
  21. The method of claim 14, 15 or 16 further comprising supplying an input string of letters representing a word with a plurality of associated phoneme pronunciations and using decision trees to assign a numerical score to each one of said plurality of pronunciations.
  22. An apparatus for generating at least one phonetic pronunciation for an input sequence of letters selected from a predetermined alphabet, said sequence of letters forming words which substantially adhere to a predetermined syntax, said apparatus comprising:
    an input device for receiving syntax data indicative of the syntax of said words in said input sequence;
    a computer storage device for storing a plurality of text-based decision trees having questions indicative of predetermined characteristics of said input sequence,
    said predetermined characteristics including letter-related questions about said input sequence, said predetermined characteristics also including characteristics selected from the group consisting of syntax-related questions, context-related questions, dialect-related questions or combinations thereof,
    said text-based decision trees having internal nodes representing questions about predetermined characteristics of said input sequence;
    said text-based decision trees further having leaf nodes representing probability data that associates each of said letters with a plurality of phoneme pronunciations; and
    a text-based pronunciation generator connected to said text-based decision trees for processing said input sequence of letters and generating a first set of phonetic pronuciations corresponding to said input sequence of letters based upon said text-based decision trees.
  23. The apparatus of claim 22 further comprising:
       a phoneme-mixed tree score estimator connected to said text-based pronunciation generator for processing said first set to generate a second set of scored phonetic pronunciations, the scored phonetic pronunciations representing at least one phonetic pronunciation of said input sequence.
EP99303390A 1998-04-29 1999-04-29 Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word Expired - Lifetime EP0953970B1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US69308 1993-05-28
US09/069,308 US6230131B1 (en) 1998-04-29 1998-04-29 Method for generating spelling-to-pronunciation decision tree
US09/067,764 US6016471A (en) 1998-04-29 1998-04-29 Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
US67764 1998-04-29
US70300 1998-04-30
US09/070,300 US6029132A (en) 1998-04-30 1998-04-30 Method for letter-to-sound in text-to-speech synthesis

Publications (3)

Publication Number Publication Date
EP0953970A2 true EP0953970A2 (en) 1999-11-03
EP0953970A3 EP0953970A3 (en) 2000-01-19
EP0953970B1 EP0953970B1 (en) 2004-03-03

Family

ID=27371225

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99303390A Expired - Lifetime EP0953970B1 (en) 1998-04-29 1999-04-29 Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word

Country Status (7)

Country Link
EP (1) EP0953970B1 (en)
JP (1) JP3481497B2 (en)
KR (1) KR100509797B1 (en)
CN (1) CN1118770C (en)
AT (1) ATE261171T1 (en)
DE (1) DE69915162D1 (en)
TW (1) TW422967B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000054254A1 (en) * 1999-03-08 2000-09-14 Siemens Aktiengesellschaft Method and array for determining a representative phoneme
EP1221693A2 (en) * 2001-01-05 2002-07-10 Matsushita Electric Industries Co., Ltd. Prosody template matching for text-to-speech systems
EP1638080A3 (en) * 2004-08-11 2006-07-26 International Business Machines Corporation A text-to-speech system and method
US7124083B2 (en) 2000-06-30 2006-10-17 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US7146319B2 (en) * 2003-03-31 2006-12-05 Novauris Technologies Ltd. Phonetically based speech recognition system and method
US7565291B2 (en) 2000-07-05 2009-07-21 At&T Intellectual Property Ii, L.P. Synthesis-based pre-selection of suitable units for concatenative speech
US7725309B2 (en) 2005-06-06 2010-05-25 Novauris Technologies Ltd. System, method, and technique for identifying a spoken utterance as a member of a list of known items allowing for variations in the form of the utterance
CN101452701B (en) * 2007-12-05 2011-09-07 株式会社东芝 Confidence degree estimation method and device based on inverse model
US20140365515A1 (en) * 2013-06-10 2014-12-11 Google Inc. Evaluation of substitution contexts
US9336771B2 (en) 2012-11-01 2016-05-10 Google Inc. Speech recognition using non-parametric models
US9858922B2 (en) 2014-06-23 2018-01-02 Google Inc. Caching speech recognition scores
US10204619B2 (en) 2014-10-22 2019-02-12 Google Llc Speech recognition using associative mapping
WO2020082992A1 (en) * 2018-10-25 2020-04-30 陈逸天 Method for learning word by using historical spelling experience, apparatus and electronic device
WO2022246782A1 (en) * 2021-05-28 2022-12-01 Microsoft Technology Licensing, Llc Method and system of detecting and improving real-time mispronunciation of words

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001048737A2 (en) * 1999-12-23 2001-07-05 Intel Corporation Speech recognizer with a lexical tree based n-gram language model
WO2002029612A1 (en) * 2000-09-30 2002-04-11 Intel Corporation Method and system for generating and searching an optimal maximum likelihood decision tree for hidden markov model (hmm) based speech recognition
CN100445046C (en) * 2000-10-13 2008-12-24 索尼公司 Robot device and behavior control method for robot device
US20040078191A1 (en) * 2002-10-22 2004-04-22 Nokia Corporation Scalable neural network-based language identification from written text
FI118062B (en) * 2003-04-30 2007-06-15 Nokia Corp Decision tree with a sparse memory
US7558389B2 (en) * 2004-10-01 2009-07-07 At&T Intellectual Property Ii, L.P. Method and system of generating a speech signal with overlayed random frequency signal
JP2009525492A (en) * 2005-08-01 2009-07-09 一秋 上川 A system of expression and pronunciation techniques for English sounds and other European sounds
JP4769223B2 (en) * 2007-04-26 2011-09-07 旭化成株式会社 Text phonetic symbol conversion dictionary creation device, recognition vocabulary dictionary creation device, and speech recognition device
KR101250897B1 (en) * 2009-08-14 2013-04-04 한국전자통신연구원 Apparatus for word entry searching in a portable electronic dictionary and method thereof
US20110238412A1 (en) * 2010-03-26 2011-09-29 Antoine Ezzat Method for Constructing Pronunciation Dictionaries
WO2013003772A2 (en) * 2011-06-30 2013-01-03 Google Inc. Speech recognition using variable-length context
US9741339B2 (en) * 2013-06-28 2017-08-22 Google Inc. Data driven word pronunciation learning and scoring with crowd sourcing based on the word's phonemes pronunciation scores
JP6234134B2 (en) * 2013-09-25 2017-11-22 三菱電機株式会社 Speech synthesizer
CN107767858B (en) * 2017-09-08 2021-05-04 科大讯飞股份有限公司 Pronunciation dictionary generating method and device, storage medium and electronic equipment
KR102605159B1 (en) * 2020-02-11 2023-11-23 주식회사 케이티 Server, method and computer program for providing voice recognition service

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0562138A1 (en) * 1992-03-25 1993-09-29 International Business Machines Corporation Method and apparatus for the automatic generation of Markov models of new words to be added to a speech recognition vocabulary

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4852173A (en) * 1987-10-29 1989-07-25 International Business Machines Corporation Design and construction of a binary-tree system for language modelling
KR100355393B1 (en) * 1995-06-30 2002-12-26 삼성전자 주식회사 Phoneme length deciding method in voice synthesis and method of learning phoneme length decision tree
JP3627299B2 (en) * 1995-07-19 2005-03-09 ソニー株式会社 Speech recognition method and apparatus
US5758024A (en) * 1996-06-25 1998-05-26 Microsoft Corporation Method and system for encoding pronunciation prefix trees

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0562138A1 (en) * 1992-03-25 1993-09-29 International Business Machines Corporation Method and apparatus for the automatic generation of Markov models of new words to be added to a speech recognition vocabulary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANDERSEN O ET AL: "Comparison of two tree-structured approaches for grapheme-to-phoneme conversion" PROCEEDINGS ICSLP 96. FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (CAT. NO.96TH8206), PROCEEDING OF FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. ICSLP '96, PHILADELPHIA, PA, USA, 3-6 OCT. 1996, pages 1700-1703 vol.3, XP002123689 1996, New York, NY, USA, IEEE, USA ISBN: 0-7803-3555-4 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6430532B2 (en) 1999-03-08 2002-08-06 Siemens Aktiengesellschaft Determining an adequate representative sound using two quality criteria, from sound models chosen from a structure including a set of sound models
WO2000054254A1 (en) * 1999-03-08 2000-09-14 Siemens Aktiengesellschaft Method and array for determining a representative phoneme
US8224645B2 (en) 2000-06-30 2012-07-17 At+T Intellectual Property Ii, L.P. Method and system for preselection of suitable units for concatenative speech
US7460997B1 (en) 2000-06-30 2008-12-02 At&T Intellectual Property Ii, L.P. Method and system for preselection of suitable units for concatenative speech
US8566099B2 (en) 2000-06-30 2013-10-22 At&T Intellectual Property Ii, L.P. Tabulating triphone sequences by 5-phoneme contexts for speech synthesis
US7124083B2 (en) 2000-06-30 2006-10-17 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US7565291B2 (en) 2000-07-05 2009-07-21 At&T Intellectual Property Ii, L.P. Synthesis-based pre-selection of suitable units for concatenative speech
US6845358B2 (en) 2001-01-05 2005-01-18 Matsushita Electric Industrial Co., Ltd. Prosody template matching for text-to-speech systems
EP1221693A2 (en) * 2001-01-05 2002-07-10 Matsushita Electric Industries Co., Ltd. Prosody template matching for text-to-speech systems
EP1221693A3 (en) * 2001-01-05 2004-02-04 Matsushita Electric Industries Co., Ltd. Prosody template matching for text-to-speech systems
US7146319B2 (en) * 2003-03-31 2006-12-05 Novauris Technologies Ltd. Phonetically based speech recognition system and method
EP1638080A3 (en) * 2004-08-11 2006-07-26 International Business Machines Corporation A text-to-speech system and method
US7725309B2 (en) 2005-06-06 2010-05-25 Novauris Technologies Ltd. System, method, and technique for identifying a spoken utterance as a member of a list of known items allowing for variations in the form of the utterance
CN101452701B (en) * 2007-12-05 2011-09-07 株式会社东芝 Confidence degree estimation method and device based on inverse model
US9336771B2 (en) 2012-11-01 2016-05-10 Google Inc. Speech recognition using non-parametric models
US9875295B1 (en) 2013-06-10 2018-01-23 Goolge Inc. Evaluation of substitution contexts
US20140365455A1 (en) * 2013-06-10 2014-12-11 Google Inc. Evaluation of substitution contexts
US9384303B2 (en) * 2013-06-10 2016-07-05 Google Inc. Evaluation of substitution contexts
US9483581B2 (en) * 2013-06-10 2016-11-01 Google Inc. Evaluation of substitution contexts
US20140365515A1 (en) * 2013-06-10 2014-12-11 Google Inc. Evaluation of substitution contexts
US9858922B2 (en) 2014-06-23 2018-01-02 Google Inc. Caching speech recognition scores
US10204619B2 (en) 2014-10-22 2019-02-12 Google Llc Speech recognition using associative mapping
WO2020082992A1 (en) * 2018-10-25 2020-04-30 陈逸天 Method for learning word by using historical spelling experience, apparatus and electronic device
WO2022246782A1 (en) * 2021-05-28 2022-12-01 Microsoft Technology Licensing, Llc Method and system of detecting and improving real-time mispronunciation of words

Also Published As

Publication number Publication date
TW422967B (en) 2001-02-21
EP0953970B1 (en) 2004-03-03
JP3481497B2 (en) 2003-12-22
ATE261171T1 (en) 2004-03-15
CN1233803A (en) 1999-11-03
EP0953970A3 (en) 2000-01-19
KR19990083555A (en) 1999-11-25
DE69915162D1 (en) 2004-04-08
CN1118770C (en) 2003-08-20
KR100509797B1 (en) 2005-08-23
JPH11344990A (en) 1999-12-14

Similar Documents

Publication Publication Date Title
EP0953970B1 (en) Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
US6029132A (en) Method for letter-to-sound in text-to-speech synthesis
US6016471A (en) Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
US6363342B2 (en) System for developing word-pronunciation pairs
US6233553B1 (en) Method and system for automatically determining phonetic transcriptions associated with spelled words
Black et al. Issues in building general letter to sound rules
Pagel et al. Letter to sound rules for accented lexicon compression
US6230131B1 (en) Method for generating spelling-to-pronunciation decision tree
JP5014785B2 (en) Phonetic-based speech recognition system and method
US7418389B2 (en) Defining atom units between phone and syllable for TTS systems
US6684187B1 (en) Method and system for preselection of suitable units for concatenative speech
US6411932B1 (en) Rule-based learning of word pronunciations from training corpora
US8069045B2 (en) Hierarchical approach for the statistical vowelization of Arabic text
US6134528A (en) Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations
Watts Unsupervised learning for text-to-speech synthesis
EP1668628A1 (en) Method for synthesizing speech
Goronzy Robust adaptation to non-native accents in automatic speech recognition
Amrouche et al. Design and Implementation of a Diacritic Arabic Text-To-Speech System.
Pearson et al. Automatic methods for lexical stress assignment and syllabification.
Dutoit et al. TTSBOX: A MATLAB toolbox for teaching text-to-speech synthesis
Chinathimatmongkhon et al. Implementing Thai text-to-speech synthesis for hand-held devices
Akinwonmi Development of a prosodic read speech syllabic corpus of the Yoruba language
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Khalil et al. Optimization of Arabic database and an implementation for Arabic speech synthesis system using HMM: HTS_ARAB_TALK
Toma et al. Automatic rule-based syllabication for Romanian

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17P Request for examination filed

Effective date: 20000515

AKX Designation fees paid

Free format text: AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

17Q First examination report despatched

Effective date: 20020712

RIC1 Information provided on ipc code assigned before grant

Ipc: 7G 10L 13/08 A

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: LI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED.

Effective date: 20040303

Ref country code: FR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: CH

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040303

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 69915162

Country of ref document: DE

Date of ref document: 20040408

Kind code of ref document: P

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20040429

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20040429

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20040430

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040603

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040603

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040603

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040604

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20040614

NLV1 Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act
REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20040603

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

EN Fr: translation not filed
26N No opposition filed

Effective date: 20041206

REG Reference to a national code

Ref country code: GB

Ref legal event code: 728V

REG Reference to a national code

Ref country code: GB

Ref legal event code: 728Y

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20040803

REG Reference to a national code

Ref country code: GB

Ref legal event code: 732E

Free format text: REGISTERED BETWEEN 20140612 AND 20140618

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20180329

Year of fee payment: 20

REG Reference to a national code

Ref country code: GB

Ref legal event code: PE20

Expiry date: 20190428

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION

Effective date: 20190428