USRE42647E1 - Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same - Google Patents

Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same Download PDF

Info

Publication number
USRE42647E1
USRE42647E1 US10/193,594 US19359402A USRE42647E US RE42647 E1 USRE42647 E1 US RE42647E1 US 19359402 A US19359402 A US 19359402A US RE42647 E USRE42647 E US RE42647E
Authority
US
United States
Prior art keywords
information
phoneme
prosody
synchronization
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US10/193,594
Inventor
Jung Chul Lee
Min Soo Hahn
Hang Seop Lee
Jae Woo Yang
YoungJik Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electronics and Telecommunications Research Institute ETRI
Original Assignee
Electronics and Telecommunications Research Institute ETRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electronics and Telecommunications Research Institute ETRI filed Critical Electronics and Telecommunications Research Institute ETRI
Priority to US10/193,594 priority Critical patent/USRE42647E1/en
Application granted granted Critical
Publication of USRE42647E1 publication Critical patent/USRE42647E1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the present invention relates to a text-to-speech conversion systems (hereinafter, referred to as TTS) for interlocking with multimedia synchronizing synthesized speech and a moving picture and a method for organizing input data of the same, and more particularly to a text-to-speech conversion system ( TTS ) for interlocking with multimedia synchronizing synthesized speech and a moving picture and a method for organizing input data of the same for enhancing the natural naturalness of the synthesized speech and accomplishing synchronization between multimedia and TTS by defining additional prosody information, the information required to interlock TTS with multimedia, and interface between these information and TTS for use in the production of the synthesized speech.
  • TTS text-to-speech conversion systems
  • the function of the a speech synthesizer is to provide different forms of information for a man using a computer user.
  • the speech synthesizer should serve the provide the user with synthesized speech with high quality from a given text.
  • the speech synthesizer should also produce the synthesized speech to be synchronized with theses media video data such as a moving picture, animation and so on.
  • the synchronization function of TTS with in the multimedia environment is essential to provide the user with high quality service with high quality.
  • a conventional TTS goes through the process consisting of 3 steps as follows until the synthesized speech is produced from on inputted text.
  • a language processor 1 converts the text into a series of phoneme, presumes prosody information and symbolizes this information. Symbol of prosody information is presumed from a boundary of the phrase and paragraph, a location of accent in word, a sentence pattern, and so on using the analysis result of syntax.
  • a prosody processor 2 calculates a value of prosody control parameter from the symbolized prosody information using a rule and a table.
  • Prosody control parameter includes duration of phoneme, pitch contour, energy contour, and pause interval information.
  • a signal processor 3 produces a synthesized speech using a synthesis unit database 4 and the prosody control parameter.
  • the conventional TTS has simple function to output data inputted by the unit of sentence as the synthesized speech. Accordingly, in order to output sentences stored in a file or sentences inputted through a communication network as the synthesized speech in succession, a main control program which reads sentences from the inputted data and transmits them to an input of TTS is required.
  • a main control program includes a method to separate the text from the inputted data and then output the synthesized speech once from the beginning to the end, a method to produce the synthesized speech in interlock with a text editor, a method to look up the sentences by use of a graphic interface and produce the synthesized speech, and so on, but the object to which these methods are applicable is restricted to the text.
  • TTS text-to-speech conversion system
  • a TTS for interlocking with multimedia comprises a multimedia information input unit for organizing text, prosody, the information on synchronization with moving picture, lip-shape, and the information such as individual property; a data distributor by each media for distributing the information of the multimedia information input unit into the information by each media; a language processor for converting the text distributed by the data distributor by each media into phoneme stream, presuming prosody information and symbolizing the information; a prosody processor for calculating a value of prosody control parameter from the symbolized prosody information using a rule and a table; a synchronization adjuster for adjusting the duration of the phoneme using the synchronization information distributed by the data distributor by each media; a signal processor for producing a synthesized speech using the prosody control parameter and data in a synthesis unit database; and a picture output apparatus for outputting the picture information distributed by the data distributor by each media onto a screen.
  • a method for organizing input data of a text-to-speech conversion system (TTS) for interlocking with multimedia comprises the steps of: classifying multimedia input information organized for enhancing the natural of synthesized speech and implementing the synchronization of multimedia with TIS into text, prosody, the information on synchronization with moving picture, lip-shape, and individual property information in a multimedia information input unit; distributing the information classified in the multimedia information input in a data distributor by each media, based on respective information; converting text distributed in the data distributor by each media into phoneme stream, presuming prosody information and symbolizing the information in a language processor; calculating a value of prosody control parameter other than prosody control parameter included in multimedia information in a prosody processor; adjusting the duration every each phoneme in a synchronization adjuster so that processing result in the prosody processor may be synchronized with a picture signal according to input of the synchronization information; producing the synchronized speech in a signal processor using the prosody information from the data distributor by
  • FIG. 1 is a constructional view of a conventional text-to-speech conversion system.
  • FIG. 2 is a constructional view of a hardware to which the present invention is applied.
  • FIG. 3 is a constructional view of a text-to-speech conversion system according to the present invention.
  • the hardware consists of a multimedia data input unit 5 , a central processing unit 6 , a synthesis database 7 , a digital to analog (D/A) converter 8 , and a picture output apparatus 9 .
  • the multimedia data input unit 5 is inputted with data composed of multimedia such as picture and text and outputs this data to the central processing unit 6 .
  • the central processing unit 6 distributes the multimedia data input of the present invention, adjusts synchronization, and performs algorithm based therein to produce synthesized speech.
  • the synthesis database 7 is a database used in the algorithm for producing the synthesized speech. This synthesis database 7 is stored in a storage device and transmits necessary data to the central processing unit 6 .
  • the digital to analog (D/A) converter 8 converts the synthesized digital data into analog signal and outputs the analog signal.
  • the picture output apparatus 9 outputs inputted picture information onto a screen.
  • Table 1 and 2 are algorithms illustrating the state of organized multimedia input information, which consists of text, prosody, the information on synchronization with moving picture, lip-shape, and individual property information.
  • TTS_Sequence( ) TTS_Sequence_Start_Code
  • TTS_Sentence_ID Language_Code
  • Prosody_Enable Video_Enable Lip_Shape_Enable Trick_Mode_Enable do ⁇
  • TTS_Sequence_Start_Code is a bit string represented with Hexadecimal ‘XXXX’ and means a start of TTS sentence.
  • the TTS_Sentence_ID is a 10-bit ID and which represents a proper unique identifying number of for each TTS data stream.
  • the language_Code represents an object language such as Korean language, English language, German language, Japanese language, French language etc,. to be synthesize.
  • the prosody_Enable is a 1-bit flag and has a value of ‘1’ when a prosody data of original sound is included in an organized data.
  • the Video_Enable is a 1-bit flag and has a value of ‘1’ when a TTS is interlocked with moving picture.
  • the Lip _Shape_Enable is a 1-bit flag and has a value of ‘1’ when a lip_shape data is included in an organized data.
  • the Trick_Mode_Enable is a 1-bit flag and has a value of ‘1’ when a data is organized to support a trick mode such as stop, restart, forward and backward.
  • the TTS_Sentence_Start_Code is a bit string represented with by Hexadecimal ‘XXXX’ and means indicates a start of a TTS sentence.
  • the TTS_Sentence_Start_Code is a 10-bit ID and represents a proper number of unique identifier for each TTS data stream.
  • the TTS_Sentence_ID is a 10-bit ID and represents a proper number a unique identifier of each TTS sentence existed present in the TTS data stream.
  • the Silence become a is a one-bit flag which is set to ‘1’ when a present input frame of 1-bit flag is silence a silent speech section.
  • a represents the duration time of a present silence silent speech section is represented by in milliseconds.
  • At stage of the Gender indicates the desired gender is distinguished from a of the synthesized speech.
  • an indicates a desired apparent age of the synthesized speech distinguished into a categorized by baby, youth, middle age and or old age speech quality.
  • the Speech_Rate represents a desired output speech rate of the synthesized speech.
  • a represents the length of an input text sentence is represented by as a byte.
  • At stage of the TTS_Text represents an optional length of a sentence text having optional length is represented.
  • the Dur_Enable is a 1-bit flag and become a set to ‘1’ when a duration time information is included in an the organized data stream.
  • the FO_Contour_Enable is a 1-bit flag and become a set to ‘1’ when a pitch information of each phoneme is included in the organized data stream.
  • the Energy_Contour_Enable is a 1-bit flag and become a set to ‘1’ when an energy information of each phoneme is included in the organized data stream.
  • Symbol_each_phoneme represents a symbol, such as IPA, which is to represent each phoneme is represented.
  • the Dur_each_phoneme represents a duration time of each phoneme.
  • At stage of the FO_contour_each_phoneme represents a pitch pattern of the phoneme represented by a pitch value of a beginning point, a mid point and an end point of the phoneme is represented.
  • At stage of the Energy_Contour_each_phoneme represents an energy pattern of the phoneme is represented and an energy value of a beginning point, a mid point and an end point of the phoneme is represented by decibel in decibels (dB).
  • the Sentence_Duration represents a total duration time of the synthesized speech of the sentence.
  • the Position_in_Sentence represents a position of a present frame in the sentence.
  • Offset represents a delay time when the synthesized speech is interlockedsynchronized with a moving picture, and a beginning point of the sentence is in the GOP (Group Of Pictures), aThe delay time consumedidentifies the time from the beginning point of the GOP to the beginning point of the sentence is represented.
  • the Number_of_Lip_Event represents the number of changing point of lip-shape in the sentence.
  • Lip — in — Sentence represents the location of a lip — shape changing part in a sentence.
  • the Lip_shape represents a lip—shape at lip—shape changing point of the sentence.
  • Text information includes a classification code for a used language and a sentence text.
  • Prosody information includes the number of phoneme in the sentence, phoneme stream information, the duration of every each phoneme, pitch pattern of phoneme, energy pattern of phoneme, and is used for enhancing the natural of the synthesized speech.
  • the synchronization information of the moving picture with the synthesized speech can be considered as the dubbing concept and the synchronization could be realized in three ways.
  • the duration of the synthesized speech is adjusted using the information about the beginning points of sentences, the durations of sentences, and the delay times of the beginning points of sentences.
  • the beginning points of each sentence indicates the locations of scenes from which output of the synthesized speech for each sentence within the moving picture is started.
  • the durations of sentences indicate the number of scenes in which the synthesized speech for each sentence lasts.
  • the moving picture of MPEG-2 and MPEG-4 picture compression type in which Group of Picture (GOP) concept is used should start at not any scene but a beginning scene within Group of Picture for reproduction.
  • GOP Group of Picture
  • the delay time of the beginning point is the information required to synchronize between the Group of Picture and the TTS and indicates delay time between the beginning scene and a speech beginning point. This method is easy to be realized and can minimize additional effort but is difficult to accomplish natural synchronization by this method.
  • the second method produces synthesized speech on a phoneme basis by use of beginning point information,and end point information, and phoneme information are marked every for each phoneme within an interval period associated with a speech signal in the moving picture and these information is used to produce the synthesized speech.
  • This method has an advantage in that the degree of accuracy is high since the synchronization between the moving picture and the synthesized speech by the phoneme unit can be attained but realized.
  • this method also has a disadvantage in that additional effort should be fairly must be made to detect and record the duration information by of the phoneme unit within the speech interval of the moving picture.
  • the third method records the synchronization information based on the beginning point of speech, the end point of speech, lip-shape, information, and a change point of timeinformation of lip-shape change.
  • Lip-shape is numeralizedinformation quantifies the to distance (extent of opening) between the upper lip and the lower lip, the distance (extent of width) between left and right and points of the lip, and the extent of projecting of the lip and is defined as a quantized and normalized pattern depended on dependent upon articulation location and articulation manner of the phoneme using a on the basis of pattern with high discriminative property highly discriminating pattern.
  • This method is a method to raise improves the efficiency of synchronization, while minimizing additional effort to produce the synchronization information for synchronization can be minimized.
  • the organizedOrganizing multimedia input information which is applied toin accordance with the present invention allows an information provider to select and implement optionally among 3 any of the three synchronization methods as described above.
  • Lip animation can be implemented by using phoneme stream prepared from the inputted text in the TTS and the duration every each phoneme, or phoneme stream distributed from the input information and the duration every each phoneme, or by using the information on lip-shape included in the inputted information.
  • the individual property information allows the user to change gender, age, and speech rate of the synthesized speech.
  • Gender has male and female, and age is classified into 4, for example, 6-7 years, 18 years, 40 years, and 65 years.
  • the change of speech rate may have 10 steps between 0.7 and 1.6 times of a standard rate. Quality of the synthesized speech can be diversified using these information.
  • FIG. 3 is a constructional view of the text-to-speech conversion system (TTS) according to the present invention.
  • the TTS consists of a multimedia information input unit 10 , a data distributor by each media 11 , a standardized language processor 12 , a prosody processor 13 , a synchronization adjuster 14 , a signal processor 15 , a synthesis unit database 16 , and a picture output apparatus 17 .
  • the multimedia input unit 10 is configured as form of Table 1 and 2 and comprises text, prosody information, the information on synchronization with moving picture, the information on lip-shape.
  • requisite information is text
  • other information can be optionally provided by an information provider as optional item for enhancing the individual property and the natural and accomplishing the synchronization with the multimedia, and if needed, can be amended by a TTS user by means of a character input device (keyboard) or a mouse. These information is transmitted to the data distributor by each media 11.
  • the data distributor by each media 11 receives the multimedia information of which the picture information is transmitted to the picture output apparatus 17 , text is transmitted to the language processor 12 , and the synchronization information is converted into data structure capable of utilizing in the synchronization adjuster 14 and transmitted to the synchronization adjustor 14 .
  • this multimedia information is converted into a data structure capable of utilizing in the signal processor 15 and then transmitted to the prosody processor 13 and the synchronization adjustor 14 .
  • individual property information is included in the inputted multimedia information, this multimedia information is converted into data structure capable of utilizing in the synthesis unit database 16 and the prosody processor 13 within the TTS and then transmitted to the synthesis unit database 16 and the prosody processor 13 .
  • the language processor 12 converts text into phoneme stream, presumes prosody information, symbolizes this information, and then transmits the symbolized information to the prosody processor 13 .
  • the symbol of prosody information is presumed from a boundary of the phrase and paragraph, a location of accent in word, a sentence pattern, and so on using the analysis result syntax.
  • the prosody processor 13 takes the processing result of the language processor 12 and calculates value of prosody control parameter other than prosody control parameter included in the multimedia information.
  • Prosody control parameter includes duration pitch contour, energy contour, pause point, and pause length of phoneme.
  • the calculated result is transmitted to thesynchronization adjustor 14 .
  • the synchronization adjustor 14 takes the processing result of the prosody processor 13 and adjusts the duration every each phoneme in orderto synchronize the result with the picture signal.
  • the adjustment of the duration every each phoneme utilizes the synchronization information transmitted from the data distributor by each media 11 .
  • lip-shape is assigned to each phoneme depended on articulation location and articulation manner of each phoneme and, on the basis of this, the assigned lip-shape information is compared to lip-shape information included in the synchronization information and then phoneme stream is divided into small groups by the number of lip-shape recorded in the synchronization information.
  • the duration of phoneme in the small groups is calculated again using the duration information of lip-shape included in the synchronization information.
  • the adjusted duration information is transmitted to the signal processor, included in the processing result of the prosody processor.
  • the signal processor 15 receivesthe prosody information from the multimedia distributor 11 or the processing result of the synchronization adjustor 14 to produce and output the synthesized speech using the synthesis unit database 16 .
  • the synthesis unit database 16 receives the individual property information from the multimedia distributor 11 , selects synthesis units adaptable to gender and age, and then transmits data required for synthesis to the signal processor 15 in response to a request from the signal processor 15 .
  • the individual property of the synthesized speech can be realized and the natural of the synthesized speech can be enhanced by organizing the individual property and prosody information presumed by the analysis of actual speech data, along with text information, as multistage information.
  • a foreign movie can be dubbed in Korean by implementing the synchronization of the synthesized speech with the moving picture by way of the direct use of text information and lip-shape information which is presumed by the analysis of actual speech data and lip-shape in the moving picture for the production of the synthesized speech.
  • the present invention is applicable to a variety of field such as communication service, office automation, education and so on by making the synchronization between the picture information and the TTS in the multimedia environment possible.

Abstract

The present invention provides a text-to-speech conversion system (TTS) for interlocking synchronizing with multimedia and a method for organizing input data of the TTS which can enhance the natural naturalness of synthesized speech and accomplish the synchronization of multimedia with TTS by defining additional prosody information, the information required to interlock synchronize TTS with multimedia, and interface between these this information and TTS for use in the production of the synthesized speech.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a text-to-speech conversion systems (hereinafter, referred to as TTS) for interlocking with multimedia synchronizing synthesized speech and a moving picture and a method for organizing input data of the same, and more particularly to a text-to-speech conversion system ( TTS ) for interlocking with multimedia synchronizing synthesized speech and a moving picture and a method for organizing input data of the same for enhancing the natural naturalness of the synthesized speech and accomplishing synchronization between multimedia and TTS by defining additional prosody information, the information required to interlock TTS with multimedia, and interface between these information and TTS for use in the production of the synthesized speech.
2. Description of the Related Art
Generally, the function of the a speech synthesizer is to provide different forms of information for a man using a computer user. To this end, the speech synthesizer should serve the provide the user with synthesized speech with high quality from a given text. In addition, for the interlock with database produced in multimedia environment such as moving picture or animation, or a variety of media provided from a counterpart of conversion Preferably, the speech synthesizer should also produce the synthesized speech to be synchronized with theses media video data such as a moving picture, animation and so on. Particularly In particular, the synchronization function of TTS with in the multimedia environment is essential to provide the user with high quality service with high quality.
As shown in FIG. 1, typically, a conventional TTS goes through the process consisting of 3 steps as follows until the synthesized speech is produced from on inputted text.
In a first step, a language processor 1 converts the text into a series of phoneme, presumes prosody information and symbolizes this information. Symbol of prosody information is presumed from a boundary of the phrase and paragraph, a location of accent in word, a sentence pattern, and so on using the analysis result of syntax.
In a second step, a prosody processor 2 calculates a value of prosody control parameter from the symbolized prosody information using a rule and a table. Prosody control parameter includes duration of phoneme, pitch contour, energy contour, and pause interval information.
In a third step, a signal processor 3 produces a synthesized speech using a synthesis unit database 4 and the prosody control parameter. In other words, this means that the conventional TTS should presume the information associated with the natural speech rate in the language processor 1 and the prosody processor 2 only by the inputted text.
Further, the conventional TTS has simple function to output data inputted by the unit of sentence as the synthesized speech. Accordingly, in order to output sentences stored in a file or sentences inputted through a communication network as the synthesized speech in succession, a main control program which reads sentences from the inputted data and transmits them to an input of TTS is required. Such a main control program includes a method to separate the text from the inputted data and then output the synthesized speech once from the beginning to the end, a method to produce the synthesized speech in interlock with a text editor, a method to look up the sentences by use of a graphic interface and produce the synthesized speech, and so on, but the object to which these methods are applicable is restricted to the text.
At present, studies on TTS have considerably advanced for the vernacular language in different countries and a commercial use has been accomplished in some countries. However, this is in situation of the only use for the syntheses of speech from the inputted text. In addition, by a prior organization, since it is impossible to presume from only the text the information required when moving picture is to be dubbed by use of TTS or when the natural interlock between the synthesized speech and multimedia such as animation is to be implemented, there is no method to realize these functions. Furthermore, there is also no result of the studies on use of additional data for enhancement of the natural in the synthesized speech and organization of these data.
SUMMARY OF THE INVENTION
Therefore, it is an object of the present invention to provide a text-to-speech conversion system ( TTS) for interlocking with multimedia synchronizing synthesized speech with a moving picture and a method for organizing input data of the same therefore for enhancing the natural naturalness of synthesized speech and accomplishing synchronization of synchronizing multimedia with TTS by defining additional prosody information, the information required to interlock synchronize TTS with multimedia, and an interface between these such information and TTS for use in the production of the producing synthesized speech.
In order to accomplish the above object, a TTS for interlocking with multimedia according to the present invention comprises a multimedia information input unit for organizing text, prosody, the information on synchronization with moving picture, lip-shape, and the information such as individual property; a data distributor by each media for distributing the information of the multimedia information input unit into the information by each media; a language processor for converting the text distributed by the data distributor by each media into phoneme stream, presuming prosody information and symbolizing the information; a prosody processor for calculating a value of prosody control parameter from the symbolized prosody information using a rule and a table; a synchronization adjuster for adjusting the duration of the phoneme using the synchronization information distributed by the data distributor by each media; a signal processor for producing a synthesized speech using the prosody control parameter and data in a synthesis unit database; and a picture output apparatus for outputting the picture information distributed by the data distributor by each media onto a screen.
In order to accomplish the above object, a method for organizing input data of a text-to-speech conversion system (TTS) for interlocking with multimedia comprises the steps of: classifying multimedia input information organized for enhancing the natural of synthesized speech and implementing the synchronization of multimedia with TIS into text, prosody, the information on synchronization with moving picture, lip-shape, and individual property information in a multimedia information input unit; distributing the information classified in the multimedia information input in a data distributor by each media, based on respective information; converting text distributed in the data distributor by each media into phoneme stream, presuming prosody information and symbolizing the information in a language processor; calculating a value of prosody control parameter other than prosody control parameter included in multimedia information in a prosody processor; adjusting the duration every each phoneme in a synchronization adjuster so that processing result in the prosody processor may be synchronized with a picture signal according to input of the synchronization information; producing the synchronized speech in a signal processor using the prosody information from the data distributor by each media, the processing result in the synchronization adjuster, and a synthesis unit database; and outputting the picture information distributed by the data distributor by each media onto a screen in a picture output apparatus.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features, aspects of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a constructional view of a conventional text-to-speech conversion system.
FIG. 2 is a constructional view of a hardware to which the present invention is applied.
FIG. 3 is a constructional view of a text-to-speech conversion system according to the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Now, the present invention will be described in detail by way of the preferred embodiment.
Referring to FIG. 2, a constructional view of hardware to which the present invention is applied is shown. In FIG. 2, the hardware consists of a multimedia data input unit 5, a central processing unit 6, a synthesis database 7, a digital to analog (D/A) converter 8, and a picture output apparatus 9.
The multimedia data input unit 5 is inputted with data composed of multimedia such as picture and text and outputs this data to the central processing unit 6.
The central processing unit 6 distributes the multimedia data input of the present invention, adjusts synchronization, and performs algorithm based therein to produce synthesized speech.
The synthesis database 7 is a database used in the algorithm for producing the synthesized speech. This synthesis database 7 is stored in a storage device and transmits necessary data to the central processing unit 6.
The digital to analog (D/A) converter 8 converts the synthesized digital data into analog signal and outputs the analog signal.
The picture output apparatus 9 outputs inputted picture information onto a screen.
Table 1 and 2 are algorithms illustrating the state of organized multimedia input information, which consists of text, prosody, the information on synchronization with moving picture, lip-shape, and individual property information.
TABLE 1
Syntax
TTS_Sequence( ) {
TTS_Sequence_Start_Code
TTS_Sentence_ID
Language_Code
Prosody_Enable
Video_Enable
Lip_Shape_Enable
Trick_Mode_Enable
do{
     TTS_Sentence ( )
     }while (next_bits( )==TTS_Sentence_Start_Code
}
Here, the TTS_Sequence_Start_Code is a bit string represented with Hexadecimal ‘XXXXX’ and means a start of TTS sentence.
The TTS_Sentence_ID is a 10-bit ID and which represents a proper unique identifying number of for each TTS data stream.
The language_Code represents an object language such as Korean language, English language, German language, Japanese language, French language etc,. to be synthesize.
The prosody_Enable is a 1-bit flag and has a value of ‘1’ when a prosody data of original sound is included in an organized data.
The Video_Enable is a 1-bit flag and has a value of ‘1’ when a TTS is interlocked with moving picture.
The Lip _Shape_Enable is a 1-bit flag and has a value of ‘1’ when a lip_shape data is included in an organized data.
The Trick_Mode_Enable is a 1-bit flag and has a value of ‘1’ when a data is organized to support a trick mode such as stop, restart, forward and backward.
TABLE 2
Syntax
TTS_Sentence ( ) {
TTS_Sentence_Start_Code
TTS_Sentence_ID
Silence
if (Silence) {
 Silence_Duration
 }
else {
 Gender
 Age
 if(!Video_Enable) {
  Speech_Rate
  }
 Length_of_Text
 TTS_Text( )
 if(Prosody_Enable) {
  Dur_Enable
  FO_Contour_Enable
  Energy_Contour_Enable
  Number_of_Phonemes
  for(j=0 ; j<Number_of_phonemes ; j++) {
   Symbol_each_phoneme
   if(Dur_Enable) {
   Dur_each_phoneme
   }
   if(FO_Contour_Enable {
   FO_contour_each_phoneme
   }
   if(Energy_Contour_Enable) {
   Energy_contour_each_phoneme
   }
  }
 }
 if(Video_Enable) {
  Sentence_Duration
  Position_in_Sentence
  offset
  }
 if(Lip_Shape_Enable) {
  Number_of_Lip_Event
  for(j=0 , j<Number_of_Lip_Event ; j++) {
   Lip_in_Sentence
   Lip_Shape
   }
  }
 }
}
Here, the TTS_Sentence_Start_Code is a bit string represented with by Hexadecimal ‘XXXXX’ and means indicates a start of a TTS sentence. And the TTS_Sentence_Start_Code is a 10-bit ID and represents a proper number of unique identifier for each TTS data stream.
The TTS_Sentence_ID is a 10-bit ID and represents a proper number a unique identifier of each TTS sentence existed present in the TTS data stream.
The Silence become a is a one-bit flag which is set to ‘1’ when a present input frame of 1-bit flag is silence a silent speech section.
At stage of the Silence_Duration, a represents the duration time of a present silence silent speech section is represented by in milliseconds.
At stage of the Gender, indicates the desired gender is distinguished from a of the synthesized speech.
At stage of the Age, an indicates a desired apparent age of the synthesized speech distinguished into a categorized by baby, youth, middle age and or old age speech quality.
The Speech_Rate represents a desired output speech rate of the synthesized speech.
At stage of the Length_of_Text, a represents the length of an input text sentence is represented by as a byte.
At stage of the TTS_Text, represents an optional length of a sentence text having optional length is represented.
The Dur_Enable is a 1-bit flag and become a set to ‘1’ when a duration time information is included in an the organized data stream.
The FO_Contour_Enable is a 1-bit flag and become a set to ‘1’ when a pitch information of each phoneme is included in the organized data stream.
The Energy_Contour_Enable is a 1-bit flag and become a set to ‘1’ when an energy information of each phoneme is included in the organized data stream.
At stage of the Number_of_Phonemes, represents the number of phoneme needed phonemes required to synthesize a sentence are represented.
At stage of the Symbol_each_phoneme, represents a symbol, such as IPA, which is to represent each phoneme is represented.
The Dur_each_phoneme represents a duration time of each phoneme.
At stage of the FO_contour_each_phoneme, represents a pitch pattern of the phoneme represented by a pitch value of a beginning point, a mid point and an end point of the phoneme is represented.
At stage of the Energy_Contour_each_phoneme, represents an energy pattern of the phoneme is represented and an energy value of a beginning point, a mid point and an end point of the phoneme is represented by decibel in decibels (dB).
The Sentence_Duration represents a total duration time of the synthesized speech of the sentence.
The Position_in_Sentence represents a position of a present frame in the sentence.
At stage of the offset,Offset represents a delay time when the synthesized speech is interlockedsynchronized with a moving picture, and a beginning point of the sentence is in the GOP (Group Of Pictures), aThe delay time consumedidentifies the time from the beginning point of the GOP to the beginning point of the sentence is represented.
The Number_of_Lip_Event represents the number of changing point of lip-shape in the sentence.
LipinSentence represents the location of a lipshape changing part in a sentence.
The Lip_shape represents a lip—shape at lip—shape changing point of the sentence.
Text information includes a classification code for a used language and a sentence text. Prosody information includes the number of phoneme in the sentence, phoneme stream information, the duration of every each phoneme, pitch pattern of phoneme, energy pattern of phoneme, and is used for enhancing the natural of the synthesized speech. The synchronization information of the moving picture with the synthesized speech can be considered as the dubbing concept and the synchronization could be realized in three ways.
Firstly, there is a method to synchronize between the moving picture and the synthesized speech by the sentence unit by which method the duration of the synthesized speech is adjusted using the information about the beginning points of sentences, the durations of sentences, and the delay times of the beginning points of sentences. The beginning points of each sentence indicates the locations of scenes from which output of the synthesized speech for each sentence within the moving picture is started. The durations of sentences indicate the number of scenes in which the synthesized speech for each sentence lasts. In addition, the moving picture of MPEG-2 and MPEG-4 picture compression type in which Group of Picture (GOP) concept is used should start at not any scene but a beginning scene within Group of Picture for reproduction. Therefore, the delay time of the beginning point is the information required to synchronize between the Group of Picture and the TTS and indicates delay time between the beginning scene and a speech beginning point. This method is easy to be realized and can minimize additional effort but is difficult to accomplish natural synchronization by this method.
Secondly, there is a method by whichThe second method produces synthesized speech on a phoneme basis by use of beginning point information,and end point information, and phoneme information are marked every for each phoneme within an interval period associated with a speech signal in the moving picture and these information is used to produce the synthesized speech. This method has an advantage in that the degree of accuracy is high since the synchronization between the moving picture and the synthesized speech by the phoneme unit can be attained but realized. However, this method also has a disadvantage in that additional effort should be fairly must be made to detect and record the duration information by of the phoneme unit within the speech interval of the moving picture.
Thirdly, there is a method to recordThe third method records the synchronization information based on the beginning point of speech, the end point of speech, lip-shape, information, and a change point of timeinformation of lip-shape change. Lip-shape is numeralizedinformation quantifies the to distance (extent of opening) between the upper lip and the lower lip, the distance (extent of width) between left and right and points of the lip, and the extent of projecting of the lip and is defined as a quantized and normalized pattern depended on dependent upon articulation location and articulation manner of the phoneme using a on the basis of pattern with high discriminative property highly discriminating pattern. This method is a method to raise improves the efficiency of synchronization, while minimizing additional effort to produce the synchronization information for synchronization can be minimized.
The organizedOrganizing multimedia input information which is applied toin accordance with the present invention allows an information provider to select and implement optionally among 3 any of the three synchronization methods as described above.
In addition, the organized multimedia input information is also used in the process to implement lip animation. Lip animation can be implemented by using phoneme stream prepared from the inputted text in the TTS and the duration every each phoneme, or phoneme stream distributed from the input information and the duration every each phoneme, or by using the information on lip-shape included in the inputted information.
The individual property information allows the user to change gender, age, and speech rate of the synthesized speech. Gender has male and female, and age is classified into 4, for example, 6-7 years, 18 years, 40 years, and 65 years. The change of speech rate may have 10 steps between 0.7 and 1.6 times of a standard rate. Quality of the synthesized speech can be diversified using these information.
FIG. 3 is a constructional view of the text-to-speech conversion system (TTS) according to the present invention. In FIG. 3, the TTS consists of a multimedia information input unit 10, a data distributor by each media 11, a standardized language processor 12, a prosody processor 13, a synchronization adjuster 14, a signal processor 15, a synthesis unit database 16, and a picture output apparatus 17.
The multimedia input unit 10 is configured as form of Table 1 and 2 and comprises text, prosody information, the information on synchronization with moving picture, the information on lip-shape. Among these, requisite information is text, other information can be optionally provided by an information provider as optional item for enhancing the individual property and the natural and accomplishing the synchronization with the multimedia, and if needed, can be amended by a TTS user by means of a character input device (keyboard) or a mouse. These information is transmitted to the data distributor by each media 11.
The data distributor by each media 11 receives the multimedia information of which the picture information is transmitted to the picture output apparatus 17, text is transmitted to the language processor 12, and the synchronization information is converted into data structure capable of utilizing in the synchronization adjuster 14 and transmitted to the synchronization adjustor 14. If prosody information is included in the inputted multimedia information, this multimedia information is converted into a data structure capable of utilizing in the signal processor 15 and then transmitted to the prosody processor 13 and the synchronization adjustor 14. If individual property information is included in the inputted multimedia information, this multimedia information is converted into data structure capable of utilizing in the synthesis unit database 16 and the prosody processor 13 within the TTS and then transmitted to the synthesis unit database 16 and the prosody processor 13.
The language processor 12 converts text into phoneme stream, presumes prosody information, symbolizes this information, and then transmits the symbolized information to the prosody processor 13. The symbol of prosody information is presumed from a boundary of the phrase and paragraph, a location of accent in word, a sentence pattern, and so on using the analysis result syntax.
The prosody processor 13 takes the processing result of the language processor 12 and calculates value of prosody control parameter other than prosody control parameter included in the multimedia information. Prosody control parameter includes duration pitch contour, energy contour, pause point, and pause length of phoneme. The calculated result is transmitted to thesynchronization adjustor 14.
The synchronization adjustor 14 takes the processing result of the prosody processor 13 and adjusts the duration every each phoneme in orderto synchronize the result with the picture signal. The adjustment of the duration every each phoneme utilizes the synchronization information transmitted from the data distributor by each media 11. First, lip-shape is assigned to each phoneme depended on articulation location and articulation manner of each phoneme and, on the basis of this, the assigned lip-shape information is compared to lip-shape information included in the synchronization information and then phoneme stream is divided into small groups by the number of lip-shape recorded in the synchronization information. Also, the duration of phoneme in the small groups is calculated again using the duration information of lip-shape included in the synchronization information. The adjusted duration information is transmitted to the signal processor, included in the processing result of the prosody processor.
The signal processor 15 receivesthe prosody information from the multimedia distributor 11 or the processing result of the synchronization adjustor 14 to produce and output the synthesized speech using the synthesis unit database 16.
The synthesis unit database 16 receives the individual property information from the multimedia distributor 11, selects synthesis units adaptable to gender and age, and then transmits data required for synthesis to the signal processor 15 in response to a request from the signal processor 15.
As can be seen from the description described above, according to the present invention, the individual property of the synthesized speech can be realized and the natural of the synthesized speech can be enhanced by organizing the individual property and prosody information presumed by the analysis of actual speech data, along with text information, as multistage information. Furthermore, a foreign movie can be dubbed in Korean by implementing the synchronization of the synthesized speech with the moving picture by way of the direct use of text information and lip-shape information which is presumed by the analysis of actual speech data and lip-shape in the moving picture for the production of the synthesized speech. Still furthermore, the present invention is applicable to a variety of field such as communication service, office automation, education and so on by making the synchronization between the picture information and the TTS in the multimedia environment possible.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
It is therefore intended by the appended claims to cover any and all such applications, modifications, and embodiments within the scope of the present invention.

Claims (32)

1. A text-to-speech conversion system for interlocking with multimedia comprising;
a multimedia information input unit for organizing text, prosody information, information on synchronization with a moving picture, lip-shape information, picture information, and individual property information including a gender, age, accent, pronunciation and speech rate of synthesized speech;
a data distributor for distributing the information from said multimedia information input unit into information for each media;
a language processor for converting the text distributed by said data distributor into a phoneme stream, presuming prosody information and symbolizing the presumed prosody information;
a prosody processor for calculating a prosody control parameter value from the symbolized prosody information from the language processor;
a synchronization adjuster for adjusting a duration of each phoneme using the synchronization information distributed by said data distributor;
a synthesis unit database for receiving the individual property information from said data distributor, selecting synthesis units adaptable to gender and age and outputting data required for synthesis;
a signal processor for producing a synthesized speech using the prosody control parameter and the data output from said synthesis unit database; and
a picture output apparatus for outputting the picture information distributed by said data distributor onto a screen.
2. A method for organizing input data of a text-to-speech conversion system for interlocking with multimedia, said method comprising the steps of:
(a) classifying multimedia input information organized for enhancing natural synthesized speech and implementing synchronization of multimedia with text-to-speech into text, prosody information, information on synchronization with a moving picture, lip-shaped information, picture information, and individual property information using a multimedia information input unit;
(b) distributing using a data distributor the multimedia input information classified in the multimedia information input unit based on respective information;
(c) converting the text distributed by the data distributor into a phoneme stream, presuming prosody information and symbolizing the presumed prosody information using a language processor;
(d) calculating a prosody control parameter value which is not included in the multimedia input information using a prosody processor;
(e) adjusting a duration of each phoneme using a synchronization adjuster so as to synchronize a processing result of the prosody processor with a picture signal according to the synchronization information distributed by the data distributor;
(f) selecting synthesis units adaptable to gender and age based on the individual property information from the data distributor using a synthesis unit database and outputting data required for synthesis;
(g) producing synthesized speech using a signal processor based on the prosody information distributed by the data distributor, a processing result of the synchronization adjuster, and the data from the synthesis unit database; and
(h) outputting the picture information distributed by the data distributor onto a screen using a picture output unit.
3. The method in accordance with claim 2, wherein the organized multimedia information comprises text information, prosody information, information on synchronization with a moving picture, lip-shaped information, and individual property information.
4. The method in accordance with claim 3, wherein the prosody information comprises a number of phoneme, phoneme stream information, duration of each phoneme, pitch pattern of the phoneme, and energy pattern of the phoneme.
5. The method in accordance with claim 4, wherein the duration time of the phoneme is indicative of a value of pitch at a beginning point, a mid point, and an end point within the phoneme.
6. The method in accordance with claim 5, wherein the energy pattern of the phoneme is indicative of a value of energy in decibels at the beginning point, the mid point, and the end point within the phoneme.
7. The method in accordance with claim 3, wherein the synchronization information comprises text, lip-shape, location information with a moving picture, and duration information.
8. The method in accordance with claim 3, wherein the synchronization information comprises a beginning point, duration and delay time information of a starting point, and duration of each phoneme is controlled by the synchronization information.
9. The method in accordance with claim 3, wherein the synchronization information is composed of a duration of a beginning point of a sentence, a duration information of a starting point, and duration of each phoneme is controlled by forecast lip-shape considered an articulation manner of the phoneme and articulation control of lip-shape within the synchronization and duration information of the synchronization information.
10. The method in accordance with claim 3, wherein the synthesized speech is produced based on beginning point information, end point information, and phoneme information for each phoneme within an interval associated with a speech signal.
11. The method in accordance with claim 3, wherein the synthesized speech is produced based on a distance of an opening between an upper lip and a lower lip, a distance between end points of the lips, and an extent of projection of a lip, and a lip-shape quantized and normalized pattern is defined depending on articulation location and articulation manner of the phoneme on a basis of pattern with discriminative property.
12. The method in accordance with claim 3, wherein if the multimedia input information comprises prosody information, further comprising the steps of:
(i) converting the prosody information into a data structure recognizable by the signal processor; and
(j) transmitting the converted prosody information to the prosody processor and the synchronization adjustor.
13. The method in accordance with claim 3, wherein if the multimedia input information includes individual property information, further comprising the steps of:
(k) converting the individual property information into a data structure recognizable by the synthesis unit database and the prosody processor within the text-to-speech;
(l) transmitting the converted individual property information to the synthesis unit database and the prosody processor.
14. A text-to-speech conversion system (TTS) for synchronizing synthesized speech and a moving picture which is to be displayed on a picture output apparatus which is connected with the TTS, the TTS including a language processor for converting the text into phoneme stream and presuming prosody information from the phoneme stream; a prosody processor for calculating prosody control parameter values from the prosody information using a predefined rule; and a signal processor for producing synthesized speech using the prosody control parameter values and synthetic data stored in a synthesis unit database, characterized in that the TTS comprises:
a multimedia information input unit for inputting a set of multimedia information, the set of multimedia information including moving picture information, text information, and synchronization information;
a data distributor for classifying the set of multimedia information into a plurality of subsets of the multimedia information to distribute each subset of the multimedia information into a corresponding one of the language processor, prosody processor, signal processor and the picture output apparatus; and
a synchronization adjuster for adjusting the duration of each phoneme in the phoneme stream using the synchronization information subset distributed by said data distributor to synchronize between synthesized speech to be produced by the signal processor and the moving picture to be displayed on the picture output apparatus.
15. A method for synchronizing synthesized speech generated from a TTS and a moving picture which is to be displayed on a picture output apparatus which is connected with the TTS, the method comprising the steps of:
receiving a set of multimedia information which includes text information, moving picture information and synchronization information;
classifying the set of the received multimedia information into a plurality of subsets of the information including a synchronization information subset;
converting each classified text information subset into a phoneme stream;
presuming prosody information from each phoneme stream;
calculating prosody control parameter values based on the prosody information;
adjusting the duration of each phoneme of each phoneme stream using the respective classified synchronization information subset to synchronize between the synthesized speech and the moving picture; and
producing the synthesized speech using the prosody control parameter values and data in a synthesis unit database, in synchronism with the moving picture to be displayed.
16. The method according the claim 15, wherein said prosody control parameters are comprised of the number of phonemes, duration time of each phoneme, pitch pattern of each phoneme and energy pattern of each phoneme.
17. The method according to claim 16, wherein said pitch pattern of each phoneme is indicative of a value of pitch of the desired synthesized speech at a beginning point, a middle point, and an end point within each phoneme.
18. The method according to claim 16, wherein said energy pattern of each phoneme is indicative of a value of energy of the desired synthesized speech in decibels at a beginning point, a mid point and an end point within each phoneme.
19. The system according to claim 14, wherein said synchronization information subset includes lip-shape information, said lip-shape information including a number of lip-shape change points, a location of lip-shape change points in a sentence and a lip-shape representation at every lip-shape change point.
20. The system according to claim 14, wherein a set of said multimedia information further includes individual property information, said individual property information subset including gender and age information of the synthesized speech.
21. The system according to claim 14, wherein if a set of said multimedia information further includes prosody control parameters, said prosody control parameters are capable of being utilized in said synchronization adjuster without the processing of said language processor and prosody processor.
22. The system according the claim 21, wherein said prosody control parameters include the number of phonemes in the data stream, a duration time of each phoneme, a pitch pattern of each phoneme and an energy pattern of each phoneme.
23. The system according to claim 22, wherein said pitch pattern of each phoneme is indicative of a value of pitch of the desired synthesized speech at a beginning point, a middle point, and an end point within each phoneme.
24. The system according to claim 22, wherein said energy pattern of each phoneme is indicative of a value of energy of the desired synthesized speech in decibels at a beginning point, a mid point and an end point within each phoneme.
25. The method of claim 15, wherein the classified synchronization information subset includes lip-shape information, said lip-shape information including a number of lip-shape change points, a location of lip-shape change points in a sentence and a lip-shape representation at every lip-shape change point.
26. The method according to claim 15, wherein said set of the received multimedia information further includes an individual property information subset, said individual property information subset including gender and age information of the synthesized speech.
27. A method for synchronizing synthesized speech generated from a TTS and a moving picture which is to be displayed on a picture output apparatus which is connected with the TTS, the method comprising the steps of:
receiving a set of multimedia information which includes text information, moving picture information, synchronization information and prosody control parameters, said prosody control parameters including a duration of each phoneme;
classifying a set of the received multimedia information into a plurality of subsets of the information including a synchronization information subset;
adjusting the duration of each phoneme using the classified synchronization information subset to synchronize between the synthesized speech and the moving picture; and
producing the synthesized speech using the prosody control parameter values included in a set of the received multimedia information and data in a synthesis unit database, in synchronism with the moving picture to be displayed onto screen in the picture output apparatus.
28. A process for producing synthesized speech in synchronism with an associated moving picture characterized in that the process comprises receiving a set of multimedia information including text information, moving picture information and synchronization information; and synthesizing speech from the received text information in synchronization with the received moving picture information using the received synchronization information.
29. A speech synthesizer for use in synchronizing synthesized speech generated from a TTS and a moving picture which is to be displayed on a picture output apparatus which is connected with the TTS, the speech synthesizer comprising:
means for receiving prosody control parameters including a duration of each phoneme, synchronization information and moving picture data;
means for adjusting the duration of each phoneme of each phoneme stream using the synchronization information to synchronize between the synthesized speech and the moving picture; and
means for producing the synthesized speech using the prosody control parameter values and data in a synthesis unit database, in synchronism with the moving picture to be displayed.
30. A synthesizer for producing a synthesized speech using a text information, comprising:
receiving means for receiving the text information usable for synthesizing speech, and synchronization information including a number of lip-shape change points and a lip-shape representation at every lip-shape change point;
synthesizing means, for producing the synthesized speech from the text information using the synchronization information; and
outputting means for outputting the synthesized speech.
31. The synthesizer according to claim 30, wherein a picture output apparatus is connected to the synthesizer and operated in synchronization with the synthesizer.
32. The synthesizer according to claim 31, wherein the synchronization information is related to a moving picture outputted from the picture output apparatus.
US10/193,594 1997-05-08 2002-09-30 Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same Expired - Lifetime USRE42647E1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/193,594 USRE42647E1 (en) 1997-05-08 2002-09-30 Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR97-17615 1997-05-08
KR1019970017615A KR100240637B1 (en) 1997-05-08 1997-05-08 Syntax for tts input data to synchronize with multimedia
US09/020,712 US6088673A (en) 1997-05-08 1998-02-09 Text-to-speech conversion system for interlocking with multimedia and a method for organizing input data of the same
US10/193,594 USRE42647E1 (en) 1997-05-08 2002-09-30 Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/020,712 Reissue US6088673A (en) 1997-05-08 1998-02-09 Text-to-speech conversion system for interlocking with multimedia and a method for organizing input data of the same

Publications (1)

Publication Number Publication Date
USRE42647E1 true USRE42647E1 (en) 2011-08-23

Family

ID=19505142

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/020,712 Ceased US6088673A (en) 1997-05-08 1998-02-09 Text-to-speech conversion system for interlocking with multimedia and a method for organizing input data of the same
US10/193,594 Expired - Lifetime USRE42647E1 (en) 1997-05-08 2002-09-30 Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/020,712 Ceased US6088673A (en) 1997-05-08 1998-02-09 Text-to-speech conversion system for interlocking with multimedia and a method for organizing input data of the same

Country Status (4)

Country Link
US (2) US6088673A (en)
JP (2) JP3599549B2 (en)
KR (1) KR100240637B1 (en)
DE (1) DE19753454C2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166276A1 (en) * 2005-10-26 2013-06-27 c/o Cortica, Ltd. System and method for context translation of natural language
WO2020161697A1 (en) * 2019-02-05 2020-08-13 Igentify Ltd. System and methodology for modulation of dynamic gaps in speech

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7076426B1 (en) * 1998-01-30 2006-07-11 At&T Corp. Advance TTS for facial animation
KR100395491B1 (en) * 1999-08-16 2003-08-25 한국전자통신연구원 Method Of Visual Communication On Speech Translating System Based On Avatar
JP4320487B2 (en) * 1999-09-03 2009-08-26 ソニー株式会社 Information processing apparatus and method, and program storage medium
US6557026B1 (en) * 1999-09-29 2003-04-29 Morphism, L.L.C. System and apparatus for dynamically generating audible notices from an information network
USRE42904E1 (en) * 1999-09-29 2011-11-08 Frederick Monocacy Llc System and apparatus for dynamically generating audible notices from an information network
JP4465768B2 (en) * 1999-12-28 2010-05-19 ソニー株式会社 Speech synthesis apparatus and method, and recording medium
JP4032273B2 (en) * 1999-12-28 2008-01-16 ソニー株式会社 Synchronization control apparatus and method, and recording medium
US6529586B1 (en) 2000-08-31 2003-03-04 Oracle Cable, Inc. System and method for gathering, personalized rendering, and secure telephonic transmission of audio data
US6975988B1 (en) * 2000-11-10 2005-12-13 Adam Roth Electronic mail method and system using associated audio and visual techniques
KR100379995B1 (en) * 2000-12-08 2003-04-11 야무솔루션스(주) Multicodec player having text-to-speech conversion function
US20030009342A1 (en) * 2001-07-06 2003-01-09 Haley Mark R. Software that converts text-to-speech in any language and shows related multimedia
US7487092B2 (en) * 2003-10-17 2009-02-03 International Business Machines Corporation Interactive debugging and tuning method for CTTS voice building
WO2005059895A1 (en) 2003-12-16 2005-06-30 Loquendo S.P.A. Text-to-speech method and system, computer program product therefor
US20050187772A1 (en) * 2004-02-25 2005-08-25 Fuji Xerox Co., Ltd. Systems and methods for synthesizing speech using discourse function level prosodic features
US20060136215A1 (en) * 2004-12-21 2006-06-22 Jong Jin Kim Method of speaking rate conversion in text-to-speech system
JP3955881B2 (en) * 2004-12-28 2007-08-08 松下電器産業株式会社 Speech synthesis method and information providing apparatus
KR100710600B1 (en) * 2005-01-25 2007-04-24 우종식 The method and apparatus that createdplayback auto synchronization of image, text, lip's shape using TTS
TWI341956B (en) * 2007-05-30 2011-05-11 Delta Electronics Inc Projection apparatus with function of speech indication and control method thereof for use in the apparatus
US8374873B2 (en) 2008-08-12 2013-02-12 Morphism, Llc Training and applying prosody models
US8731931B2 (en) 2010-06-18 2014-05-20 At&T Intellectual Property I, L.P. System and method for unit selection text-to-speech using a modified Viterbi approach
JP6069211B2 (en) * 2010-12-02 2017-02-01 アクセシブル パブリッシング システムズ プロプライアタリー リミテッド Text conversion and expression system
JP2012150363A (en) * 2011-01-20 2012-08-09 Kddi Corp Message image editing program and message image editing apparatus
KR101358999B1 (en) * 2011-11-21 2014-02-07 (주) 퓨처로봇 method and system for multi language speech in charactor
WO2014141054A1 (en) * 2013-03-11 2014-09-18 Video Dubber Ltd. Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
KR20220147276A (en) * 2021-04-27 2022-11-03 삼성전자주식회사 Electronic devcie and method for generating text-to-speech model for prosody control of the electronic devcie
WO2023166527A1 (en) * 2022-03-01 2023-09-07 Gan Studio Inc. Voiced-over multimedia track generation

Citations (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AT72083B (en) 1912-12-18 1916-07-10 S J Arnheim Attachment for easily interchangeable locks.
US4260229A (en) * 1978-01-23 1981-04-07 Bloomstein Richard W Creating visual images of lip movements
US4305131A (en) * 1979-02-05 1981-12-08 Best Robert M Dialog between TV movies and human viewers
WO1985004747A1 (en) 1984-04-10 1985-10-24 First Byte Real-time text-to-speech conversion system
JPH02234285A (en) 1989-03-08 1990-09-17 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for synthesizing picture
JPH03241399A (en) 1990-02-20 1991-10-28 Canon Inc Voice transmitting/receiving equipment
EP0225729B1 (en) 1985-11-14 1992-01-22 BRITISH TELECOMMUNICATIONS public limited company Image encoding and synthesis
US5111409A (en) * 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
DE4101022A1 (en) 1991-01-16 1992-07-23 Medav Digitale Signalverarbeit Variable speed reproduction of audio signal without spectral change - dividing digitised audio signal into blocks, performing transformation, and adding or omitting blocks before reverse transformation
JPH04285769A (en) 1991-03-14 1992-10-09 Nec Home Electron Ltd Multi-media data editing method
JPH04359299A (en) 1991-06-06 1992-12-11 Sony Corp Image deformation method based on voice signal
JPH0564171A (en) 1991-09-03 1993-03-12 Hitachi Ltd Digital video/audio signal transmission system and digital audio signal reproduction method
JPH05188985A (en) 1992-01-13 1993-07-30 Hitachi Ltd Speech compression system, communication system, and radio communication device
JPH05313686A (en) 1992-04-02 1993-11-26 Sony Corp Display controller
US5313522A (en) * 1991-08-23 1994-05-17 Slager Robert P Apparatus for generating from an audio signal a moving visual lip image from which a speech content of the signal can be comprehended by a lipreader
JPH06326967A (en) 1993-05-12 1994-11-25 Matsushita Electric Ind Co Ltd Data transmission method
JPH06348811A (en) 1993-06-07 1994-12-22 Sharp Corp Moving image display device
US5386581A (en) * 1989-03-28 1995-01-31 Matsushita Electric Industrial Co., Ltd. Multimedia data editing apparatus including visual graphic display of time information
JPH0738857A (en) 1993-07-16 1995-02-07 Pioneer Electron Corp Synchronization system for time-division video and audio signals
JPH07306692A (en) 1994-05-13 1995-11-21 Matsushita Electric Ind Co Ltd Speech recognizer and sound inputting device
EP0689362A2 (en) 1994-06-21 1995-12-27 AT&T Corp. Sound-synchronised video system
JPH0830287A (en) 1994-07-19 1996-02-02 Internatl Business Mach Corp <Ibm> Text-speech converting system
US5500919A (en) * 1992-11-18 1996-03-19 Canon Information Systems, Inc. Graphics user interface for controlling text-to-speech conversion
EP0706170A2 (en) 1994-09-29 1996-04-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Method of speech synthesis by means of concatenation and partial overlapping of waveforms
US5557661A (en) * 1993-11-02 1996-09-17 Nec Corporation System for coding and decoding moving pictures based on the result of speech analysis
US5615300A (en) * 1992-05-28 1997-03-25 Toshiba Corporation Text-to-speech synthesis with controllable processing time and speech quality
US5630017A (en) 1991-02-19 1997-05-13 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
US5636325A (en) * 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US5657426A (en) * 1994-06-10 1997-08-12 Digital Equipment Corporation Method and apparatus for producing audio-visual synthetic speech
US5677993A (en) * 1992-08-31 1997-10-14 Hitachi, Ltd. Information processing apparatus using pointing input and speech input
US5677739A (en) 1995-03-02 1997-10-14 National Captioning Institute System and method for providing described television services
US5729694A (en) * 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US5751906A (en) * 1993-03-19 1998-05-12 Nynex Science & Technology Method for synthesizing speech from text and for spelling all or portions of the text by analogy
US5777612A (en) * 1995-03-20 1998-07-07 Fujitsu Limited Multimedia dynamic synchronization system
US5850629A (en) * 1996-09-09 1998-12-15 Matsushita Electric Industrial Co., Ltd. User interface controller for text-to-speech synthesizer
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
US5970459A (en) * 1996-12-13 1999-10-19 Electronics And Telecommunications Research Institute System for synchronization between moving picture and a text-to-speech converter
JP4359299B2 (en) 2006-09-13 2009-11-04 Tdk株式会社 Manufacturing method of multilayer ceramic electronic component

Patent Citations (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AT72083B (en) 1912-12-18 1916-07-10 S J Arnheim Attachment for easily interchangeable locks.
US4260229A (en) * 1978-01-23 1981-04-07 Bloomstein Richard W Creating visual images of lip movements
US4305131A (en) * 1979-02-05 1981-12-08 Best Robert M Dialog between TV movies and human viewers
WO1985004747A1 (en) 1984-04-10 1985-10-24 First Byte Real-time text-to-speech conversion system
EP0225729B1 (en) 1985-11-14 1992-01-22 BRITISH TELECOMMUNICATIONS public limited company Image encoding and synthesis
JPH02234285A (en) 1989-03-08 1990-09-17 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for synthesizing picture
GB2231246A (en) 1989-03-08 1990-11-07 Kokusai Denshin Denwa Co Ltd Converting text input into moving-face picture
US5386581A (en) * 1989-03-28 1995-01-31 Matsushita Electric Industrial Co., Ltd. Multimedia data editing apparatus including visual graphic display of time information
US5111409A (en) * 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
JPH03241399A (en) 1990-02-20 1991-10-28 Canon Inc Voice transmitting/receiving equipment
DE4101022A1 (en) 1991-01-16 1992-07-23 Medav Digitale Signalverarbeit Variable speed reproduction of audio signal without spectral change - dividing digitised audio signal into blocks, performing transformation, and adding or omitting blocks before reverse transformation
US5630017A (en) 1991-02-19 1997-05-13 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
US5689618A (en) 1991-02-19 1997-11-18 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
JPH04285769A (en) 1991-03-14 1992-10-09 Nec Home Electron Ltd Multi-media data editing method
JPH04359299A (en) 1991-06-06 1992-12-11 Sony Corp Image deformation method based on voice signal
US5313522A (en) * 1991-08-23 1994-05-17 Slager Robert P Apparatus for generating from an audio signal a moving visual lip image from which a speech content of the signal can be comprehended by a lipreader
JPH0564171A (en) 1991-09-03 1993-03-12 Hitachi Ltd Digital video/audio signal transmission system and digital audio signal reproduction method
JPH05188985A (en) 1992-01-13 1993-07-30 Hitachi Ltd Speech compression system, communication system, and radio communication device
JPH05313686A (en) 1992-04-02 1993-11-26 Sony Corp Display controller
US5615300A (en) * 1992-05-28 1997-03-25 Toshiba Corporation Text-to-speech synthesis with controllable processing time and speech quality
US5677993A (en) * 1992-08-31 1997-10-14 Hitachi, Ltd. Information processing apparatus using pointing input and speech input
US5636325A (en) * 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US5500919A (en) * 1992-11-18 1996-03-19 Canon Information Systems, Inc. Graphics user interface for controlling text-to-speech conversion
US5751906A (en) * 1993-03-19 1998-05-12 Nynex Science & Technology Method for synthesizing speech from text and for spelling all or portions of the text by analogy
JPH06326967A (en) 1993-05-12 1994-11-25 Matsushita Electric Ind Co Ltd Data transmission method
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
JPH06348811A (en) 1993-06-07 1994-12-22 Sharp Corp Moving image display device
JPH0738857A (en) 1993-07-16 1995-02-07 Pioneer Electron Corp Synchronization system for time-division video and audio signals
US5557661A (en) * 1993-11-02 1996-09-17 Nec Corporation System for coding and decoding moving pictures based on the result of speech analysis
US5608839A (en) 1994-03-18 1997-03-04 Lucent Technologies Inc. Sound-synchronized video system
JPH07306692A (en) 1994-05-13 1995-11-21 Matsushita Electric Ind Co Ltd Speech recognizer and sound inputting device
US5657426A (en) * 1994-06-10 1997-08-12 Digital Equipment Corporation Method and apparatus for producing audio-visual synthetic speech
EP0689362A2 (en) 1994-06-21 1995-12-27 AT&T Corp. Sound-synchronised video system
JPH0830287A (en) 1994-07-19 1996-02-02 Internatl Business Mach Corp <Ibm> Text-speech converting system
US5774854A (en) * 1994-07-19 1998-06-30 International Business Machines Corporation Text to speech system
EP0706170A2 (en) 1994-09-29 1996-04-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Method of speech synthesis by means of concatenation and partial overlapping of waveforms
US5677739A (en) 1995-03-02 1997-10-14 National Captioning Institute System and method for providing described television services
US5777612A (en) * 1995-03-20 1998-07-07 Fujitsu Limited Multimedia dynamic synchronization system
US5729694A (en) * 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US5850629A (en) * 1996-09-09 1998-12-15 Matsushita Electric Industrial Co., Ltd. User interface controller for text-to-speech synthesizer
US5970459A (en) * 1996-12-13 1999-10-19 Electronics And Telecommunications Research Institute System for synchronization between moving picture and a text-to-speech converter
JP4359299B2 (en) 2006-09-13 2009-11-04 Tdk株式会社 Manufacturing method of multilayer ceramic electronic component

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Nakumura et al. "Speech Recognition and Lip Movement Synthesis"; HMM based Audio-Visual Integration; pp. 93-98, 1997.
Nakumura et al. "Speech Recognition and Lip Movement Synthesis"; HMM based Audio-Visual Integration; pp. 93-98.
Yamamoto et al. pp. 245-246 Nara Institute of Science and Technology Sep. 1997.
Yamamoto et al. pp. 245-246 Nara Institute of Science and Technology.

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166276A1 (en) * 2005-10-26 2013-06-27 c/o Cortica, Ltd. System and method for context translation of natural language
US9087049B2 (en) * 2005-10-26 2015-07-21 Cortica, Ltd. System and method for context translation of natural language
WO2020161697A1 (en) * 2019-02-05 2020-08-13 Igentify Ltd. System and methodology for modulation of dynamic gaps in speech

Also Published As

Publication number Publication date
US6088673A (en) 2000-07-11
KR19980082608A (en) 1998-12-05
JP2004361965A (en) 2004-12-24
JP3599549B2 (en) 2004-12-08
KR100240637B1 (en) 2000-01-15
JP4344658B2 (en) 2009-10-14
DE19753454A1 (en) 1998-11-12
DE19753454C2 (en) 2003-06-18
JPH10320170A (en) 1998-12-04

Similar Documents

Publication Publication Date Title
USRE42647E1 (en) Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same
KR100236974B1 (en) Sync. system between motion picture and text/voice converter
US7145606B2 (en) Post-synchronizing an information stream including lip objects replacement
US5677739A (en) System and method for providing described television services
US20060285654A1 (en) System and method for performing automatic dubbing on an audio-visual stream
US20060136226A1 (en) System and method for creating artificial TV news programs
US20160021334A1 (en) Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
US11064245B1 (en) Piecewise hybrid video and audio synchronization
WO2005116992A1 (en) Method of and system for modifying messages
EP3935635A1 (en) System and method for simultaneous multilingual dubbing of video-audio programs
GB2231246A (en) Converting text input into moving-face picture
US11729475B2 (en) System and method for providing descriptive video
KR102567931B1 (en) Contents generation flatfrom device undating interactive scenario based on viewer reaction
JP2011186143A (en) Speech synthesizer, speech synthesis method for learning user&#39;s behavior, and program
US20210390937A1 (en) System And Method Generating Synchronized Reactive Video Stream From Auditory Input
KR102463283B1 (en) automatic translation system of video contents for hearing-impaired and non-disabled
TWI790705B (en) Method for adjusting speech rate and system using the same
KR102546559B1 (en) translation and dubbing system for video contents
EP4345814A1 (en) Video-generation system
Soens et al. Robust temporal alignment of spontaneous and dubbed speech and its application for Automatic Dialogue Replacement
JP2004071013A (en) Method, device and program for recording audio data with video
KR20230099934A (en) The text-to-speech conversion device and the method thereof using a plurality of speaker voices
KR20230043294A (en) Apparatus for conversing style of background using analysis song lyrics and method thereof

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FPAY Fee payment

Year of fee payment: 12