EP1271469A1 - Method for generating personality patterns and for synthesizing speech - Google Patents

Method for generating personality patterns and for synthesizing speech Download PDF

Info

Publication number
EP1271469A1
EP1271469A1 EP01115216A EP01115216A EP1271469A1 EP 1271469 A1 EP1271469 A1 EP 1271469A1 EP 01115216 A EP01115216 A EP 01115216A EP 01115216 A EP01115216 A EP 01115216A EP 1271469 A1 EP1271469 A1 EP 1271469A1
Authority
EP
European Patent Office
Prior art keywords
speech
features
anyone
acoustical
synthesizing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01115216A
Other languages
German (de)
French (fr)
Inventor
Krzysztof Marasek
Thomas Kemp
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Deutschland GmbH
Original Assignee
Sony International Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony International Europe GmbH filed Critical Sony International Europe GmbH
Priority to EP01115216A priority Critical patent/EP1271469A1/en
Publication of EP1271469A1 publication Critical patent/EP1271469A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

To mimic the speaking behavior of a given speaker, a method for generating personality patterns in particular for synthesizing speech is proposed in which acoustical as well as non-acoustical speech features (SF) are extracted from a given speech input (SI).

Description

The present invention relates to a method for generating personality patterns and to a method for synthesizing speech.
Nowadays, a large variety of equipment and appliances employ man-machine dialogue systems to ensure an easy and reliable use by a human user. These man-machine dialogue systems are enabled to receive and consider users' utterances, in particular orders and/or inquiries, and to react and respond in an appropriate way. Nevertheless, current speech synthesis systems involved in such man-machine dialogue systems suffer from a lack of personality and naturalness. Although the systems are enabled to deal with the context of the situation in an appropriate way, the prepared and output speech of the dialogue system often sounds monotonically, machine-like, and not embedded into the particular situation.
It is an object of the present invention to provide a method for generating personality patterns in particular for synthesizing speech and a method for synthesizing speech in which naturalness of the speech and its features can be realized.
The object is achieved by a method for generating personality patterns, in particular for synthesizing speech, with the features of claim 1. Furtheron, the object is achieved by a method for synthesizing speech according to the characterizing features of claim 11. A system and a computer program product for carrying out the inventive methods are the subject-matter of claims 14 and 15, respectively. Preferred embodiments of the inventive methods are within the scope of the dependent subclaims.
In the inventive method for generating personality patterns, in particular for synthesizing speech, a speech input is received and/or preprocessed. From the speech input acoustical and/or non-acoustical speech features are extracted. Based on the extracted speech features and/or on models and/or parameters thereof, a personality pattern is generated and/or stored.
It is therefore a basic idea of the present invention to extract acoustical and alternatively or simultaneously non-acoustical speech features from a received speech input. The speech features are then directly or indirectly used to construct a personality pattern which can lateron be used to reconstruct a speech output with the mimic of the speech input and its speaker. The speech features are therefore parameterized or modeled and included or described in certain models or units.
According to an embodiment of the inventive method for generating personality patterns, online input speech and/or speech of a speech data base for at least one given speaker are used for receiving said speech input. Using a speech data base enables a system involving the inventive method to generate the personality patterns in advance of an application. That means that, before the system is applied for example in an speech synthesizing unit, a speech model for a single speaker or for a variety of speakers can be constructed. Within the application of the inventive method it is also possible to construct the personality patterns during the application in a speech synthesizing unit in a real time or online manner, so as to adapt a speech output generated in a dialogue system during the application and/or during the dialogue with the user.
It is an aspect of the present invention to use a large variety of features from the speech input so as to model the personality patterns as good as possible to achieve in an application of a dialogue system a particular natural responding speech output.
It is therefore an aspect of a further embodiment of the present invention to use prosodic features, voice quality features, global statistic and/or spectral properties, and/or the like as acoustical features.
Within the class of prosodic features, pitch, pitch range, intonation attitude, loudness, speaking rate, phone duration, speech element duration features, and or the like can be employed.
Within the class of voice quality features, phonation type, articulation manner, voice timbre features, and/or the like can be employed.
In the class of non-acoustical features, contextual features and/or the like may be important in accordance to a further advantageous embodiment of the present invention. In particular, syntactical, grammatical, semantical features, and/or the like can be used as contextual features.
As a human speaker has distinct preferences in constructing sentences, phrases, word combinations, and/or the like, according to a further preferred embodiment of the present invention within the class of non-acoustical features statistical features on the usage, distribution, and/or probability of speech elements - such as words, subword units, syllables, phonemes, phones, and/or the like - and/or combinations of them within said speech input can be used. Additional sentence, phrase, word combination preferences can be evaluated and included into said personality pattern.
To prepare for the extraction of contextual features or the like, a process of speech recognition is preferably carried out within the inventive method.
Alternatively or additionally, a process of speaker identification and/or adaptation can be performed, in particular so as to increase the matching rate of the feature extraction and/or of the recognition rate of the process of speech recognition.
In the inventive method for synthesizing speech, in particular for a man-machine dialogue system, the inventive method for generating personality patterns is employed.
According to a further embodiment of the inventive method for synthesizing speech, the method for generating personality patterns is essentially carried out in a preprocessing step, in particular based on a speech data base or the like.
Alternatively or additionally, the method for generating personality patterns can be carried out and/or continued in a continuous, real time, or online manner. This enables a system involving said method for synthesizing speech to adapt its speech output in accordance to the received input during the dialogue.
Both of the methods for generating personality patterns and/or for synthesizing speech can be configured to create a personality pattern or a speech output which is in some sense complementary to the personality pattern or character assigned to the speaker of the speech input. That means, for instance, that in the case of an emergency call system for activating ambulance or fire alarm services the speaker of the speech input might be excited and/or confused. It might therefore be necessary to calm down the speaking person and this can be achieved by creating a personality pattern for the speech synthesis reflecting a strong and confident and safe character. Additionally, it might also be possible to construct personality patterns for the synthesized speech output which reflects a gender which is complementary to the gender of the speaker of the speech input, i. e. in the case of a male speaker, the system might respond as a female speaker so as to make the dialogue most convenient for the speaking person.
It is a further aspect of the present invention to provide a system, an apparatus, a device, and/or the like for generating personality patterns and/or for synthesizing speech which is in each case capable of performing and/or realizing the inventive methods for generating personality patterns and/or for synthesizing speech and/or its steps.
According to a further aspect of the present invention, a computer program product is provided, comprising computer program means which is adapted to perform and/or to realize the inventive method for generating personality patterns and/or for synthesizing speech and/or the steps thereof when it is executed on a computer, a digital signal processing means, and/or the like.
The aspects of the present invention will become more elucidated taking into account the following remarks:
After the identification of a speaker, both his relevant voice quality features and his speech itself - as described by any units, such as words, syllables, diphones, sentences, and/or the like - is automatically extracted according to the invention. Also information about preferred sentence structure and word usage are extracted and used to create a speech synthesis system with those characteristics in a completely unsupervised way.
The starting point for these inventive concepts is the lack of personality of current speech synthesis systems. Prior art systems are developed with text-to-speech (TTS) operation in mind, where intelligibility and naturalness of speech is the most important. For dialogue systems, however, the personality of the dialogue partner is essential, too. Depending on the personality of the artificial dialogue partner, the speaker may be interested in continuation of the dialogue or not. Thus, adding a personality pattern to the speech generated by the device may be crucial for the success of the dialogue device.
Therefore, it is proposed to collect and store all information about speaking style of the person making conversation with the system or device and to use said information to modify the speaking style of the device.
The proposed methods can be used to mimic the actual speaker talking to the device but also to equip the device with some different personalities, e. g. gathered from the speaking style of famous people, movie stars, or the like. This can be very attractive for potential customers. The proposed system can be used not only to mimic speaker's behavior but more generally to control the dialogue depending on changing speaking style and emotions of the human partner.
The collection of features describing the speaker's personality can be done on different levels during the conversation of the human by a dialogue unit. In order to mimic the speaker's voice, the speech signal has to be recorded and segmented into phones, diphones, and/or into other speech units or speech elements in dependence on the speech synthesis method used in the system.
Prosodic features like pitch, pitch range, attitude of sentence intonation (monotonous or effected), loudness, speaking rate, durations of phones, and/or the like can be collected to characterize the speaker's prosody.
Voice quality features like phonation type, articulation manner, voice timbre, and/or the like can be automatically extracted from the collected speech data.
Speaker identification or a speaker identification module are necessary for a proper function of the system.
The system can also collect all the words recognized from the adherences spoken by the speaker and to generate and evaluate statistics on the usage. This can be used to find the most frequent phrases, words used by a given speaker, and/or the like. Also syntactic information gathered from the recognized phrases can enhance the quality of personality description.
After all necessary information has been collected, the dialogue system can adjust parameters and units of acoustic output - for example the synthesized waveforms or the like - and modes of text generation to suite the recognized speaker's characteristic.
The parameterized personality can be stored for future use or can be preprogrammed in the dialogue device. The information can be used to recognize speakers and to change the personality of the system depending on the user's preference or mood, for example in case of a system with a built-in emotion recognition engine.
The personality can be changed according to the user's wish, preprogrammed sequence or depending on changing speaker's style and emotions of the speaker.
The main advantage of such a system is the possibility to adapt the dialogue to the given speaker, make the dialogue more attractive, and/or the like. The possibility to mimic certain speakers or to switch between different personalities or speaking styles can be very entertaining and attractive for the user.
In the following, further advantages and aspects of the present invention will be described taking reference to the accompanying figure.
Fig. 1
is a schematical block diagram describing a preferred embodiment of a method for synthesizing speech employing an embodiment of the inventive method for generating personality patterns.
The schematical block diagram of Fig. 1 shows a preferred embodiment of the inventive method for a synthesizing speech employing an embodiment of the inventive method for generating personality pattern from a given received speech input SI.
In step S1, speech input S1 is received. In a first section S10 of the inventive method for synthesizing speech, non-acoustic features are extracted from the received speech input SI. In a second section S20 of the inventive method for synthesizing speech, acoustical features are extracted from the received speech input SI. The sections S10 and S20 can be performed parallely or sequentially on a given device or apparatus.
In the first section S10 for extracting non-acoustical features from the speech input S1 in a first step S11, speech parameters are extracted from said speech input SI. In a second step S12, the speech input S1 is fed into a speech recognizer to analyze the content and the context of the received speech input SI.
Based on the recognition result, in a following step S13 contextual features are extracted from said speech input S1, in particular syntactical, semantical, grammatical, and statistical information on particular speech elements are obtained.
In the embodiment of Fig. 1, the second section S20 of the inventive method for synthesizing speech consists of three steps S21, S22, and S23 to be performed independently from each other.
In the first step S21 of the second section S20 for extracting acoustical features, prosodic features are extracted from the received speech input SI. Said prosodic feature may comprise features of pitch, pitch range, intonation attitude, loudness, speaking rate, speech element duration, and/or the like.
In a second step S22, voice quality features are extracted from the given received speech input SI, for instance phonation type, articulation manner, voice timbre features, and/or the like.
Finally, in a third and final step S23 of the second section S20, statistical/spectral features are extracted from the given speech input SI.
The non-acoustical features and the acoustical features obtained from sections S10 and S20 are merged in a following postprocessing step S30 to detect, model, and store a personality pattern PP for the given speaker.
The data describing the personality pattern PP for the current speaker are fed into a following step S40 which includes the steps of speech synthesis, text generation, and dialogue managing from which a responsive speech output SO is generated and then output in a final step S50.

Claims (15)

  1. Method for generating personality patterns, in particular for synthesizing speech, wherein:
    speech input (SI) is received and/or preprocessed,
    acoustical and/or non-acoustical speech features (SF) are extracted from said speech input (SI),
    based on the extracted speech features (SF) or on models or parameters thereof a personality pattern (PP) is generated and/or stored.
  2. Method according to claim 1, wherein online input speech and/or speech of a speech data base for at least one given speaker are used for receiving said speech input (SI).
  3. Method according to anyone of the preceding claims, wherein prosodic features, voice quality features, global statistical, and/or spectral properties, and/or the like are used as acoustical features.
  4. Method according to claim 3, wherein pitch, pitch range, intonation attitude, loudness, speaking rate, phone duration, speech element duration features, and/or the like are used as prosodic features.
  5. Method according to anyone of the claims 3 or 4, wherein phonation type, articulation manner, voice timbre features, and/or the like are used as voice quality features.
  6. Method according to anyone of the preceding claims, wherein contextual features, and/or the like are used as said non-acoustical features.
  7. Method according to claim 6, wherein syntactical, grammatical, semantical features, and/or the like are used as contextual features.
  8. Method according to anyone of the claims 6 or 7, wherein statistical features on the usage, distribution, and/or probability of speech elements - such as words, subword units, syllables, phonemes, phones, and/or the like - and/or combinations of them within said speech input (SI) are used as non-acoustical features.
  9. Method according to anyone of the preceding claims, wherein a process of speech recognition is carried out, in particular to prepare the extraction of contextual features and/or the like.
  10. Method according to anyone of the preceding claims, wherein a process of speaker identification and/or adaptation is performed, in particular so as to increase the matching rate of the feature extraction and/or of the recognition rate of the process of speech recognition.
  11. Method for synthesizing speech, in particular for a man-machine dialogue system, wherein the method for generating personality patterns according to anyone of the claims 1 to 10 is employed.
  12. Method according to claim 11, wherein the method for generating personality patterns is essentially carried out in a preprocessing step, in particular based on a speech data base or the like.
  13. Method according to anyone of the claims 11 or 12, wherein the method for generating personality patterns is carried out and/or continued in a continuous, real time, or online manner.
  14. System for generating personality patterns and/or for synthesizing speech which is capable of performing and/or realizing the method for generating personality patterns according to anyone of the claims 1 to 10 and/or the method for synthesizing speech according to anyone of the claims 11 to 13 and/or the steps thereof.
  15. Computer program product, comprising computer program means adapted to perform and/or to realize the method for generating personality patterns according to anyone of the claims 1 to 10 and/or the method for synthesizing speech according to anyone of the claims 11 to 13 and/or the steps thereof when it is executed on a computer, a digital signal processing means, and/or the like.
EP01115216A 2001-06-22 2001-06-22 Method for generating personality patterns and for synthesizing speech Withdrawn EP1271469A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP01115216A EP1271469A1 (en) 2001-06-22 2001-06-22 Method for generating personality patterns and for synthesizing speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP01115216A EP1271469A1 (en) 2001-06-22 2001-06-22 Method for generating personality patterns and for synthesizing speech

Publications (1)

Publication Number Publication Date
EP1271469A1 true EP1271469A1 (en) 2003-01-02

Family

ID=8177799

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01115216A Withdrawn EP1271469A1 (en) 2001-06-22 2001-06-22 Method for generating personality patterns and for synthesizing speech

Country Status (1)

Country Link
EP (1) EP1271469A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068466A1 (en) * 2003-01-24 2004-08-12 Voice Signal Technologies, Inc. Prosodic mimic synthesis method and apparatus
WO2005081508A1 (en) * 2004-02-17 2005-09-01 Voice Signal Technologies, Inc. Methods and apparatus for replaceable customization of multimodal embedded interfaces
EP2147429A1 (en) * 2007-05-24 2010-01-27 Microsoft Corporation Personality-based device
US7873390B2 (en) 2002-12-09 2011-01-18 Voice Signal Technologies, Inc. Provider-activated software for mobile communication devices
WO2014024399A1 (en) * 2012-08-10 2014-02-13 Casio Computer Co., Ltd. Content reproduction control device, content reproduction control method and program
US9363378B1 (en) 2014-03-19 2016-06-07 Noble Systems Corporation Processing stored voice messages to identify non-semantic message characteristics
US9865281B2 (en) 2015-09-02 2018-01-09 International Business Machines Corporation Conversational analytics
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278943A (en) * 1990-03-23 1994-01-11 Bright Star Technology, Inc. Speech animation and inflection system
WO1999012324A1 (en) * 1997-09-02 1999-03-11 Jack Hollins Natural language colloquy system simulating known personality activated by telephone card
US6144938A (en) * 1998-05-01 2000-11-07 Sun Microsystems, Inc. Voice user interface with personality

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278943A (en) * 1990-03-23 1994-01-11 Bright Star Technology, Inc. Speech animation and inflection system
WO1999012324A1 (en) * 1997-09-02 1999-03-11 Jack Hollins Natural language colloquy system simulating known personality activated by telephone card
US6144938A (en) * 1998-05-01 2000-11-07 Sun Microsystems, Inc. Voice user interface with personality

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JANET E. CAHN: "The Generation of Affect in Synthesized Speech", JOURNAL OF THE AMERICAN VOICE I/O SOCIETY, vol. 8, July 1990 (1990-07-01), pages 1 - 19, XP002183399, Retrieved from the Internet <URL:http://www.media.mit.edu/~cahn/masters-thesis.htm> [retrieved on 20011120] *
KLASMEYER ET AL: "The perceptual importance of selected voice quality parameters", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1997. ICASSP-97., 1997 IEEE INTERNATIONAL CONFERENCE ON MUNICH, GERMANY 21-24 APRIL 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21 April 1997 (1997-04-21), pages 1615 - 1618, XP010226301, ISBN: 0-8186-7919-0 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7873390B2 (en) 2002-12-09 2011-01-18 Voice Signal Technologies, Inc. Provider-activated software for mobile communication devices
WO2004068466A1 (en) * 2003-01-24 2004-08-12 Voice Signal Technologies, Inc. Prosodic mimic synthesis method and apparatus
US8768701B2 (en) 2003-01-24 2014-07-01 Nuance Communications, Inc. Prosodic mimic method and apparatus
CN1742321B (en) * 2003-01-24 2010-08-18 语音信号科技公司 Prosodic mimic method and apparatus
WO2005081508A1 (en) * 2004-02-17 2005-09-01 Voice Signal Technologies, Inc. Methods and apparatus for replaceable customization of multimodal embedded interfaces
US8285549B2 (en) 2007-05-24 2012-10-09 Microsoft Corporation Personality-based device
US8131549B2 (en) * 2007-05-24 2012-03-06 Microsoft Corporation Personality-based device
AU2008256989B2 (en) * 2007-05-24 2012-07-19 Microsoft Technology Licensing, Llc Personality-based device
EP2147429A4 (en) * 2007-05-24 2011-10-19 Microsoft Corp Personality-based device
EP2147429A1 (en) * 2007-05-24 2010-01-27 Microsoft Corporation Personality-based device
WO2014024399A1 (en) * 2012-08-10 2014-02-13 Casio Computer Co., Ltd. Content reproduction control device, content reproduction control method and program
US9363378B1 (en) 2014-03-19 2016-06-07 Noble Systems Corporation Processing stored voice messages to identify non-semantic message characteristics
US9865281B2 (en) 2015-09-02 2018-01-09 International Business Machines Corporation Conversational analytics
US9922666B2 (en) 2015-09-02 2018-03-20 International Business Machines Corporation Conversational analytics
US11074928B2 (en) 2015-09-02 2021-07-27 International Business Machines Corporation Conversational analytics
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet

Similar Documents

Publication Publication Date Title
JP7355306B2 (en) Text-to-speech synthesis method, device, and computer-readable storage medium using machine learning
KR100811568B1 (en) Method and apparatus for preventing speech comprehension by interactive voice response systems
Shichiri et al. Eigenvoices for HMM-based speech synthesis.
US7966186B2 (en) System and method for blending synthetic voices
JP4884212B2 (en) Speech synthesizer
US20200251104A1 (en) Content output management based on speech quality
JPH10507536A (en) Language recognition
JP5507260B2 (en) System and technique for creating spoken voice prompts
WO2007148493A1 (en) Emotion recognizer
CA2167200A1 (en) Multi-language speech recognition system
EP1280137B1 (en) Method for speaker identification
JP2006517037A (en) Prosodic simulated word synthesis method and apparatus
JP2011186143A (en) Speech synthesizer, speech synthesis method for learning user&#39;s behavior, and program
EP1271469A1 (en) Method for generating personality patterns and for synthesizing speech
Levinson et al. Speech synthesis in telecommunications
O'Shaughnessy Modern methods of speech synthesis
US20230148275A1 (en) Speech synthesis device and speech synthesis method
Creer et al. Building personalized synthetic voices for individuals with dysarthria using the HTS toolkit
US20230146945A1 (en) Method of forming augmented corpus related to articulation disorder, corpus augmenting system, speech recognition platform, and assisting device
JP3706112B2 (en) Speech synthesizer and computer program
Westall et al. Speech technology for telecommunications
Carlson Synthesis: Modeling variability and constraints
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
Houidhek et al. Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic
KR102116014B1 (en) voice imitation system using recognition engine and TTS engine

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

AKX Designation fees paid
REG Reference to a national code

Ref country code: DE

Ref legal event code: 8566

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030703