WO2014092666A1 - Personalized speech synthesis - Google Patents

Personalized speech synthesis Download PDF

Info

Publication number
WO2014092666A1
WO2014092666A1 PCT/TR2013/000383 TR2013000383W WO2014092666A1 WO 2014092666 A1 WO2014092666 A1 WO 2014092666A1 TR 2013000383 W TR2013000383 W TR 2013000383W WO 2014092666 A1 WO2014092666 A1 WO 2014092666A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech synthesis
user
records
database
Prior art date
Application number
PCT/TR2013/000383
Other languages
French (fr)
Inventor
Mustafa Levent Arslan
Original Assignee
Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayii Ve Ticaret Anonim Sirketi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayii Ve Ticaret Anonim Sirketi filed Critical Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayii Ve Ticaret Anonim Sirketi
Publication of WO2014092666A1 publication Critical patent/WO2014092666A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The invention relates to a personalized speech synthesis requiring the use of a voice recorder in order to record the user voice; wherein it comprises a voice database (1) where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored; a sufficient number of source speech synthesis modules (2) which allows synthesizing the texts in electronic environment in a human-like way by using the obtained voice records; and a voice identification module (3) which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database (1), and which sets the source speech synthesis module (2) of the determined record as the personalized speech synthesis module (4).

Description

DESCRIPTION
PERSONALIZED SPEECH SYNTHESIS TECHNICAL FIELD
The invention relates to a method for performing personalized speech synthesis using voice identification technique, and to a system allowing the applicability thereof. TECHNICAL FIELD
Text to Speech (TTS) means transforming the texts in electronic environment into speech in a human-like way. Today, the unit selection synthesis method, which allows performing synthesis by combining the phonemes together, is commonly used. In this method, a database is created by the speech recordings segmented into phonemes and the best matching (i.e. fitting) phonemes are extracted from this database during the synthesis and joined together, thereby transforming text to speech.
Concatenative speech synthesis performed by techniques like unit selection synthesis is only natural and intelligible with databases comprising long-term human voice. Therefore, long working hours, as well as a stable speaker performance, is required. Hence, speech synthesis can only be made by a particular speaker, but personalized voice production cannot be performed. Within the state of the art, several studies on performing personalized speech synthesis have been conducted and there are some patents regarding the subject matter in literature. For instance, the patent with the application no. US 20060074672 discloses a method, wherein a database comprising speech segments stored therein in order to perform personalized speech synthesis is included, said method being based on comparing said speech segments in the database to pre-recorded voices of the user and storing best matching ones in a new database. For that end, it is requested from the user to repeat a list of meaningless_words. Required speech segments are obtained from the words spoken by the user and compared with the recorded speech segments. The best matching ones are labeled and transferred to a new database. This operation is performed for all meaningless words in the list. The database having been formed after all the words in the list are processed is now ready to perform personalized speech synthesis. Another patent application disclosing a different method for performing personalized speech synthesis in patent literature is the patent document no. US 20110165912. Said patent discloses a method, wherein personalized speech features are obtained when the preset keywords are used by the speaker (i.e. user), and wherein personalized speech synthesis is performed using the obtained speech features.
Both patents mentioned above are based on recording the voice of a user via a microphone belonging to the respective device, in order to be able to perform personalized speech synthesis. However, speech synthesis will fail in case that the voices (i.e. sounds) are not recorded in an environment free of external noises, e.g. voice recording rooms/labs. Therefore, applicability of said methods is low. Moreover, the speech synthesis modules need training data comprising long-term voice records in order to be able to perform personalized speech synthesis. This, in turn, is time consuming and demanding for the user. Another patent document in literature regarding to the subject matter is the patent no. W0 02069323. In this patent document, the speaker dependent parameters are adapted using enrollment data from the new speaker, and after the adaptation, the speaker dependent parameters are combined with the speaker independent parameters, thereby producing personalized speech. This patent, just as in the two US patents mentioned above, requires a considerable amount of training data in order to be able to perform a new speech synthesis. This case is tiresome for the user; besides, synthesis errors are inevitable unless a voice recording environment adequately free of external noises is provided.
As a result, a novelty in the related technical field mentioned above is deemed necessary.
BRIEF DESCRIPTION OF THE INVENTION
The present invention relates to a personalized speech synthesis method, and to a system having components allowing the applicability of said method, in order to eliminate the aforementioned disadvantages and to provide new advantages in the related technical field.
The primary object of the invention is to introduce a method for performing personalized speech synthesis in a quick manner, without requiring such processes as voice training, feature extraction, and encoding. Accordingly, the voice record belonging to an individual is compared with different voice records, for each one of which a separate speech synthesis module is formed and which are stored in a database; and the speech synthesis module of the best matching voice record is set as the personalized speech synthesis module. Another object of the invention is to introduce a method for performing a personalized speech synthesis where synthesis failures (i.e. errors) resulting from the noise are prevented. In this regard, the voice records in the voice database to be used for performing personalized speech synthesis are made in recording environments free of noise.
The present invention, in order to achieve all the objects mentioned above and to be understood from the detailed description below, relates to a personalized speech synthesis method. The present method is characterized by comprising the steps of;
a. recording the voices having a sufficient length and belonging to a sufficient number of individuals in a soundproof and isolated environment, and storing (i.e. recording) them in a voice database,
b. forming separate source speech synthesis modules from voice records for each individual,
c. recording the voice of a user to be long enough,
d. comparing the voice record of the user with the voice records in the voice database by means of a voice identification module, and determining the best matching voice with the voice of the user, and
e. setting the source speech synthesis module of the determined voice record as the personalized speech synthesis module.
Another preferred embodiment of the method according to the invention comprises performing filtering according to the age, gender, accent, etc., of the user among the voice records in the voice database, prior to the process step d. Another preferred embodiment of the method according to the invention is that it is replicable starting from the process step c, so as to be able to perform speech synthesis suited for different users. In other words, once the voice database is created, the process steps a and b are not needed to be performed for determining the suitable voice for another user. The present invention, in order to achieve all the objects mentioned above and to be understood from the detailed description below, relates to a personalized speech synthesis system requiring the use of a voice recorder in order to record the user voice. The present system is characterized by comprising the following; - a voice database where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored, - a sufficient number of source speech synthesis modules for synthesizing the texts in the electronic environment so as to be natural as human voice by using the obtained voice records, and
- a voice identification module which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database, and which sets the source speech synthesis module of the determined record as the personalized speech synthesis module.
In another preferred embodiment of the invention, at least one module capable of performing filtering among voice records inside the voice database according to criteria such as age, gender, accent, etc., is comprised. Hence, a search in a narrower area in accordance with user features is performed in order to find the best matching voice with that of the user, thereby achieving the result in a shorter time. In another preferred embodiment of the invention, the voice database comprises the voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents, is comprised. Thus, the likelihood of finding the voice record that resembles the voice of any speaker at maximum level within the voice database is increased.
In order for the embodiment of present invention and the advantages thereof, together with the additional elements, to be best understood, it should be evaluated with -the below described figures. BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a schematic view of the system allowing the applicability of the speech synthesis method according to the invention. Fig. 2 is the flow diagram of the speech synthesis method according to the invention.
REFERENCE NUMERALS
1 Voice Database
2 Source Speech Synthesis Module
3 Voice Identification Module
4 Personalized Speech Synthesis Module 5 Voice Recorder
DETAILED DESCRIPTION OF THE INVENTION In this detailed description, the novelty according to the invention will be explained in order for the invention to be better understood, without any limitations, by way of illustrations. Accordingly, the description below and the figures disclose a method for performing personalized speech synthesis using identification technique, and a system having components that will allow the applicability thereof.
In Fig. 1 , a schematic view showing the components allowing the applicability of the speech synthesis method is given. Referring to this figure, a voice database (1) which stores the voice records of several individuals, said records being made in special environments with soundproof properties, is provided. Separate source speech synthesis modules (2) have been developed for synthesizing the texts in the electronic environment in a human-like way, for each one of the voice records. Said source speech synthesis modules (2) do not require the use of any particular synthesis technique. However, HMM-based speech synthesis technique is preferentially used due to the fact that it allows speaker adaptation and requires little amount of voice training data.
The voice of a speaker, the personalized speech synthesis of whom is desired to be performed, is received by means of a voice recorder. Said voice recorder may be chosen from the devices having a microphone, for example smartphones, tablet computers, laptops, and desktops. The voice record of the speaker is a short record made up of a few words. The obtained voice record is compared with the voice records in the voice database (1) by means of a voice identification module (3) and the best matching voice with the voice of the user is determined.
The voice database (1) consists of the voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents, is comprised. Thus, the likelihood of finding the voice record that resembles the voice of any user at maximum level within the voice database (1) is increased.
Modules capable of performing filtering among the voice records inside the voice database (1) according to criteria such as age, gender, accent, etc., may also be used. These modules determine the voice records that are most likely to resemble the user voice inside the voice I database (1), in accordance with the features of the user. Afterwards, the voice identification module (3) performs comparison between these determined voice records and voice record of the user. In this case, a search in a narrower area in accordance with user features is performed, rather than conducting a search in the whole voice database (1), in order to find the best matching voice. Thus, the best matching voice with that of the user is found in a shorter time.
In accordance with the explanations above, the method according to the invention which enables the personalized speech synthesis to be performed comprises the following process steps:
- obtaining different voice records from a population of individuals wide enough,
- forming separate source speech synthesis modules (2) from voice records for each individual,
- obtaining a voice record of a few words from the user by means of the voice recorder,
- performing filtering inside the voice database (1) in accordance with user features such as age, gender, and accent, optionally,
- comparing the voice record of the user with the filtered voice records, or with the voice records in the voice database (1) by means of the voice identification module (3), and determining the best matching voice record with that of the user, and
- setting the source speech synthesis module (2) of the determined voice record as the personalized speech synthesis module (4).
Subsequent to these process steps, any text in the electronic environment is synthesized with the most resembling voice to that of the user by means of the source speech synthesis module (2) set as the personalized speech synthesis module (4).
One of the important fields of application of the personalized speech synthesis method according to the invention is the health sector. In this field, the system serves to obtain voice records of the individuals suffering from full or partial voice loss, particularly those with larynx cancer, before the surgery and use them for determining the best matching one with that of the patient, and thus performing speech synthesis. In addition, a text translated from a source language is enabled to be read in the target langue with the voice of the user himself/herself. For example, after translating a text in Turkish into English, the most resembling voice to that of the speaker is determined by means of the voice record of a few words belonging to the speaker, and the text in English can be synthesized as if it is read by the speaker himself/herself in English. Moreover, the personalized speech synthesis method according to the invention can be used practically and in a sort time in any field where personalized sound vocalization is necessary, for example in various mobile applications, in the field of entertainment.

Claims

1. A personalized speech synthesis method, characterized in comprising the following process steps: a. recording the voices having a sufficient length and belonging to a sufficient number of individuals in a soundproof and isolated environment, and storing them in a voice database (1),
b. forming separate source speech synthesis modules (2) from voice records for each individual,
c. recording the voice of a user to be long enough,
d. comparing the voice record of the user with the voice records in the voice database (1) by means of a voice identification module (3), and determining the best matching voice with the voice of the user, and
e. setting the source speech synthesis module (2) of the determined voice record as the personalized speech synthesis module (4).
2. The personalized speech synthesis method according to Claim 1 , characterized in that filtering is performed according to the age, gender, accent, etc., of the user among the voice records in the voice database (1), prior to the process step d.
3. The personalized speech synthesis method according to Claim 1 , characterized in that it is replicable starting from the process step c, so as to be able to perform speech synthesis suited for different users.
4. A personalized speech synthesis system requiring the use of a voice recorder in order to record the user voice, characterized in comprising;
- a voice database (1) where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored,
- a sufficient number of source speech synthesis modules (2) for synthesizing the texts in the electronic environment in a human-like way by using the obtained voice records, and
- a voice identification module (3) which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database (1), and which sets the source speech synthesis module (2) of the determined record as the personalized speech synthesis module (4).
The personalized speech synthesis system according to Claim 4, characterized in comprising at least one module capable of performing filtering among voice records inside the voice database (1) according to criteria such as age, gender, accent, etc.
The personalized speech synthesis system according to Claim 4, characterized in that the voice database (1) comprises voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents.
PCT/TR2013/000383 2012-12-13 2013-12-13 Personalized speech synthesis WO2014092666A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
TR2012/14614 2012-12-13
TR201214614 2012-12-13

Publications (1)

Publication Number Publication Date
WO2014092666A1 true WO2014092666A1 (en) 2014-06-19

Family

ID=50073414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/TR2013/000383 WO2014092666A1 (en) 2012-12-13 2013-12-13 Personalized speech synthesis

Country Status (1)

Country Link
WO (1) WO2014092666A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5911129A (en) * 1996-12-13 1999-06-08 Intel Corporation Audio font used for capture and rendering
WO2002069323A1 (en) 2001-02-26 2002-09-06 Matsushita Electric Industrial Co., Ltd. Voice personalization of speech synthesizer
US20040111271A1 (en) * 2001-12-10 2004-06-10 Steve Tischer Method and system for customizing voice translation of text to speech
US20090043583A1 (en) * 2007-08-08 2009-02-12 International Business Machines Corporation Dynamic modification of voice selection based on user specific factors
US20110165912A1 (en) 2010-01-05 2011-07-07 Sony Ericsson Mobile Communications Ab Personalized text-to-speech synthesis and personalized speech feature extraction
US20110238407A1 (en) * 2009-08-31 2011-09-29 O3 Technologies, Llc Systems and methods for speech-to-speech translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5911129A (en) * 1996-12-13 1999-06-08 Intel Corporation Audio font used for capture and rendering
WO2002069323A1 (en) 2001-02-26 2002-09-06 Matsushita Electric Industrial Co., Ltd. Voice personalization of speech synthesizer
US20040111271A1 (en) * 2001-12-10 2004-06-10 Steve Tischer Method and system for customizing voice translation of text to speech
US20090043583A1 (en) * 2007-08-08 2009-02-12 International Business Machines Corporation Dynamic modification of voice selection based on user specific factors
US20110238407A1 (en) * 2009-08-31 2011-09-29 O3 Technologies, Llc Systems and methods for speech-to-speech translation
US20110165912A1 (en) 2010-01-05 2011-07-07 Sony Ericsson Mobile Communications Ab Personalized text-to-speech synthesis and personalized speech feature extraction

Similar Documents

Publication Publication Date Title
US20210327409A1 (en) Systems and methods for name pronunciation
JP7106680B2 (en) Text-to-Speech Synthesis in Target Speaker's Voice Using Neural Networks
Solovyev et al. Deep learning approaches for understanding simple speech commands
US20200410981A1 (en) Text-to-speech (tts) processing
CN110570876B (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
US10699695B1 (en) Text-to-speech (TTS) processing
CN110675866B (en) Method, apparatus and computer readable recording medium for improving at least one semantic unit set
US10008216B2 (en) Method and apparatus for exemplary morphing computer system background
EP4205109A1 (en) Synthesized data augmentation using voice conversion and speech recognition models
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Thennattil et al. Phonetic engine for continuous speech in Malayalam
US10643600B1 (en) Modifying syllable durations for personalizing Chinese Mandarin TTS using small corpus
Norouzian et al. An approach for efficient open vocabulary spoken term detection
WO2014092666A1 (en) Personalized speech synthesis
JP2004279436A (en) Speech synthesizer and computer program
Bohac et al. A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users
Brown Automatic recognition of geographically-proximate accents using content-controlled and content-mismatched speech data.
Laszko Word detection in recorded speech using textual queries
CN111696530B (en) Target acoustic model obtaining method and device
Huckvale 14 An Introduction to Phonetic Technology
Yong et al. Low footprint high intelligibility Malay speech synthesizer based on statistical data
Chakraborty et al. ASHI: A Database of Assamese Accented Hindi
Iglesias et al. Voice personalization and speaker de-identification in speech processing systems
JP6470586B2 (en) Audio processing apparatus and program
Parikh et al. Design Principles of an Automatic Speech Recognition Functionality in a User-centric Signed and Spoken Language Translation System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13829021

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13829021

Country of ref document: EP

Kind code of ref document: A1