WO2014092666A1

WO2014092666A1 - Personalized speech synthesis

Info

Publication number: WO2014092666A1
Application number: PCT/TR2013/000383
Authority: WO
Inventors: Mustafa Levent Arslan
Original assignee: Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayii Ve Ticaret Anonim Sirketi
Priority date: 2012-12-13
Filing date: 2013-12-13
Publication date: 2014-06-19

Abstract

The invention relates to a personalized speech synthesis requiring the use of a voice recorder in order to record the user voice; wherein it comprises a voice database (1) where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored; a sufficient number of source speech synthesis modules (2) which allows synthesizing the texts in electronic environment in a human-like way by using the obtained voice records; and a voice identification module (3) which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database (1), and which sets the source speech synthesis module (2) of the determined record as the personalized speech synthesis module (4).

Description

DESCRIPTION

PERSONALIZED SPEECH SYNTHESIS TECHNICAL FIELD

The invention relates to a method for performing personalized speech synthesis using voice identification technique, and to a system allowing the applicability thereof. TECHNICAL FIELD

Text to Speech (TTS) means transforming the texts in electronic environment into speech in a human-like way. Today, the unit selection synthesis method, which allows performing synthesis by combining the phonemes together, is commonly used. In this method, a database is created by the speech recordings segmented into phonemes and the best matching (i.e. fitting) phonemes are extracted from this database during the synthesis and joined together, thereby transforming text to speech.

Concatenative speech synthesis performed by techniques like unit selection synthesis is only natural and intelligible with databases comprising long-term human voice. Therefore, long working hours, as well as a stable speaker performance, is required. Hence, speech synthesis can only be made by a particular speaker, but personalized voice production cannot be performed. Within the state of the art, several studies on performing personalized speech synthesis have been conducted and there are some patents regarding the subject matter in literature. For instance, the patent with the application no. US 20060074672 discloses a method, wherein a database comprising speech segments stored therein in order to perform personalized speech synthesis is included, said method being based on comparing said speech segments in the database to pre-recorded voices of the user and storing best matching ones in a new database. For that end, it is requested from the user to repeat a list of meaningless_words. Required speech segments are obtained from the words spoken by the user and compared with the recorded speech segments. The best matching ones are labeled and transferred to a new database. This operation is performed for all meaningless words in the list. The database having been formed after all the words in the list are processed is now ready to perform personalized speech synthesis. Another patent application disclosing a different method for performing personalized speech synthesis in patent literature is the patent document no. US 20110165912. Said patent discloses a method, wherein personalized speech features are obtained when the preset keywords are used by the speaker (i.e. user), and wherein personalized speech synthesis is performed using the obtained speech features.

Both patents mentioned above are based on recording the voice of a user via a microphone belonging to the respective device, in order to be able to perform personalized speech synthesis. However, speech synthesis will fail in case that the voices (i.e. sounds) are not recorded in an environment free of external noises, e.g. voice recording rooms/labs. Therefore, applicability of said methods is low. Moreover, the speech synthesis modules need training data comprising long-term voice records in order to be able to perform personalized speech synthesis. This, in turn, is time consuming and demanding for the user. Another patent document in literature regarding to the subject matter is the patent no. W0 02069323. In this patent document, the speaker dependent parameters are adapted using enrollment data from the new speaker, and after the adaptation, the speaker dependent parameters are combined with the speaker independent parameters, thereby producing personalized speech. This patent, just as in the two US patents mentioned above, requires a considerable amount of training data in order to be able to perform a new speech synthesis. This case is tiresome for the user; besides, synthesis errors are inevitable unless a voice recording environment adequately free of external noises is provided.

As a result, a novelty in the related technical field mentioned above is deemed necessary.

BRIEF DESCRIPTION OF THE INVENTION

The present invention relates to a personalized speech synthesis method, and to a system having components allowing the applicability of said method, in order to eliminate the aforementioned disadvantages and to provide new advantages in the related technical field.

The primary object of the invention is to introduce a method for performing personalized speech synthesis in a quick manner, without requiring such processes as voice training, feature extraction, and encoding. Accordingly, the voice record belonging to an individual is compared with different voice records, for each one of which a separate speech synthesis module is formed and which are stored in a database; and the speech synthesis module of the best matching voice record is set as the personalized speech synthesis module. Another object of the invention is to introduce a method for performing a personalized speech synthesis where synthesis failures (i.e. errors) resulting from the noise are prevented. In this regard, the voice records in the voice database to be used for performing personalized speech synthesis are made in recording environments free of noise.

The present invention, in order to achieve all the objects mentioned above and to be understood from the detailed description below, relates to a personalized speech synthesis method. The present method is characterized by comprising the steps of;

a. recording the voices having a sufficient length and belonging to a sufficient number of individuals in a soundproof and isolated environment, and storing (i.e. recording) them in a voice database,

b. forming separate source speech synthesis modules from voice records for each individual,

c. recording the voice of a user to be long enough,

d. comparing the voice record of the user with the voice records in the voice database by means of a voice identification module, and determining the best matching voice with the voice of the user, and

e. setting the source speech synthesis module of the determined voice record as the personalized speech synthesis module.

Another preferred embodiment of the method according to the invention comprises performing filtering according to the age, gender, accent, etc., of the user among the voice records in the voice database, prior to the process step d. Another preferred embodiment of the method according to the invention is that it is replicable starting from the process step c, so as to be able to perform speech synthesis suited for different users. In other words, once the voice database is created, the process steps a and b are not needed to be performed for determining the suitable voice for another user. The present invention, in order to achieve all the objects mentioned above and to be understood from the detailed description below, relates to a personalized speech synthesis system requiring the use of a voice recorder in order to record the user voice. The present system is characterized by comprising the following; - a voice database where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored, - a sufficient number of source speech synthesis modules for synthesizing the texts in the electronic environment so as to be natural as human voice by using the obtained voice records, and

- a voice identification module which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database, and which sets the source speech synthesis module of the determined record as the personalized speech synthesis module.

In another preferred embodiment of the invention, at least one module capable of performing filtering among voice records inside the voice database according to criteria such as age, gender, accent, etc., is comprised. Hence, a search in a narrower area in accordance with user features is performed in order to find the best matching voice with that of the user, thereby achieving the result in a shorter time. In another preferred embodiment of the invention, the voice database comprises the voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents, is comprised. Thus, the likelihood of finding the voice record that resembles the voice of any speaker at maximum level within the voice database is increased.

In order for the embodiment of present invention and the advantages thereof, together with the additional elements, to be best understood, it should be evaluated with -the below described figures. BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic view of the system allowing the applicability of the speech synthesis method according to the invention. Fig. 2 is the flow diagram of the speech synthesis method according to the invention.

REFERENCE NUMERALS

1 Voice Database

2 Source Speech Synthesis Module

3 Voice Identification Module

4 Personalized Speech Synthesis Module 5 Voice Recorder

DETAILED DESCRIPTION OF THE INVENTION In this detailed description, the novelty according to the invention will be explained in order for the invention to be better understood, without any limitations, by way of illustrations. Accordingly, the description below and the figures disclose a method for performing personalized speech synthesis using identification technique, and a system having components that will allow the applicability thereof.

In Fig. 1 , a schematic view showing the components allowing the applicability of the speech synthesis method is given. Referring to this figure, a voice database (1) which stores the voice records of several individuals, said records being made in special environments with soundproof properties, is provided. Separate source speech synthesis modules (2) have been developed for synthesizing the texts in the electronic environment in a human-like way, for each one of the voice records. Said source speech synthesis modules (2) do not require the use of any particular synthesis technique. However, HMM-based speech synthesis technique is preferentially used due to the fact that it allows speaker adaptation and requires little amount of voice training data.

The voice of a speaker, the personalized speech synthesis of whom is desired to be performed, is received by means of a voice recorder. Said voice recorder may be chosen from the devices having a microphone, for example smartphones, tablet computers, laptops, and desktops. The voice record of the speaker is a short record made up of a few words. The obtained voice record is compared with the voice records in the voice database (1) by means of a voice identification module (3) and the best matching voice with the voice of the user is determined.

The voice database (1) consists of the voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents, is comprised. Thus, the likelihood of finding the voice record that resembles the voice of any user at maximum level within the voice database (1) is increased.

Modules capable of performing filtering among the voice records inside the voice database (1) according to criteria such as age, gender, accent, etc., may also be used. These modules determine the voice records that are most likely to resemble the user voice inside the voice I database (1), in accordance with the features of the user. Afterwards, the voice identification module (3) performs comparison between these determined voice records and voice record of the user. In this case, a search in a narrower area in accordance with user features is performed, rather than conducting a search in the whole voice database (1), in order to find the best matching voice. Thus, the best matching voice with that of the user is found in a shorter time.

In accordance with the explanations above, the method according to the invention which enables the personalized speech synthesis to be performed comprises the following process steps:

- obtaining different voice records from a population of individuals wide enough,

- forming separate source speech synthesis modules (2) from voice records for each individual,

- obtaining a voice record of a few words from the user by means of the voice recorder,

- performing filtering inside the voice database (1) in accordance with user features such as age, gender, and accent, optionally,

- comparing the voice record of the user with the filtered voice records, or with the voice records in the voice database (1) by means of the voice identification module (3), and determining the best matching voice record with that of the user, and

- setting the source speech synthesis module (2) of the determined voice record as the personalized speech synthesis module (4).

Subsequent to these process steps, any text in the electronic environment is synthesized with the most resembling voice to that of the user by means of the source speech synthesis module (2) set as the personalized speech synthesis module (4).

One of the important fields of application of the personalized speech synthesis method according to the invention is the health sector. In this field, the system serves to obtain voice records of the individuals suffering from full or partial voice loss, particularly those with larynx cancer, before the surgery and use them for determining the best matching one with that of the patient, and thus performing speech synthesis. In addition, a text translated from a source language is enabled to be read in the target langue with the voice of the user himself/herself. For example, after translating a text in Turkish into English, the most resembling voice to that of the speaker is determined by means of the voice record of a few words belonging to the speaker, and the text in English can be synthesized as if it is read by the speaker himself/herself in English. Moreover, the personalized speech synthesis method according to the invention can be used practically and in a sort time in any field where personalized sound vocalization is necessary, for example in various mobile applications, in the field of entertainment.

Claims

1. A personalized speech synthesis method, characterized in comprising the following process steps: a. recording the voices having a sufficient length and belonging to a sufficient number of individuals in a soundproof and isolated environment, and storing them in a voice database (1),

b. forming separate source speech synthesis modules (2) from voice records for each individual,

c. recording the voice of a user to be long enough,

d. comparing the voice record of the user with the voice records in the voice database (1) by means of a voice identification module (3), and determining the best matching voice with the voice of the user, and

e. setting the source speech synthesis module (2) of the determined voice record as the personalized speech synthesis module (4).

2. The personalized speech synthesis method according to Claim 1 , characterized in that filtering is performed according to the age, gender, accent, etc., of the user among the voice records in the voice database (1), prior to the process step d.

3. The personalized speech synthesis method according to Claim 1 , characterized in that it is replicable starting from the process step c, so as to be able to perform speech synthesis suited for different users.

4. A personalized speech synthesis system requiring the use of a voice recorder in order to record the user voice, characterized in comprising;

- a voice database (1) where voice records having a sufficient length, belonging to a sufficient number of individuals, and performed in a soundproof and isolated environment are stored,

- a sufficient number of source speech synthesis modules (2) for synthesizing the texts in the electronic environment in a human-like way by using the obtained voice records, and

- a voice identification module (3) which determines the best matching record with the user voice by comparing the voice records of the user with the ones in voice database (1), and which sets the source speech synthesis module (2) of the determined record as the personalized speech synthesis module (4).

The personalized speech synthesis system according to Claim 4, characterized in comprising at least one module capable of performing filtering among voice records inside the voice database (1) according to criteria such as age, gender, accent, etc.

The personalized speech synthesis system according to Claim 4, characterized in that the voice database (1) comprises voice records belonging to a wide population of individuals in a way to include different tones of voice, age groups, genders, and accents.