WO2004079719A1

WO2004079719A1 - Device for indexing a continuous audio signal of undetermined length

Info

Publication number: WO2004079719A1
Application number: PCT/FR2004/000152
Authority: WO
Inventors: Ghislain Moncomble; Thierry Milin
Original assignee: France Telecom
Priority date: 2003-01-30
Filing date: 2004-01-23
Publication date: 2004-09-16
Also published as: FR2850783A1

Abstract

The invention relates to a device which determines the contexts of an audio signal (SA). The inventive device comprises a filter (1) which filters the audio signal into a voice signal (SV) and a noisy signal, an analyser (2) which analyses the voice signal in order to produce voice parameters, and a voice-recognition module (3) which converts the voice signal into a text signal (ST). The aforementioned text signal is divided into periodic time text segments (Sn). According to the invention, one unit (5) determines a context (CSn) of the current segment according to the voice parameters and the text segment. Another unit (6) determines an upper general-context time boundary which merges with an upper time boundary of the current segment when the contexts of the current segment and the preceding segment are similar and which remains merged with an upper time boundary of the preceding segment when the contexts are not similar.

Description

Device for indexing an indefinite continuous audio signal

The present invention relates to a device for indexing a continuous audio signal of indefinite duration.

The development of telecommunications has led to the explosion of the quantity of information to be processed and in parallel, the need for automatic classification of information. While techniques have existed for a long time for processing textual information, techniques for processing audio information are currently in full development. Speech recognition or automatic translation is based on techniques resulting in part from linguistic studies. These notably use vocabulary dictionaries, the application of grammatical rules and the conjugation of verbs, and more recently the definition of contexts.

The context of a multimedia document is to define, by analyzing the multimedia document, the subject and the meaning of the multimedia document in order to improve transcriptions of the multimedia document into a text or audio document. Instead of simply applying simple correspondences for example between a suite of phonemes and its textual representation, a general context of the multimedia document is also considered in order to minimize the risks of misinterpretation of the suite of phonemes. For example if the general context of the multimedia document is "the days of the week", the sequence of phonemes "[s] [a] [m] [d] [i]" will be interpreted by a speech recognition engine with context like the word "Saturday" and not the expression "ça me dire" ..

A context is made up of a list of key words or expressions and their equivalents. Each key word or expression characterizes a context that can be addressed in any multimedia document. Certain contexts are combinations of contexts, or in the case of current or regional contexts, combinations of contexts specified by a proper name, such as for example: Brittany Weather, Afghanistan War, etc.

US Patent 6,434,520 discloses a system for indexing segments of a multimedia document, particularly audio, in a database according to information characterizing the document, such as the identity of the speaker and the sound environment of the document, but also based on the context of the speaker's words. US Patent 6,212,494 describes a process based on linguistic analyzes of an online technical document in order to extract and catalog the essential information of the document to constitute for example a glossary, an index or an aid to understanding the document. This process is also based on a morphological, lexical and syntactic analysis of the document but also on the context analysis at the level of each sentence.

The American patent application, the publication number of which is US 2002/0091509 A1, relates to a method for automatic translation of text sentences which is also based on identifying the context of the sentences by analyzing and browsing the text step by step, and taking into account the analyzes to continuously improve the quality of the translation.

In the following description, reference is made to audio signals extracted from multimedia documents.

The context determination technique as defined above in multimedia document transcriptions is not adaptable as it is to a continuous audio signal of indefinite duration. Indeed, in the prior techniques cited above, a context is determined on a known syntactic element, for example a sentence. However, when a context is determined for a continuous audio signal of indefinite duration, it is impossible to predict the end of the sentence when it does not yet exist. Unlike the processing of audio signals of fixed duration constituting audio documents of limited duration, a strong time scrolling constraint exists in the case of the processing of continuous signals of indefinite duration. The technique for determining context for term audio ^signal is therefore not applicable to audio indeterminate signals.

The objective of the present invention is to determine the contexts of a continuous audio signal of indefinite duration and thus to remedy the time scrolling constraint in order to implement processing techniques specific to audio signals of determined duration on signals. audio indefinite. More specifically, the invention relates to a device for indexing a continuous audio signal of indefinite duration, comprising means for filtering the continuous audio signal into a voice signal and a noisy signal, means for analyzing the voice signal in order to produce voice parameters , and a voice recognition means converting the voice signal into a text signal.

The means set out above of the indexing device according to the invention constitute a cascade of known individual modules used for the voice processing of an audio signal.

To achieve the above-mentioned objective, the indexing device of the invention is characterized in that it comprises a means for segmenting the continuous text signal into periodic temporal text segments, a first means for determining a context of the current text segment as a function of the averages of the voice parameters over the duration of the current segment and of the respective text segment, and a second means for determining a general context which is deduced from similar contexts of consecutive preceding segments and of which an upper time bound is confused with a time bound upper of the current text segment when the contexts of the current text segment and the text segment preceding the current text segment are similar, and is kept confused with an upper time bound of the text segment preceding the current text segment when the context of the current text segment does not is not similar to the context of the segme nt previous text.

The voice recognition means can produce a text signal as a function of the contexts determined by the first and second means. Other characteristics and advantages of the present invention will appear more clearly on reading the following description of several preferred embodiments of the invention with reference to the corresponding appended drawings in which:

- Figure 1 is a schematic block diagram of an indexing device according to a first embodiment of the invention; - Figure 2 is a schematic block diagram of an indexing device according to a second embodiment of the invention; and FIG. 3 is an algorithm of steps executed by the indexing device according to the first embodiment for determining a context from a current segment and a previous segment in a continuous audio signal of indefinite duration.

The invention will be described below in the context of audio signals, regardless of the origin of these audio signals. An audio signal is extracted from a multi-component signal such as audio / video or multimedia signals, or directly from an audio-only signal. Some sources capable of providing audio signals with or without filtering are, for example, television receivers, radio receivers or personal terminals of the computer or digital assistant type or mobile telephone or radiotelephone terminal. The invention can be implemented both in a terminal and / or in a server depending on the characteristics of the application which implements the invention.

With reference to FIG. 1, an indexing device according to the invention comprises a filter 1, a voice analyzer 2, a voice recognition module 3, a segmentation unit 4, a segment context determination unit 5 and a general context determination unit 6. The filter 1 receives as input a continuous audio signal SA of duration indefinite. It will be assumed that the audio signal SA is digital; if o; the audio signal received is analog and converted by an analog-digital converter included in filter 1.

The filter 1 filters by spectral subtraction or adaptive filtering the audio signal SA in order to dissociate it into a signal comprising only voice and called signal SV and a signal comprising background noises and called "noisy signal" or residual signal SB. The filter 1 is for example based on ^a linear predictive analysis LPC (Linear Predictive Coding) and isolates different acoustic components in an audio signal such as voice, the voice noise and pure music. The noisy signal SB which is likely to disturb the vocal analysis and. the following voice recognition is not processed in the indexing device according to the first embodiment shown in FIG. 1. The voice signal SV is then processed in parallel by the voice analyzer 2 and the voice recognition module 3.

The vocal analyzer 2 analyzes the vocal signal SV in order to continuously determine a list of PVS parameters characterizing the vocal signal SV, called "list of vocal parameters". The list of voice parameters is not fixed but includes, among other things, acoustic and particularly prosodic parameters such as the vibration frequency, intensity, flow, timbre and also other parameters such as the relative age of the speaker.

In parallel with the voice analysis, the voice signal SV is submitted to the voice recognition module 3. In the embodiment shown in FIG. 1, the language of the audio signal is considered to be known. The voice recognition module 3 transforms the voice signal SV into a text signal ST.

In a variant, module 3 considers the results of a context study carried out beforehand in order to refine the recognition and transcription of the voice signal. The context is translated into syntactic elements, that is to say key words and expressions, with high probabilities of being included in a portion of the voice signal. For example, the context of a relatively periodic or frequent advertising or news spot in an audio signal emitted by a sound broadcasting station is predicted by knowing the detailed program of this station, or by deducing it from advertising spots or previous news. Various contexts in the form of key words and expressions, as defined above, constitute pre-memorized contexts, or deduced from text segments preceding the current segment and / or from a context study, and managed in a database. contextual data 45 linked to module 3 and to units 5 and 6. The contexts in base 45 are gradually improved during the processing of the audio signal SA to facilitate voice recognition in the voice recognition module 3 and the determination of segment context textual current in unit 5. The contexts in the base 45 are also completed and refined by automatic consultation of external databases in function recently detected contexts. Module 3 can be based on Natural Language Understanding NLU software.

The segmentation unit 4 segments the text signal ST into temporal and periodic text segments ..., S _n , ... as the audio signal SA is received in a buffer memory. Indeed, the segmentation unit 4 further comprises a buffer memory continuously storing the audio signal SA for a duration greater than a predetermined duration DS of audio signal segments. In practice, the capacity of the buffer memory is such that it records a maximum of a portion of the audio signal SA having a duration at least ten times approximately greater than that DS of the segments. The predetermined duration DS of the text signal segments depends on the ratio between the indexing quality of the device, that is to say the relevance of the indexing as a function of the meaning of the words contained in the text signal, and the time indexing of the device. For example, a segment duration DS of 20 seconds compared to a segment duration of 1 minute increases the frequency of indexing of the device to the detriment of the quality of indexing. A minimum duration of 15 seconds is typically sufficient for the device to ensure a minimum quality.

In another preferred embodiment of the invention, the segmentation is not based on a temporal characteristic but depends on a syntactic element such as a word, or a group of words or a sentence.

The unit 5 determines one or more contexts CS _n of the current text segment S _n as a function of the average PVS _n of each voice parameter PVS on the current text segment and the content of the current text segment S _n . In a preferred variant, contexts established and stored previously also serve for determining the context in unit 5 and contribute to increasing the relevance of new segment contexts which will in turn participate in the determination of contexts for next segments.

In another variant, an initial general context is determined initially before any indexing of the audio signal SA from parameters external to the indexing device and linked inter alia to the source of the audio signal such as radio receiver, television receiver, telephone terminal or radiotelephone, or telephone conversation recorder. When the audio signal SA to be processed is that received by a radio or television receiver, program grids or information thereon as well as any information capable of informing the context of the first text segments enrich the contextual database 45. This general context is based by the unit 5 on the textual context of a determined number of textual segment preceding the current textual segment S _n when the context of the immediately preceding segment is not determined.

The general context determination unit 6 compares the context CS _n of the current text segment S _n to the context CS _n _ι of the previous text segment S _n -ι in order to determine time limits of a current general context CG _m . The general context CG _m compared to a segment context remains unchanged during one or more consecutive text segments whose contexts are similar and jointly define the general context. The set of consecutive text segments defining the general context CG _m is limited by time limits respectively confused with the lower bound, also called the anterior bound, of the first processed text segment of the set and the upper bound BS _m , also called the bound posterior, of the last textual segment treated of the whole. For the purpose of optimizing the indexing of the audio signal SA, periodic portions of the voice signal SV having a duration greater than and proportional to the duration DS of the periodic text segments S _n are processed K times by the voice analyzer 2, the voice recognition module 3, the segment context determination unit 5 and the general context determination unit 6 in order to refine the relevance of the contexts of the portions. For example, passing a portion of the voice signal SV two to K times through means 2 to 6 refines the relevance of the contexts of this portion. The number K of processing cycles of a portion of audio signal, as shown diagrammatically at 26 in FIG. 1, depends on the time constraints, on the quality of each processing in means 2 to 6 and on the memory capacity. buffer in the segmentation unit 4. The faster the indexing device has to process the audio signal, the smaller the number K. Also for indexing optimization purposes, the unit 5 determines some contexts of the current text segment S _{n in order} to further segment the text signal ST into different general contexts in the unit 6, in order to juxtapose several general contexts on at minus one segment textual. Thus intervals of different general contexts that do not have a priori lower and upper time limits combined are juxtaposed by the unit 6 over at least one interval of a text segment, which increases the accuracy of the general information relating to the audio signal. .

According to a second preferred embodiment shown in FIG. 2, the indexing device also comprises an audio comparator 7. The audio comparator 7 is in relation to an audio database 71 in which pieces of audio data such as music are stored. , songs, advertising jingles, news flashes and sound effects. More generally, the database 71 has previously recorded any piece of audio data preferably qualified by audio parameters PASp and contexts CAp whose time limits are staggered with respect to a fixed mark of audio data, such as the beginning of a song or a jingle. The database 71 thus contains pieces of typed audio data which are used to interrupt the continuous audio signal SA with respect to a general context, as will be seen below with regard to "context jump".

The audio comparator 7 comprises a buffer memory and a segmentation unit. The comparator compares a sample of the audio signal SA with samples of audio pieces contained in the audio database 71. The substantially identical samples allow the comparator to determine portions of the audio signal SA corresponding to complete pieces or parts of audio tracks contained in the database 71. The parameters PASp and the context CA _p of the identified portion of the audio signal SA are applied to unit 5 over the entire duration of the determined portion, replacing the averages PVS _n of the voice parameters on the current segment and of the content of the segment textual S _n . The textual segments S _n of the textual signal ST are thus qualified respectively by vocal parameters PASp and audio contexts CAp read in the base 71, which inhibits a processing of these segments S _n by the vocal analyzer 2 and the unit of voice recognition 3, as indicated by link 72.

The audio comparator 7 also participates in improving the quality of context determination since the parameters PASp and the contexts CAp associated with the audio data and contained in the audio database 71 are determined both manually and therefore very precisely, as well as automatically . In order to improve the determination of contexts, the noisy signal SB comprising the residual non-vocal part of the audio signal SA produced by the filter 1 is applied by the filter 1 to the audio comparator 7. The comparator then compares portions of the signal noise produced in order to try to qualify the noise signal SB by parameters PASp and contexts CAp coming from the audio database 71 and thus to improve the determination of context in the unit and to inform the contextual base 45 for new contexts. In order to quickly constitute audio data in the base 71, the machines hosting the management means managing the audio database 71 can be used. In another variant, the management means is associated with the audio comparator 7 in the indexing device.

A known language determination unit 8 is inserted between the filter 1 and the voice recognition module 3 in order to determine the language of the voice signal SV if this is not previously known. For multi-language information, for example, the language is recognized continuously.

We now refer to FIG. 3 to describe the main steps E1 to E82 executed by the indexing device to determine the contexts of an indeterminate continuous audio signal SA in the case of the first embodiment shown in FIG. 1.

The segment S _n is filtered by the filter 1 in step El in order to constitute a voice signal SV composed solely of the voice part of the signal SA without any background noise. The voice signal SV is then simultaneously analyzed in the analyzer 2 in step E2 and processed by the voice recognition module 3 in step E3. Following the analysis of the signal SV in step E2, the analyzer 2 produces voice parameters PVS continuously of the audio signal SA, and following the processing by voice recognition in step E3, the module 3 produces a text signal ST deduced from the voice signal SV. In the fourth step E4, the unit stores the text signal ST in the buffer memory, possibly after digital transformation. The time that digital samples of the ST text signal remain in the buffer depends on the duration predetermined DS of segments S _n , and is at least equal to the duration of segment DS.

The temporal and periodic segmentation of the text signal ST occurs in the fifth step E5. The text signal ST is segmented by the unit 4 into consecutive text segments S _n of duration DS. In FIG. 3, the processing of a _n ^th current segment S _{n is considered,} although each segment of the text signal ST is subjected to the same following steps as and when the audio signal SA is received by the indexing device.

As a function of the averages of the voice parameters on the current segment PVS _n and of the text segment S _n , the unit 5 determines a context of segment CS _n of the voice segment S _n in step E6. The time limits of the context CS _n of the segment S _n are known since they are confused with the terminals BS _n of the time segment S _n - The context CS _n and the voice parameters PVS _n are stored in step E7 in the contextual base 45. As a variant, this storage is temporary, the time for saving in memory depending on the duration of the text segments S _π and on the time of processing of a segment by the context determination units 5 and 6. The expressions and keywords characterizing a context are determined in step E6 by different methods of analysis, such as recovering the subjects of a sentence after deleting the propositions, adjectives or other elements. Alternatively, all existing methods of determining context alone or in combination are used in the present invention.

The unit 6 then compares the context CS _n to the context CS _n _] _ of the previous segment S _n -ι in step E8. When the two contexts CS _n and CS _n -ι are not similar, that is to say have almost no or few key words and expressions in common, step E81 deduces that the upper bound BS _n - ι of the previous segment S _n -ι is equal to the upper bound BCGm of the general current context CG _m + ι whose last textual segment is the segment S _n -ι. The lower limit of the current segment S _n then defines the lower limit of the general current context according to BCG _m- ι-i relating to the segment S _n and possibly to the segments according to the segment S _n .

When in step E8, the contexts CS _n and CS _n -ι are similar, that is to say have a number of identical or synonymous key words and expressions greater than a predetermined threshold, for example equal to 2 or 3, the upper bound BCG _m of the general current context CG _m is momentarily merged with the upper bound BS _n of the current segment S _n in step E82. The segment S _n can be the last textual segment relating to the general context CGm _. if later the contexts of the textual segments S _n and S _n + i are not similar.

As the individual text segments are indexed ..., S _n _ι, S _n , S _n + ι, ... by the respective contexts ..., CS _n -i, CS _n , CS _n + ι, ... in step E7, the continuous audio signal SA is indexed by successive general contexts ..., BCG _m , ... which each relate to one or more consecutive textual segments indexed. For example, the signal SA is indexed subject A to the 8 minute ^lβme from a reference start time segmentation in the unit 4, and a subject B to 6 ^-th to the 12 ^th minute , then subject C for 1 minute, then subject B again, etc. for a DS segment duration of 30 seconds for example. The subject B is present in the signal SA twice after being interrupted for 1 minute by subject C which has been recognized by the audio comparator 7 in the audio database 71. This phenomenon is called context jump. Topics A, B and C are for example news, a section on cinema and a set of advertising inserts.

In this example, the context determination units 5 and 6 control the writing of the context of the last temporal text segment S _n of the subject B preceding the subject C as well as the general context of the subject B when the comparator 7 detects all of the consecutive samples in the audio signal SA relating to the subject C, by comparison with the samples of audio data in the audio database 71. At least the unit 6 recovers the general context of the segment preceding said detected set of the subject C following the last segment of subject C thus still having the same general context as the set detected from the beginning of subject C. This retrieval prevents the indexing device again determining at least one general context relative to the first segments of subject B according to the subject C, which general context is in this case the general context preceding subject C.

In another embodiment, the time bounds deduced for the general context CG _m are stored in the contextual base 45. Second contexts and their parameters contained in the contextual database 45 are linked to the general context when the general context has common parameters with the parameters of the second contexts. Thus the context CG _m defined by a few key words is refined by its reconciliation with other contexts contained in the database contextual 45. The contextual database is established beforehand and contains a list of referenced subjects and associated keywords, as well as other parameters qualifying a context. As a variant, the second contexts are stored in a second contextual database shared between indexing devices according to the invention.

Claims

1 - Device for indexing a continuous audio signal (SA) of indefinite duration, comprising means (1) for filtering the continuous audio signal into a voice signal (SV) and a noisy signal (SB), means (2) for analyzing the voice signal (SV) in order to produce voice parameters (PVS), and a voice recognition means (3) converting the voice signal (SV) into a text signal (ST), characterized in that it comprises means ( 4) to segment the continuous text signal (ST) into periodic time text segments (S _n ), a first means (5) for determining a context

(CS _n ) of the current text segment (S _n ) as a function of the means (PVS _n ) of the voice parameters over the duration of the current segment and of the respective text segment (S _n ), and a second means (6) for determining a context general (BCG _m ) which is deduced from similar contexts of consecutive preceding segments and of which an upper time bound is confused (E82) with an upper time bound (BS _n ) of the current text segment (S _n ) when the contexts (CS _n , CS _n -i) of the current text segment and of the text segment preceding the current text segment are similar, and is kept confused (E81) with an upper time bound (BS _n -] _) of the text segment (S _n -ι) preceding the current text segment when the context (CS _n ) of the current text segment is not similar to the context (CS _n -ι) of the previous text segment. 2 - Device according to claim 1, wherein the voice recognition means (3) produces a text signal (ST) according to the contexts determined by the first and second means.

3 - Device according to claim 1 or 2, wherein an initial general context is determined initially from parameters external to the device and is based by the first means for determining (5) on the textual context of textual segments preceding the textual segment current when the context of the text segment immediately preceding is not determined.

4 - Device according to any one of claims 1 to 3, in which periodic portions of duration greater than and proportional to the duration of the text segments (S _n ) are processed K times by the means for analyzing (2), the means voice recognition (3) and the first and second means for determining (5, 6) in order to refine the relevance of the contexts of said portion.

5 - Device according to any one of claims 1 to 4, wherein the second means for determining (6) juxtaposes several general contexts on at least one text segment.

6 - Device according to any one of claims 1 to 5, further comprising means (71) for previously storing consecutive pieces of audio data with respective parameters (PAS) and contexts (CA), and means ( 7) to compare a sample of the audio signal (SA) to samples of pieces of audio data, to qualify a current portion of the audio signal

(SA) by voice parameters (PASp) and context

(CAp) of pieces of audio data when the sample of the audio signal and a sample of the pieces of audio data are substantially identical.

7 - Device according to claim 6, wherein the means for comparing (7) detects a set of consecutive samples in the audio signal (SA) by comparison with the samples of audio data in the means for storing (71), • and the second means for determining (6) retrieves the general context of the segment preceding said set detected following the last segment still having the general context of said set.

8 - Device according to claim 6 or 7, wherein the means for comparing (7) compares portions of the noisy signal (SB) produced by the means for filtering (1) in order to improve the determination of context in the first means to determine.

9 - Device according to any one of claims 1 to 8, comprising means (8) between the means for filtering (1) and the voice recognition means (3) for determining a language of the voice signal (SV).

10 - Device according to any one of claims 1 to 9, comprising means (45) for storing and managing contexts deduced from text segments preceding the current text segment (S _n ) and / or from a context study in order to facilitate ^' the voice recognition in the voice recognition means (3) and determining the context of the current text segment in the first means for determining (5).