WO2012080651A1

WO2012080651A1 - Enrichment of the audio content of an audiovisual program by means of speech synthesis

Info

Publication number: WO2012080651A1
Application number: PCT/FR2011/052967
Authority: WO
Inventors: Roberto Agro; Halim Bendiabdallah
Original assignee: France Telecom
Priority date: 2010-12-16
Filing date: 2011-12-13
Publication date: 2012-06-21
Also published as: FR2969361A1

Abstract

The invention relates to a method for enriching the audio content of an audiovisual stream (F), including: obtaining (105) at least one first basic data stream (F_txt), including text enrichment data (d_txt), and a second basic data stream (F_audio), including initial audio data (d_audio), from the audiovisual stream (F); converting (109) the text enrichment data (d_txt), extracted from the first basic data stream (F_txt), into supplementary audio data (d_sup); and mixing (113) the supplementary audio data (d_sup) with the initial audio data (d_audio) extracted from the second basic data stream (F_audio) so as to obtain enriched audio data (d'_audio). The invention also relates to a device (1) intended for enriching the audio content of an audiovisual stream and suitable for implementing said enrichment method.

Description

Enrichment of the audio content of an audiovisual program by speech synthesis

The invention relates to the field of audio enrichment of audiovisual programs, and in particular that of audio description applied to audiovisual programs transmitted in the form of digital data streams.

In the field of digital television broadcasting, television programs are usually broadcast in the form of audiovisual streams bringing together a certain number of elementary video and audio streams associated and synchronized with each other. MPEG2-TS and DVB standards are commonly used to allow the transport and broadcasting of such audiovisual streams.

The structure of an MPEG2-TS audiovisual stream is simple and generic. It consists of audio elementary streams, video elementary streams and elementary data streams, as well as signaling tables based on MPEG2-TS and DVB standards. In particular, the same audiovisual stream may comprise a single elementary video stream associated with several audio elementary streams, respectively corresponding to different languages, which makes it possible to switch between these languages during the viewing of the program broadcast by such an audiovisual stream.

In order to allow a better accessibility to the audio-visual world, in particular for the blind or the visually impaired, one solution consists in resorting to processes of enrichment of the soundtrack of the diffused programs. The use of such enrichment processes may even be required, or soon to be, in some countries where public bodies ensure that a quota of audiovisual programs is accessible to the blind and visually impaired.

These enrichment methods are commonly referred to as "audio description", or "audio vision", and consist of describing the scenes of a film or program by a voice-over interposed between Original dialogues of the film or program, to provide additional pieces of information allowing the visually impaired to better understand the context of this film or program.

At present, such audio description methods consist in transmitting an audio track containing both the sound of the associated film (voice and sound effects) and the voice over of the description of this film. These processes use a pre-mix of the original audio track with the voice-over, before the actual broadcast of the film, which is performed in an editing studio with the participation of an actor who brings his voice to the description of the film.

The currently used audio description methods, however, suffer from a number of disadvantages:

First, the creation of the enriched audio track uses a complex chain for its implementation.

In addition, as different people and different equipment are used to perform this type of descriptive dubbing, these methods generate significant additional costs.

Finally, the enriched audio track obtained has a bit rate equivalent to any other audio component, that is to say between 128 and 256 kb / s. This enriched audio track is therefore bandwidth-consuming and obliges the broadcasting companies to delete other audio tracks, for example "multi-language" audio tracks, in order to be able to insert such a descriptive audio track to meet the standards in terms of audio. accessibility.

The present invention aims to overcome the aforementioned drawbacks and is intended to provide an enrichment process that is less productive, limits additional costs and is less consumer bandwidth.

It proposes for this purpose a method of enriching the audio content of an audiovisual stream comprising obtaining a first elementary data stream comprising textual enrichment data and a second elementary data stream comprising audio data. initials from the audiovisual stream, converting the enrichment text data extracted from the first elementary data stream to enrichment audio data and mixing the enrichment audio data with the original audio data extracted from the second elementary data stream to to obtain enriched audio data.

Advantageously, the enrichment method furthermore comprises the synchronization of the enrichment audio data with the initial audio data before their mixing, in order to prevent an accidental temporal overlap of the original audio and enrichment channels during their mixing. which would render inaudible the enriched audio track associated with the program transmitted by the audiovisual stream.

In particular, this synchronization of the enrichment audio data with the initial audio data is performed by means of at least one tag inserted in the header of at least one textual data packet belonging to the first elementary data stream.

According to an embodiment in which the second elementary data stream comprises at least one audio data packet comprising initial audio data and a time stamp, the synchronization of the audio enrichment data with the initial audio data is performed by synchronizing the audio data stream. time stamp inserted in the text data packet with the time stamp of the audio data packet.

In a particularly advantageous embodiment where the audiovisual stream is transmitted according to the MPEG2-TS standard, the obtaining step comprises obtaining the first and second elementary data streams by demultiplexing the audiovisual stream by means of identifiers respectively. associated with these first and second elementary streams of data in a PMT table, which allows a simple separation of the different elementary flows.

In this embodiment, particularly advantageously, the textual enrichment data are inserted beforehand into the first elementary stream in accordance with the teletext functionality defined in the DVB standard, which makes it possible to reuse an already existing functionality for transmitting the data of the data. enrichment without having to add a new feature specific to this type of application.

Advantageously, descriptive data specifically associated with the enrichment of audio content is inserted into a data field specific to at least one elementary stream packet belonging to the first elementary stream to indicate that the textual enrichment data is used only. as part of the enrichment of audio content, which distinguishes the use of teletext functionality for the purpose of enriching audio content of an audiovisual stream of a conventional use.

The enrichment of audio content is for example requested by a user of the playback equipment on which this audiovisual stream is intended to be returned or being restored. This user request triggers the search by a teletext data decoding unit joined to descriptive data specifically associated with the enrichment of audio content: if these descriptive data are present, then the attached teletext data will be used for the generation of data. audio enrichment data. In the absence of such descriptive data, teletext data will therefore be decoded and exploited conventionally, as currently defined by the teletext feature.

In particular, when this specific data field is the field PES_data_field of an elementary stream packet, defined according to the DVB standard and comprising a first data_identifier first field and a second data_unit_id elementary field, the descriptive data specifically associated with the enrichment of audio content consists of at least one value selected in a range of values from 0x80 to 0xFF and inserted in the data_identifier elementary field and in the data_unit_id elementary field.

When the specific data field is a descriptive data field belonging to the PMT table and defined according to the MPEG2-TS standard, the descriptive data specifically associated with the enrichment of audio content then advantageously consist of at least one value chosen in a range. value ranging from 0x06 to 0x1F and inserted into said specific data field of the PMT.

Particularly advantageously, the textual enrichment data are formulated in the XML format and comprise at least one configuration parameter for the conversion of said text enrichment data into enrichment audio data among the reading speed, the type of voice , the intonation of the phrasing, the accentuation and the language, which makes it possible to configure the vocal conversion of the text enrichment data from the transmitter of the television program. In one embodiment of the enrichment method, the audiovisual stream includes, in the first elementary data stream, the textual enrichment data and at least one time stamp according to which the audio enrichment data is to be synchronized with the data elements. initial audio data, the first elementary data stream being multiplexed with the second elementary data stream.

The present invention also proposes a method for generating an audio-visual stream suitable for enriching audio content, comprising a step of inserting enrichment textual data and at least one time stamp into a first elementary data stream. and a step of multiplexing the first elementary data stream with at least a second elementary data stream comprising initial audio data to obtain the audiovisual stream.

The present invention furthermore proposes a device for enriching the audio content of an audiovisual stream, comprising a demultiplexing unit adapted to obtain at least a first elementary stream of data comprising textual enrichment data and a second elementary data stream. comprising initial audio data from the audiovisual stream, a decoding unit configured to convert the enrichment text data extracted from the first elementary data stream into enriching audio data, and an audio mixing unit configured to mix the audio data of enrichment with the original audio data extracted from the second elementary data stream to obtain enriched audio data.

In an advantageous embodiment, the decoding unit comprises a voice synthesis unit configured to vocally synthesize the enrichment audio data from the enriched textual data extracted from the first elementary data stream and a synchronization unit configured to synchronize the enrichment audio data with the original audio data extracted from the second elementary data stream before providing them to the audio mixing unit, to prevent accidental time overlap of the original audio and enrichment channels during their mix.

In particular, the demultiplexing unit is further adapted to obtain a third elementary data stream comprising video data from the audiovisual stream, the decoding unit comprising an audio decoding unit configured to extract the original audio data from the audio stream. second elementary data stream for providing them to the audio mixing unit; and a video decoding unit configured to extract the video data from the third elementary data stream to output them from the enrichment device.

Advantageously, when the audiovisual stream is transmitted according to the MPEG2-TS standard, the enrichment device is able to implement the steps of the audio content enrichment method above. The subject of the invention is also a signal conveying an audiovisual stream intended to be transmitted to an audiovisual stream decoding unit, this signal comprising:

a first elementary data stream comprising textual enrichment data;

a second elementary data stream comprising initial audio data of the audiovisual stream,

the textual enrichment data being intended to be converted by the decoding unit into audio enrichment data adapted to be mixed with the initial audio data.

In one embodiment of the signal according to the invention, the first elementary data stream comprises at least one time stamp according to which the audio enrichment data is to be synchronized with the initial audio data during the mixing of the enrichment audio data. with the initial audio data.

In one embodiment of the signal according to the invention, the textual enrichment data are inserted in the first elementary stream in accordance with a teletext feature defined in a standard for coding and / or audiovisual stream transport.

In one embodiment, the signal according to the invention comprises descriptive data specifically associated with audio content enrichment, said descriptive data being inserted into a specific data field of at least one elementary stream packet belonging to the first stream. elementary to indicate that textual enrichment data is used only for the purpose of enriching audio content.

The invention also relates to a data carrier readable by a computer or data processor, and comprising a signal according to the invention.

The information medium may be any hardware, entity or device capable of storing a signal. For example, the medium may comprise storage means, such as a ROM or RAM memory, for example a CD ROM disk or a magnetic recording means, for example a floppy disk ("floppy disk") according to the English terminology. Saxon) or a computer hard drive.

On the other hand, the information medium can be a carrier that can be transmitted in the form of a carrier wave, such as an electromagnetic signal (electrical, radio or optical signal), which can be conveyed via an appropriate transmission means, wired or not. wired: electrical or optical cable, radio or infrared link, or by other means.

The invention also relates to a method comprising a step of generating and / or a step of sending a signal according to the invention.

The method and the device for enriching the audio content of an audiovisual stream, an object of the invention will be better understood on reading the description and on the observation of the following drawings in which:

FIG. 1 illustrates the steps of a method of enriching the audio content of an audiovisual stream according to the present invention; and

- Figure 2 schematically shows a device for enriching the audio content of an audiovisual stream according to the present invention.

Reference is first made to FIG. 1, in which the steps of a method 100 for enriching the audio content of an audiovisual stream according to the present invention are illustrated.

This enrichment process takes place more particularly in a device for enriching the audio content of an audiovisual stream, described in more detail in relation to FIG. 2, which is capable of receiving a digital audiovisual stream F using for example the standard MPEG2-TS for the transport of audiovisual streams.

This enrichment method notably comprises obtaining (step 105) at least a first elementary data stream F 1 comprising textual enrichment data d 1 and a second elementary stream of _audio data F comprising audio data. initials of _audio from the audiovisual stream F.

Such elementary streams are for example pre-multiplexed with a video elementary stream F _V i _of o comprising video _video data, during the preparation (step 102) of the audiovisual stream F at the level of the digital television program broadcaster, before the broadcasting (step 103) of the audiovisual stream F prepared.

In particular, this obtaining step 105 may include the separation of the audiovisual stream F received by the digital reception device into:

a first elementary stream F ^ consisting of textual data packets Ρ _Μ (1), ..., Ρ _Μ (ί) comprising the textual enrichment data d ^;

a second elementary _audio stream F consisting of audio data packets Paudio (l), - · -, PaudioG) comprising the initial _audio audio data; and

a third elementary _video stream F consisting of video data packets P _V ideo (I) _> - _> - _> Pvideo (k) comprising video _video data.

During this prior generation of the audiovisual stream F, the textual enrichment data d1 is inserted in a first elementary data stream F ^ (step 101), for example in the form of ASCII character strings inserted into a number of characters. textual data packets Ρ _Μ (1), ..., Ρ _Μ (ί) of this first elementary stream F ^.

This insertion can be carried out simply by an operator, for example by means of word processing tools, upstream of the audiovisual stream transmitter and avoids having to use the services of an actor to read a document. voice-over to mix directly with the original audio track, which generates additional costs, and can also shorten the production time.

During this insertion step, time stamps are also advantageously inserted in the textual data packets Ρ _Μ (1), ..., Ρ _Μ (ί) where the textual data d ^ are inserted. These timestamps may be used in particular during a possible synchronization of the text enrichment data with the audio data to be enriched, as described later in the description.

Once the text data included in the first elementary data stream F ^, the first elementary stream is multiplexed with other elementary streams audio F _audio including _audio of original audio data to enrich and video F _video at a step 102 of multiplexing to obtain the audiovisual stream F described above.

Once the audiovisual stream F has been generated, this audiovisual stream F is broadcast (step 103) in order to be received by a certain number of digital reception devices.

To come back to the separation step 105 mentioned above, this can be done by demultiplexing these different elementary streams from the audiovisual stream F in which they were previously multiplexed.

At the end of this step 105, the first elementary stream F ^, consisting of a certain number of textual data packets {Ρ _Μ (ί)} ι <ί <Ν including the textual enrichment data d d d on the one hand, and the second elementary stream F _aud i ₀ , consisting of a number of audio data packets {P _{on the} dio (j)} i <j <J comprising initial audio data d _aud i ₀ , are available separately .

The textual enrichment data d _M are then extracted (step 107) from the first elementary stream F ^, and more particularly from one or more data packet (s) text (s) {Ptxt (i)} i ≤i <N containing in this feed and converted (step 109) in _greater enrichment of audio data through a speech synthesis process.

Once these enrichment _sup audio data obtained, they are mixed (step

113) with the initial _audio of audio data, themselves extracted from the second elementary stream F _aud io, more particularly one or several package (s) _{to the} audio {P Diog) i≤j} <J l ^es containing in this stream, in order to obtain audio data enriched with _audio .

These audio enriched _audio data can then be used in combination with the extracted video data, video decoding, the video elementary stream F _video to provide a television program whose soundtrack is enriched by textual data enrichment d ^ .

Thus, to the extent that the enrichment data is transmitted in textual form (eg in the form of ASCII characters) rather than in the form of audio data already mixed with the original audio track as is the case with In the prior art, a substantial gain in bandwidth is obtained since the text data is significantly less bandwidth consuming than the audio data. In an advantageous embodiment, a step 111 of synchronization of the _Up enhancement audio data with the original _audio audio data is performed prior to mixing the enhancement audio data of _sup with the initial audio data of _aud i ₀ .

This ensures that the audio enrichment channel is synchronized with the original audio track and prevents accidental overlap of both types of audio data when they are being mixed, thus rendering inaudible the enhanced audio track associated with the audio track. program transmitted by the audiovisual stream F.

Such synchronization of the audio enhancement data of _sup with the initial _audio audio data can be achieved by means of one or more timestamp (s) inserted in the header of at least one packet of data Ρ _Μ (ί) belonging to the first elementary data stream F ^ and containing textual enrichment data d ^.

In a particularly advantageous embodiment, the audiovisual stream F is composed according to the MPEG-2 TS standard and transmitted according to the same standard, that is to say by means of transport packets described in this standard.

In such an embodiment, the demultiplexing previously described in relation with the step 105 for separating the elementary streams can advantageously be performed as a function of distinct PID identifiers associated respectively with these different elementary streams, which are then listed in a PMT table. (Program Map Table in English), usually used in this standard MPEG-2 TS and transmitted with the audiovisual stream F.

By reading this PMT table in order to find the PID identifiers associated with the different elementary streams, it is then possible to distinguish the different elementary streams F _audio , F _video and F _txt between them, which allows to separate them simply by reading this PMT table when the audiovisual stream F is received.

Still in this same advantageous embodiment where the MPEG-2 TS standard is used to formulate and transmit the audiovisual stream F, the synchronization mentioned above can be performed using time stamps of the "PTS" type.

(Time Stamp presentation in English).

It is indeed usual to place a unique time stamp PTS in the header of each audio data packet P _aud i ₀ (j) of the audiovisual stream. This timestamp synchronizes the audio output even when the previous timestamp has not been captured, for example when losing an audio packet.

In this embodiment, a time stamp PTS is further placed in the header of the textual data packets Ρ _Μ (ί) comprising textual enrichment data corresponding to a unit sentence. Since the textual audio description can be advantageously entirely contained in a single textual data packet Ρ _Μ (ί), a single time stamp PTS may suffice here. Synchronization of _sup enrichment audio data with the _audio data of original audio is then handled with the top of the starting audio decoding, through timestamps P inserted into the audio data packets _in gs (j) ^and ^es packets Always in the advantageous embodiment where the MPEG-2 TS standard is used to transmit the audiovisual stream F, the textual enrichment data d1 is inserted beforehand (step 101), before the broadcasting of the audiovisual stream F (step 103), in a number of textual data packets Ρ _Μ (1), ..., Ρ _Μ (ί) belonging to the first elementary stream F ^, which are defined as elementary stream packets (otherwise designated by "PES" for Packet Elementary Stream in the meaning of the MPEG2-TS standard.

In this embodiment, these text enrichment data d 1 can then be advantageously inserted as teletext in these elementary packets of the first elementary stream F 1, in accordance with the part of the DVB standard describing the mode of insertion of Teletext in a DVB stream (ETSI EN 300 472). This makes it possible to reuse existing teletext transmission functionality within the framework of the MPEG2-TS standard to simply transmit these text enrichment data d _M , without the need to implement new features specific to the audio application. vision.

Advantageously, descriptive data specifically associated with the enrichment of audio content are defined in advance, in order to be able to indicate to the receiving devices of the audiovisual stream F that they receive an audiovisual stream whose audio content can be enriched. These descriptive data are then inserted into a specific data field of one or more elementary stream packet (s) (Ρ _Μ (ί)) belonging to the first elementary stream (F ^), in order to be read and / or extracted. by the receiving devices upon reception of the audiovisual stream F.

Thus, as an example specific to the MPEG2-TS standard, the textual enrichment data d1 can be inserted into a specific field of the type "PES_data_field" of elementary stream packets "PES", which is structured in the form following by the DVB standard: PES_data_field ()

{

data_identifier

for (i = 0; i <N; i ++)

{

data_unit_id

data_unit_length

data_field () The field "PES_data_field" breaks down into a number of elementary fields:

• The "data_identifier" elementary field indicated above is used to define the type of data stored in the elementary stream packet in question. The descriptive data specifically associated with the enrichment of audio content can therefore be inserted into such an elementary field.

Values between 0x10 and 0x1 F are already defined so that they can be inserted into this "data_identifier" elementary field to designate EBU (European Broadcasting Union) data. Therefore, such values should not be used to designate textual enrichment data.

The DVB standard offers a range of values from 0x80 to reserved OxFF for user-defined needs. One or more value (s) chosen in this range of values can therefore be advantageously used as descriptive data specifically associated with the enrichment of audio content, to indicate the insertion of textual enrichment data into the stream. elementary PES, which makes it possible not to activate the standard teletext function unnecessarily.

Alternatively, since the DVB standard reserves ranges of values [0x00,0x0F] and [0x20, 0x7F] for subsequent uses, values chosen in these specific ranges can be used, in the "data_identifier" field, in order to easily designate the insertion of enrichment textual data for enriching the audio content of the audiovisual stream F and not unnecessarily activating the standard teletext function.

• In addition, the "data_unit_id" elementary field above is used to define the type and nature of the transmitted data. The DVB standard offers a range of free values between 0x80 and OxFF, which can be used to designate textual enrichment data. Descriptive data specifically associated with the enrichment of audio content can thus also be inserted into such an elementary field, for example to designate a subtype of information concerning the transmitted textual enrichment data such as the language used during the processing. speech synthesis for audio enrichment or the nature of audio enrichment data packets.

Here again, alternatively, since the DVB standard reserves the ranges of values [0x00, 0x01] and [0x04, 0x7F] for subsequent uses, values selected in these specific ranges can be used, in the field "data_unit_id" To designate it is easy to insert textual enrichment data intended to enrich the audio content of the audiovisual stream F and not to unnecessarily activate the standard teletext function.

• The "data_unit_length" elementary field is used to indicate the size in bytes of the "data_field ()" field, which can not exceed 44 bytes.

• Finally, the "data_field ()" field provides a space to insert the textual enrichment data into the elementary stream packet Ρ _Μ (ί) in question. Still in the embodiment where the teletext functionality offered by the standard

DVB is used to transmit the textual enrichment data d _M , it can also be advantageous to insert descriptive data specifically associated with the enrichment of audio content in the teletext descriptor present in the PMT table defined above, in the form of a specific identifier, in order to differentiate this specific use of teletext for the purpose of enriching audio content from the usual use that is usually made of it.

A descriptive data field is provided, according to the MPEG2-TS and DVB standards, in the PMT table to indicate the type of teletext component present in a particular elementary stream, and specify, among other things, whether this elementary stream corresponds to a subtitle, the language used, etc.

It is thus possible, by means of this descriptive data field provided for in the PMT table, to indicate that the textual enrichment data added in the form of teletext correspond to a specific audio-video type application.

This allows the enrichment device to recognize the use of an audio vision method when it receives the audiovisual stream F, which allows the implementation of a menu, thanks to this table PMT, at the device level. enrichment to indicate to the user of the device that enrichment of audio content by audio vision is available.

Advantageously, and in order to be able to carry out a fine management of this enrichment method at the level of the enrichment device itself, the textual enrichment data d1 inserted in the form of teletext can be formulated in the format xml and include one or more parameter configuration step 109 of converting textual data enrichment _million to _sup enrichment of audio.

Such configuration parameters, added to the textual enrichment data d _M , may relate to the setting of the following elements at the level of the enrichment device:

the reading speed to be used during speech synthesis (i.e. the speech rate),

- the type of voice to use during the speech synthesis (ie a male voice or a female voice, a child's voice ...), - the intonation or accentuation of pronounced sentences.

These configuration settings can also be used to:

- embed multiple languages in the same audiovisual stream,

- embed texts in order to find your way around the video when using tricks modes in the case of a recording. Such "tricks" modes can correspond to fast forward, fast rewind, pause, stop or play modes, among others.

By way of illustration, an example of textual enrichment data, in teletext mode and in XML format, is provided below: <AUDIO_VISION>

<TEXT TYPE = "NORMAL" SPEED = "1"> Hello World my name is E.T </ TEXT>

<TEXT TYPE = "TRICK MODE" SPEED = "1.5"> Scene of the meeting with the alien </ TEXT>

</ AUDIO_VISION>

In this example, configuration parameters are inserted to set the playback speed by speech synthesis. In particular, a first sentence "Hello World my name is ET" is supposed to be pronounced at normal speed, while a second sentence "scene of the encounter with the alien" is pronounced in a speed of 50% higher than the normal speed.

A tag that can be used in trick mode is also inserted here by means of complementary metadata to the textual enrichment data.

This "trick" mode allows the user to quickly browse a recorded show, giving him the ability to jump directly from one tag to another. A text identified by the type "TRICK MODE" is also inserted after this tag. Depending on the capabilities of the digital receiver used, when it detects such a beacon, it can emit a beep in a limited mode or emit a voice saying "Scene of the meeting with the alien", which indicates the position where one is in the recorded program, in a more elaborate mode.

Referring now to Figure 2, which schematically illustrates a device 1 for enriching the audio content of an audiovisual stream F according to the present invention.

Such a device can in particular take the form of a "Set Top Box" digital reception device, a digital reception device integrated into a digital television set or any other digital terminal compatible with the DVB standard. .

In addition to reception means Rx capable of receiving an audiovisual stream F transmitted by a digital broadcasting antenna or coming from a satellite antenna by means of a cable, the enhancement device 1 comprises a demultiplexing unit 10, arranged to demultiplex the audiovisual stream F received in at least a first elementary stream composed of a number of data packets Ρ _Μ (ί) comprising textual enrichment data dtxt a second elementary audio stream F _aud i ₀ composed of a number of packets P _{on the} dio (j) carrying audio _audio data and a third elementary video stream Fvideo composed of a number of _video packets P (k) carrying video _video data.

In the advantageous embodiment where the audiovisual stream F is composed and transmitted according to the MPEG2-TS standard, this demultiplexing unit 10 may comprise a PID filtering module capable of reading the table PMT transmitted with the audiovisual stream F and of find the PID identifiers associated specifically with the different elementary streams in order to distinguish them and to separate it by demultiplexing.

The enrichment device 1 further comprises a decoding unit 20 which receives the different elementary streams F _audio , F _video and F _txt demultiplexed by the demultiplexing unit 10.

This decoding unit 20 comprises, firstly, an audio decoding unit 25 which receives the individual packets P _aud i ₀ (j) of the second _audio elementary stream F carrying the initial audio of _audio data and extracts the original audio data d _audio in a format allowing the output of this audio data to a speaker, for example in a PCM format, to provide this audio initial audio data to the _audio mixing unit 30 described later.

The PCM format is shown here for illustrative purposes as the output format of the original _audio audio data, but it is obvious that any other Audio output format, such as the AC3, can also be used, depending on the input format. used by the mixing unit 30.

This decoding unit 20 comprises, on the other hand, a video decoding unit 27 which receives the different packets P _V ideo (k) of the third elementary stream F _video carrying _video data d _V i _deo and extracts the video data d _V i _deo in a video image format for outputting such video data to a broadcast screen, such as a television, for outputting the enhancement device 1.

With regard to the processing of packets Ρ _Μ (ί) comprising textual enrichment data d _M and belonging to the first elementary stream F ^, the decoding unit 20 comprises an extraction unit 21 arranged to extract the textual data. of enriching dtxt of these packets Ρ _Μ (ί).

The decoding unit 20 further comprises a speech synthesis unit 22 which receives these textual enrichment data.

and converted to _sup enrichment audio data, typically by means of a speech synthesis process. This speech synthesis unit 22 can thus convert an ASCII character string representing the textual enrichment data.

in _sup enrichment audio data in a PCM format.

Here again, the PCM format is here indicated for illustrative purposes as the output format of the audio enrichment data of _sup , but it is obvious that any other output format Audio, such as AC3, can also be used, depending on the input format used by the mixing unit 30.

The device 1 also comprises concentrating an audio mixing unit 30 receiving, on the one hand, the enrichment audio data _sup converted by the voice synthesis unit 22 and, on the other hand, the original audio data _{The audio} mixing unit 30 performs the mixing of the audio enrichment data of _sup and the initial audio data of _aud i ₀ , in order to enrich the latter with the additional information contained in the audio decoding unit 25. in enrichment audio data _sup> which results in enriched audio H ^υ audio - These enhanced audio of _aud i ₀ can then be provided by the audio mixing unit on an output "audio out" of the enrichment device 1, together with the video data dvideo from the video decoding unit 27 which are provided on a "video out" output.

When the enrichment device 1 is in the form of a digital television decoder, otherwise referred to as Set Top Box (STB), these outputs "Audio out" and "Video out" can then be connected by a cable external to a television screen for broadcasting the program contained in the audiovisual stream, whose audio track is enriched by additional information. When the enrichment device 1 is in the form of an internal module to a digital television, these outputs "Audio out" and "Video out" can then be connected by internal connections to the speakers and the screen of this digital TV to broadcast this enriched program.

In an advantageous embodiment, the decoding unit 20 of the enrichment device 1 further comprises a synchronization unit 23, connected between the speech synthesis unit 22 and the audio mixing unit 30.

The synchronization unit 23 receives the audio data enrichment _sup from the audio conversion unit 22 and synchronizes with the original audio data _aud i ₀ in order to ensure that they do not overlap when mixing performed by the audio mixing unit 30.

When the audiovisual file F is composed and transmitted according to the MPEG2-TS standard and time stamps of "PTS" type have been inserted into the packets Ρ _Μ (ί) of the elementary stream F ^, the synchronization unit 23 uses these time stamps PTS to calibrate the audio enhancement data of _sup relative to the top of the audio decoding start made by the audio decoding unit, using the clock of this audio decoding unit 25 as necessary.

Of course, the invention is not limited to the embodiments described above and shown, from which we can provide other modes and other embodiments, without departing from the scope of the invention. .

Thus, the example of an audio enrichment to improve the accessibility of a television program by visually impaired has been described above. However, the present invention can also be used in the more general context of audio enrichment of both audio and video content, such as video services offered on the Internet.

In addition, the XML format has been previously indicated as being able to be used to insert textual enrichment data along with metadata. However, the invention is not limited to this type of format, but can be put into practice with any other type of format in which text data may be accompanied by metadata, such as for example.

Claims

claims

A method of enriching an audio content of an audiovisual stream (F), comprising:

a step of obtaining (105) at least a first elementary data stream (F ^) comprising textual enrichment data (d _M ) and a second elementary data stream (F _audio ) comprising audio data initials (d _audio ) from the audiovisual stream (F); a step of converting (109) textual enrichment data (d ^) into enriching audio data (d _sup );

a step of mixing (113) the enrichment audio data (<¾ _ϋρ ) with the initial audio data (d _aud i ₀ ) in order to obtain enriched audio data (of _aud i ₀ ).

An enhancement method according to claim 1, further comprising a step of synchronizing (111) the enrichment audio data (d _sup ) with the initial audio data (d _audio ) prior to mixing.

An enrichment method according to claim 2, wherein the synchronization of the enrichment audio data (d _sup ) with the initial audio data (d _audio ) is performed in accordance with at least one time stamp inserted into at least one packet of text data (Ρ _Μ (ί)) belonging to the first elementary data stream (F ^).

An enrichment method according to claim 3, wherein the second elementary data stream (F _audio ) comprises at least one audio data packet (P _audio (j)) comprising initial audio data (d _audio ) and a stamp time, the synchronization of the enrichment audio data (d _sup ) with the initial audio data (d _audio ) being performed by synchronizing the time stamp inserted in the textual data packet (Ρ _Μ (ί)) with the time stamp of the audio data packet (P _aud i ₀ (j)) -

5. Enrichment method according to one of claims 1 to 4, wherein the audiovisual stream (F) is transmitted according to the MPEG2-TS standard, wherein the obtaining step comprises obtaining the first and second elementary data streams (F ^ F ^ _o ) by demultiplexing the audiovisual stream (F) by means of identifiers respectively associated with said first and second elementary data streams in a PMT table.

6. enrichment method according to one of claims 1 to 5, wherein the textual enrichment data (d _M ) are inserted before (101) in the first elementary flow

(Ρ _Μ ) according to a teletext feature defined in an encoding standard and / or audiovisual stream transport.

An enrichment method according to claim 16, wherein descriptive data specifically associated with audio content enrichment is inserted into a specific data field of at least one elementary stream packet (Ρ _Μ (ί)). belonging to the first elementary stream (F ^) to indicate that the textual enrichment data is used only as part of the enrichment of audio content.

The enrichment method according to claim 7, wherein the specific data field is the field PES_data_field of a basic stream packet, defined according to the DVB standard and comprising a first elementary field data_identifier and a second elementary field data_unit_id, in wherein the descriptive data specifically associated with the audio content enrichment consists of at least one value selected in a range of values from 0x80 to 0xFF and inserted in the data_identifier elementary field and / or in the data_unit_id elementary field.

An enrichment method according to claim 7, wherein the specific data field is a descriptive data field belonging to the PMT table and defined according to the MPEG2-TS standard, wherein the descriptive data specifically associated with the enrichment of audio content consists of at least one value selected in a range of values from 0x06 to 0x1F and inserted into said specific data field of the PMT.

The enrichment method according to one of claims 6 to 9, wherein the textual enrichment data (d ^) comprises at least one configuration parameter of the conversion of said textual enrichment data (d ^) into data. Enrichment audio (d _sup ) among the reading speed, the voice type, the intonation of the phrasing, the accentuation and the language.

11. An enrichment method according to any one of claims 1 to 10, wherein the audiovisual stream comprises, in the first elementary stream of data, textual enrichment data and at least one time stamp according to which the audio data of enrichment are to be synchronized with the initial audio data,

the first elementary data stream being multiplexed with the second elementary data stream.

12. Device for enriching (1) the audio content of an audiovisual stream (F), comprising: a demultiplexing unit (10) adapted to obtain at least a first elementary data stream (F ^) comprising data of textual enrichment (d ^) and a second elementary data stream (F _audio ) comprising initial audio data (d _audio ) to from the audiovisual stream (F);

a decoding unit (20) configured to convert the enrichment textual data (d ^) extracted from the first elementary data stream (F ^) to enrichment audio data (d _sup ); and

an audio mixing unit (30) configured to mix the enrichment audio data

(d _sup ) with the initial audio data (d _audio ) extracted from the second elementary data stream (F _a u _d i _o ) to obtain enriched audio data (of _aud i ₀ ).

An audio stream enhancement device of an audiovisual stream according to claim 12, wherein the decoding unit (20) comprises a speech synthesis unit (22) configured to vocally synthesize the audio enrichment data (d). _sup ) from the enrichment textual data (d ^) extracted from the first elementary data stream (F ^) and a synchronization unit (23) configured to synchronize the enrichment audio data (d _sup ) with the audio data initials (d _audio )) extracted from the second elementary data stream (F _audio ) before supplying them to the audio mixing unit (30).

14. A device for enriching the audio content of an audiovisual stream according to claim 12 or 13, wherein the demultiplexing unit (10) is further adapted to obtain a third elementary data stream (F _video ) comprising data video (d _video ) from the audiovisual stream (F), wherein the decoding unit (20) comprises an audio decoding unit (25), configured to extract the initial audio data (d _audio ) from the second elementary stream of data (F _audio ) for supplying them to the audio mixing unit (30), and a video decoding unit (27) configured to extract the video data (d _video ) from the third elementary data stream (F _v i _deo ) to provide them at the output of the enrichment device.

15. Device for enriching the audio content of an audiovisual stream according to one of claims 12 to 14, wherein the audiovisual stream (F) is transmitted according to the MPEG2-TS standard, in which the enrichment device comprises means for implementing the steps of the audio content enrichment method according to one of claims 4 to 10.

16. Signal, conveying an audiovisual stream, intended to be transmitted to an audiovisual stream decoding unit, said signal comprising

a first elementary data stream (F 1) comprising textual enrichment data (d 1);

a second elementary data stream ( _audio F) comprising initial audio data

(d _audio ) of the audiovisual stream,

the textual enrichment data (d _M ) being intended to be converted by the unit of decoding of audio enrichment data (d _sup ) adapted to be mixed with the initial audio data (d _audio ).

The signal of claim 16, the first elementary data stream (F 1) comprises at least one time stamp according to which the audio enrichment data is to be synchronized with the initial audio data during the mixing of the audio enrichment data. with the initial audio data.

Signal according to claim 16 or 17, the textual enrichment data (d ^) is inserted (101) in the first elementary stream (F ^) according to a teletext feature defined in a coding and / or transport standard. audiovisual stream.

The signal according to claim 16 or 18, comprising descriptive data specifically associated with the enhancement of audio content, said descriptive data being inserted into a specific data field of at least one elementary stream packet (Ρ _Μ (ί) ) belonging to the first elementary stream (F ^) to indicate that the textual enrichment data are used only in the context of the enrichment of audio content.