WO2002005433A1

WO2002005433A1 - A method, a device and a system for compressing a musical and voice signal

Info

Publication number: WO2002005433A1
Application number: PCT/SG2001/000144
Authority: WO
Inventors: Kai Kong Ng
Original assignee: Cyberinc Pte Ltd
Priority date: 2000-07-10
Filing date: 2001-07-10
Publication date: 2002-01-17
Also published as: AU2001284619A1; SG98418A1

Abstract

A method for compressing a musical and voice signal, comprising a musical signal and a voice signal , comprises the following steps: determining musical notes parameters of the musical signal, compressing the voice signal independently from the musical signal, and storing the musical notes parameters together with the compressed voice signal, thereby generating a synthesized musical and voice signal.

Description

A method, a device and a system for compressing a musical and voice signal

BACKGROUND OF THE INVENTION

The invention relates to a method, a device and a system for compressing a musical and voice signal.

In today's advance in digital communication technology, transmission of data across the Internet, mobile technology- has made information available to the user almost immediately even over decentralized communication networks such as the Internet .

This technology has also shaped the way people enjoy themselves .

In the audio entertainment field, a user nowadays usually expects audio enjoyment in an on-demand basis. The Internet has served as a very useful highway to transport and distribute musical and voice signals to the user anywhere and anytime.

The Internet phenomena are only at its infancy and it has experienced enormous growth.' Even then, the increasing number of users and new applications entering the Internet need a bandwidth which goes far beyond the bandwidth which is currently available from communication networks.

Compression technology is therefore the topic of focus, which would reduce the bandwidth requirement for the transmission of data in general, in particular for the transmission of musical and voice signals. Using compression of data, to the user, it would mean a shorter time to download the data, a need of smaller storage space and therefore saved money and time .

MP3 (Motion Picture Expert Group (MPEG) Audio Layer 3) was therefore being popular adopted by the industry as the de- facto standard for transmission of audio data across the Internet .

MP3 provides a compression ratio of about 10 times over uncompressed data for CD quality audio (44.1 kHz * 16 bit/s) . To transmit 'a three minute lasting uncompressed song as a musical and voice signal across the Internet would take (44.1 * 16kbps * 2 channels * 3 min * 60 sec) / 56kbps = 4536 sec = 75.6 min_f which is more than an hour.

MP3 would reduce that to only 7.56 minutes, which is an amazing feat.

However, to transmit an album of 10 MP3 songs as musical and voice signals would again take more than an hour. Therefore, a compression method of more than 10 times would be desirable if Internet music becomes a reality.

As shown in Fig.l, professional music is usually recorded in a studio within a soundproof room.

The sound from the musical instruments, also referred to as musical signals 101 and vocals, i.e. speech signals, also referred to as voice signals 102, are recorded on separate tracks .

If the data (comprising the analog musical signals 101 and the voice signals 102) is to be compressed using a digital method, the analog signal is first converted to a digital form through an analog to digital conversion device, i.e. the analog musical signals 101 are converted into digitized musical signals 103 and the analog voice signals 102 are converted into digitized voice signals 104.

The separate signals are then mixed down (through a mixer) onto a master track (the audio signal) , which is symbolized by block 105 in Fig.l, which master track forms the compression source for most compression methods (including MP3) .

MP3 audio compression belongs to a class of data compression schemes called perceptual coding.

This is based on the sub band / transform coding technique. Perceptual coding analyses the frequency and amplitude content of the input signal, and compares it to a model of human auditory perception.

Information that is audible is coded and everything that is inaudible can be discarded.

The advantage of sub band / transform coding is that it works in the frequency domain. The uncorrelated nature of the spectral components makes it possible to quantise the spectral components in different frequency bands with a different number of bits, provided that the resulting quantization noise is unperceived.

This advantage is further exploited by MP3 using the masking" phenomenon of the human auditory system. The MP3 encoder analyses the frequency and amplitude content of. the input . audio signal and compares it to a psychoacoustics model of the human auditory system.

Alternative forms of audio compression include ADPCM (Adaptive Delta Pulse Code Modulation) , wavelet compression, etc. After the audio signal is compress (step 106) , the compressed data are stored into a storage device (step 107) , e.g. a hard disk, a CD-Rom or a semiconductor device like a Flash-Memory or a Read-Only-Memory (ROM) . The data could also be stored • into a server computer where it would, be transmitted over a. transmission line (such as the Internet) to a user on-demand and stored within the user's storage device in a user's client computer.

When the user wishes to listen to the piece of an audio signal, the compressed audio data is decompressed (step 108) and outputted to a digital-to-analog device (step 109) , with the analog signal driving a loudspeaker producing music for listening pleasure.

SUMMARY OF THE INVENTION

An object of the invention is to compress a musical and voice signal with an improved compression ratio.

The object is achieved with a method, a device and a system for compressing a musical and voice signal with the features according to the independent claims .

In a method for compressing^' a musical and voice sigikal, which musical and voice signal comprises a musical signal and a voice signal, the sound from the musical instruments, also ** referred to as musical signal and vocals, i.e. speech signal, also referred to as voice signal 102, are recorded on separate tracks .

The analog musical signal is then converted into a digitized musical signal and the analog voice signal is converted into a digitized voice signal. For the digitized musical signal, notes parameters of the musical signal are determined. In this context, notes parameters are e.g.

• the fundamental frequency of the notes of the musical signal, and/or

• the amplitude of the musical signal, and/or

• the type of instrument or instruments, which are involved in generating the musical signal.

The fundamental ..frequency in this context is the frequency, with which the notes of the reconstructed signal will later be played. ^*

For the digitized voice signal, a compressed digitized voice signal is generated, using e.g. a speech recognition algorithm or a Linear Prediction Coding algorithm (LPC) .

The determination of the notes parameters and the compression of the digitized voice signal are executed independently from each other. ^■

In a last step, the musical notes parameters are stored together with the compressed voice signal in a memory, so that it is possible to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal .

The invention provides a much higher compression rate than the known compression algorithm. The compression rate is even improved when using the speech recognition algorithm, e.g. using Hidden Markov Models, for compressing the voice signal.

The stored musical notes parameters and compressed voice signal may be transmitted from a server computer over a communication network, e.g. via the Internet to a client computer, where it is restored and decompressed, thereby generating the synthesized musical and voice signal, which is presented to a user of the client computer.

Alternatively, the compressed data may be stored into a storage device (step 107), e.g. a hard disk, a CD-Rpm or a semiconductor device like a Flash-Memory or a ROM (Read-Only- Memory) and restored and decompressed from that respective storage device.

When using the speech recognition algorithm for compressing the speech signal, the restoring of the compressed voice signal may Comprise the step of text-phoneme-converting of the compressed voice signal into a speech synthesis signal, which is used for generating the synthesized musical and voice signal.

Furthermore, a device for compressing a musical and voice signal comprises a processing unit for executing the above mentioned steps .

Thus, the device includes e.g.

• a musical notes determination unit for determining musical notes parameters of the musical signal,

• a voice signal compression unit for compressing the voice signal independently from the musical signal, and

• a memory for storing tήe musical notes parameters together with the compressed voice signal, so that it is possible to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating-a synthesized musical and voice signal, the memory being connected to the musical notes determination unit and the voice signal compression unit .

Furthermore, a system for compressing and decompressing a musical and voice signal comprises a processing unit for executing the above mentioned steps. Thus, the system includes e.g.

• a musical notes determination unit for determining musical notes parameters of the musical signal, • a voice signal compression unit for compressing the voice signal independently from the musical signal,

• a memory for storing the musical notes parameters together with the compressed voice signal, so that it possible to restore and decompress the musical notes parameters ,and the compressed voice signal, thereby generating a synthesized musical and voice signal, the memory 'being connected to the musical notes determination unit and the voice signal compression unit, and • a musical and voice signal synthesizing unit for restoring and decompressing the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal.

The invention may be implemented using a special electronic circuit, i.e. in hardware, or using computer programs, i.e. in software.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram showing an example of a method for compressing a musical and voice signal;

Figure 2 is a block diagram showing a model of human speech production;

Figure 3 is a block diagram showing an LPC voice coding unit, also referred to as a vocoder; and Figure 4 is a block diagram showing a system and a method for compressing a musical and voice signal according to a preferred embodiment of the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

Preferred embodiments of the invention and modifications thereof will now be described with reference to the accompanying drawings.

According to the embodiments of the invention, an improved compression ratio is achieved by synthesizing an audio signal, i.e. a musical and voice signal, instead of modeling it.

In order to properly synthesize the audio signal, the complete model of the instrument and vocal cord is required.

Music synthesizing has been available e.g. by equipments which could synthesize (music synthesizer) musical instruments. Such a synthesizer has been provided by a standard keyboard input and it produces musical output from a musical notes.

Such a synthesizer e.g. uses a Wavetable method by ]pecording all the notes from a musical instrument and stores it into a semiconductor storage (ROM) . Given the instrument, notes ant! velocity (the information about how hard and how fast the key of the keyboard is pressed) , the particular musical notes can be played.

Although popular in recording audio signals, it should be mentioned that music synthesis has never been used as a compression methodology according to the state of the art. Furthermore, voice can be synthesized using a text-to-speech generating method.

Furthermore, it is known to extract the vocal parameters to mimic a person's voice. Since a human is quite perceptive to a singer voice, a compression method that models the general vocal code would be sufficient and will be described instead.

The vocal compression according to a embodiment of the invention uses a method called Linear Predictive Coding (LPC) .

According to the LPC, the way, how the human speech is generated, is modeled.

Speech is produced by cooperation of lungs, glottis (with vocal cords) and articulation tract (mouth and nose cavity) .

For the production of voiced sounds, the lungs press air through the epiglottis, the vocal cords vibrates; they interrupt the air stream and produce a quasi-periodic pressure wave.

In the case of unvoiced sounds, the excitation of the vocal tract is more noise-like.

A model 200 illustrates the human speech production, as shown in Fig.2.

The lungs are modeled by a DC source 201 , the vocal cords by an impulse generator 202 and the articulation tract by a linear filter system 203. A noise generator 204 produces the unvoiced excitation. Speech sounds 205- consist of both voiced and unvoiced signals mixed together.

A great advantage of an LPC coder is the manipulation facilities and the narrow analogy to human speech. By manipulating the parameters of the LPC vocoder, it is for example possible to transform a male voice into a female voice or a child voice. An LPC vocoder can be used as the engine for the text-to-speech synthesis, which will be described later in detail .

Fig.3 shows a block diagram of an LPC vocoder 300.

The first step is to perform an LPC and speech analysis on the digital voice data, i.e. an LPC analysis (block 301) and a pitch analysis (block 302) .

Both sets of the determined LPC coefficients 303 and the determined pitch values 304 are then stored in the parameter memory (block 305) . These parameters are then used to control the synthesis part of the LPC vocoder 300.

In other words, the stored parameters a fed into a pitch generator 306, which generates reconstructed pitch values 307 and into a digital filter 308. Furthermore, noise signals 310 are generated, by a noise generator 309. The reconstructed pitch values 307 and the noise signals 310 are amplified (block 311) and the amplified signals 312 are fed into the digital filter 308, thereby generating a reconstructed voice signal 313.

In this context, it should be mentioned that the LPC compression can only be used for human speech compression, i.e. for compressing a voice signal. It is not suitable for compression of a musical signal. The compression ratio achieved by the LPC is much higher than any audio compression (MP3, ADPCM or Wavelet) so far.

Fig.4 shows a system for compressing and decompressing a musical and voice signal according to a preferred embodiment of the invention. The system 400 comprises a server computer 401 and a plurality of client computers 402, one of them being shown in Fig.4.

The respective steps which are executed during the method are symbolized as blocks in the server computer 401 and the client computer 402, respectively.

The server computer 401 and the client computer 402 are connected to each other via the Internet 403 as a communication network.

As shown in Fig.4, an analog musical signal 404 and an analog voice signal 405 are recorded on separate tracks using a microphone (not shown) .

The analog musical signal 404 and analog voice signal 405 are converted into a digital musical signal 406 and a digital voice signal 407 signal using an analog to digital conversion device.

The digital signal from the musical instrument, i.e. the digital musical signal 406 is fed into a frequency analyzer, which determines the fundamental frequency of the notes played. The amplitude and the type of instrument are also recorded.

In order to determine these parameters, the digital musical*"^* signal 406 is transformed from the time domain to the frequency domain. The fundamental frequency is selected and its amplitude is noted, i.e. stored. The fundamental frequency is the frequency, with which the noted will be played. The frequency and the amplitude are recorded as described in the General MIDI standard. The frequency is respectively stored as the notes. The amplitude is stored as the velocity. According to this embodiment, the determined values are normalized to fit in the required predetermined range .

Together, the fundamental frequency of the notes played, the amplitude and the type of instrument form the musical notes parameters (block 408) .

The digital voice signal 407 is fed into an LPC vocoder 409 The LPC vocoder 409 determines the LPC coefficients as described above* thereby generating a compressed voice signal 411.

A speech recognition can alternatively be used to replace the LPC. When using a speech recognition algorithm, Hidden Markov Models may be used.

The musical notes parameters 410 and the compressed voice signal 411 is multiplexed and stored in storage device of the server computer 401 (block 412) , alternatively on any other storage medium such as a CD-ROM.

The term "multiplexed" is to be understood in the sense that a rather small portion of the musical notes parameters 410 and a rather small portion of the compressed voice signal 411 are loaded into a small memory space sufficient to store those two portions, which respectively form a sub portion of the whole musical notes parameters 410 and compressed voice signal 411.

With this optional feature, it is possible to reduce the required memory space in the client computer, which is especially advantageous if the client computer is a cheap and rather low-end device such as a mobile phone or a PDA having an audio player, with which it is possible to reconstruct and play the reconstructed audio signal. Another advantage of the storing of a small portion of the musical notes parameters 410 and a small portion of the compressed voice signal 411 together is that in this case it is not necessary to transmit the entire musical notes parameters 410 and the entire compressed voice signal 411 before beginning to reconstruct and play the audio signal, i.e. the song. This is particularly advantageous when using a rather slow communication network such as the Internet 403 using a slow telephone modem line between the server computer 401 and the client computer 402.

The data 413 is transmitted across the Internet 403 on an on- demand basis.

The received data 413 is then stored within the client computer 402 (block 414) .

When the user of the client computer 402 wishes to listen to a piece of music, the compressed data 413 is extracted and decompressed e.g. in real-time.

In other words, the stored musical notes parameters 410 are extracted (block 415) and a decompressed digital musical signal 416 is generated using the Wavetable method used in a usual keyboard synthesizer.

Furthermore, the stored compressed voice signal 411 is decompressed (block 417) and a decompressed digital voice signal 418 is generated.

When using the LPC, the decompressed digital voice signal 418 is generated in the way described with reference to the LPC vocoder of Fig.3.

In general, text-to-speech conversion is used for the synthesis of the digital voice signal 418. This means that a stored dictionary of text and corresponding phonemes is used. Each phoneme has a corresponding voice. The information and stress of the voice are adjusted based on the particular context of the reconstructed digital voice signal 418. The information and the stress may be provided by the melody of the digital musical signal 410 using the note pitch and its amplitude.

When using the speech recognition algorithm and the corresponding text-to-speech conversion will usually not generate the sound of the original singer. However, using the speech recognition algorithm provides the higher compression ratio.

The "raw" musical signals and voices signals are combined either by digital or analog means.

For analog combination, a digital-to-analog conversion process will convert the digital signals to analog signals. In other words, the decompressed digital musical signal 416 is converted into a decompressed analog musical signal 420 (block 419) . The reconstructed digital voice signal 418 is converted into a reconstructed analog voice signal 422 (block 421) .

The analog musical signal 420 and the analog voice signal 422 are then combined through a' summing operational amplifier, i.e. a mixer 423, thereby generating a reconstructed analog musical and voice signal 424.

The analog musical and voice signal.424 is. output to a power amplifier 425 and the thereby generated amplified analog musical and voice signal 426 is used to drive a speaker in order to generate the audio signal 427 output to the user of the client computer 402.

Claims

CLAIMSWhat is claimed is :

1. A method for compressing a musical and voice signal, comprising a musical signal and a voice signal, which method comprises the following steps :

• determining musical notes parameters of the musical signal, • compressing; the voice signal independently from the musical signal, and

• storing- the musical notes parameters together with the compressed voice signal, so that it is possible to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal.

2. The method according to claim 1, wherein

• the musical signal is a digitized musical signal, and • the voice signal is a digitized voice signal.

3. The method according to claim 1 or 2, wherein

• the fundamental frequency of the notes of the musical signal, and/or • the amplitude of the musical signal, and/or

• the type of instrument 'or instruments, which

involved in generating the musical signal, are determined as the musical notes parameters.

4. The method according to any one of the claims 1 to 3, wherein the compressing of the digital voice signal is performed using a linear prediction algorithm.

5. The method according to any one of the claims 1 to 3 , wherein the compressing of the digital voice signal is performed using a speech recognition algorithm.

6. The method according to claim 5, wherein the speech recognition algorithm is an algorithm using Hidden Markov Models.

7. The method according to any one of the preceding claims, wherein the stored musical notes parameters and compressed voice signal are restored and decompressed, thereby generating the synthesized musical and voice signal.

8. The method according to claim 5 and 7, wherein the restoring of the compressed voice signal comprises the step of text-phoneme-converting of the compressed voice signal into a speech synthesis signal, which is used for generating the synthesized musical and voice signal .

9. A device for compressing a musical and voice signal, comprising a musical signal and a voice signal, the device comprising: • a musical notes determination unit for determining musical notes parameters of the musical signal,

• a memory for storing the musical notes parameters together with the compressed voice signal, so that it is possible to restore and decompress the musical <notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal, the*"" memory being connected to the musical notes determination unit and the voice signal compression unit .

10. A system for compressing and decompressing a musical and voice signal, comprising a musical signal and a voice signal, the system comprising:

• a memory for storing the musical notes parameters together with the compressed voice signal, so that it is possible to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal, the memory being connected to the musical notes determination unit and the voice signal compression unit, and ,

• a musical and voice signal synthesizing unit for restoring and decompressing the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal.

11. A computer readable medium, having a program recorded thereon, where the program makes the computer execute a procedure comprising the following steps for compressing a musical and voice signal, comprising a musical signal and a voice signal:

• determining musical notes parameters of the musical signal,

• compressing the voice signal independently from the musical signal, and • storing the musical notes parameters together with the compressed voice signal, so that it is possible^to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal .

12. A computer program element which makes the computer execute a procedure comprising the following steps for compressing a musical and voice signal, comprising a musical signal and a voice signal : • determining musical notes parameters of the musical signal, compressing the voice signal independently from the musical signal, and storing the musical notes parameters together with the compressed voice signal, so that it is possible to restore and decompress the musical notes parameters and the compressed voice signal, thereby generating a synthesized musical and voice signal.