CA1231473A - Voice activity detection process and means for implementing said process - Google Patents
Voice activity detection process and means for implementing said processInfo
- Publication number
- CA1231473A CA1231473A CA000454771A CA454771A CA1231473A CA 1231473 A CA1231473 A CA 1231473A CA 000454771 A CA000454771 A CA 000454771A CA 454771 A CA454771 A CA 454771A CA 1231473 A CA1231473 A CA 1231473A
- Authority
- CA
- Canada
- Prior art keywords
- block
- threshold
- vadth
- voice
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired
Links
- 230000000694 effects Effects 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000001514 detection method Methods 0.000 title claims abstract description 19
- 238000001228 spectrum Methods 0.000 claims abstract description 16
- 238000012360 testing method Methods 0.000 claims description 10
- 206010019133 Hangover Diseases 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 4
- 230000003247 decreasing effect Effects 0.000 claims description 2
- 108091006146 Channels Proteins 0.000 claims 2
- 230000005540 biological transmission Effects 0.000 abstract description 5
- 230000005284 excitation Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 5
- 238000005311 autocorrelation function Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 206010011878 Deafness Diseases 0.000 description 2
- 101150115538 nero gene Proteins 0.000 description 2
- 206010011732 Cyst Diseases 0.000 description 1
- 241000557829 Herbertus Species 0.000 description 1
- 241000282320 Panthera leo Species 0.000 description 1
- 241000364021 Tulsa Species 0.000 description 1
- 210000001367 artery Anatomy 0.000 description 1
- 150000001768 cations Chemical class 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 208000031513 cyst Diseases 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000029305 taxis Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J3/00—Time-division multiplex systems
- H04J3/16—Time-division multiplex systems in which the time allocation to individual channels within a transmission cycle is variable, e.g. to accommodate varying complexity of signals, to vary number of channels transmitted
- H04J3/1682—Allocation of channels according to the instantaneous demands of the users, e.g. concentrated multiplexers, statistical multiplexers
- H04J3/1688—Allocation of channels according to the instantaneous demands of the users, e.g. concentrated multiplexers, statistical multiplexers the demands of the users being taken into account after redundancy removal, e.g. by predictive coding, by variable sampling
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J3/00—Time-division multiplex systems
- H04J3/17—Time-division multiplex systems in which the transmission channel allotted to a first user may be taken away and re-allotted to a second user if the first user becomes inactive, e.g. TASI
- H04J3/175—Speech activity or inactivity detectors
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Time-Division Multiplex Systems (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
Abstract
ABSTRACT
In a transmission system for transmitting voice signals each signal being sampled and coded to derive therefrom blocks of samples and short term power spectrum characteristics or each block, a voice activity detection process is proposed wherein an energy representative information is derived from each block of sample, which energy representative information is compared with a predetermined threshold and said threshold is adjusted accordingly. Then an active/ambiguous decision is taken based on the relative magnitude of the energy information with respect to the adjusted threshold, which ambiguity is then eventually resolved through analysis of the magnitude of variation of the short term power spectrum characteristics.
In a transmission system for transmitting voice signals each signal being sampled and coded to derive therefrom blocks of samples and short term power spectrum characteristics or each block, a voice activity detection process is proposed wherein an energy representative information is derived from each block of sample, which energy representative information is compared with a predetermined threshold and said threshold is adjusted accordingly. Then an active/ambiguous decision is taken based on the relative magnitude of the energy information with respect to the adjusted threshold, which ambiguity is then eventually resolved through analysis of the magnitude of variation of the short term power spectrum characteristics.
Description
I '7"3 A VOICE ACTIVITY DETECTION PROCESS AND
JEANS FOR IMPLE~lE~TING SAID PROCESS
grief background of the invention Field of invention This invention deals with digital speech transmission and more particularly with means for efficiently processing speech signals to enable effective use of channel bandwidth.
Technical background In view of the high cost of transmission channels, it might be wise to take advantage of any speech characteristics which would enable concentrating the traffic of a maxim.
number of telephone users over a same channel.
During a telephone conversation, each subscriber speaks less than half the time during which the correction is established. The remaining idle time is devoted to listening, gaps between words and syllables, and pauses. A
number of systems have already been proposed which take advantage o- this idle time. For instance, additional users, up to twice the overall channel capacity, are assigned to a same channel in TAXI systems see J. Campanella digital TAXI", Combat Technical Revue of 1975). These souses obviously need means for detecting any user's inactivity to be able to assign the channel to a different user.
Unfortunately, the voice activity determination is far Crow being straightforward. In general, the method or detections voice activity of a given speaker is based on measurement o' the speech signal energy over short periods ox tire. The measured energy is -then compared With a presupposed threshold level. Speech is declared present to the measured energy exceeds the threshold, otherwise the period is declared idle, i.e. the concerned speaker is considered silent Lo said period of time. The prickly lies it the threshold determination due to the rot that dire rent speakers usually speak at different levels and a 50 Guy 'O
the fact that the losses vary from one transition line to FRY
I
another. A too highly set threshold would result into Tao speech signals being clipped. Thus the received speech signal would be of rather poor quality. While a low threshold would obviously lead to a poor TAXI system efficiency. In addition, one has also to take into account the presence of noise which should be discriminated from voice signals.
An object of the present invention is to provide for an improved speech activity detection process.
Another object of the invention is to provide for means for detecting low level speech activity within high level background noise.
Summary of Invention In a transmission system for transmitting voice signals each signal being sampled and coded to derive therefrom blocs of samples and short term power spectrum characteristics of each block, a voice activity detection process is proposed wherein an energy representative information is derived prom each block of samples, which energy representative information is compared with a predetermined threshold an said threshold is adjusted accordingly. Then an active/ambiguous decision is taken based on tune relative magnitude of the energy information with respect to the adjusted threshold, josh arrogate if any is then resolved through analysis OX the magnitude of variation of the short term power spectrum characteristics.
The foregoing and other features and advantages of this invention will be apparent from the -Following more particular description of the preferred embodiment of the invention as illustrated in the accorlpanying drawing.
grief on ox the drawing Fly. 1 is a bloc diagram of a TAXI system, FRY
Fig. 2 - 4 are block diagrams of prior art coders based on linear prediction theory, Fig. 5 and 6 : respectively summarize the linear prediction coders and decoders characteristics to be used for this invention, Fig. 7 and 8 summarize the various steps of the invention process, Fig. 9 is a block diagram of a device for implementing the invention.
Des _ Zion of a referred embodiment of the invention Represented in figure 1 is a block diagram of a TAXI type of system. P users, i.e. voice terminal sources, are respectively attached to individual input channels through coders (CODE 1, CODE 2,..., CODE P) attached to ports SPORT
1, PORT 2,..., PORT P). Each coder converts the analog voice signal fed by a user through a port into digital data. The digital data are then concentrated over a single output channel L to be forwarded toward a remote receiving location (not shown where it would be redistributed to designated terminals (not shown) to which they are respectively assigned. The concentration operation is performed through a Time Division Multiplexer (TDM-~PX~ I
Under normal TAM conditions the number of users is selected such that the -total number of bits per second provided by the P sources would match with the output line transmission rate capability. But, as mentioned above such an arrangement would not take full advantage of a number of speech characteristics, e. g. silences, as Task systems do.
For TAXI operation, the number P ox users attached to the system is purposely made higher -than it would be in a conventional multiple system. In other words, assuming all the users would be in operation at a given instant, then, the multiplexer and, more particularly the output line, would be unable -to handle the resulting data traffic without taking into consideration the above mentioned silence or other in activities. This is why, a Voice Activity Detector FRY
I
(VOWED 12 is connected to the output of earn coder Said VAT
12 is made -to permanently scan the coalers outputs to detect those coders which may be considered active and gate their outputs into the multiplexer 10 through gates Go, Go,....
GYP. The voice activity detector 12 also provides the lit plexer with the active coders addresses indications to be inserted within the multiplexed message and transmitted over the output line for each time frame. A ooze terminal is considered active whenever its output level is above a threshold level preset within the Voice Ac-~ivitv Detector 12.
A voice activity detector is proposed here which not only enables adequately adjusting its threshold to the spearer environmental conditions, but in addition taxes full advantage of the coder characteristics. As already mentioned, voice activity detection requires energy measurements. The proposed Voice Activity Detector ED
achieves cost effectiveness by using data already available within the coder for performing the energy measuring operation. It applies to a number of coders based on tune linear prediction -theory which assures a modeling of the linear vocal tract by an all-pole filter. Reference on this subject is provided by J. AURICLE and ASH. GUY in -their book on linear Prediction of Speech" published in 1976 by Springer Verlag, New York.
The modeling applies to a wide range of digital speech compression systems, e. g. adaptive predictive coders ARC
voice excited predictive coders (VEPC), linear p-ealction vocoders ILPC). For references on these coders one should refer to a number of publications including :
"Adaptive Predictive Coding ox Speech SigIlals" by B.S.
ANAL and MAR. SCHROEDER, in Bell Cyst. Tech. Jo~lrr.al, Vol. 49, pup 1973-1986, October 1970 ;
- "9.6/7.2 Kbps Voice Excited Productive Coder EPIC)", by D. ESTEBAN, C. GAY.. END, D. i~VDUIT arid J. loners, yin IEEE ICASSP, Tulsa, April 1978 warily Kbps stands -or Kilobytes per second ; assailed, FRY
- "A Linear Prediction Vocoder Simulation Based on -the Auto correlation Method", by JO Market and A. H. GRAY!
in IEEE Trans. on Acoustic Speech and Signal Processing Vol. Aspire no 2, pup 12~-134, April 1974.
In ARC coders, the speech signal is inverse filtered by an optimum predictor, resulting in a so called excitation signal which is quantized, transmitted and used at the synthesis location to excite an all-pole filter. Both inverse and all-pole filter characteristics are derived from the voice signal characteristics. Shown in figure 2 is a block diagram summarizing the basic elements of an ARC
coder. The voice signal samples I provided by blocks of N
samples (see BCPCM or Block Commanded POX tuitions are fed into a predictor filter l, the coefficients X of which are derived in 2 from the voice signal analysis. An excitation (residual) signal is then derived in 3, which residual signal is coded in into an EN information. The voice signal is thus finally converted into Casey and Encoded information.
In VEPC coders, the excitation signal is approximated by considering only the lowest frequency band or vase Band BY
(e. g., 0-1 KHz) of the original excitation signal. A bloc diagram summarizing the VEPC coder functions is represented in fig. 3. The difference with the ARC coder lies in the fact that only the base-band is finally coded into SAIGON in 5, while the upper band synthesis, ego 1-3 KHz is being represented by its energy (EYING) .
The upper band components will be synthesized, when needed, (i.e. at the receiving station not shown) by means of non-linear distortion, high-pass filtering, and energy matching. Additional information on VEPC coding is also available in US Patent ~,216,354 to this assignee.
A block diagram for an LPC coder is represented in fig. 4.
In this case, the excitation signal is being represented by a voiced/unvoiced (VIVA) I decision Al bit), a pitch FRY
period representation coded in I (e.g. 5 bits) and an energy indication (e.g. 4 bits) coded in (8).
In LPC decoders, and for synthesis purposes not shown, the excitation will be approximated either by a pulse twain at the pitch frequency in the case of voiced signals, or -y a white noise in the case of unvoiced signals A common block diagram for the analysis part ox the coders based on the three above cited techniques is shown in figure 5. The input speech signal is analyzed by blocs OX
samples I, based on the assumption that the signal is stationary within each block. The upper path ox the analvser includes means for the determination of auto correlation function (DEAF), which means extracts spectral information I based on auto correlation coefficients from the input signal. This spectral information is then processed, in (DPC)16, for the determination of prediction coefficients X, which coefficients are to be transmitted for being used for synthesis purposes within the correspondil~q receiver. Both devices 14 and 16 are eventually incl~ldcc within the device 2 of figures 2 through 4. In addition algorithms for converting Us into Us, and vice-versa, are already known in the Martin the lower portion of figure 5, extraction of excitation data EN is performed in EEL
18. The excitation data contents differ prom one type of coder to another. When using Adaptive Predictive Coding (ARC) means, the EN parameters will include the encoded excitation signal. With Linear Prediction Vocoders (LPC) the EN parameters include : pitch period indication voiced/unvoiced decision indication ; and block energy India cation. While, with Voice Excited Predictive Vocoders (VEPC) the EN parameters include : encoded base-band signal and high-frequency energy indication, respectively designated by SIGNAL and ENERGY in the above mentioned patent.
A common block diagram of the synthesizing means for the three above techniques, i.e. ARC, LPC and VEPC, is shown in figure 6. The received EN parameters are used for generating the excitation signal (GOES) in 20. Said excitation signal is used to excite a model digital filter (r') Jo whose FRY
coefficients are adjusted by -the received prediction coefficients I. Reconstructed voice samples on) are provided by filter 22.
The above mentioned coders may be used to achieve compression of a speech signal originally coded at I Kbps (CCITT PAM) into a 2.4 to 32 Kbps. The resulting quality would range from synthetic quality (2.4 Kbps) to communications quality ~16 Kbps) and toll quality (32 Kbps~.
Err a full understanding of the above comments one can refer to "Speech Coding" published by JO FLANAGAN, MAR.
SCHROEDER, B.S. ANAL, RYE. CRUSHER, U.S. JUT and JAM.
TRIBOLET in IRE Trans. on Communications, Vol. CASEY, N
4, pup 710-737, April 1979. Such compression already enables a more efficient use of the communication channel. The use of TAXI techniques roughly doubles this efficiency at no substantial extra cost, which is particularly true with this improved voice activity detection method.
The activity decision at the output of each voice coder COVE
1,..., CODE P (see fig. 1) is naturally based on awl evaluation, for each block of N input speech samples, o, the signal energy, and on the comparison of this energy with an activity threshold.
The characteristics of possibly existing background noise in any normal environment will also be taken into account by continuously evaluating the power spectrum. of said noise.
In addition, the proposed process will keep significantly low the processing workload required once associated with a speech coder based on linear prediction.
Indeed, the short term power spectrum or the signal in a block of samples is directly related to the auto correlation function of this signal, and the energy of the signal is well approximated by the magnitude OTT the largest sample within the block. These information are already available within the coder. One is already used or the computation Ox the predictor coefficients, the other for intermediate signal scaling in a fixed point implementation For ~R9-82-010 isle 3 instance, in coders operating according to stock Commanding PAM techniques, the block characteristic term (C) or scaling factor already available is directly related to the magnitude of the largest sample within the bloc. In other words, given a block of N samples I with n = 1, 2,..., I
the magnitude OX of the largest sample is normally determined within the coder independently of arty voice activity detection requirement.
C = MAX = MAX ( I x (n) I
In practice, the C coefficient is used for normalizing the input signal prior to performing auto correlation, coefficients determination, and thus "C" is already available within the coder, apart from any voice activity determination concern.
For the duration of each block of samples (e. g. 20 my), and based on each MAX value measured, the Voice Activity Detection (VAT) operations will be performed on the following principles. If OX is smaller than a predetermined threshold level, then the threshold should be adjusted to SAX rapidly ; otherwise, the threshold adjustment will ye made progressively from one block of signal samples to the following blocks. This threshold adjustment helps tracking background noises with increasing energy levels.
The second principle is based on à measurement of SAX with respect to the current threshold value. If MA is substantially larger than the threshold lovely OX I.
(threshold value), with k I the bloc of samples being processed is considered as deriving from voice signal, i. e.
the corresponding channel, is considered "Austria"
Otherwise, an ambiguity still remains which should ye resolved.
It should be understood that instead of OX one might consider any block energy representative information EM.
FRY
n The ambiguity resolutioll is used on two assumptions. rearrest, if -the time delay between the block of samples being presently processed and the last "active" block provided by the considered channel is less than a given hangover relay, then the block is classified as being an "active" Blake (i.
e. provided by an active channel). Otherwise, the system would rely on an additional test based on spectral analysis of the signal. In other words, the system would then rely on -the short-term power spectrum of the signal in a block of samples which is directly related to the function (Roy)) of -this signal. Assuming the I function variation is significantly large, then the block is considered "actively, otherwise the block is considered "inactive" i. e. equiva-lent to a silence.
The hangover delay consideration will help bridging short inter syllabic silences (0.1 to 1 second for instance) Chile it does not increase significantly the speech activity (less than 5 I). This hangover enables avoiding the possible inter syllabic clipping which is unpleasant.
The threshold adjustment combined with spectral variations analysis enables rejecting large steady background noise.
For example, assuming the speaker operates in a White noise environment, if a blower is turned on, thus producing a high acoustic energy, the Voice Activity detector will adapt itself and detect Lowe energy voice segments such as fricatives in utterance attacks and reject no speech segments.
Figures 7 and 8 summarize the various steps o; the Voice Activity detection method according to which each Blake of samples is processed. The current auto correlation coefficients Roy as well as OX are previously stored.
Xl~X is first compared t~lith a predetermined threshold level VADTH initially set empirically. The level of said threshold is then dyllamically adjusted based on this lucks versus V~DTr.
test. If OX is smaller than VADTH then the threshold is rapidly updated to I value. Othertlise, OX is updated b a small increment by satanic the Noah V.~DTH to VAT
FRY
lo ~L~3~L4~
with the decimal value of said "1" increment being equal Jo 1/211 or 1/2048.
The next test determines whether OX is substantially greater than VADTH. For that purpose, OX is compared with k.VADTH, with k = 2 or 4 as indicated in connection with fig. 8. If this is the case, i.e. OX ' k.VADTH, then the block is said to be active, i. e. to belong to speech signal and a flag (VADFLAG) is set to one. Simultaneously, a hangover decanter, i. e. timer VADTO~T is set to RUT a predetermined time delay value, say 3 to 50 block lengths durations (20 my each). If MAX is not substantially greater than VADTH, then an ambiguity persists. The block might be active or inactive. The hangover counter is decreased by one unit for the currently processed block. As long as the counter contents is positive, then the bloc is classified as an active block.
Now, assuming the hangover time has elapsed, then the short term power spectrum function variation is carpeted by measuring SUM = I I Rood I
wherein symbolizes a summing operation, and ¦ ¦ indicates that magnitude is considered. If SUE is greater than a predetermined value RX empirically set to, say, the decimal value 1280/2048 or to 640/2048, then again, the block is considered active. Otherwise, the block is classified "non active" or corresponding to a speaker's silence. The VADFLAG
is then set to zero.
The short term power spectrum information may also be derived differently, e.g. using prediction coefficients I
rather than Risk While figure 7 summarizes the main steps of the Voice Activity Detection process, the short term power spectrum FRY
information computing process and the various arteries ~lpdatings are more particularly addressed by figure 8.
According to this figure, several tests are performed. The first test (VADTOUT > - 3 ?) enables setting k to 2 or and RX decimal value to 0.3 or 0.6.
The second test is intended to decide when a snap shot should be taken of the auto correlation function which will later on be used to update the Rood terms. For instance, the updating operation may be performed at the 25 inactive (silence) block, in other words crier 25 consecutive detections of inactive blocks. But the effective ode updating operation is delayed by 5 additional consecutive ambicJuous blocks. Also, assuming more ambiguous block aye detected subsequently, the VADTOUT is arbitrarily set to a fixed value to avoid any counter overflow.
A block diagram of a system for implementing the Voice Activity Detection process is shown in figure 9. An input buffer (BUY) I stores blocks of samples I. Assuring the input signal is sampled at 8 KHz, end assuming each block of samples represents a segment or signal 20 my long, then each block contains 160 samples. These samples are sorted in 26 to derive, for each block of samples, the SIX information therefrom. iota a fixed point implementation, the X OX
determination is already performed within the coder to scale the samples, and need not be repeated for the Voice activity Detection VOWED) purposes.
MAX is then moved into a threshold adjusting device (To'.
ADJUST) 28 where it is first compared with a previously sex threshold VADTH. Based on said compare result, V.~DTH is adjusted by either being slightly incremented or by being forced -to MAX value.
The OX - k.VADT~I 0 k = 2 or 4 for example test is then performed in (COOPER.) 3C.
FRY
A bit So is set to one in case the above -test result is negative. So is used to set a VADFLAG latch 32 and set the timer VADTO~T 34 to say, 3 units it 60 my). Whenever So = 0, then the VADTOUT timer is decrement Ed by one unit it 20 my).
The timer 34 provides a grating bit whenever the timer contents is equal to -25. This bit is used to open a Nate 36 to update an auto correlation memory 38 contents. The normalized auto correlation coefficients Wrier to be moved into memory 38 are provided my a device 40 Russia belongs to the auto correlation function determinator (DEAF) 14. This updating is done through a buffer R(i)RSv, and is confirmed when the counter V~DTOUT is equal to -30. The I
coefficients need not be computed especially for the Voice Activity detection operation. They have already been computed within the coder, for each block of samples.
Whenever the VADTOUT timer contents is equal to Nero, the Roy function variation computation is started in Soul device 42. Said device 42 being connected to devices 38 and on computes 7 I Ruled) and thus determines the magnitude of variation of the short term power spectrum characteristics.
The device 42 also compares SUE to a short term vower I spectrum variation reference value RUT. A positive test SUM RUT sets a bit So to logic level 1 (active challnel).
This logic level is used to set the VADFLAG to 1. The VADFLAG = 1 indication is also forwarded to the multiplexer 10 (fig l) together wile rein its PORT origin is identified.
Otherwise So = 0 and said So bit is inverted in I-. and used to reset VADFLAG to Nero ; in which case the channel its considered inactive or idle.
FRY
JEANS FOR IMPLE~lE~TING SAID PROCESS
grief background of the invention Field of invention This invention deals with digital speech transmission and more particularly with means for efficiently processing speech signals to enable effective use of channel bandwidth.
Technical background In view of the high cost of transmission channels, it might be wise to take advantage of any speech characteristics which would enable concentrating the traffic of a maxim.
number of telephone users over a same channel.
During a telephone conversation, each subscriber speaks less than half the time during which the correction is established. The remaining idle time is devoted to listening, gaps between words and syllables, and pauses. A
number of systems have already been proposed which take advantage o- this idle time. For instance, additional users, up to twice the overall channel capacity, are assigned to a same channel in TAXI systems see J. Campanella digital TAXI", Combat Technical Revue of 1975). These souses obviously need means for detecting any user's inactivity to be able to assign the channel to a different user.
Unfortunately, the voice activity determination is far Crow being straightforward. In general, the method or detections voice activity of a given speaker is based on measurement o' the speech signal energy over short periods ox tire. The measured energy is -then compared With a presupposed threshold level. Speech is declared present to the measured energy exceeds the threshold, otherwise the period is declared idle, i.e. the concerned speaker is considered silent Lo said period of time. The prickly lies it the threshold determination due to the rot that dire rent speakers usually speak at different levels and a 50 Guy 'O
the fact that the losses vary from one transition line to FRY
I
another. A too highly set threshold would result into Tao speech signals being clipped. Thus the received speech signal would be of rather poor quality. While a low threshold would obviously lead to a poor TAXI system efficiency. In addition, one has also to take into account the presence of noise which should be discriminated from voice signals.
An object of the present invention is to provide for an improved speech activity detection process.
Another object of the invention is to provide for means for detecting low level speech activity within high level background noise.
Summary of Invention In a transmission system for transmitting voice signals each signal being sampled and coded to derive therefrom blocs of samples and short term power spectrum characteristics of each block, a voice activity detection process is proposed wherein an energy representative information is derived prom each block of samples, which energy representative information is compared with a predetermined threshold an said threshold is adjusted accordingly. Then an active/ambiguous decision is taken based on tune relative magnitude of the energy information with respect to the adjusted threshold, josh arrogate if any is then resolved through analysis OX the magnitude of variation of the short term power spectrum characteristics.
The foregoing and other features and advantages of this invention will be apparent from the -Following more particular description of the preferred embodiment of the invention as illustrated in the accorlpanying drawing.
grief on ox the drawing Fly. 1 is a bloc diagram of a TAXI system, FRY
Fig. 2 - 4 are block diagrams of prior art coders based on linear prediction theory, Fig. 5 and 6 : respectively summarize the linear prediction coders and decoders characteristics to be used for this invention, Fig. 7 and 8 summarize the various steps of the invention process, Fig. 9 is a block diagram of a device for implementing the invention.
Des _ Zion of a referred embodiment of the invention Represented in figure 1 is a block diagram of a TAXI type of system. P users, i.e. voice terminal sources, are respectively attached to individual input channels through coders (CODE 1, CODE 2,..., CODE P) attached to ports SPORT
1, PORT 2,..., PORT P). Each coder converts the analog voice signal fed by a user through a port into digital data. The digital data are then concentrated over a single output channel L to be forwarded toward a remote receiving location (not shown where it would be redistributed to designated terminals (not shown) to which they are respectively assigned. The concentration operation is performed through a Time Division Multiplexer (TDM-~PX~ I
Under normal TAM conditions the number of users is selected such that the -total number of bits per second provided by the P sources would match with the output line transmission rate capability. But, as mentioned above such an arrangement would not take full advantage of a number of speech characteristics, e. g. silences, as Task systems do.
For TAXI operation, the number P ox users attached to the system is purposely made higher -than it would be in a conventional multiple system. In other words, assuming all the users would be in operation at a given instant, then, the multiplexer and, more particularly the output line, would be unable -to handle the resulting data traffic without taking into consideration the above mentioned silence or other in activities. This is why, a Voice Activity Detector FRY
I
(VOWED 12 is connected to the output of earn coder Said VAT
12 is made -to permanently scan the coalers outputs to detect those coders which may be considered active and gate their outputs into the multiplexer 10 through gates Go, Go,....
GYP. The voice activity detector 12 also provides the lit plexer with the active coders addresses indications to be inserted within the multiplexed message and transmitted over the output line for each time frame. A ooze terminal is considered active whenever its output level is above a threshold level preset within the Voice Ac-~ivitv Detector 12.
A voice activity detector is proposed here which not only enables adequately adjusting its threshold to the spearer environmental conditions, but in addition taxes full advantage of the coder characteristics. As already mentioned, voice activity detection requires energy measurements. The proposed Voice Activity Detector ED
achieves cost effectiveness by using data already available within the coder for performing the energy measuring operation. It applies to a number of coders based on tune linear prediction -theory which assures a modeling of the linear vocal tract by an all-pole filter. Reference on this subject is provided by J. AURICLE and ASH. GUY in -their book on linear Prediction of Speech" published in 1976 by Springer Verlag, New York.
The modeling applies to a wide range of digital speech compression systems, e. g. adaptive predictive coders ARC
voice excited predictive coders (VEPC), linear p-ealction vocoders ILPC). For references on these coders one should refer to a number of publications including :
"Adaptive Predictive Coding ox Speech SigIlals" by B.S.
ANAL and MAR. SCHROEDER, in Bell Cyst. Tech. Jo~lrr.al, Vol. 49, pup 1973-1986, October 1970 ;
- "9.6/7.2 Kbps Voice Excited Productive Coder EPIC)", by D. ESTEBAN, C. GAY.. END, D. i~VDUIT arid J. loners, yin IEEE ICASSP, Tulsa, April 1978 warily Kbps stands -or Kilobytes per second ; assailed, FRY
- "A Linear Prediction Vocoder Simulation Based on -the Auto correlation Method", by JO Market and A. H. GRAY!
in IEEE Trans. on Acoustic Speech and Signal Processing Vol. Aspire no 2, pup 12~-134, April 1974.
In ARC coders, the speech signal is inverse filtered by an optimum predictor, resulting in a so called excitation signal which is quantized, transmitted and used at the synthesis location to excite an all-pole filter. Both inverse and all-pole filter characteristics are derived from the voice signal characteristics. Shown in figure 2 is a block diagram summarizing the basic elements of an ARC
coder. The voice signal samples I provided by blocks of N
samples (see BCPCM or Block Commanded POX tuitions are fed into a predictor filter l, the coefficients X of which are derived in 2 from the voice signal analysis. An excitation (residual) signal is then derived in 3, which residual signal is coded in into an EN information. The voice signal is thus finally converted into Casey and Encoded information.
In VEPC coders, the excitation signal is approximated by considering only the lowest frequency band or vase Band BY
(e. g., 0-1 KHz) of the original excitation signal. A bloc diagram summarizing the VEPC coder functions is represented in fig. 3. The difference with the ARC coder lies in the fact that only the base-band is finally coded into SAIGON in 5, while the upper band synthesis, ego 1-3 KHz is being represented by its energy (EYING) .
The upper band components will be synthesized, when needed, (i.e. at the receiving station not shown) by means of non-linear distortion, high-pass filtering, and energy matching. Additional information on VEPC coding is also available in US Patent ~,216,354 to this assignee.
A block diagram for an LPC coder is represented in fig. 4.
In this case, the excitation signal is being represented by a voiced/unvoiced (VIVA) I decision Al bit), a pitch FRY
period representation coded in I (e.g. 5 bits) and an energy indication (e.g. 4 bits) coded in (8).
In LPC decoders, and for synthesis purposes not shown, the excitation will be approximated either by a pulse twain at the pitch frequency in the case of voiced signals, or -y a white noise in the case of unvoiced signals A common block diagram for the analysis part ox the coders based on the three above cited techniques is shown in figure 5. The input speech signal is analyzed by blocs OX
samples I, based on the assumption that the signal is stationary within each block. The upper path ox the analvser includes means for the determination of auto correlation function (DEAF), which means extracts spectral information I based on auto correlation coefficients from the input signal. This spectral information is then processed, in (DPC)16, for the determination of prediction coefficients X, which coefficients are to be transmitted for being used for synthesis purposes within the correspondil~q receiver. Both devices 14 and 16 are eventually incl~ldcc within the device 2 of figures 2 through 4. In addition algorithms for converting Us into Us, and vice-versa, are already known in the Martin the lower portion of figure 5, extraction of excitation data EN is performed in EEL
18. The excitation data contents differ prom one type of coder to another. When using Adaptive Predictive Coding (ARC) means, the EN parameters will include the encoded excitation signal. With Linear Prediction Vocoders (LPC) the EN parameters include : pitch period indication voiced/unvoiced decision indication ; and block energy India cation. While, with Voice Excited Predictive Vocoders (VEPC) the EN parameters include : encoded base-band signal and high-frequency energy indication, respectively designated by SIGNAL and ENERGY in the above mentioned patent.
A common block diagram of the synthesizing means for the three above techniques, i.e. ARC, LPC and VEPC, is shown in figure 6. The received EN parameters are used for generating the excitation signal (GOES) in 20. Said excitation signal is used to excite a model digital filter (r') Jo whose FRY
coefficients are adjusted by -the received prediction coefficients I. Reconstructed voice samples on) are provided by filter 22.
The above mentioned coders may be used to achieve compression of a speech signal originally coded at I Kbps (CCITT PAM) into a 2.4 to 32 Kbps. The resulting quality would range from synthetic quality (2.4 Kbps) to communications quality ~16 Kbps) and toll quality (32 Kbps~.
Err a full understanding of the above comments one can refer to "Speech Coding" published by JO FLANAGAN, MAR.
SCHROEDER, B.S. ANAL, RYE. CRUSHER, U.S. JUT and JAM.
TRIBOLET in IRE Trans. on Communications, Vol. CASEY, N
4, pup 710-737, April 1979. Such compression already enables a more efficient use of the communication channel. The use of TAXI techniques roughly doubles this efficiency at no substantial extra cost, which is particularly true with this improved voice activity detection method.
The activity decision at the output of each voice coder COVE
1,..., CODE P (see fig. 1) is naturally based on awl evaluation, for each block of N input speech samples, o, the signal energy, and on the comparison of this energy with an activity threshold.
The characteristics of possibly existing background noise in any normal environment will also be taken into account by continuously evaluating the power spectrum. of said noise.
In addition, the proposed process will keep significantly low the processing workload required once associated with a speech coder based on linear prediction.
Indeed, the short term power spectrum or the signal in a block of samples is directly related to the auto correlation function of this signal, and the energy of the signal is well approximated by the magnitude OTT the largest sample within the block. These information are already available within the coder. One is already used or the computation Ox the predictor coefficients, the other for intermediate signal scaling in a fixed point implementation For ~R9-82-010 isle 3 instance, in coders operating according to stock Commanding PAM techniques, the block characteristic term (C) or scaling factor already available is directly related to the magnitude of the largest sample within the bloc. In other words, given a block of N samples I with n = 1, 2,..., I
the magnitude OX of the largest sample is normally determined within the coder independently of arty voice activity detection requirement.
C = MAX = MAX ( I x (n) I
In practice, the C coefficient is used for normalizing the input signal prior to performing auto correlation, coefficients determination, and thus "C" is already available within the coder, apart from any voice activity determination concern.
For the duration of each block of samples (e. g. 20 my), and based on each MAX value measured, the Voice Activity Detection (VAT) operations will be performed on the following principles. If OX is smaller than a predetermined threshold level, then the threshold should be adjusted to SAX rapidly ; otherwise, the threshold adjustment will ye made progressively from one block of signal samples to the following blocks. This threshold adjustment helps tracking background noises with increasing energy levels.
The second principle is based on à measurement of SAX with respect to the current threshold value. If MA is substantially larger than the threshold lovely OX I.
(threshold value), with k I the bloc of samples being processed is considered as deriving from voice signal, i. e.
the corresponding channel, is considered "Austria"
Otherwise, an ambiguity still remains which should ye resolved.
It should be understood that instead of OX one might consider any block energy representative information EM.
FRY
n The ambiguity resolutioll is used on two assumptions. rearrest, if -the time delay between the block of samples being presently processed and the last "active" block provided by the considered channel is less than a given hangover relay, then the block is classified as being an "active" Blake (i.
e. provided by an active channel). Otherwise, the system would rely on an additional test based on spectral analysis of the signal. In other words, the system would then rely on -the short-term power spectrum of the signal in a block of samples which is directly related to the function (Roy)) of -this signal. Assuming the I function variation is significantly large, then the block is considered "actively, otherwise the block is considered "inactive" i. e. equiva-lent to a silence.
The hangover delay consideration will help bridging short inter syllabic silences (0.1 to 1 second for instance) Chile it does not increase significantly the speech activity (less than 5 I). This hangover enables avoiding the possible inter syllabic clipping which is unpleasant.
The threshold adjustment combined with spectral variations analysis enables rejecting large steady background noise.
For example, assuming the speaker operates in a White noise environment, if a blower is turned on, thus producing a high acoustic energy, the Voice Activity detector will adapt itself and detect Lowe energy voice segments such as fricatives in utterance attacks and reject no speech segments.
Figures 7 and 8 summarize the various steps o; the Voice Activity detection method according to which each Blake of samples is processed. The current auto correlation coefficients Roy as well as OX are previously stored.
Xl~X is first compared t~lith a predetermined threshold level VADTH initially set empirically. The level of said threshold is then dyllamically adjusted based on this lucks versus V~DTr.
test. If OX is smaller than VADTH then the threshold is rapidly updated to I value. Othertlise, OX is updated b a small increment by satanic the Noah V.~DTH to VAT
FRY
lo ~L~3~L4~
with the decimal value of said "1" increment being equal Jo 1/211 or 1/2048.
The next test determines whether OX is substantially greater than VADTH. For that purpose, OX is compared with k.VADTH, with k = 2 or 4 as indicated in connection with fig. 8. If this is the case, i.e. OX ' k.VADTH, then the block is said to be active, i. e. to belong to speech signal and a flag (VADFLAG) is set to one. Simultaneously, a hangover decanter, i. e. timer VADTO~T is set to RUT a predetermined time delay value, say 3 to 50 block lengths durations (20 my each). If MAX is not substantially greater than VADTH, then an ambiguity persists. The block might be active or inactive. The hangover counter is decreased by one unit for the currently processed block. As long as the counter contents is positive, then the bloc is classified as an active block.
Now, assuming the hangover time has elapsed, then the short term power spectrum function variation is carpeted by measuring SUM = I I Rood I
wherein symbolizes a summing operation, and ¦ ¦ indicates that magnitude is considered. If SUE is greater than a predetermined value RX empirically set to, say, the decimal value 1280/2048 or to 640/2048, then again, the block is considered active. Otherwise, the block is classified "non active" or corresponding to a speaker's silence. The VADFLAG
is then set to zero.
The short term power spectrum information may also be derived differently, e.g. using prediction coefficients I
rather than Risk While figure 7 summarizes the main steps of the Voice Activity Detection process, the short term power spectrum FRY
information computing process and the various arteries ~lpdatings are more particularly addressed by figure 8.
According to this figure, several tests are performed. The first test (VADTOUT > - 3 ?) enables setting k to 2 or and RX decimal value to 0.3 or 0.6.
The second test is intended to decide when a snap shot should be taken of the auto correlation function which will later on be used to update the Rood terms. For instance, the updating operation may be performed at the 25 inactive (silence) block, in other words crier 25 consecutive detections of inactive blocks. But the effective ode updating operation is delayed by 5 additional consecutive ambicJuous blocks. Also, assuming more ambiguous block aye detected subsequently, the VADTOUT is arbitrarily set to a fixed value to avoid any counter overflow.
A block diagram of a system for implementing the Voice Activity Detection process is shown in figure 9. An input buffer (BUY) I stores blocks of samples I. Assuring the input signal is sampled at 8 KHz, end assuming each block of samples represents a segment or signal 20 my long, then each block contains 160 samples. These samples are sorted in 26 to derive, for each block of samples, the SIX information therefrom. iota a fixed point implementation, the X OX
determination is already performed within the coder to scale the samples, and need not be repeated for the Voice activity Detection VOWED) purposes.
MAX is then moved into a threshold adjusting device (To'.
ADJUST) 28 where it is first compared with a previously sex threshold VADTH. Based on said compare result, V.~DTH is adjusted by either being slightly incremented or by being forced -to MAX value.
The OX - k.VADT~I 0 k = 2 or 4 for example test is then performed in (COOPER.) 3C.
FRY
A bit So is set to one in case the above -test result is negative. So is used to set a VADFLAG latch 32 and set the timer VADTO~T 34 to say, 3 units it 60 my). Whenever So = 0, then the VADTOUT timer is decrement Ed by one unit it 20 my).
The timer 34 provides a grating bit whenever the timer contents is equal to -25. This bit is used to open a Nate 36 to update an auto correlation memory 38 contents. The normalized auto correlation coefficients Wrier to be moved into memory 38 are provided my a device 40 Russia belongs to the auto correlation function determinator (DEAF) 14. This updating is done through a buffer R(i)RSv, and is confirmed when the counter V~DTOUT is equal to -30. The I
coefficients need not be computed especially for the Voice Activity detection operation. They have already been computed within the coder, for each block of samples.
Whenever the VADTOUT timer contents is equal to Nero, the Roy function variation computation is started in Soul device 42. Said device 42 being connected to devices 38 and on computes 7 I Ruled) and thus determines the magnitude of variation of the short term power spectrum characteristics.
The device 42 also compares SUE to a short term vower I spectrum variation reference value RUT. A positive test SUM RUT sets a bit So to logic level 1 (active challnel).
This logic level is used to set the VADFLAG to 1. The VADFLAG = 1 indication is also forwarded to the multiplexer 10 (fig l) together wile rein its PORT origin is identified.
Otherwise So = 0 and said So bit is inverted in I-. and used to reset VADFLAG to Nero ; in which case the channel its considered inactive or idle.
FRY
Claims (8)
1. A system wherein at least one voice signal provided by a source through an input channel is coded to derive therefrom blocks of samples x(n) of predetermined duration, and, short term power spectrum information, a voice activity detection process for discriminating active voice blocks from non active voice blocks said process including, for each block of samples :
a) Setting an amplitude threshold VADTH ;
b) Processing the block of x(n) values to derive therefrom a signal energy representative information XM ;
c) First comparing XM to VADTH and adjusting said threshold accordingly ;
d) Second comparing XM to k. VADTH, where k is a predetermined numerical value and VADTH, is the adjusted threshold, to derive therefrom a channel activity indication when XM is larger than k . VADTH, or an ambiguity indication otherwise, whereby a hangover timer is set upon activity detection or ambiguity resolution operations are to be performed upon ambiguity detection, which ambiguity resolution includes :
- decreasing and testing said timer contents whereby a positive timer contents is indicative of an active voice block and a negative timer contents is still indicative of an ambiguity situation ;
- computing short term power spectrum information variation between the currently processed block and at least one previously processed block ; and, - comparing said short term power spectrum variation with a preset reference level whereby the currently processed ambiguous block is considered inactive or active based on said comparison indication.
a) Setting an amplitude threshold VADTH ;
b) Processing the block of x(n) values to derive therefrom a signal energy representative information XM ;
c) First comparing XM to VADTH and adjusting said threshold accordingly ;
d) Second comparing XM to k. VADTH, where k is a predetermined numerical value and VADTH, is the adjusted threshold, to derive therefrom a channel activity indication when XM is larger than k . VADTH, or an ambiguity indication otherwise, whereby a hangover timer is set upon activity detection or ambiguity resolution operations are to be performed upon ambiguity detection, which ambiguity resolution includes :
- decreasing and testing said timer contents whereby a positive timer contents is indicative of an active voice block and a negative timer contents is still indicative of an ambiguity situation ;
- computing short term power spectrum information variation between the currently processed block and at least one previously processed block ; and, - comparing said short term power spectrum variation with a preset reference level whereby the currently processed ambiguous block is considered inactive or active based on said comparison indication.
2. A system according to claim 1 wherein said threshold adjustment is performed by either forcing the VADTH
value to XM or by progressively adjusting said threshold from one block to the next, depending upon the first XM to VADTH compare result.
value to XM or by progressively adjusting said threshold from one block to the next, depending upon the first XM to VADTH compare result.
3. A system according to claim 2 wherein k = 2 or 4 according to said counter contents.
4. A system according to claim 3 wherein said timer is made to be set for a maximum value equal to three to fifty blocks durations, and to be decrementable by increments equal to a block duration.
5. A system according to claim 4 wherein said short term power spectrum information is provided by the set of autocorrelation coefficients R(i) derived from the processed block of samples.
6. A system according to claim 5 wherein said voice signal is coded according to the so-called linear prediction theory.
7. A system according to claim 6 wherein said signal energy representative information is approximated by the amplitude XMAX of the largest sample within the block.
8. A system according to any one of claims 1 2 or 7 wherein said preset reference level is adjusted to a first or a second predetermined value based upon said hangover time contents.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP83430018.8 | 1983-06-07 | ||
EP83430018A EP0127718B1 (en) | 1983-06-07 | 1983-06-07 | Process for activity detection in a voice transmission system |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1231473A true CA1231473A (en) | 1988-01-12 |
Family
ID=8191498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000454771A Expired CA1231473A (en) | 1983-06-07 | 1984-05-18 | Voice activity detection process and means for implementing said process |
Country Status (5)
Country | Link |
---|---|
US (1) | US4672669A (en) |
EP (1) | EP0127718B1 (en) |
JP (1) | JPS603240A (en) |
CA (1) | CA1231473A (en) |
DE (1) | DE3370423D1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767236A (en) * | 2018-07-10 | 2020-02-07 | 上海智臻智能网络科技股份有限公司 | Voice recognition method and device |
Families Citing this family (77)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4764966A (en) * | 1985-10-11 | 1988-08-16 | International Business Machines Corporation | Method and apparatus for voice detection having adaptive sensitivity |
US5276765A (en) * | 1988-03-11 | 1994-01-04 | British Telecommunications Public Limited Company | Voice activity detection |
KR0161258B1 (en) * | 1988-03-11 | 1999-03-20 | 프레드릭 제이 비스코 | Voice activity detection |
FR2631147B1 (en) * | 1988-05-04 | 1991-02-08 | Thomson Csf | METHOD AND DEVICE FOR DETECTING VOICE SIGNALS |
FR2643523A1 (en) * | 1989-02-22 | 1990-08-24 | Applic Electro Tech Avance | Discriminator for digital transmissions |
CA1290868C (en) * | 1989-09-28 | 1991-10-15 | Maurizio Cecarelli | Voice data discriminator |
US5226108A (en) * | 1990-09-20 | 1993-07-06 | Digital Voice Systems, Inc. | Processing a speech signal with estimated pitch |
US5216747A (en) * | 1990-09-20 | 1993-06-01 | Digital Voice Systems, Inc. | Voiced/unvoiced estimation of an acoustic signal |
FR2670065B1 (en) * | 1990-11-30 | 1993-01-22 | Lmt Radio Professionelle | METHOD FOR THE DIGITAL TRANSMISSION OF SPEECH IN AN ASYNCHRONOUS NETWORK. |
ES2240252T3 (en) * | 1991-06-11 | 2005-10-16 | Qualcomm Incorporated | VARIABLE SPEED VOCODIFIER. |
EP0538536A1 (en) * | 1991-10-25 | 1993-04-28 | International Business Machines Corporation | Method for detecting voice presence on a communication line |
US5410632A (en) * | 1991-12-23 | 1995-04-25 | Motorola, Inc. | Variable hangover time in a voice activity detector |
SE501305C2 (en) * | 1993-05-26 | 1995-01-09 | Ericsson Telefon Ab L M | Method and apparatus for discriminating between stationary and non-stationary signals |
US5559832A (en) * | 1993-06-28 | 1996-09-24 | Motorola, Inc. | Method and apparatus for maintaining convergence within an ADPCM communication system during discontinuous transmission |
IN184794B (en) * | 1993-09-14 | 2000-09-30 | British Telecomm | |
US5586126A (en) * | 1993-12-30 | 1996-12-17 | Yoder; John | Sample amplitude error detection and correction apparatus and method for use with a low information content signal |
JP3484757B2 (en) * | 1994-05-13 | 2004-01-06 | ソニー株式会社 | Noise reduction method and noise section detection method for voice signal |
TW271524B (en) | 1994-08-05 | 1996-03-01 | Qualcomm Inc | |
US5742734A (en) * | 1994-08-10 | 1998-04-21 | Qualcomm Incorporated | Encoding rate selection in a variable rate vocoder |
US5497337A (en) * | 1994-10-21 | 1996-03-05 | International Business Machines Corporation | Method for designing high-Q inductors in silicon technology without expensive metalization |
AU696092B2 (en) * | 1995-01-12 | 1998-09-03 | Digital Voice Systems, Inc. | Estimation of excitation parameters |
US5822726A (en) * | 1995-01-31 | 1998-10-13 | Motorola, Inc. | Speech presence detector based on sparse time-random signal samples |
US5754974A (en) * | 1995-02-22 | 1998-05-19 | Digital Voice Systems, Inc | Spectral magnitude representation for multi-band excitation speech coders |
US5701390A (en) * | 1995-02-22 | 1997-12-23 | Digital Voice Systems, Inc. | Synthesis of MBE-based coded speech using regenerated phase information |
JPH08263099A (en) * | 1995-03-23 | 1996-10-11 | Toshiba Corp | Encoder |
GB2317084B (en) * | 1995-04-28 | 2000-01-19 | Northern Telecom Ltd | Methods and apparatus for distinguishing speech intervals from noise intervals in audio signals |
US5598466A (en) * | 1995-08-28 | 1997-01-28 | Intel Corporation | Voice activity detector for half-duplex audio communication system |
US6175634B1 (en) | 1995-08-28 | 2001-01-16 | Intel Corporation | Adaptive noise reduction technique for multi-point communication system |
US5844994A (en) * | 1995-08-28 | 1998-12-01 | Intel Corporation | Automatic microphone calibration for video teleconferencing |
FI100840B (en) * | 1995-12-12 | 1998-02-27 | Nokia Mobile Phones Ltd | Noise attenuator and method for attenuating background noise from noisy speech and a mobile station |
US5774849A (en) * | 1996-01-22 | 1998-06-30 | Rockwell International Corporation | Method and apparatus for generating frame voicing decisions of an incoming speech signal |
US5765130A (en) * | 1996-05-21 | 1998-06-09 | Applied Language Technologies, Inc. | Method and apparatus for facilitating speech barge-in in connection with voice recognition systems |
CN1225736A (en) | 1996-07-03 | 1999-08-11 | 英国电讯有限公司 | Voice activity detector |
US5751901A (en) * | 1996-07-31 | 1998-05-12 | Qualcomm Incorporated | Method for searching an excitation codebook in a code excited linear prediction (CELP) coder |
US5864793A (en) * | 1996-08-06 | 1999-01-26 | Cirrus Logic, Inc. | Persistence and dynamic threshold based intermittent signal detector |
US6708146B1 (en) | 1997-01-03 | 2004-03-16 | Telecommunications Research Laboratories | Voiceband signal classifier |
KR100302370B1 (en) * | 1997-04-30 | 2001-09-29 | 닛폰 호소 교카이 | Speech interval detection method and system, and speech speed converting method and system using the speech interval detection method and system |
US6023674A (en) * | 1998-01-23 | 2000-02-08 | Telefonaktiebolaget L M Ericsson | Non-parametric voice activity detection |
JP3273599B2 (en) * | 1998-06-19 | 2002-04-08 | 沖電気工業株式会社 | Speech coding rate selector and speech coding device |
US6351731B1 (en) | 1998-08-21 | 2002-02-26 | Polycom, Inc. | Adaptive filter featuring spectral gain smoothing and variable noise multiplier for noise reduction, and method therefor |
US6453285B1 (en) | 1998-08-21 | 2002-09-17 | Polycom, Inc. | Speech activity detector for use in noise reduction system, and methods therefor |
US6691084B2 (en) | 1998-12-21 | 2004-02-10 | Qualcomm Incorporated | Multiple mode variable rate speech coding |
US6556967B1 (en) | 1999-03-12 | 2003-04-29 | The United States Of America As Represented By The National Security Agency | Voice activity detector |
US6618701B2 (en) * | 1999-04-19 | 2003-09-09 | Motorola, Inc. | Method and system for noise suppression using external voice activity detection |
US6381568B1 (en) | 1999-05-05 | 2002-04-30 | The United States Of America As Represented By The National Security Agency | Method of transmitting speech using discontinuous transmission and comfort noise |
US7161931B1 (en) * | 1999-09-20 | 2007-01-09 | Broadcom Corporation | Voice and data exchange over a packet based network |
US6757301B1 (en) * | 2000-03-14 | 2004-06-29 | Cisco Technology, Inc. | Detection of ending of fax/modem communication between a telephone line and a network for switching router to compressed mode |
GB0007655D0 (en) * | 2000-03-29 | 2000-05-17 | Simoco Int Ltd | Digital transmission |
JP4201471B2 (en) * | 2000-09-12 | 2008-12-24 | パイオニア株式会社 | Speech recognition system |
JP4201470B2 (en) * | 2000-09-12 | 2008-12-24 | パイオニア株式会社 | Speech recognition system |
CN1311424C (en) * | 2001-03-06 | 2007-04-18 | 株式会社Ntt都科摩 | Audio data interpolation apparatus and method, audio data-related information creation apparatus and method, audio data interpolation information transmission apparatus and method, program and |
US7356464B2 (en) * | 2001-05-11 | 2008-04-08 | Koninklijke Philips Electronics, N.V. | Method and device for estimating signal power in compressed audio using scale factors |
US7941313B2 (en) * | 2001-05-17 | 2011-05-10 | Qualcomm Incorporated | System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system |
US7203643B2 (en) * | 2001-06-14 | 2007-04-10 | Qualcomm Incorporated | Method and apparatus for transmitting speech activity in distributed voice recognition systems |
US7746797B2 (en) * | 2002-10-09 | 2010-06-29 | Nortel Networks Limited | Non-intrusive monitoring of quality levels for voice communications over a packet-based network |
US20040234067A1 (en) * | 2003-05-19 | 2004-11-25 | Acoustic Technologies, Inc. | Distributed VAD control system for telephone |
US7269252B2 (en) * | 2003-08-06 | 2007-09-11 | Polycom, Inc. | Method and apparatus for improving nuisance signals in audio/video conference |
US8315865B2 (en) * | 2004-05-04 | 2012-11-20 | Hewlett-Packard Development Company, L.P. | Method and apparatus for adaptive conversation detection employing minimal computation |
US7752050B1 (en) * | 2004-09-03 | 2010-07-06 | Stryker Corporation | Multiple-user voice-based control of devices in an endoscopic imaging system |
US8443279B1 (en) | 2004-10-13 | 2013-05-14 | Stryker Corporation | Voice-responsive annotation of video generated by an endoscopic camera |
WO2006077559A2 (en) * | 2005-01-21 | 2006-07-27 | Koninklijke Philips Electronics N.V. | Method and apparatus for detecting the presence of a digital television signal |
US7346502B2 (en) * | 2005-03-24 | 2008-03-18 | Mindspeed Technologies, Inc. | Adaptive noise state update for a voice activity detector |
WO2006105275A2 (en) * | 2005-03-29 | 2006-10-05 | Sonim Technologies, Inc. | Push to talk over cellular (half-duplex) to full-duplex voice conferencing |
US7962340B2 (en) * | 2005-08-22 | 2011-06-14 | Nuance Communications, Inc. | Methods and apparatus for buffering data for use in accordance with a speech recognition system |
BRPI0807703B1 (en) | 2007-02-26 | 2020-09-24 | Dolby Laboratories Licensing Corporation | METHOD FOR IMPROVING SPEECH IN ENTERTAINMENT AUDIO AND COMPUTER-READABLE NON-TRANSITIONAL MEDIA |
JP5229217B2 (en) * | 2007-02-27 | 2013-07-03 | 日本電気株式会社 | Speech recognition system, method and program |
EP2107553B1 (en) * | 2008-03-31 | 2011-05-18 | Harman Becker Automotive Systems GmbH | Method for determining barge-in |
EP2148325B1 (en) * | 2008-07-22 | 2014-10-01 | Nuance Communications, Inc. | Method for determining the presence of a wanted signal component |
GB0919672D0 (en) * | 2009-11-10 | 2009-12-23 | Skype Ltd | Noise suppression |
US8762150B2 (en) * | 2010-09-16 | 2014-06-24 | Nuance Communications, Inc. | Using codec parameters for endpoint detection in speech recognition |
US9502050B2 (en) | 2012-06-10 | 2016-11-22 | Nuance Communications, Inc. | Noise dependent signal processing for in-car communication systems with multiple acoustic zones |
DE112012006876B4 (en) | 2012-09-04 | 2021-06-10 | Cerence Operating Company | Method and speech signal processing system for formant-dependent speech signal amplification |
US9613633B2 (en) | 2012-10-30 | 2017-04-04 | Nuance Communications, Inc. | Speech enhancement |
US9530433B2 (en) * | 2014-03-17 | 2016-12-27 | Sharp Laboratories Of America, Inc. | Voice activity detection for noise-canceling bioacoustic sensor |
CN105321528B (en) * | 2014-06-27 | 2019-11-05 | 中兴通讯股份有限公司 | A kind of Microphone Array Speech detection method and device |
US9467569B2 (en) | 2015-03-05 | 2016-10-11 | Raytheon Company | Methods and apparatus for reducing audio conference noise using voice quality measures |
CN106599110A (en) * | 2016-11-29 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Artificial intelligence-based voice search method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA1130920A (en) * | 1979-03-05 | 1982-08-31 | William G. Crouse | Speech detector with variable threshold |
US4351983A (en) * | 1979-03-05 | 1982-09-28 | International Business Machines Corp. | Speech detector with variable threshold |
-
1983
- 1983-06-07 EP EP83430018A patent/EP0127718B1/en not_active Expired
- 1983-06-07 DE DE8383430018T patent/DE3370423D1/en not_active Expired
-
1984
- 1984-03-14 JP JP59047325A patent/JPS603240A/en active Granted
- 1984-05-18 CA CA000454771A patent/CA1231473A/en not_active Expired
- 1984-05-31 US US06/616,021 patent/US4672669A/en not_active Expired - Lifetime
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767236A (en) * | 2018-07-10 | 2020-02-07 | 上海智臻智能网络科技股份有限公司 | Voice recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
JPS603240A (en) | 1985-01-09 |
JPH0226901B2 (en) | 1990-06-13 |
DE3370423D1 (en) | 1987-04-23 |
EP0127718A1 (en) | 1984-12-12 |
EP0127718B1 (en) | 1987-03-18 |
US4672669A (en) | 1987-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA1231473A (en) | Voice activity detection process and means for implementing said process | |
US5812965A (en) | Process and device for creating comfort noise in a digital speech transmission system | |
EP0786760B1 (en) | Speech coding | |
RU2146394C1 (en) | Method and device for alternating rate voice coding using reduced encoding rate | |
US8019599B2 (en) | Speech codecs | |
US5933803A (en) | Speech encoding at variable bit rate | |
EP0877355B1 (en) | Speech coding | |
US5915235A (en) | Adaptive equalizer preprocessor for mobile telephone speech coder to modify nonideal frequency response of acoustic transducer | |
KR100575193B1 (en) | A decoding method and system comprising an adaptive postfilter | |
EP0785541B1 (en) | Usage of voice activity detection for efficient coding of speech | |
US20010034601A1 (en) | Voice activity detection apparatus, and voice activity/non-activity detection method | |
EP1214705B1 (en) | Method and apparatus for maintaining a target bit rate in a speech coder | |
KR100798668B1 (en) | Method and apparatus for coding of unvoiced speech | |
US7590532B2 (en) | Voice code conversion method and apparatus | |
EP1554717B1 (en) | Preprocessing of digital audio data for mobile audio codecs | |
WO2000010307A2 (en) | Adaptive rate network communication system and method | |
US20050143984A1 (en) | Multirate speech codecs | |
US6104994A (en) | Method for speech coding under background noise conditions | |
CA2317969C (en) | Method and apparatus for decoding speech signal | |
Ferrer-Ballester et al. | Efficient adaptive vector quantization of LPC parameters | |
EP1557820A1 (en) | Voice activity detection operating with compressed speech signal parameters | |
Viswanathan et al. | Medium and low bit rate speech transmission |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKEX | Expiry |