CN104123934A

CN104123934A - Speech composition recognition method and system

Info

Publication number: CN104123934A
Application number: CN201410353819.6A
Authority: CN
Inventors: 黄昭鸣; 周林灿; 李宁
Original assignee: Tai Ge Electronics (shanghai) Co Ltd
Current assignee: Tai Ge Electronics (shanghai) Co Ltd
Priority date: 2014-07-23
Filing date: 2014-07-23
Publication date: 2014-10-29

Abstract

The invention discloses a speech composition recognition method. The method includes obtaining a sample signal, filtering and denoising the sample signal, quantizing the sample signal into a binary sample signal through an A/D (Analog to Digital) converter, and extracting a speech signal including the speech from the binary sample signal; extracting acoustic characteristic parameters in the speech signal; selecting and training an acoustic model, and respectively estimating the parameter estimate of the acoustic model according to every acoustic characteristic parameter to obtain optimal model parameters corresponding to maximum likelihood values; performing speech composition recognition, collecting signals to be recognized, and calculating a probability value of every acoustic characteristic parameter of the recognition signal according to the optimal model parameters to obtain a recognition result. The speech composition recognition method can accurately recognize the combination of specific syllables and tones of monosyllables as well as the content of the speech. The invention further discloses a speech composition recognition system.

Description

A kind of structure voice recognition method and system thereof

Technical field

The present invention relates to speech recognition, especially design a kind of structure voice recognition method and system thereof.

Background technology

Structure sound is the basis of language production, by structure sound organ (as, lower jaw, lip, tongue, soft palate etc.) the coordinated movement of various economic factors produce.The least speech unit that the motion of structure sound produces is phoneme, and phonetics has defined phoneme and comprised vowel and consonant two classes.The structure sound recognition result of standard Chinese comprises two parts: the syllable that phoneme set is synthetic and tone.But structure sound recognition technology cannot accurately be identified by the identical syllable word tone that same tone does not form at present, and not take phoneme as unit identifies, and causes recognition result and is not suitable for speech language education.

In order to overcome the content that cannot accurately identify in voice of the prior art, cannot accurately identify by the identical syllable word tone that same tone does not form, and not take phoneme and identify as unit, cause recognition result and be not suitable for the defect of speech language education, proposed a kind of structure voice recognition method and system thereof.

Summary of the invention

The present invention proposes a kind of structure voice recognition method, comprise the steps: to obtain sample signal, described sample signal is carried out after filtering and noise reduction, described sample signal is changed and is quantified as binary sample signal by A/D, from described binary sample signal, extract the voice signal that comprises voice; Extract the acoustical characteristic parameters in described voice signal, described acoustical characteristic parameters is used for identifying syllable and tone; Selected and training acoustic model, calculate respectively the maximum likelihood probability value of acoustical characteristic parameters under hidden Markov model described in each, obtains the optimization model parameter corresponding to described maximum likelihood value; The identification of structure sound, gathers signal to be identified, and the probable value according to each acoustical characteristic parameters of signal to be identified described in described optimization model calculation of parameter, obtains recognition result.

In the described structure voice recognition method that the present invention proposes, the step of extracting the voice signal that comprises voice comprises: by described binary sample signal intercepting, be a plurality of frames; Calculate the mean value of the short-time autocorrelation function of at least one frame; The threshold rate of mistake in short-term that is used for judging present frame according to described mean value calculation; According to the described threshold rate of crossing in short-term, judge that described present frame is voiceless sound or voiced sound; Judge one by one all frames, until obtain voice signal while obtaining start frame and abort frame.

In the described structure voice recognition method that the present invention proposes, described short-time autocorrelation function is:

{\hat{R}}_{n} (k) = Σ_{m = 0}^{N - 1} x_{n} (m) {x^{'}}_{n} (m + k);

In formula, k represents that maximum-delay counts, R _n(k) represent short-time autocorrelation function, x _nthe sampled point that represents voice signal, m represents the sequence number of sampled point, x ' _nthe three level quantized signals that represent voice signal, N represents the number of voice signal sampled point.

In the described structure voice recognition method that the present invention proposes, the described threshold rate of crossing is in short-term:

Z_{n} = Σ_{m = n - N + 1}^{n} {| sgn [x_{n} (m) - T] - sgn [x_{n} (m - 1) - T] | + | sgn [x_{n} (m) + T] - sgn [x_{n} (m - 1) + T] |};

Wherein,

sgn (x) = \{\begin{matrix} 1, & x &GreaterEqual; 0 \\ - 1, & x < 0 \end{matrix}

In formula, Z _nrepresent to cross in short-term threshold rate, T represents the threshold value of setting, and is positive number, x _nthe sampled point that represents voice signal, m represents the sequence number of sampled point, and N represents the number of voice signal sampled point, and n represents the sequence number of speech frame.

In the described structure voice recognition method that the present invention proposes, further comprise after extracting described voice signal: increase the weight of the high fdrequency component in described voice signal; Utilize window function to carry out windowing operation to described voice signal.

In the described structure voice recognition method that the present invention proposes, described acoustical characteristic parameters comprises front 12 rank Mel cepstrum coefficients and first order difference result and second order difference result, and the calculation procedure of described Mel cepstrum coefficient and first order difference result thereof and second order difference result comprises: the power spectrum that calculates described voice signal by fast fourier transform; Utilize Mel wave filter to calculate described power spectrum and obtain Mel frequency spectrum; By discrete cosine transform, calculate described Mel frequency spectrum and obtain Mel cepstral coefficients; Successively described Mel cepstral coefficients is carried out to the calculus of differences with the time, obtain first order difference result and second order difference result.

In the described structure voice recognition method that the present invention proposes, described acoustical characteristic parameters comprises logarithm energy in short-term, and the described energy of logarithm in short-term represents as following formula:

E = \log Σ_{n = 1}^{N} s_{n}^{2};

In formula, s _nrepresent voice signal discrete series, N represents total number of sampled point, and n represents sampled point sequence number,

In the described structure voice recognition method that the present invention proposes, the step that obtains described optimization model parameter comprises: average and the covariance of calculating described acoustical characteristic parameters; Average and covariance that the initial average of acoustic model and covariance are replaced with to described acoustical characteristic parameters; The model parameter of estimating described acoustic model, obtains estimates of parameters; Described estimates of parameters is replaced to the parameter in described acoustic model, calculate respectively the maximum likelihood value of acoustical characteristic parameters under hidden Markov model described in each, obtain the optimization model parameter corresponding to described maximum likelihood value.

In the described structure voice recognition method that the present invention proposes, according to Baum-Welch algorithm, estimation obtains described estimates of parameters.

In the described structure voice recognition method that the present invention proposes, the calculation procedure of described recognition result comprises: described signal to be identified is divided, obtained the word sequence that a plurality of words form; Extract a plurality of acoustical characteristic parameters of current word; According to described optimization model parameter, with hidden Markov model, calculate respectively the probable value of acoustical characteristic parameters described in each, using the acoustical characteristic parameters of described probable value maximum as the recognition result of described word; Calculate successively the recognition result to each word in described signal to be identified, obtain the recognition result of identification signal to be stated.

In the described structure voice recognition method that the present invention proposes, obtain further comprising after described recognition result: by described recognition result and the target sound contrast of setting in advance, obtain having dysarthric initial consonant, simple or compound vowel of a Chinese syllable and tone in described signal to be identified.

The invention allows for a kind of structure sound recognition system, comprising: voice acquisition device, it is for collecting sample signal and signal to be identified; Voice processing apparatus, it is for described sample signal and signal to be identified are carried out to data-switching and pre-service, and extracts respectively the acoustical characteristic parameters of described sample signal and described signal to be identified; Structure sound recognition device, it obtains optimization model parameter for the acoustical characteristic parameters training acoustic model according to described sample signal, and the acoustical characteristic parameters according to signal to be identified described in described optimization model calculation of parameter, obtains recognition result.

In the described structure sound recognition system that the present invention proposes, described structure sound recognition device is further used for described recognition result to judge, judges in described signal to be identified and has dysarthric initial consonant, simple or compound vowel of a Chinese syllable and tone.

Structure voice recognition method of the present invention not only can accurately be identified the content in voice, can also identify monosyllabic concrete syllable combination and tone thereof, can be used for the dysarthric assessment of speech and rehabilitation training, speech recognition and encryption, links up the fields such as auxiliary.

The present invention also can further assess dysarthrosis, by the structure sound sharpness to speech structure sound patient, assesses, and can judge patient normally concrete initial consonant, simple or compound vowel of a Chinese syllable and the tone of structure sound, provides concrete dysarthrosis type.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of structure voice recognition method of the present invention.

Fig. 2 is the process flow diagram that extracts Mel cepstrum coefficient.

Fig. 3 calculates the single order of Mel cepstrum coefficient and the schematic diagram of second order difference result.

Fig. 4 is the schematic diagram of the structure sound identification based on hidden Markov model.

Fig. 5 utilizes structure voice recognition method to judge dysarthric schematic flow sheet in embodiment.

Fig. 6 is the structural representation of structure sound recognition system of the present invention.

Embodiment

In conjunction with following specific embodiments and the drawings, the present invention is described in further detail.Implement process of the present invention, condition, experimental technique etc., except the content of mentioning specially below, be universal knowledege and the common practise of this area, the present invention is not particularly limited content.

What Fig. 1 showed is structure voice recognition method of the present invention, and it comprises the steps:

Obtain sample signal, sample signal is carried out after filtering and noise reduction, sample signal is changed and is quantified as binary sample signal by A/D, from binary sample signal, extract the voice signal that comprises voice;

Extract the acoustical characteristic parameters in voice signal, acoustical characteristic parameters comprises for identifying front 12 Mel cepstrum coefficients of syllable and tone and logarithm energy in short-term, and their first order difference, second order difference result, altogether 39 parameters.

Select and training acoustic model, according to each acoustical characteristic parameters, estimate respectively the estimates of parameters of acoustic model, estimates of parameters is replaced to the parameter in acoustic model, calculate respectively the maximum likelihood value of each acoustical characteristic parameters under hidden Markov model, obtain the optimization model parameter corresponding to maximum likelihood value;

The identification of structure sound, gathers signal to be identified, and the probable value according to each acoustical characteristic parameters of optimization model calculation of parameter signal to be identified, obtains recognition result.

Below each step in structure sound method of the present invention is elaborated.

(obtaining sample signal)

When obtaining sample signal or signal to be identified, because the accuracy of the product confrontation system identification of input speech signal has a significant impact, so have higher requirements for recording quality and the noiseproof feature of sample signal or signal to be identified.After obtaining sample signal, first sample signal is carried out to filtering.Filtering is to surpass sample frequency f in order to suppress sample signal medium frequency value _s1/2 frequency component, to prevent that aliasing from disturbing, and also suppresses the interference of 50Hz alternating current frequency of operation simultaneously.The process of filtering can adopt interpolation bandpass filter to realize.

(bandpass filtering and A/D conversion)

Filtered sample signal is through sample frequency f _sthe digitized sampling of=44100Hz, the discrete-time series of generation sample signal.This sequence is not still the form that computing machine can be identified, need to be quantified as binary signal by A/D conversion operations, can adopt the equal interval quantizing of 12, each sampling pulse in sample signal is converted into 12 bit binary number and for computing machine, process and identification.

(extraction voice signal)

For signal relevant to structure sound in recognition sample signal, need to from one section of sample signal, determine effective start-stop position of voice, definite starting point and terminal intercept out the data segment of efficient voice in coming.The present invention adopts and a kind ofly improvedly utilizes the maximal value of auto-correlation in short-term of voice signal and cross in short-term the end-point detection that mode that threshold rate combines realizes efficient voice.Below to the short-time autocorrelation function of introducing in the present invention with cross in short-term threshold rate and be described further.

(short-time autocorrelation function)

In this example, suppose that the time-domain expression after sample signal windowing is x (m), wherein, n frame signal expression formula is x _n(m), frame length is N, and the short-time autocorrelation function of this frame voice signal is:

wherein, 0≤k≤K, K is that maximum delay is counted.

Short-time autocorrelation function is very obvious to the differentiation of voiced sound, voiceless sound and noise, the short-time autocorrelation function waveform of voiced sound has obvious quasi periodic, the short-time autocorrelation function of voiceless sound and noise also has larger difference, and the latter's waveform is similar to pulse shape more.Because the time domain waveform variation of voice is very fast, it is as far as possible little that the length N of institute's windowed function need to be chosen; Meanwhile, the periodicity (frame voice at least comprise the waveform of two periodic functions) in short-term of voice signal need to just can embody in the window function of sufficient length.For solving the demand of these two kinds of contradictions, the present invention has adopted the short-time autocorrelation function of revising, and adopts the window function of two different lengths, and to voice signal, windowing obtains x respectively _nand x ' (m) _n(m+k), ask its product, the delay that the length of two windows the differs maximum K that counts, its expression formula is:

{\hat{R}}_{n} (k) = Σ_{m = 0}^{N - 1} x_{n} (m) {x^{'}}_{n} (m + k),

0≤k≤K；

In above formula, k represents that maximum-delay counts, R _n(k) represent short-time autocorrelation function, x _nthe sampled point that represents voice signal, m represents the sequence number of sampled point, x ' _nthe three level quantized signals that represent voice signal, N represents the number of voice signal sampled point.Because the auto-correlation function value of each sampled point is calculated by N sampled point, avoid autocorrelation function because the increase of k value decays.

(crossing in short-term threshold rate)

What cross in short-term that threshold rate represents is that in frame voice, voice signal waveform, through the number of times of time transverse axis (signal amplitude is zero), now, certainly exists some sampled points to the positive and negative contrary sign of range value in signal, and this situation is exactly " zero passage ".N frame voice signal x _n(m) the threshold rate of crossing is in short-term:

Z_{n} = \frac{1}{2} Σ_{m = 0}^{N - 1} | sgn [x_{n} (m)] - sgn [x_{n} (m - 1)] |, sgn (x) = \{\begin{matrix} 1, & x &GreaterEqual; 0 \\ - 1 . & x < 0 \end{matrix};

Research shows, the short-time energy of voiced sound and voiceless sound and short-time zero-crossing rate have obvious difference.But in practical application, noise may make signal produce false short-time zero-crossing rate, and therefore, this method is revised short-time zero-crossing rate, and a threshold value scope ± T is set near zero level, the expression formula that obtained in short-term threshold rate is:

Z_{n} = Σ_{m = n - N + 1}^{n} {| sgn [x_{n} (m) - T] - sgn [x_{n} (m - 1) - T] | + | sgn [x_{n} (m) + T] - sgn [x_{n} (m - 1) + T] |};

In formula, Z _nrepresent to cross in short-term threshold rate, T represents the threshold value of setting, and is positive number, x _nthe sampled point that represents voice signal, m represents the sequence number of sampled point, and N represents the number of voice signal sampled point, and n represents the sequence number of speech frame.What this index reflected is the number of times that signal passes positive and negative threshold.If there is noise in signal, as long as the value of noise signal is no more than [T, T], just can avoid to a great extent the generation of false s Zero-crossing Number, improve the anti-noise ability of whole system.

(extracting voice signal embodiment)

Initial 5 frames of the sample signal of supposing the system input are all noises, calculate the peaked mean value of short-time autocorrelation function of these noises, using this value as crossing in short-term the threshold T of threshold rate.According to the threshold rate of crossing in short-term of present frame, judge that this frame is that speech frame is voiceless sound or voiced sound.Wherein, positive and negative threshold value ± T is set near zero level, crosses in short-term threshold rate Z _nwith following formula, represent:

Z_{n} = Σ_{m = - \infty}^{\infty} {| sgn [x (n) - T] - sgn [x (n - 1) - T] | + | sgn [x (n) - T] - sgn [x (n - 1) - T] |} ω (n - m)

In formula, Z _nrepresent to cross in short-term threshold rate, T represents the threshold value of setting, and is positive number, x _nthe sampled point that represents voice signal, m represents the sequence number of sampled point, and N represents the number of voice signal sampled point, and n represents the sequence number of speech frame.Because voiced energy concentrates on below 3kHz, and voiceless sound is similar to white noise, energy majority concentrates on upper frequency, therefore, the threshold rate of crossing in short-term of voiced sound and voiceless sound just exists very large difference, can be used as the foundation of the pure and impure type of differentiating each frame voice, what be voiceless sound crosses threshold rate in short-term far above voiced sound.

(pre-service-pre-emphasis)

The present invention also further carries out pre-service to voice signal.In pre-treatment step, add " pre-emphasis " to process.Object is to increase the energy of high fdrequency component, makes the frequency of whole voice signal become smooth, effectively improves signal to noise ratio (S/N ratio), to can use unified signal to noise ratio (S/N ratio) when signal is carried out to spectrum analysis and sound channel calculation of parameter, reduces difficulty in computation.Pre-emphasis is added on after the digital sample of voice signal, before characteristic parameter extraction, adopting order digital filter to form can improve the pre-emphasis digital filter of signal high fdrequency component with 6dB/ octave, the response equation of this digital filter is: H (z)=1-μ z ^-1.Wherein, one approaches μ value and is not more than 1, in this method, and value μ=0.97.

(pre-service-windowing and a minute frame)

The average power spectra of voice signal is subject to glottal excitation and mouth and nose radiation effect, high band falls by 6dB/ octave greatly more than 800Hz, and while extracting characteristic parameter from voice signal, the frequency spectrum that needs computing voice signal, frequency is higher, and corresponding composition is just less, and the difficulty of the Frequency spectrum ratio low frequency part of HFS is asked, therefore, carry out pre-emphasis processing.Object is to improve HFS, makes the frequency spectrum of signal become smooth, remains on low frequency in the whole frequency band of high frequency, can ask frequency spectrum by same signal to noise ratio (S/N ratio), is convenient to extract characteristic parameter.Because voice have quasi periodic in short-term, for utilize this specific character in voice signal is processed, better reflected signal feature, therefore, need to carry out windowing process to voice signal, and this is the important foundation that voice signal is processed in short-term.

Take within the scope of the order of magnitude that millisecond is unit, the certain physical characteristics of voice signal remains unchanged substantially, thereby so greatly simplify the computation process of voice signal being carried out to holistic approach by voice signal is carried out to short-time analysis, be also convenient to speech signal analysis to set up associated will analysis voice signal with the physiology course that voice produce.In voice signal, need to intercept one section of voice with stationarity in short-term as object, this section of voice are " frame " voice, and the duration of voice segments is exactly frame length, the value of frame length one at 10-30ms.Each frame voice is all to have certain fixed characteristic, and the flow process that minute frame is processed also just can be regarded as the analysis to one whole section of continuous speech frame by frame.Window function is for extracting the instrument of " speech frame " from continuous speech, being denoted as ω (n).The characteristic of window function is exactly by the whole zero setting of voice segments needing outside processing region, and to reach the object that extracts speech frame, this process is exactly to voice signal " minute frame ".A minute frame for voice is that voice signal expression formula s (n) is multiplied by window function ω (n), i.e. the voice signal expression formula of windowing is: s _ω(n)=s (n) ω (n).Wherein, Hamming window function is the most conventional.Reason has two aspects: in time domain, the voice signal wave-shape amplitude that has added Hamming window can smoothly be reduced to zero, reduces the truncation effect that sharply declines and cause due to window two ends, causes frequency leakage; On frequency domain, the Frequency Response curve that has added the voice signal of Hamming window has more level and smooth low-pass characteristic, has reflected preferably the frequency characteristic in short-term of voice signal.The expression-form of the Hamming window function that N is ordered is:

In above formula, w (n) represents window function, and N represents the length of window function, and n represents the number of sampled point.

(acoustic model)

Acoustic model is a kind of feature parameter model in essence, is that the voice by some extract after characteristic parameter, uses the training algorithm of appointment to train rear generation.First the signal to be identified of input is extracted to feature parameter vector sequence, again this characteristic sequence is mated with acoustic model, by gap more between the two, calculate the feature vector sequence of signal to be identified and the distance between acoustic model, to obtain optimum recognition result.Structure voice recognition method of the present invention adopts based on hidden Markov model (Hidden Markov Model, HMM) probabilistic model method is carried out Model Matching and comparison, this is due to the phonetic feature variation that HMM can conform well and speaker causes, is very suitable for the identification of unspecified person.Acoustic model is an abstraction unit for HMM, so acoustic model also becomes to do acoustic model, and a common acoustic model will comprise a plurality of states, can be velocity of sound, syllable or half syllable etc.The training of acoustic model comprises the following steps:

(selected acoustic model)

In the present invention, optional acoustic model comprises half syllable acoustic model and syllable acoustic model.Because syllable acoustic model is done as a whole identification by initial consonant and simple or compound vowel of a Chinese syllable, cannot embody structure sound feature and the difference of initial consonant and simple or compound vowel of a Chinese syllable, so take in this embodiment, to adopt half syllable be example as acoustic model.Each syllable is when pronunciation, and the tensity that participates in the muscle of structure sound can experience crescendo, strong peak and diminuendo three phases, and corresponding sound has also correspondingly been divided into sound, neck sound and radio reception three parts.According to this rule, a syllable can be divided into a plurality of parts, forms half syllable unit.This element can be added that by the consonant of syllable initial position part vowel forms, and also can add that the consonant of ending forms by the part vowel after a syllable.Nearly more than 2000 half such syllable unit of English.In standard Chinese, served as the normally consonant (representing with C) of sound, served as the normally vowel (representing with V) of leading sound, radio reception does not limit vowel or consonant.The syllable structure citation form of Chinese speech has V, CV, VC, CVC etc., because half syllable unit combination type is less, during for acoustic model as recognition system, the software and hardware of system requires lower, and therefore, half syllable unit is the most frequently used a kind of acoustic model in Chinese speech identification.

(extract characteristic parameter-in short-term logarithm energy)

In this embodiment, corresponding to the acoustical characteristic parameters of the required extraction of acoustic model of half syllable, comprise front 12 Mel cepstrum coefficients and logarithm energy in short-term, and their first order difference, second order difference result, amount to 39 parameters.

Extract characteristic parameter for identifying speech syllable for logarithm energy and front 12 dimension Mel cepstrum coefficient (MFCC) and single order and second order difference results in short-term, amount to 13 and tie up parameters.Choose logarithm energy in short-term as one of characteristic parameter of speech recognition: in formula, s _nrepresent voice signal discrete series, N represents total number of sampled point, and n represents sampled point sequence number, and E represents logarithm energy in short-term.The reason of choosing logarithm energy is, can distinguish voiceless sound and noiseless composition that amplitude is less, avoid using obscuring that linear energy parameter may cause, can also solve the excessive problem of linear energy calculation of parameter amount, better distinguish voiceless sound, voiced sound and noiseless composition.

(extracting characteristic parameter-Mel cepstrum coefficient, single order and second order difference result)

Extract the basic procedure of Mel cepstrum coefficient as shown in Figure 2.First, the speech frame of each windowed function is done to fast fourier transform (FFT) and obtain power spectrum; Again power spectrum is obtained to Mel frequency spectrum through Mel bank of filters, Mel bank of filters is actually one group of normalized V-belt bandpass filter, power spectrum is asked for to log spectrum.The principle of design of Mel bank of filters is smooth spectrum, highlights the resonance peak of voice signal, rationally reduces characteristic information amount; Then, Mel frequency spectrum is solved to cepstrum, do discrete cosine transform (DCT) and obtain Mel cepstrum, Mel cepstrum coefficient is a stack features vector,

in formula, c _krepresent Mel cepstrum coefficient, m _ithe value that represents i voice signal sampled point, k represents the sequence number of Mel cepstrum coefficient, and i represents the sequence number of sampled point, and N represents the number of Mel bank of filters intermediate cam ripple.In present embodiment, get N=20, and get k=2,3 ..., 13 o'clock c _k12 results as MFCC coefficient.But, only adopt 13 dimension parameters can not meet the requirement that in practical situations, system is realized higher discrimination, therefore, on the basis of these parameters, further calculate above-mentioned 13 dimension parameters to the first order difference of time and second order difference result, obtain first order difference logarithm energy and first order difference MFCC and second order difference logarithm energy and second order difference MFCC, amount to 39 Wesys in the characteristic parameter of syllable identification, as shown in Figure 3.

(extracting characteristic parameter-fundamental frequency)

Extracting for identifying the characteristic parameter of tone is fundamental frequency (being called for short " fundamental frequency ").Select sum of magnitude difference square function (Sum Magnitude Difference Magnitude Squre Function, SMDSF) to carry out fundamental frequency extraction.This algorithm can carry out fundamental frequency accurately to the voice under any sample frequency and extract.The expression formula of SMDSF is as shown in the formula expression:

D_{S} (τ) = Σ_{j = 0}^{L - 1} {[s_{w_{2}} (j + τ) - s_{w_{1}} (j)]}^{2};

In formula, discrete voice sequence s (j) windowing w ₁(j), τ=0,1 .., L-1, L is the sampling number in every frame voice.In like manner, discrete voice sequence s (j) windowing w ₂(j), window function w ₁and w (j) ₂(j) expression formula is respectively:

w_{1} (j) = \{\begin{matrix} 1, & j = 0,1, . . ., L - 1 \\ 0, & other \end{matrix}; w_{2} (j) = \{\begin{matrix} 1, & j = 0,1, . . ., 2 (L - 1) \\ 0, & other \end{matrix};

For evaluating the aperiodicity of structure sound voice, also need SMDSF to be normalized, that is: wherein, τ=0,1 .., L-1, L is the sampling number in every frame voice.

For quasi-periodic signal, suppose that pitch period is P, D _s(P) the aperiodicity composition energy and in signal is proportional.And proportional with signal gross energy.Therefore, value can embody composition energy and the ratio of signal gross energy non-periodic in signal. in pitch period P place value signal period property is not obvious, value larger; Signal period property is more remarkable, be worth less; For the periodic signal of standard, therefore, can be used as the acyclic tolerance of signal.

(training acoustic model)

Initial average and covariance using the average of above-mentioned gross acoustic features parameter and covariance as acoustic model, according to Baum-Welch algorithm, obtain after the estimation of model parameter, utilize estimates of parameters to replace original model parameter and re-start again estimation, calculate respectively the maximum likelihood value of each acoustical characteristic parameters under hidden Markov model, can obtain corresponding to the optimization model parameter under the maximum likelihood meaning of maximum likelihood value.

(identification of structure sound)

The structure sound identifying that the present invention is based on HMM is to realize the identification of mandarin structure sound with sequence labelling problem, is similar to decoding problem, and utilizing training parameter is the word sequence mark optimum of current input, the i.e. status switch of maximum probability.Structure sound identifying of the present invention as shown in Figure 4, be a kind of from left to right without redirect HMM model (hidden Markov model), hidden Markov model is used for describing the probability model of statistics of random processes characteristic, comprises Markov chain and one stochastic process two parts.Wherein, Markov chain is described the transfer of state with transition probability; One stochastic process is carried out the relation between description state and observation sequence with observed value probability.This from left to right without in redirect HMM model, status number equals the phoneme number of entry, that is to say the corresponding phoneme of each state.

The traditional approach of speech dysarthrosis assessment is the mode with subjective Auditory Perception by professionals such as speech rehabilitation teachers, and patient is sent and specifies the structure sound sharpness of the entry in assessment vocabulary to evaluate.Entry in dysarthrosis assessment vocabulary will cover the whole initial consonant of mandarin, simple or compound vowel of a Chinese syllable and tone conventionally, and each entry pronunciation is the combination of these initial consonants, simple or compound vowel of a Chinese syllable and tone.Vocabulary comprises 50 entries.Cover 21 initial consonants, 13 simple or compound vowel of a Chinese syllable and 4 tones, comprise 18 phoneme contrasts and 36 pairs of Minimal phoneme contrasts, can reflect each phoneme innate ability, each phoneme contrast ability and structure sound readability of patient.This method is used HMM to set up the core identification engine of structure sound recognition system.The entry number of supposing the system identification vocabulary is V, and the HMM of one of them word v is Φ _v, each word has N state.While training the acoustic model of each word, use K different pronunciation.Sequence X=(the x of the characteristic parameter extracting from input voice of input ₁, x ₂..., x _t) as observation sequence, after characteristic parameter extraction, calculating probability P (X| Φ under HMM _v), finally in all entries, get entry v* that maximum likelihood probability is corresponding as recognition result, thereby identify initial consonant, simple or compound vowel of a Chinese syllable and the tone of structure sound.

(dysarthrosis assessment)

The present invention further utilizes standard acoustic model and the structure voice recognition method of having trained to realize dysarthric assessment.According to recognition result and the target sound contrast of setting in advance, structure sound sharpness to speech structure sound patient is assessed, can judge patient normally concrete initial consonant, simple or compound vowel of a Chinese syllable and the tone of structure sound, provide concrete dysarthrosis type (comprise phoneme omission, substitute and distort three kinds).

For example consult Fig. 5, target sound is set as initial consonant/b/, preset the corresponding word of this target sound for " bag/b ā o/ ", if input voice are " cat/m ā o/ ", in the entry drawing after structure voice recognition method of the present invention, known initial consonant/b/ quilt/m/ substitutes, and this situation is " substituting " obstacle; If input voice are " recessed/ā o/ ", initial consonant/b/ is missed when pronunciation, and this situation is " omission " obstacle, if the voice of input can not find the entry that Chinese is corresponding after structure voice recognition method of the present invention, explanation is exactly that structure sound is distorted obstacle.

As shown in Figure 6, a kind of structure sound of the present invention recognition system comprises voice acquisition device 1, voice processing apparatus 2 and structure sound recognition device 3.

Voice acquisition device 1 is an omnidirectional microphone, for collecting sample signal and signal to be identified.Voice processing apparatus 2 is connected with voice acquisition device 1, and it is for sample signal and signal to be identified are carried out to data-switching and pre-service, and extracts respectively the acoustical characteristic parameters of sample signal and signal to be identified.Structure sound recognition device 3 is connected with voice processing apparatus 2, for the acoustical characteristic parameters training acoustic model according to sample signal, obtains optimization model parameter, and the acoustical characteristic parameters according to optimization model calculation of parameter signal to be identified, obtains recognition result.Structure sound recognition device 3 of the present invention is further used for recognition result to judge, by recognition result and the target sound contrast of setting in advance, judges and in signal to be identified, has dysarthric initial consonant, simple or compound vowel of a Chinese syllable and tone.

Protection content of the present invention is not limited to above embodiment.Do not deviating under the spirit and scope of inventive concept, variation and advantage that those skilled in the art can expect are all included in the present invention, and take appending claims as protection domain.

Claims

1. a structure voice recognition method, is characterized in that, comprises the steps:

Obtain sample signal, described sample signal is carried out after filtering and noise reduction, described sample signal is changed and is quantified as binary sample signal by A/D, from described binary sample signal, extract the voice signal that comprises voice;

Extract the acoustical characteristic parameters in described voice signal, described acoustical characteristic parameters is used for identifying syllable and tone;

Selected and training acoustic model, calculate respectively the maximum likelihood probability value of acoustical characteristic parameters under hidden Markov model described in each, obtains the optimization model parameter corresponding to described maximum likelihood value;

The identification of structure sound, gathers signal to be identified, and the probable value according to each acoustical characteristic parameters of signal to be identified described in described optimization model calculation of parameter, obtains recognition result.

2. structure voice recognition method as claimed in claim 1, is characterized in that, the step of extracting the voice signal that comprises voice comprises:

By described binary sample signal intercepting, be a plurality of frames;

Calculate the mean value of the short-time autocorrelation function of at least one frame;

The threshold rate of mistake in short-term that is used for judging present frame according to described mean value calculation;

According to the described threshold rate of crossing in short-term, judge that described present frame is voiceless sound or voiced sound;

Judge one by one all frames, until obtain voice signal while obtaining start frame and abort frame.

3. structure voice recognition method as claimed in claim 2, is characterized in that, described short-time autocorrelation function is:

4. structure voice recognition method as claimed in claim 2, is characterized in that, the described threshold rate of crossing is in short-term:

Wherein,

5. structure voice recognition method as claimed in claim 1, is characterized in that, further comprises after extracting described voice signal:

Increase the weight of the high fdrequency component in described voice signal;

Utilize window function to carry out windowing operation to described voice signal.

6. structure voice recognition method as claimed in claim 1, it is characterized in that, described acoustical characteristic parameters comprises Mel cepstrum coefficient and first order difference result and second order difference result, and the calculation procedure of described Mel cepstrum coefficient and first order difference result thereof and second order difference result comprises:

By fast fourier transform, calculate the power spectrum of described voice signal;

Utilize Mel wave filter to calculate described power spectrum and obtain Mel frequency spectrum;

By discrete cosine transform, calculate described Mel frequency spectrum and obtain Mel cepstral coefficients;

Successively described Mel cepstral coefficients is carried out to the calculus of differences with the time, obtain first order difference result and second order difference result.

7. structure voice recognition method as claimed in claim 1, is characterized in that, described acoustical characteristic parameters comprises logarithm energy in short-term, and the described energy of logarithm in short-term represents as following formula:

In formula, s _nrepresent voice signal discrete series, N represents total number of sampled point, and n represents sampled point sequence number.

8. structure voice recognition method as claimed in claim 1, is characterized in that, the step that obtains described optimization model parameter comprises:

Calculate average and the covariance of described acoustical characteristic parameters;

Average and covariance that the initial average of acoustic model and covariance are replaced with to described acoustical characteristic parameters;

The model parameter of estimating described acoustic model, obtains estimates of parameters;

Described estimates of parameters is replaced to the parameter in described acoustic model, calculate respectively the maximum likelihood probability value of acoustical characteristic parameters under hidden Markov model described in each, obtain the optimization model parameter corresponding to described maximum likelihood value.

9. structure voice recognition method as claimed in claim 1, is characterized in that, according to Baum-Welch algorithm, estimation obtains described estimates of parameters.

10. structure voice recognition method as claimed in claim 1, is characterized in that, the calculation procedure of described recognition result comprises:

Described signal to be identified is divided, obtained the word sequence that a plurality of words form;

Extract a plurality of acoustical characteristic parameters of current word;

According to described optimization model parameter, with hidden Markov model, calculate respectively the probable value of acoustical characteristic parameters described in each, using the acoustical characteristic parameters of described probable value maximum as the recognition result of described word;

Calculate successively the recognition result to each word in described signal to be identified, obtain the recognition result of identification signal to be stated.

11. structure voice recognition methods as claimed in claim 1, is characterized in that, obtain further comprising after described recognition result:

By described recognition result and the target sound contrast of setting in advance, obtain having dysarthric initial consonant, simple or compound vowel of a Chinese syllable and tone in described signal to be identified.

12. 1 kinds of structure sound recognition systems, is characterized in that, comprise

Voice acquisition device, it is for collecting sample signal and signal to be identified;

Voice processing apparatus, it is for described sample signal and signal to be identified are carried out to data-switching and pre-service, and extracts respectively the acoustical characteristic parameters of described sample signal and described signal to be identified;

Structure sound recognition device, it obtains optimization model parameter for the acoustical characteristic parameters training acoustic model according to described sample signal, and the acoustical characteristic parameters according to signal to be identified described in described optimization model calculation of parameter, obtains recognition result.

13. structure sound recognition systems as claimed in claim 12, is characterized in that, described structure sound recognition device is further used for described recognition result to judge, judges in described signal to be identified and has dysarthric initial consonant, simple or compound vowel of a Chinese syllable and tone.