US5953699A - Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence - Google Patents

Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence Download PDF

Info

Publication number
US5953699A
US5953699A US08/959,465 US95946597A US5953699A US 5953699 A US5953699 A US 5953699A US 95946597 A US95946597 A US 95946597A US 5953699 A US5953699 A US 5953699A
Authority
US
United States
Prior art keywords
feature vectors
input speech
feature
line segment
time series
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US08/959,465
Inventor
Keizaburo Takagi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Assigned to NEC CORPORATION reassignment NEC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TAKAGI, KEIZABURO
Application granted granted Critical
Publication of US5953699A publication Critical patent/US5953699A/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates

Definitions

  • the present invention relates to a speech recognition apparatus and, more particularly, to a speech recognition apparatus which can approximate the movable range of the feature vector at each time point, and execute a distance calculation such that the optimal combination of all the combinations within the range is obtained as a distance value.
  • a conventional matching method for speech recognition input speech is converted into a time series data of feature vectors of one type, and standard speech is analyzed and converted into feature vectors of one type by the same method as that used for the input speech to be stored as a standard pattern.
  • the distances between these vectors are then calculated by using a matching method such as DP matching which allows nonlinear expansion/contraction in the time axis direction.
  • the category of the standard pattern which exhibits the minimum distance is output as the recognition result.
  • the distance or similarity between one feature vector of the input speech and one feature vector of a standard pattern is calculated.
  • a multi-template is designed such that a plurality of standard speaker voices are converted into a plurality of feature vectors to form standard patterns, and the feature of the standard pattern at each time point is expressed by a plurality of feature vectors.
  • a distance calculation is performed by a so-called Viterbi algorithm of obtaining the distances or similarities between all the combinations of one input feature vector and a plurality of feature vectors of standard patterns at each time point, and using the optimal one of the obtained distances or similarities, a so-called Baum-Welch algorithm of expressing the feature at each time point by the weighted sum of all distances or similarities, a semi-continuous scheme, or the like.
  • such events include speech recognition performed in the presence of large ambient noise.
  • noise spectra are added to an input speech spectrum in an additive manner.
  • the level of the noise varies at the respective time points, and hence cannot be predicted.
  • standard pattern speech is formed by using a finite number of several types of SNRs (Signal to Noise Ratios), and speech recognition is performed by using multi-templates with different SNR conditions.
  • a speech recognition apparatus comprising analysis means for outputting a feature of input speech at each time point as time series data of a feature vector, multiplexed (In the present specification, the term ⁇ multiplexed ⁇ is used to indicate that the plurality of different vectors that are defined for a single time point share the same position in the time series data.) standard patterns obtained by converting a feature of standard speaker voices into a plurality of different feature vectors, and storing the vectors as time series data of multiplexed feature vectors, and matching means for calculating a similarity or distance value, in matching between the feature vector of the input speech from the analysis means and the time series data of the plurality of feature vectors as the multiplexed standard patterns, between a line segment connecting two points of the multiplexed feature vectors of the multiplexed standard patterns and the feature vector of the input speech.
  • a scheme of performing a distance calculation by using a vector which can freely move within the range defined by a line segment is disclosed in Japanese Patent Laid-Open No. 58-115490.
  • a method of using the above distance calculation scheme is disclosed in detail in Japanese Patent Laid-Open No. 58-111099.
  • the object of the two prior arts is to perform a high-precision distance calculation with respect to a feature vector discretely expressed in the time axis direction. This object is the same as that of the present invention.
  • the present invention essentially differs from the two prior arts in that an event which continuously changes at the same time point is properly handled.
  • the arrangement of the present invention also differs from those of the prior arts.
  • FIG. 1 Is a block diagram showing a speech recognition apparatus according to the first embodiment of the present invention
  • FIG. 2 is a block diagram showing a speech recognition apparatus according to the second embodiment of the present invention.
  • FIG. 3 is a block diagram showing a speech recognition apparatus according to the fourth and fifth embodiments of the present invention.
  • FIG. 4 is a block diagram showing a speech recognition apparatus according to the sixth embodiment of the present invention.
  • FIGS. 5A and 5B are views showing the principle of distance calculations in the speech recognition apparatus of the present invention.
  • the present invention has been made to solve the problem in the conventional scheme, i.e., the problem that the feature of a standard pattern or input speech at each time point can be expressed by only a set of discrete points in a pattern space, and an event which continuously changes cannot be expressed. More specifically, in the present invention, the feature of input speech or a standard pattern at each time point is expressed as a line segment having two ends in a pattern space, and a distance calculation is performed between the points and the line segment in matching. Therefore, an event which continuously changes at each time point can be handled with a sufficiently high precision, and high speech recognition performance can be obtained.
  • a speech recognition apparatus is designed to express the feature of a standard pattern at each time point by a line segment defined by two end points in a pattern space or a set of line segments instead of a set of discrete points as in the conventional multi-template scheme. More specifically, an analysis section outputs the feature of input speech at each time point as the time series data of a feature vector.
  • This analysis method is one of various known methods. Although all the methods will not be described, any scheme can be used as long as it is designed to output the feature vectors of speech.
  • Standard speaker voices are analyzed in an analysis method similar to that used in the analysis section. This analysis is performed such that the feature at each time point is expressed as two end points or a set of points within the range in which the feature can change. This operation will be described in consideration of, for example, ambient noise levels.
  • the range of possible SNRs in input speech is set to the range of 0 dB to 40 dB.
  • Two types of voices with SNRs of 0 dB and 40 dB as two ends are converted into feature vectors. These vectors are stored as the multiplexed standard patterns. In this case, a feature is expressed by one pair of two end points.
  • the range of 0 dB to 40 dB may be divided into four ranges to express a feature as four pairs of two end points.
  • the matching section performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths.
  • a DP matching or HMM algorithm is available.
  • the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained.
  • a distance calculation at a lattice point (i, j) In this case, the distance between a vector X(i) and the line segment represented by two end points Y 1 (J) and Y 2 (j) in a space is obtained.
  • the distances between three points are obtained first according to three equations (1) given below:
  • d(V, W) indicates an operation of obtaining the square distance between the two points V and W.
  • a square distance Z is calculated on the basis of the this distance when a vertical line can be drawn from the vector x(i) to the line segment (Y 1 (j), Y 2 (j)) according to equation (2):
  • a final square distance D is determined on the following relationship in magnitude in Table 1, which corresponds to a case wherein a vertical line can be drawn and a case wherein a vertical line cannot be drawn.
  • optimal standard patterns can be continuously selected always for even input speech with an intermediate SNR between 0 dB and 40 dS.
  • a high-precision distance calculation can therefore be performed, and hence a high-performance speech recognition apparatus can be provided.
  • FIG. 1 shows a speech recognition apparatus according to the first embodiment of the present invention.
  • This speech recognition apparatus includes an analysis section 11 for outputting the feature of input speech at each time point as the time series data of a feature vector, multiplexed standard patterns 12 obtained by converting the features of standard speaker voices at each tine point into a plurality of different feature vectors, and storing the vectors in a storage means as the time series data of multiplexed feature vectors, and a matching section 13 for calculating the similarity or distance value at each tine point, in matching between the time series data of a feature vector of the input speech from the analysis section 11 and the time series data of a plurality of feature vectors as the multiplexed standard patterns 12, between a line segment connecting two points of the multiplexed feature vectors as the multiplexed standard patterns 12 and the feature vector of the input speech.
  • This speech recognition apparatus is designed to express the feature of a standard pattern at each time point, which is expressed as a set of discrete points in the conventional multi-template scheme, as a line segment defined by two end points in a pattern space or a set of line segments.
  • the analysis section 11 outputs the feature of input speech at each tine point as the time series data of a feature vector.
  • This analysis method includes various known methods. Although not all the methods will be described, any method can be used as long as it is designed to output the feature vectors of speech.
  • Standard speaker voices are analyzed in an analysis method similar to that used in the analysis section 11. This analysis is performed such that the feature at each time point is expressed as two end points or a set of points within the range in which the feature can change. This operation will be described in consideration of, for example, ambient noise levels.
  • the range of possible SNRs in input speech is set to the range of 0 dB to 40 dB. Two types of voices with SNRs of 0 dB and 40 dB as two ends are converted into feature vectors. These vectors are stored as the multiplexed standard patterns 12.
  • Such events include a phenomenon in which the utterance of a speaker himself/herself changes in a high-noise environment (so-called Lombard effect), changes in speech in the acoustic space constituted by many speakers, and the like.
  • Lombard effect a phenomenon in which the utterance of a speaker himself/herself changes in a high-noise environment
  • changes in speech in the acoustic space constituted by many speakers and the like.
  • a feature is expressed by one pair of two end points.
  • the range of 0 dB to 40 dB may be divided into four ranges to express a feature as four pairs of two end points or by connecting these points by broken line approximation.
  • standard patterns feature vectors themselves are used. As in HEM and the Like, however, standard patterns may be expressed by average vectors, the variance therein, and the like.
  • the matching section 13 performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths.
  • a DP matching or HMM algorithm is available.
  • the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained.
  • the similarity or distance value between the respective lattice points (i, j) is calculated between a line segment connecting two points of the multiplexed feature vectors as the multiplexed standard patterns and the feature vector of the input speech. Feature vector multiplexing may be performed for all or some of the vectors.
  • the matching section 13 outputs the category or category sequence of standard patterns which finally exhibit the maximum cumulative similarity or the minimum distance as the recognition result.
  • FIG. 2 shows a speech recognition apparatus according to the second embodiment of the present invention.
  • This speech recognition apparatus includes a multiplex analysis section 21 for converting the feature of input speech at each time point into a plurality of different feature vectors, and outputting the vectors as the time series data of multiplexed feature vectors, standard patterns 22 obtained by converting standard speaker voices into the time series data of feature vectors and storing them in advance, and a matching section 23 for calculating the similarity or distance value at each time point, in matching between the time series data of the multiplexed feature vector of the input speech from the multiplex analysis section 21 and the time series data of the corresponding feature vector as the standard patterns 22, between a line segment connecting two points of the multiplexed feature vectors of the input speech and the corresponding feature vector as the standard patterns 22.
  • input speech at each time point is expressed as one type of time series feature vector at one point in a space.
  • input speech at each time point is expressed by a line segment defined by two end points or a set of line segments in the range in which the input speech can change, thereby performing speech recognition.
  • the multiplex analysis section 21 expresses the feature of the input speech at each time point by two end points or a set of two end points, and outputs the feature as the time series data of a multiplexed feature vector. Any analysis method can be used as long as it is designed to output the feature vectors of speech.
  • Standard speaker voices are analyzed by an analysts method similar to that used in the multiplex analysis section 21. However, multiplexing is not performed, and the standard speaker voices are constituted by standard patterns for DP matching, standard patterns for HMM, or the like.
  • a speech recognition apparatus according to the third embodiment of the present invention will be described next with reference to FIG. 2.
  • the multiplexed feature vectors of input speech or multiplexed feature vectors as multiplexed standard patterns in the speech recognition apparatus according to the first or second embodiment are generated by adding noise of different levels to vectors.
  • An example of input speech multiplexing is a technique of using the additional noise level that continuously changes as in the speech recognition apparatus of this embodiment.
  • a multiplex analysis section 21 expresses two end points in a space by subtracting white noise whose upper and lower limits are determined from input speech on the assumption that the input speech is the sum of true speech (free from noise) and white noise of an unknown level.
  • y(j) be the tire series data of the spectrum of input speech
  • time series data Y 1 (j) and Y 2 (a) of the feature vectors generated at two ends of the white noise levels from which the input speech is subtractod are generated according to, for example, equations (3) given below:
  • C ⁇ . ⁇ is the function for converting the spectrum into the final feature vector
  • w 1 and w 2 are the upper and lower limits of the white noise level.
  • a matching section 23 performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths.
  • a DP matching or HMM algorithm is available.
  • the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained.
  • the similarity or distance value between the respective lattice points (it j) is calculated between a line segment connecting two points of the multiplexed feature vectors of input speech and a feature vector as a standard pattern 22.
  • the matching section 23 outputs the category or category sequence of standard patterns which finally exhibit the maximum cumulative similarity or the minimum distance as the recognition result.
  • FIG. 3 shows a speech recognition apparatus according to the fourth and fifth embodiments of the present invention.
  • the speech recognition apparatus according to the fourth embodiment includes a spectrum subtraction section 30 for performing spectrum subtraction upon conversion of input speech into a spectrum in addition to the arrangement of the speech recognition apparatus according to the second embodiment (see FIG. 2).
  • a multiplex analysis section 31 generates the multiplexed feature vector of the input speech by adding white noise of different levels to the spectrum output from the spectrum subtraction section 30.
  • the multiplexed feature vector of input speech is generated by using flooring values of different levels for the spectrum output from the spectrum subtraction section 30.
  • the spectrum subtraction section 30 subtracts a spectrum n of estimated ambient noise from spectrum time series data y(j) of input speech to generate a spectrum y'(j) after noise removal according to equation (4):
  • the multiplex analysis section 31 generates the multiplexed feature vector of input speech by adding white noise of different levels to the spectrum output from the spectrum subtraction section 30. More specifically, a spectrum y'(t) after spectrum subtraction includes a negative component. For this reason, when a cepstrum or a logarithmic spectrum is to be used as a feature vector, the negative component must be converted into a positive real number as the limit of logarithmic input values. As an example of this operation, the operation based on equation (5) given below is performed;
  • Clip .! indicates an operation of substituting the corresponding value for a component equal to or smaller than 0 or a predetermined positive value
  • is the additional white noise.
  • the white noise ⁇ is added to adjust the operating point of logarithmic processing to be performed for conversion to a feature vector. If, for example, ⁇ takes a large valuer the operating point increases to reduce the unevenness of a pattern after logarithmic conversion. If ⁇ takes a small value, the unevenness of the pattern increases. With this effect, the operating point is increased for a portion, e.g., noise, which is not required for speech recognition. That is, the noise can be suppressed by reducing the unevenness of a pattern after logarithmic conversion. For speech, the unevenness of pattern after logarithmic conversion is increased (the operation point is decreased) to effectively obtain a feature.
  • the multiplex analysis section 31 generates the multiplexed feature vector of input speech by using flooring values of different levels for the spectrum output from the spectrum subtraction section 30. More specifically, a spectrum y'(t) after spectrum subtraction includes a negative component. For this reason, when a spectrum or a logarithmic spectrum is to be used as a feature vector, the negative component must be converted into a positive real number as the limit of logarithmic input values. As an example of this operation, a so-called flooring operation based on the method disclosed in, e.g., M. Berouti, R. Schwarts, and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise", ICASSP, pp.
  • a minimum value ⁇ n k is set for each component k, and every component below this value is replaced with a minimum value, as indicated by equation 8: ##EQU1## where k is the suffix indicating the component of the spectrum, n is the spectrum of the estimated noise, and ⁇ is the constant sufficiently smaller than 1.
  • k is the suffix indicating the component of the spectrum
  • n is the spectrum of the estimated noise
  • is the constant sufficiently smaller than 1.
  • FIG. 4 shows the speech recognition apparatus according to the sixth embodiment of the present invention.
  • This speech recognition apparatus includes an analysis section 41 for outputting the feature of Input speech at each time point as the time series data of a feature vector, a noise extraction section 42 for extracting ambient noise from the input speech, standard patterns 43 obtained by converting standard speaker voices into the time series data of feature vectors and storing the time series data in advance, a standard pattern conversion section 44 for generating a plurality of different feature vectors by changing the level of the noise extracted by the noise extraction section 42, and adding the resultant noise to the standard patterns 43, and storing them as multiplexed standard patterns 45, and a matching section 46 for calculating the similarity or distance value, in matching between the time series data of the feature vector of input speech from the analysis section 41 and the time series data of a plurality of feature vectors as the multiplexed standard patterns 45, between a line segment connecting two points of the multiplexed feature vectors as the
  • This speech recognition apparatus uses a method of estimating noise from input speech, e.g., the spectrum shape immediately before utterance, and converting standard patterns by using the noise such that the same noise environment as that of the input Is set, thereby performing recognition. Since the relative relationship (i.e., SNR) between speech and noise in noise estimation is unknown, the SNRs of the standard patterns cannot be uniquely determined. For this reason, the standard patterns 43 are written as multiplexed feature vectors defined by two end points corresponding to the maximum and minimum SNRs.
  • the standard pattern conversion section 44 converts time series data y(j) of the spectra of the standard patterns 43 into spectra y' 1 (j) and y' 2 (j) of multiplexed standard patterns by using coefficients ⁇ 1 and ⁇ 2 at two end points, as indicated by equations (12) and (13) given below:
  • the multiplexed spectrum time series data are converted into the final series data of the feature vectors, and the time series data are stored as the multiplexed standard patterns 45, thereby performing matching.
  • FIGS. 5A and 5B show the principle of the matching section of the speech recognition apparatus according to the present invention.
  • the similarity or distance value at each time point is obtained between one vector (denoted by reference symbol X in FIGS. 5A and 5B) and the line segment represented by two end point vectors (denoted by reference symbols Y 1 and Y 2 in FIGS. 5A and 5B).
  • the similarity or distance value is calculated by using the length of the vertical line.
  • the similarity or distance value is calculated by using a shorter one of the distances from one vector to the two end points of the line segment.
  • a final square distance D is determined by the following relationship in magnitude in Table 2, which corresponds to a case wherein a vertical line can be drawn and a case wherein a vertical line cannot be drawn (FIG. 5B).
  • an event which continuously changes at each time point is expressed by a line segment defined by two end points in a pattern space or a set of line segments therein, and a distance calculation is performed by calculating the optimal distance or likelihood between a vector and the line segment, thereby provided a high-precision speech recognition apparatus.
  • the multi-template scheme based on the conventional method requires many sampling points to express a sufficiently wide range of changes, resulting in a large memory capacity and a large amount of distance calculations.
  • an inexpensive apparatus cannot be provided. It, for example, an event at each time point is expressed by 10 sampling points, 10 distance calculations and a memory corresponding to the ten points are required.
  • the amount of calculation required to calculate the distance between one line segment and a vector corresponds to three distance calculations, and the required memory capacity corresponds to two points. A more inexpensive apparatus can therefore be provided.

Abstract

A speech recognition apparatus has an analysis section that outputs features of input speech as a time sequence of feature vectors defined for discrete time points corresponding to a processed speech frame. Reference paradigm utterances are converted into a time sequence of standard (reference) feature vectors. The possible continuous variation of standard feature vectors at each point in time is expressed by a line segment, or set of line segments, connecting the feature vectors for the two end points of the "movable" range within which the feature can change, rather than using a larger set of reference vectors as in a conventional multitemplate approach to speech recognition. For example, the continuous range of possible background noise levels in input speech defines a line segment connecting the two feature vectors at the two SNR value limits. A matching apparatus calculates the distance between the input speech feature vector at each time point and the reference line segment endpoints and the perpendicular distance to the reference line segment (where meaningful), for each reference line segment corresponding to that particular time. The distance between each input feature and each standard (reference) feature sequence, represented by its line segment at a given time, is defined as the smallest of these three (or two) computed distance values.

Description

BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus and, more particularly, to a speech recognition apparatus which can approximate the movable range of the feature vector at each time point, and execute a distance calculation such that the optimal combination of all the combinations within the range is obtained as a distance value.
According to a conventional matching method for speech recognition, input speech is converted into a time series data of feature vectors of one type, and standard speech is analyzed and converted into feature vectors of one type by the same method as that used for the input speech to be stored as a standard pattern. The distances between these vectors are then calculated by using a matching method such as DP matching which allows nonlinear expansion/contraction in the time axis direction. The category of the standard pattern which exhibits the minimum distance is output as the recognition result. At each time point in matching, the distance or similarity between one feature vector of the input speech and one feature vector of a standard pattern is calculated.
According to this method, however, since the voices generated by different speakers greatly vary in feature in many cases even if the contents of utterances are the same, high performance cannot be obtained with respect to speech from a speaker different from standard speakers. Even with respect to speech from a single speaker, stable performance cannot be obtained when the feature of the speech changes due to the speaker's physical condition, a psychological factor, and the like. To solve this problem, a method called a multi-template scheme has been used.
A multi-template is designed such that a plurality of standard speaker voices are converted into a plurality of feature vectors to form standard patterns, and the feature of the standard pattern at each time point is expressed by a plurality of feature vectors. A distance calculation is performed by a so-called Viterbi algorithm of obtaining the distances or similarities between all the combinations of one input feature vector and a plurality of feature vectors of standard patterns at each time point, and using the optimal one of the obtained distances or similarities, a so-called Baum-Welch algorithm of expressing the feature at each time point by the weighted sum of all distances or similarities, a semi-continuous scheme, or the like.
In the conventional distance calculating methods for speech recognition, even in the method using a multi-template, differences between voices are expressed by only discrete points in a space, and distances or similarities are calculated from only finite combinations of the points. In many cases, these methods cannot satisfactorily express all events that continuously changer and hence high recognition performance cannot be obtained.
For examples such events include speech recognition performed in the presence of large ambient noise. In the presence of ambient noise, noise spectra are added to an input speech spectrum in an additive manner. In addition, the level of the noise varies at the respective time points, and hence cannot be predicted. For example, in a known conventional scheme, to solve this problem, standard pattern speech is formed by using a finite number of several types of SNRs (Signal to Noise Ratios), and speech recognition is performed by using multi-templates with different SNR conditions.
Since infinite combinations of SNRs of input speech are present, and are difficult to predict, it is theoretically impossible to solve the above problem by using a finite number of templates. It is seemingly possible to express a continuous change by a sufficient number of discrete points so as to improve the calculation precision to such an extent that an error can be approximately neglected. It is practically impossible in terms of the cost for data collection to collect enough voices from many speakers to cover all SNRs in all noise environments under SNR conditions. Even if such data collection is possible, the memory capacity and the amount of distance calculation which are required to express continuous events by many points are enormous. In this case, therefore, an inexpensive apparatus cannot be provided.
Other events in which the feature of speech continuously changes include the case of a so-called Lombard effect in which speech generated in the presence of noise itself changes when the speaker hears the noise, and a case in which the features of voices from an extremely large number of speakers change.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide a high-performance speech recognition apparatus in which the range of changes of an event in which the feature of speech continuously changes at each time point Is described with a pair of vectors at two ends, and a distance calculation can be performed, with respect to a line segment defined by the two vectors, by using a vector which can freely move within the range defined by the line segment.
It is another object of the present invention to provide an inexpensive, high--performance speech recognition apparatus.
In order to achieve the above objects, according to the present invention, there is provided a speech recognition apparatus comprising analysis means for outputting a feature of input speech at each time point as time series data of a feature vector, multiplexed (In the present specification, the term `multiplexed` is used to indicate that the plurality of different vectors that are defined for a single time point share the same position in the time series data.) standard patterns obtained by converting a feature of standard speaker voices into a plurality of different feature vectors, and storing the vectors as time series data of multiplexed feature vectors, and matching means for calculating a similarity or distance value, in matching between the feature vector of the input speech from the analysis means and the time series data of the plurality of feature vectors as the multiplexed standard patterns, between a line segment connecting two points of the multiplexed feature vectors of the multiplexed standard patterns and the feature vector of the input speech.
A scheme of performing a distance calculation by using a vector which can freely move within the range defined by a line segment is disclosed in Japanese Patent Laid-Open No. 58-115490. A method of using the above distance calculation scheme is disclosed in detail in Japanese Patent Laid-Open No. 58-111099. The object of the two prior arts is to perform a high-precision distance calculation with respect to a feature vector discretely expressed in the time axis direction. This object is the same as that of the present invention. However, the present invention essentially differs from the two prior arts in that an event which continuously changes at the same time point is properly handled. The arrangement of the present invention also differs from those of the prior arts.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 Is a block diagram showing a speech recognition apparatus according to the first embodiment of the present invention;
FIG. 2 is a block diagram showing a speech recognition apparatus according to the second embodiment of the present invention;
FIG. 3 is a block diagram showing a speech recognition apparatus according to the fourth and fifth embodiments of the present invention;
FIG. 4 is a block diagram showing a speech recognition apparatus according to the sixth embodiment of the present invention; and
FIGS. 5A and 5B are views showing the principle of distance calculations in the speech recognition apparatus of the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The principle of the present invention will be described first.
The present invention has been made to solve the problem in the conventional scheme, i.e., the problem that the feature of a standard pattern or input speech at each time point can be expressed by only a set of discrete points in a pattern space, and an event which continuously changes cannot be expressed. More specifically, in the present invention, the feature of input speech or a standard pattern at each time point is expressed as a line segment having two ends in a pattern space, and a distance calculation is performed between the points and the line segment in matching. Therefore, an event which continuously changes at each time point can be handled with a sufficiently high precision, and high speech recognition performance can be obtained.
A speech recognition apparatus according to the present invention is designed to express the feature of a standard pattern at each time point by a line segment defined by two end points in a pattern space or a set of line segments instead of a set of discrete points as in the conventional multi-template scheme. More specifically, an analysis section outputs the feature of input speech at each time point as the time series data of a feature vector. This analysis method is one of various known methods. Although all the methods will not be described, any scheme can be used as long as it is designed to output the feature vectors of speech.
Standard speaker voices are analyzed in an analysis method similar to that used in the analysis section. This analysis is performed such that the feature at each time point is expressed as two end points or a set of points within the range in which the feature can change. This operation will be described in consideration of, for example, ambient noise levels. For example, the range of possible SNRs in input speech is set to the range of 0 dB to 40 dB. Two types of voices with SNRs of 0 dB and 40 dB as two ends are converted into feature vectors. These vectors are stored as the multiplexed standard patterns. In this case, a feature is expressed by one pair of two end points. As is obvious, however, the range of 0 dB to 40 dB may be divided into four ranges to express a feature as four pairs of two end points. When a feature is to be expressed by one pair of two ends, the time series data of multiplexed feature vectors of the two end points of multiplexed standard patterns are represented by Y1 (j) and Y2 (J) (j=0, 1, . . , J), and the time series data of the feature vector of input speech is represented by x(i) (i=0, 1, . . . , I).
The matching section performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths. As an algorithm for this matching, for example, a DP matching or HMM algorithm is available. In any of the methods using these algorithms, the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained. Consider a distance calculation at a lattice point (i, j). In this case, the distance between a vector X(i) and the line segment represented by two end points Y1 (J) and Y2 (j) in a space is obtained. As in the scheme used in the above conventional technique, the distances between three points are obtained first according to three equations (1) given below:
A=d(Y.sub.1 (j), Y.sub.2 (j)) B=d(X(i), Y.sub.2 (j)) C=d(X(i), Y.sub.1 (j))(1)
where d(V, W) indicates an operation of obtaining the square distance between the two points V and W. Subsequently, a square distance Z is calculated on the basis of the this distance when a vertical line can be drawn from the vector x(i) to the line segment (Y1 (j), Y2 (j)) according to equation (2):
Z={BC-(B+C-A).sup.2 /4}/A                                  (2)
A final square distance D is determined on the following relationship in magnitude in Table 1, which corresponds to a case wherein a vertical line can be drawn and a case wherein a vertical line cannot be drawn.
              TABLE 1
______________________________________
D           Condition
______________________________________
B or C      A = 0
B           A ≠ 0, B - C ≦ -A
C           A ≠ 0, B - C ≧ A
Z           otherwise
______________________________________
By using such a distance calculation method, optimal standard patterns can be continuously selected always for even input speech with an intermediate SNR between 0 dB and 40 dS. A high-precision distance calculation can therefore be performed, and hence a high-performance speech recognition apparatus can be provided.
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The first embodiment of the present invention will be described first. FIG. 1 shows a speech recognition apparatus according to the first embodiment of the present invention. This speech recognition apparatus includes an analysis section 11 for outputting the feature of input speech at each time point as the time series data of a feature vector, multiplexed standard patterns 12 obtained by converting the features of standard speaker voices at each tine point into a plurality of different feature vectors, and storing the vectors in a storage means as the time series data of multiplexed feature vectors, and a matching section 13 for calculating the similarity or distance value at each tine point, in matching between the time series data of a feature vector of the input speech from the analysis section 11 and the time series data of a plurality of feature vectors as the multiplexed standard patterns 12, between a line segment connecting two points of the multiplexed feature vectors as the multiplexed standard patterns 12 and the feature vector of the input speech.
This speech recognition apparatus is designed to express the feature of a standard pattern at each time point, which is expressed as a set of discrete points in the conventional multi-template scheme, as a line segment defined by two end points in a pattern space or a set of line segments. The analysis section 11 outputs the feature of input speech at each tine point as the time series data of a feature vector. This analysis method includes various known methods. Although not all the methods will be described, any method can be used as long as it is designed to output the feature vectors of speech.
Standard speaker voices are analyzed in an analysis method similar to that used in the analysis section 11. This analysis is performed such that the feature at each time point is expressed as two end points or a set of points within the range in which the feature can change. This operation will be described in consideration of, for example, ambient noise levels. For example, the range of possible SNRs in input speech is set to the range of 0 dB to 40 dB. Two types of voices with SNRs of 0 dB and 40 dB as two ends are converted into feature vectors. These vectors are stored as the multiplexed standard patterns 12.
There are various conceivable events in which the feature at each time point continuously changes. For example, such events include a phenomenon in which the utterance of a speaker himself/herself changes in a high-noise environment (so-called Lombard effect), changes in speech in the acoustic space constituted by many speakers, and the like. In this case, a feature is expressed by one pair of two end points. As is obvious, however, the range of 0 dB to 40 dB may be divided into four ranges to express a feature as four pairs of two end points or by connecting these points by broken line approximation.
In this case, as standard patterns, feature vectors themselves are used. As in HEM and the Like, however, standard patterns may be expressed by average vectors, the variance therein, and the like. When the time series data of the multiplexed feature vectors at the two end points of the multiplexed standard patterns 12 are to be expressed as a pair of two end points, the time series data of the vectors at the two end points are stored as Y1 (i) and Y2 () (j=0, 1, . . , J).
The matching section 13 performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths. As an algorithm for this matching, for example, a DP matching or HMM algorithm is available. In any of the methods using these algorithms, the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained. The similarity or distance value between the respective lattice points (i, j) is calculated between a line segment connecting two points of the multiplexed feature vectors as the multiplexed standard patterns and the feature vector of the input speech. Feature vector multiplexing may be performed for all or some of the vectors. The matching section 13 outputs the category or category sequence of standard patterns which finally exhibit the maximum cumulative similarity or the minimum distance as the recognition result.
The second embodiment of the present invention will be described next. FIG. 2 shows a speech recognition apparatus according to the second embodiment of the present invention. This speech recognition apparatus includes a multiplex analysis section 21 for converting the feature of input speech at each time point into a plurality of different feature vectors, and outputting the vectors as the time series data of multiplexed feature vectors, standard patterns 22 obtained by converting standard speaker voices into the time series data of feature vectors and storing them in advance, and a matching section 23 for calculating the similarity or distance value at each time point, in matching between the time series data of the multiplexed feature vector of the input speech from the multiplex analysis section 21 and the time series data of the corresponding feature vector as the standard patterns 22, between a line segment connecting two points of the multiplexed feature vectors of the input speech and the corresponding feature vector as the standard patterns 22.
In the prior art, input speech at each time point is expressed as one type of time series feature vector at one point in a space. In this speech recognition apparatus, however, input speech at each time point is expressed by a line segment defined by two end points or a set of line segments in the range in which the input speech can change, thereby performing speech recognition. The multiplex analysis section 21 expresses the feature of the input speech at each time point by two end points or a set of two end points, and outputs the feature as the time series data of a multiplexed feature vector. Any analysis method can be used as long as it is designed to output the feature vectors of speech. Standard speaker voices are analyzed by an analysts method similar to that used in the multiplex analysis section 21. However, multiplexing is not performed, and the standard speaker voices are constituted by standard patterns for DP matching, standard patterns for HMM, or the like.
A speech recognition apparatus according to the third embodiment of the present invention will be described next with reference to FIG. 2. In this speech recognition apparatus, the multiplexed feature vectors of input speech or multiplexed feature vectors as multiplexed standard patterns in the speech recognition apparatus according to the first or second embodiment are generated by adding noise of different levels to vectors.
An example of input speech multiplexing is a technique of using the additional noise level that continuously changes as in the speech recognition apparatus of this embodiment. In this technique, a multiplex analysis section 21 expresses two end points in a space by subtracting white noise whose upper and lower limits are determined from input speech on the assumption that the input speech is the sum of true speech (free from noise) and white noise of an unknown level. Letting y(j) be the tire series data of the spectrum of input speech, time series data Y1 (j) and Y2 (a) of the feature vectors generated at two ends of the white noise levels from which the input speech is subtractod are generated according to, for example, equations (3) given below:
Y.sub.1 (j)=C{y(j)=w.sub.1 }Y.sub.2 (j)=C{y(j)-w.sub.2 }   (3)
where C{.} is the function for converting the spectrum into the final feature vector, and w1 and w2 are the upper and lower limits of the white noise level. With this operation, even if the white noise level of the input speech is unknown, the noise is properly removed at one point without fail within the predetermined range. Although this operation has been described by taking white noise as an example, noise obtained at a position where no input speech is present may be used. In this embodiment, a pair of multiplexed feature vectors are used. However, the feature may be expressed by a plurality of pairs of vectors.
A matching section 23 performs matching such that nonlinear expansion in the time axis direction is performed between two types of patterns having different lengths. As an algorithm for this matching, for example, a DP matching or HMM algorithm is available. In any of the methods using these algorithms, the distance between the lattice points of input and standard patterns, on the two-dimensional lattice, which are defined in the time axis direction must be obtained. The similarity or distance value between the respective lattice points (it j) is calculated between a line segment connecting two points of the multiplexed feature vectors of input speech and a feature vector as a standard pattern 22. The matching section 23 outputs the category or category sequence of standard patterns which finally exhibit the maximum cumulative similarity or the minimum distance as the recognition result.
The fourth and fifth embodiments of the present invention will be described next. FIG. 3 shows a speech recognition apparatus according to the fourth and fifth embodiments of the present invention. The speech recognition apparatus according to the fourth embodiment includes a spectrum subtraction section 30 for performing spectrum subtraction upon conversion of input speech into a spectrum in addition to the arrangement of the speech recognition apparatus according to the second embodiment (see FIG. 2). A multiplex analysis section 31 generates the multiplexed feature vector of the input speech by adding white noise of different levels to the spectrum output from the spectrum subtraction section 30.
Referring to FIG. 3, in the speech recognition apparatus according to the fifth embodiment, the multiplexed feature vector of input speech is generated by using flooring values of different levels for the spectrum output from the spectrum subtraction section 30.
The spectrum subtraction section 30 subtracts a spectrum n of estimated ambient noise from spectrum time series data y(j) of input speech to generate a spectrum y'(j) after noise removal according to equation (4):
y'(j)=j(j)-n                                               (4)
Various methods of estimating ambient noise have been proposed, although all the methods will not be described below. These methods include a method of using the average spectrum of ambient noise immediately before utterance upon detection of speech, a method of using a regression average with a sufficiently large time constant regardless of speech detection, and the like. Any other methods can be used as long as they can be used for spectrum subtraction.
In the speech recognition apparatus according to the fourth embodiment, the multiplex analysis section 31 generates the multiplexed feature vector of input speech by adding white noise of different levels to the spectrum output from the spectrum subtraction section 30. More specifically, a spectrum y'(t) after spectrum subtraction includes a negative component. For this reason, when a cepstrum or a logarithmic spectrum is to be used as a feature vector, the negative component must be converted into a positive real number as the limit of logarithmic input values. As an example of this operation, the operation based on equation (5) given below is performed;
y"(j)=Clip y'(j)!+θ                                  (5)
where Clip .! indicates an operation of substituting the corresponding value for a component equal to or smaller than 0 or a predetermined positive value, and θ is the additional white noise. The white noise θ is added to adjust the operating point of logarithmic processing to be performed for conversion to a feature vector. If, for example, θ takes a large valuer the operating point increases to reduce the unevenness of a pattern after logarithmic conversion. If θ takes a small value, the unevenness of the pattern increases. With this effect, the operating point is increased for a portion, e.g., noise, which is not required for speech recognition. That is, the noise can be suppressed by reducing the unevenness of a pattern after logarithmic conversion. For speech, the unevenness of pattern after logarithmic conversion is increased (the operation point is decreased) to effectively obtain a feature.
At the time of execution of this processing, however, whether input speech is noise or speech cannot be determined. Even if such determination is possible, the determination cannot be perfect. For this reason, in the present invention, such uncertain determination processing is not used, and the two end points at which suppression is maximized and mininized, respectively, are expressed by multiplexed feature vectors to determine the optimal suppression amount for matching. That is, two types of spectra for multiplexing are obtained by using θ1 by which suppression is maximized and θ2 by which suppression is minimized according to equations (6) and (7) given below. These spectra are then converted into the final multiplexed feature vectors.
Y".sub.1 (j)=Clip y"(j)!+θ.sub.1                     (6)
Y".sub.Z (j)=Clip y"(j)!+θ.sub.2                     (7)
In the speech recognition apparatus according to the fifth embodiment, the multiplex analysis section 31 generates the multiplexed feature vector of input speech by using flooring values of different levels for the spectrum output from the spectrum subtraction section 30. More specifically, a spectrum y'(t) after spectrum subtraction includes a negative component. For this reason, when a spectrum or a logarithmic spectrum is to be used as a feature vector, the negative component must be converted into a positive real number as the limit of logarithmic input values. As an example of this operation, a so-called flooring operation based on the method disclosed in, e.g., M. Berouti, R. Schwarts, and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise", ICASSP, pp. 208-211, 1979 may be performed. In this method, a minimum value βnk is set for each component k, and every component below this value is replaced with a minimum value, as indicated by equation 8: ##EQU1## where k is the suffix indicating the component of the spectrum, n is the spectrum of the estimated noise, and β is the constant sufficiently smaller than 1. With this processing, a positive value is provided as a logarithmic input value, thereby preventing a calculation failure. In addition, by changing the magnitude of β, the unevenness of the spectrum upon logarithmic conversion is changed, resulting in difficulty in determining the value of β. This problem is essentially the same as that solved by the speech recognition apparatus of the present invention. The optimal value of β changes depending on a noise portion so a speech portion. This value also changes depending on the overall SNR of speech. For these reasons, the value of β cannot be uniquely determined. By using β1 by which suppression is maximized and β2 by which suppression is minimized, therefore, two types of spectra for multiplexing are obtained according to equations (9) and (10) given below: ##EQU2##
These spectra are then converted into the final multiplexed feature vectors. In this case, as a flooring method, the method disclosed in the above reference is described. The method indicated by equation (11) given below can also be used. Any methods can be used as long as they can be used for spectrum subtraction processing. ##EQU3##
A speech recognition apparatus according to the sixth embodiment of the present invention will be described with reference to FIG. 4. FIG. 4 shows the speech recognition apparatus according to the sixth embodiment of the present invention. This speech recognition apparatus includes an analysis section 41 for outputting the feature of Input speech at each time point as the time series data of a feature vector, a noise extraction section 42 for extracting ambient noise from the input speech, standard patterns 43 obtained by converting standard speaker voices into the time series data of feature vectors and storing the time series data in advance, a standard pattern conversion section 44 for generating a plurality of different feature vectors by changing the level of the noise extracted by the noise extraction section 42, and adding the resultant noise to the standard patterns 43, and storing them as multiplexed standard patterns 45, and a matching section 46 for calculating the similarity or distance value, in matching between the time series data of the feature vector of input speech from the analysis section 41 and the time series data of a plurality of feature vectors as the multiplexed standard patterns 45, between a line segment connecting two points of the multiplexed feature vectors as the multiplexed standard patterns 45 and the feature vector of the speech input.
This speech recognition apparatus uses a method of estimating noise from input speech, e.g., the spectrum shape immediately before utterance, and converting standard patterns by using the noise such that the same noise environment as that of the input Is set, thereby performing recognition. Since the relative relationship (i.e., SNR) between speech and noise in noise estimation is unknown, the SNRs of the standard patterns cannot be uniquely determined. For this reason, the standard patterns 43 are written as multiplexed feature vectors defined by two end points corresponding to the maximum and minimum SNRs. If the spectrum of noise obtained immediately before, for example, the utterance of input speech is represented by n, the standard pattern conversion section 44 converts time series data y(j) of the spectra of the standard patterns 43 into spectra y'1 (j) and y'2 (j) of multiplexed standard patterns by using coefficients α1 and α2 at two end points, as indicated by equations (12) and (13) given below:
y'.sub.1 (j)=y(j)+α.sub.1 n                          (12)
y'.sub.2 (j)=y(j)+α.sub.2 n                          (13)
The multiplexed spectrum time series data are converted into the final series data of the feature vectors, and the time series data are stored as the multiplexed standard patterns 45, thereby performing matching.
FIGS. 5A and 5B show the principle of the matching section of the speech recognition apparatus according to the present invention. In the matching section of the speech recognition apparatus according to first to sixth embodiments, the similarity or distance value at each time point is obtained between one vector (denoted by reference symbol X in FIGS. 5A and 5B) and the line segment represented by two end point vectors (denoted by reference symbols Y1 and Y2 in FIGS. 5A and 5B). In this case, as shown in FIG. 5A, when a vertical line can be drawn to the line segment, the similarity or distance value is calculated by using the length of the vertical line. In contrast to this, when a vertical line cannot be drawn, as shown in FIG. 5B, the similarity or distance value is calculated by using a shorter one of the distances from one vector to the two end points of the line segment.
More specifically, the distance between a vector X(i) and the line segment represented by two end points Y1 (j) and Y2 (i) in a space is obtained. First of all, the distances between the three points are obtained according to equation (14) as in the conventional scheme described above.
A=d(Y.sub.1 (j), Y.sub.2 (j)) B=d(X(i), Y.sub.2 (j)) C=d(X(i), Y.sub.1 (j))(14)
where d(V, M) indicates an operation of obtaining the square distance between two points V and M. When a vertical line can be drawn front X(i) to a line segment (Y1 (j), Y2 (j)), a square distance Z (FIG. 5A) is calculated on the basis of this distance according to equation (15):
Z={BC=(B+C-A).sup.2 /4}/A                                  (15)
A final square distance D is determined by the following relationship in magnitude in Table 2, which corresponds to a case wherein a vertical line can be drawn and a case wherein a vertical line cannot be drawn (FIG. 5B).
              TABLE 2
______________________________________
D           Condition
______________________________________
B or C      A = 0
B           A ≠ 0, B - C ≦ -A
C           A ≠ 0, B - C ≧ A
Z           otherwise
______________________________________
As has been described above, according to the present invention, an event which continuously changes at each time point is expressed by a line segment defined by two end points in a pattern space or a set of line segments therein, and a distance calculation is performed by calculating the optimal distance or likelihood between a vector and the line segment, thereby provided a high-precision speech recognition apparatus.
The multi-template scheme based on the conventional method requires many sampling points to express a sufficiently wide range of changes, resulting in a large memory capacity and a large amount of distance calculations. As a result, an inexpensive apparatus cannot be provided. It, for example, an event at each time point is expressed by 10 sampling points, 10 distance calculations and a memory corresponding to the ten points are required. According to the present invention, however, the amount of calculation required to calculate the distance between one line segment and a vector corresponds to three distance calculations, and the required memory capacity corresponds to two points. A more inexpensive apparatus can therefore be provided.

Claims (20)

What is claimed is:
1. A speech recognition apparatus comprising:
means for outputting a feature of input speech as a time series of feature vectors;
means for storing a plurality of standard voice patterns as a time series of multiplexed feature vectors converted from a plurality of standard speaker voices, wherein the multiplexed feature vectors include, at each time point, a plurality of different feature vectors; and
matching means for calculating at a particular time point, a similarity or distance value between the feature vector of the input speech and the feature vectors of the standard voice patterns, by calculating a distance between the feature vector of the input speech defined for said particular time point and a line segment connecting end points of two of the multiplexed feature vectors that are defined for said particular time point.
2. An apparatus according to claim 1, wherein the plurality of different feature vectors of the standard voice patterns are generated by adding noise of different levels.
3. An apparatus according to claim 1, further comprising:
noise extraction means for extracting ambient noise from the input speech; and
means for changing a level of the ambient noise extracted by said noise extraction means, and generating the plurality of different feature vectors by adding in the noise having the changed level.
4. An apparatus according to claim 1, wherein the similarity or distance value at said particular time point is obtained between an end of the feature vector of the input speech and the line segment connecting the end points of the two multiplexed feature vectors; and
wherein the similarity or distance value is calculated by using a length of a perpendicular line when the perpendicular line can be drawn from the end of the feature vector of the input speech to intersect the line segment, or else by using a shorter one of distances from the end of the feature vector of the input speech to the two end points of the line segment when the perpendicular line cannot be drawn.
5. A speech recognition apparatus comprising:
multiplex analysis means for converting a feature of input speech at each time point into a plurality of different feature vectors, and outputting the plurality of different feature vectors as a time series of multiplexed feature vectors;
means for storing standard voice patterns as a time series of feature vectors converted from standard speaker voices; and
matching means for calculating at a particular time point, a similarity or distance value between the feature vector of the standard voice patterns defined for said particular time point and a line segment connecting end points of two of the multiplexed feature vectors of the input speech defined for said particular time point.
6. An apparatus according to claim 5, wherein the plurality of different feature vectors of the input speech are generated by adding noise of different levels.
7. An apparatus according to claim 5, further comprising spectrum subtraction means for converting the input speech into a spectrum time series, and then performing a spectrum subtraction therefrom, wherein said multiplex analysis means generates the plurality of different feature vectors of the input speech by adding white noise of different levels to the output from said spectrum subtraction means.
8. An apparatus according to claim 7, wherein the spectrum subtraction means subtracts a spectrum of ambient noise from the spectrum time series.
9. An apparatus according to claim 5, further comprising spectrum subtraction means for converting the input speech into a spectrum time series, and then performing a spectrum subtraction therefrom, wherein said multiplex analysis means generates the plurality of different feature vectors of the input speech by adding flooring values of different levels to the output from said spectrum subtraction means.
10. An apparatus according to claim 9, wherein the spectrum subtraction means subtracts a spectrum of ambient noise from the spectrum time series.
11. An apparatus according to claim 5, wherein the similarity or distance value at said particular time point is obtained between an end of the feature vector of the standard voice patterns and the line segment connecting the end points of the two multiplexed feature vectors of the input speech; and
wherein the similarity or distance value is calculated by using a length of a vertical line when the vertical line can be drawn from the end of the feature vector of the standard voice patterns to the line segment, or else by using a shorter one of distances from the end of the feature vector of the standard voice patterns to the two end points of the line segment when a vertical line cannot be drawn.
12. A speech recognition method comprising the steps of:
converting an input speech into a time series of feature vectors;
converting standard voice patterns into a time series of multiplexed feature vectors; and
matching the input speech to the standard voice patterns by calculating at a particular time point, a similarity or distance value between the feature vector of the input speech defined for said particular time point and a line segment connecting end points of two of the multiplexed feature vectors of the standard voice patterns defined for said particular time point.
13. A method according to claim 12, wherein the step of converting the standard voice patterns into the time series of multiplexed feature vectors includes the step of adding noise of different levels.
14. A method according to claim 13, further comprising the steps of:
extracting ambient noise from the input speech; and
changing a level of the extracted ambient noise;
generating a plurality of different feature vectors by adding the extracted ambient noise having the changed level to a time series of feature vectors converted from the plurality of standard speaker voices; and
storing the different feature vectors as the multiplexed feature vectors.
15. A method according to claim 14, wherein the step of matching includes the step of selecting as the similarity or distance value a length of a perpendicular line when the perpendicular line can be drawn from an end of the feature vector of the input speech to intersect the line segment, or else, a shorter one of distances from the end of the feature vector of the input speech to the two end points of the line segment when the perpendicular line cannot be drawn.
16. A speech recognition method comprising the steps of:
converting a feature of input speech into a time series of multiplexed feature vectors;
storing standard voice patterns as a time series of feature vectors converted from standard speaker voices; and
matching means for calculating at a particular time point, a similarity or distance value between the feature vector of the standard voice patterns defined for said particular time point and a line segment connecting end points of two of the multiplexed feature vectors of the input speech defined for said particular time point.
17. A method according to claim 16, wherein the step of converting the input speech into the time series of multiplexed feature vectors includes the step of adding noise of different levels.
18. A method according to claim 17, further comprising the steps of converting the input speech into a spectrum time series, and then performing a spectrum subtraction therefrom, wherein the step of converting a feature of the input speech includes the step of adding white noise of different levels to the result of the spectrum subtraction.
19. A method according to claim 17, further comprising the steps of converting the input speech into a spectrum time series, and then performing a spectrum subtraction therefrom, wherein the step of converting a feature of the input speech includes the step of adding flooring values of different levels to the result of the spectrum subtraction.
20. A method according to claim 16, wherein the step of matching includes the step of selecting as the similarity or distance value a length of a perpendicular line when the perpendicular line can be drawn from an end of the feature vector of the standard voice patterns to intersect the line segment, or else, a shorter one of distances from the end of the feature vector of the standard voice patterns to the two end points of the line segment when the perpendicular line cannot be drawn.
US08/959,465 1996-10-28 1997-10-28 Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence Expired - Fee Related US5953699A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP8-285532 1996-10-28
JP8285532A JP3039623B2 (en) 1996-10-28 1996-10-28 Voice recognition device

Publications (1)

Publication Number Publication Date
US5953699A true US5953699A (en) 1999-09-14

Family

ID=17692757

Family Applications (1)

Application Number Title Priority Date Filing Date
US08/959,465 Expired - Fee Related US5953699A (en) 1996-10-28 1997-10-28 Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence

Country Status (4)

Country Link
US (1) US5953699A (en)
EP (1) EP0838803B1 (en)
JP (1) JP3039623B2 (en)
DE (1) DE69715343T2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138263A1 (en) * 2001-01-31 2002-09-26 Ibm Corporation Methods and apparatus for ambient noise removal in speech recognition
US7216075B2 (en) * 2001-06-08 2007-05-08 Nec Corporation Speech recognition method and apparatus with noise adaptive standard pattern
US20080147394A1 (en) * 2006-12-18 2008-06-19 International Business Machines Corporation System and method for improving an interactive experience with a speech-enabled system through the use of artificially generated white noise
US20100094626A1 (en) * 2006-09-27 2010-04-15 Fengqin Li Method and apparatus for locating speech keyword and speech recognition system
US20100169089A1 (en) * 2006-01-11 2010-07-01 Nec Corporation Voice Recognizing Apparatus, Voice Recognizing Method, Voice Recognizing Program, Interference Reducing Apparatus, Interference Reducing Method, and Interference Reducing Program
US8775177B1 (en) 2012-03-08 2014-07-08 Google Inc. Speech recognition process
US9324323B1 (en) 2012-01-13 2016-04-26 Google Inc. Speech recognition using topic-specific language models

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893389A (en) * 2015-01-26 2016-08-24 阿里巴巴集团控股有限公司 Voice message search method, device and server

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4181821A (en) * 1978-10-31 1980-01-01 Bell Telephone Laboratories, Incorporated Multiple template speech recognition system
JPS58111099A (en) * 1981-12-24 1983-07-01 日本電気株式会社 Pattern matching device
JPS58115490A (en) * 1981-12-29 1983-07-09 日本電気株式会社 Pattern-to-pattern distance calculator
US4608708A (en) * 1981-12-24 1986-08-26 Nippon Electric Co., Ltd. Pattern matching system
US4737976A (en) * 1985-09-03 1988-04-12 Motorola, Inc. Hands-free control system for a radiotelephone
US4783802A (en) * 1984-10-02 1988-11-08 Kabushiki Kaisha Toshiba Learning system of dictionary for speech recognition
US4933973A (en) * 1988-02-29 1990-06-12 Itt Corporation Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems
JPH04264596A (en) * 1991-02-20 1992-09-21 N T T Data Tsushin Kk Voice recognizing method in noisy enviroment
EP0526347A2 (en) * 1991-08-01 1993-02-03 Fujitsu Limited A number-of-recognition candidates determining system in a speech recognizing device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4181821A (en) * 1978-10-31 1980-01-01 Bell Telephone Laboratories, Incorporated Multiple template speech recognition system
JPS58111099A (en) * 1981-12-24 1983-07-01 日本電気株式会社 Pattern matching device
US4608708A (en) * 1981-12-24 1986-08-26 Nippon Electric Co., Ltd. Pattern matching system
JPS58115490A (en) * 1981-12-29 1983-07-09 日本電気株式会社 Pattern-to-pattern distance calculator
US4571697A (en) * 1981-12-29 1986-02-18 Nippon Electric Co., Ltd. Apparatus for calculating pattern dissimilarity between patterns
US4783802A (en) * 1984-10-02 1988-11-08 Kabushiki Kaisha Toshiba Learning system of dictionary for speech recognition
US4737976A (en) * 1985-09-03 1988-04-12 Motorola, Inc. Hands-free control system for a radiotelephone
US4933973A (en) * 1988-02-29 1990-06-12 Itt Corporation Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems
JPH04264596A (en) * 1991-02-20 1992-09-21 N T T Data Tsushin Kk Voice recognizing method in noisy enviroment
EP0526347A2 (en) * 1991-08-01 1993-02-03 Fujitsu Limited A number-of-recognition candidates determining system in a speech recognizing device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Berouti et al., "Enhancement of Speech Corrupted by Acoutistic Noise", Bolt Baranek and Newman Inc., Cambridge, Mass., IEEE, 208-211 (1979).
Berouti et al., Enhancement of Speech Corrupted by Acoutistic Noise , Bolt Baranek and Newman Inc., Cambridge, Mass., IEEE, 208 211 (1979). *
Ohno et al., "Utterance Normalization Using Vowel Features in a Spoken Word Recognition System for Multiple Speakers," IEEE Speech Processing, Apr. 27, 1993, pp. II-578 to II-581.
Ohno et al., Utterance Normalization Using Vowel Features in a Spoken Word Recognition System for Multiple Speakers, IEEE Speech Processing , Apr. 27, 1993, pp. II 578 to II 581. *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138263A1 (en) * 2001-01-31 2002-09-26 Ibm Corporation Methods and apparatus for ambient noise removal in speech recognition
US6754623B2 (en) * 2001-01-31 2004-06-22 International Business Machines Corporation Methods and apparatus for ambient noise removal in speech recognition
US7216075B2 (en) * 2001-06-08 2007-05-08 Nec Corporation Speech recognition method and apparatus with noise adaptive standard pattern
US20100169089A1 (en) * 2006-01-11 2010-07-01 Nec Corporation Voice Recognizing Apparatus, Voice Recognizing Method, Voice Recognizing Program, Interference Reducing Apparatus, Interference Reducing Method, and Interference Reducing Program
US8150688B2 (en) * 2006-01-11 2012-04-03 Nec Corporation Voice recognizing apparatus, voice recognizing method, voice recognizing program, interference reducing apparatus, interference reducing method, and interference reducing program
US20100094626A1 (en) * 2006-09-27 2010-04-15 Fengqin Li Method and apparatus for locating speech keyword and speech recognition system
US8255215B2 (en) 2006-09-27 2012-08-28 Sharp Kabushiki Kaisha Method and apparatus for locating speech keyword and speech recognition system
US20080147394A1 (en) * 2006-12-18 2008-06-19 International Business Machines Corporation System and method for improving an interactive experience with a speech-enabled system through the use of artificially generated white noise
US9324323B1 (en) 2012-01-13 2016-04-26 Google Inc. Speech recognition using topic-specific language models
US8775177B1 (en) 2012-03-08 2014-07-08 Google Inc. Speech recognition process

Also Published As

Publication number Publication date
JP3039623B2 (en) 2000-05-08
EP0838803A2 (en) 1998-04-29
JPH10133688A (en) 1998-05-22
EP0838803A3 (en) 1998-12-23
EP0838803B1 (en) 2002-09-11
DE69715343T2 (en) 2003-06-05
DE69715343D1 (en) 2002-10-17

Similar Documents

Publication Publication Date Title
US5749068A (en) Speech recognition apparatus and method in noisy circumstances
EP0686965B1 (en) Speech recognition apparatus with speaker adaptation using acoustic category mean value calculus
US8370139B2 (en) Feature-vector compensating apparatus, feature-vector compensating method, and computer program product
US4570232A (en) Speech recognition apparatus
AU720511B2 (en) Pattern recognition
US20070276662A1 (en) Feature-vector compensating apparatus, feature-vector compensating method, and computer product
EP1195744B1 (en) Noise robust voice recognition
US20080300875A1 (en) Efficient Speech Recognition with Cluster Methods
EP0779609B1 (en) Speech adaptation system and speech recognizer
US20080208578A1 (en) Robust Speaker-Dependent Speech Recognition System
JPH0850499A (en) Signal identification method
JPH11133992A (en) Feature extracting device and feature extracting method, and pattern recognizing device and pattern recognizing method
JP3298858B2 (en) Partition-based similarity method for low-complexity speech recognizers
WO2010035892A1 (en) Speech recognition method
US5953699A (en) Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence
US6381572B1 (en) Method of modifying feature parameter for speech recognition, method of speech recognition and speech recognition apparatus
JP2002366192A (en) Method and device for recognizing voice
JP2001067094A (en) Voice recognizing device and its method
JP2000259198A (en) Device and method for recognizing pattern and providing medium
KR100614932B1 (en) Channel normalization apparatus and method for robust speech recognition
JPH11212588A (en) Speech processor, speech processing method, and computer-readable recording medium recorded with speech processing program
JP2001083978A (en) Speech recognition device
JP3698511B2 (en) Speech recognition method
CA2229113C (en) Inhibiting reference pattern generating method
JPH09305195A (en) Speech recognition device and speech recognition method

Legal Events

Date Code Title Description
AS Assignment

Owner name: NEC CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:TAKAGI, KEIZABURO;REEL/FRAME:008868/0009

Effective date: 19971020

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20030914