WO2014108890A1 - Method and apparatus for phoneme separation in an audio signal - Google Patents

Method and apparatus for phoneme separation in an audio signal Download PDF

Info

Publication number
WO2014108890A1
WO2014108890A1 PCT/IL2013/051014 IL2013051014W WO2014108890A1 WO 2014108890 A1 WO2014108890 A1 WO 2014108890A1 IL 2013051014 W IL2013051014 W IL 2013051014W WO 2014108890 A1 WO2014108890 A1 WO 2014108890A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
computer
wavelet transform
value
scalogram
Prior art date
Application number
PCT/IL2013/051014
Other languages
French (fr)
Inventor
Yossef BEN EZRA
Original Assignee
Novospeech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Novospeech Ltd filed Critical Novospeech Ltd
Publication of WO2014108890A1 publication Critical patent/WO2014108890A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-implemented method and apparatus for analyzing signals such as audio signals. The method comprises receiving a signal; performing wavelet transform of the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a multiplicity of values for a shift parameter; and analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon.

Description

METHOD AND APPARATUS FOR PHONEME SEPARATION IN AN
AUDIO SIGNAL
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Application No. 61/750,479 filed January 9, 2013, entitled "METHOD AND APPARATUS FOR PHONEME SEPARATION IN AN AUDIO SIGNAL", which is hereby incorporated by reference in its entirety.
TECHNICAL FIELD
[0002] The present disclosure relates to audio processing in general, and to a method and apparatus for phoneme edge detection in audio signals, in particular.
BACKGROUND
[0003] Audio processing generally refers to enhancing or analyzing audio signals for all kinds of purposes, such as improving audio quality, removing noise or echo, providing full duplex communication systems, improving results of audio analysis engines and tools such as continuous speech recognition, word spotting, emotion analysis, speaker recognition or verification, and others. On a higher level, improving the results of such audio analysis engines and tools, and optionally using further tools such as text analysis tools to process the resulting text, may assist in retrieving information associated with or embedded within captured or recorded audio signals. Such information may prove useful in achieving many personal or business purposes, such as trend analysis, competition analysis, quality assurance, improving service, root cause analysis of problems conveyed over voce interactions, or others.
[0004] One of the most required yet challenging audio analysis applications is continuous speech recognition. In many systems, continuous speech recognition requires a training corpus in which audio signals and corresponding text extracted therefrom are provided, with accompanying information such as the location within the audio of each extracted phoneme. The system uses the training set to learn the specific parameters of the environment, speaker groups, particular speakers, or the like, and then applies these parameters in runtime, when actual recognition is required for new audio signals. It will be appreciated that the larger and more representative the training set, the better the results achieved by the speech recognition. However, generating such training sets is highly labor intensive and requires either manual transcription, or automated speech recognition by an existing system, followed by intensive error correction.
[0005] Speech recognition systems may comprise a phoneme recognition engine, and in particular classification engines such as a Gaussian Mixture Model (GMM) or a Hidden Markov Model (HMM) for extracting phonemes out of an audio signal, wherein the engine outputs the most probable phoneme sequence for an input audio signal, or others.
[0006] In order to recognize the phonemes, a preliminary stage of phoneme separation, such as detecting the phoneme edges, may be carried out in order to separate the input audio signal into phonemes, such that each phoneme may be recognized.
[0007] Thus, in order to improve recognition, there is a need for a method and apparatus for improving phoneme separation.
BRIEF SUMMARY
[0008] One aspect of the disclosed subject matter relates to a computer- implemented method performed by a computerized device, comprising: receiving a signal; performing wavelet transform of the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a multiplicity of values for a shift parameter; and analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon. The method of Claim 1, may further comprise creating a scalogram having a first axis associated with the scaling parameter, and a second axis associated with the shifting parameter, such that a value of the scalogram for each point associated with a scaling value and a shifting value is associated with the coefficient obtained for the scaling value and a shifting value, wherein the at least one graph represents a line in the scalogram. Within the method, the signal is optionally an audio signal. Within the method, the phenomenon is optionally phoneme edges. Within the method, the phenomenon is optionally detected by identifying areas in the graph in which the signal amplitude decays substantially to zero and then rises. Within the method, the wavelet transform is optionally a continuous wavelet transform. Within the method, the wavelet transform is optionally a discrete wavelet transform. The method may further comprise performing additional audio analysis activities on the signal. The method is optionally performed online as the signal is being captured or offline for a pre-captured signal. The method is optionally performed as part of preparing a training set.
[0009] Another aspect of the disclosure relates to an apparatus having a processing unit and a storage device, the apparatus comprising: a signal receiving component for receiving a signal; a wavelet transform component for transforming the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a shift parameter; and a wavelet coefficient analysis component for analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon. The apparatus may further comprise a scalogram generation component for creating a scalogram having a first axis associated with the scaling parameter, and a second axis associated with the shifting parameter, such that a value of the scalogram for each point associated with a first scaling value and a first shifting value is associated with the coefficient obtained for the first scaling value and a first shifting value, the at least one graph represents a line in the scalogram. Within the apparatus, the signal is optionally an audio signal. Within the apparatus, the phenomenon is optionally phoneme edges. Within the apparatus, the wavelet transform is optionally continuous wavelet transform or discrete wavelet transform. The apparatus may further comprise an additional audio analysis component for performing additional audio analysis actions on the signal. The apparatus may be activated as part of preparing a training set.
[0010] Yet another aspect of the disclosure relates to a computer program product comprising: a non-transitory computer readable medium; a first program instruction for receiving a signal; a second program instruction for performing wavelet transform of the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a multiplicity of values for a shift parameter; and a third program instruction for analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon; wherein said first, second, and third program instructions are stored on said non-transitory computer readable medium. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0011] The present disclosed subject matter will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which corresponding or like numerals or characters indicate corresponding or like components. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings:
[0012] Fig. 1 shows a flowchart of steps in a method for separating phonemes; [0013] Fig. 2A shows an exemplary audio signal;
[0014] Fig, 2B shows a corresponding spectrogram created based upon the audio signal of FIG. 2A;
[0015] Fig. 3 shows a flowchart of steps in a method for separating phonemes, in accordance with some exemplary embodiments of the disclosed subject matter;
[0016] Fig. 4A shows the exemplary audio signal of Fig. 2A;
[0017] Fig. 4B shows a corresponding spectrogram created based upon the audio signal of Fig. 4 A, in accordance with some exemplary embodiments of the disclosed subject matter;
[0018] Fig. 4C shows a zoomed-in part of the audio signal of Fig. 4A, in accordance with some exemplary embodiments of the disclosed subject matter;
[0019] Fig. 4D shows a graph based upon a horizontal line of the scalogram of Fig. 4B, in accordance with some exemplary embodiments of the disclosed subject matter;
[0020] Fig. 5 is a schematic illustration of an apparatus for separating phonemes, in accordance with some exemplary embodiments of the disclosed subject matter;
[0021] Fig. 6A shows the exemplary audio signal of Fig. 2A with higher level of noise; [0022] Fig. 6B shows a spectrogram created based upon the audio signal of Fig. 6 A; and
[0023] Fig. 6C shows a scalogram created based upon the audio signal of Fig. 6A, in accordance with some exemplary embodiments of the disclosed subject matter.
DETAILED DESCRIPTION
[0024] Definitions
[0025] Short Term Fourier Transform (STFT), in the continuous-time case, relates to a calculation in which the function to be transformed is multiplied by a window function which is nonzero for only a short period of time. Fourier transform of the resulting signal, which is a one-dimensional function, is taken as the window is slid along the time axis, resulting in a two-dimensional representation of the signa thematical representation may be:
ST'FT{x(i)} (V, ω) ≡ Χ(τ, ω)
Figure imgf000009_0001
wherein x(t) is the signal to be transformed, and w(t) is a window function, such as a Hann window or Gaussian bell centered around zero. Χ(τ,ω) is essentially the Fourier Transform of x(t)w(t-x), a complex function representing the phase and magnitude of the signal over time and frequency. Time windows in STFT may be taken with overlap therebetween, in order to enhance continuity.
[0026] A spectrogram is a time- varying spectral representation, i.e., an image that shows how the spectral density of a signal varies over time. Thus, a spectrogram may be used, among other purposes, to identify phonetic sounds. Typically, the spectrogram has two axes: the horizontal axis represents time while the vertical axis may represent frequency. A third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the image. The third dimension may be represented by a color scheme wherein darker colors indicate higher amplitudes or the other way around, or by height of a three dimensional histogram, or the like. A spectrogram may be constructed based on the results of STFT applied to an audio signal.
[0027] A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation". n / is the wherein
Figure imgf000010_0001
Ψ\*} is a continuous function in the time domain and the frequency domain referred to as the mother wavelet and * represents operation of a complex conjugate, a is a scaling parameter, i.e., is associated with the size of the time window, and b is a shifting parameter.
[0029] It will be appreciated that each value of a provides a different filter, thus varying the value of a provides a filter bank. The transform having a=l and b=0 may be referred to as a "mother wavelet", while transforms having other a and b values may be referred to as "daughter wavelets". Wavelet transform may be continuous or discrete, wherein discrete wavelet transform may be used when the wavelet function is sampled rather than continuous.
[0030] A Haar wavelet is a sequence of rescaled "square- shaped" functions which together form a wavelet family or basis. A Haar wavelet may be used in discrete wavelet transform.
[0031] A scalogram may be referred to as the equivalent of a spectrogram for wavelets. Thus, a scalogram is a visual representation of a wavelet transform of a function. Similarly to a spectrogram, in a scalogram the x axis usually represents the time as determined by the shift parameter b, the y axis represents the scale, i.e., the a parameter, and the color, brightness, intensity, height or another parameter of the graph at a particular x and y combination represents the output of the wavelet for the particular combination of a and b.
[0032] The disclosed subject matter is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the subject matter. It will be understood that blocks of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to one or more processors of a general purpose computer, special purpose computer, a tested processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block or blocks in the block diagram.
[0033] These computer program instructions may also be stored in a non- transient computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the non-transient computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
[0034] The computer program instructions may also be loaded onto a device such as a computer or other programmable data processing apparatus, to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0035] One technical problem dealt with by the disclosed subject matter is the difficulty to generate spectral representations of audio signals which provide for efficient analysis.
[0036] Traditionally, audio signals are spectrally represented as spectrograms, showing how the spectral density of a signal varies with time. Spectrograms are generated upon the time signal using short- time Fourier transform (STFT).
[0037] A common format of a spectrogram is a graph having a horizontal axis representing the time, a vertical axis representing the frequency, and a third dimension, indicating the amplitude, i.e., the intensity of a particular frequency at a particular time, which may be represented by the intensity or color of each point in the image. Thus, by analyzing the intensity, areas having lower intensity and which are generally vertical, i.e., cover a wide range of frequencies over a period of time, may represent silent periods, and thinner areas or areas in which the amplitude drops relative to neighboring areas even if not to complete silence, may be interpreted as phoneme edges, i.e., areas in which one phoneme ends and another begins.
[0038] However, in STFT, a tradeoff exists between the resolution in the frequency domain and the resolution in the time domain, since according to the Nyquist principle, the resolution cannot be better than half the window size. A wide time window provides for better frequency resolution but poor time resolution, while a narrower time window provides for good time resolution but poor frequency resolution.
[0039] Thus, for a given size of time window, the frequency resolution is limited. This limits the applicability of STFT to speech analysis since normal speech comprises phonemes of all lengths, and taking any fixed time window may often miss effects such as phoneme edges which are crucial for phoneme separation and hence phoneme recognition.
[0040] In addition, Fast Fourier Transform (FFT) may miss transition effects, due for example to Gibbs phenomenon, which relates to the Fourier sums overshooting at a jump discontinuity, and that this overshoot does not die out as the frequency increases.
[0041] One technical solution comprises analyzing the audio signal using wavelet transform instead of STFT, and presenting the information in a scalogram instead of a spectrogram. The wavelet transform may be carried out with a variety of scaling parameters and shifting factors, with any required resolution. The resulting scalogram may then be processed using image analysis techniques to identify artifacts such as phoneme edges. [0042] For example, such analysis may be considering one or more horizontal lines of the scalogram, i.e., graphs related to a constant value of a in which the x axis relates to varying values of b and the y axis relates to the intensity.
[0043] One technical effect of the disclosed subject matter relates to retrieving more information from automatically analyzing an audio signal by generating and analyzing a scalogram, based on wavelet transforms of an audio signal. The scalogram provides more information, such as information about transition areas including the phoneme edges than a spectrogram based on STFT of the audio signal. Using a family of wavelet transforms, having varying shift values and scaling parameter, provides for achieving any required frequency resolution for any time window, thus receiving high-resolution information associated with varying time frames, such as phoneme edges, the distances between which may vary due to the variations in phoneme lengths.
[0044] Another technical effect of the disclosed subject matter relates to identifying with higher accuracy the phoneme edges when analyzing speech, thus enhancing the speech analysis accuracy.
[0045] Yet another technical effect of the disclosed subject matter relates to identifying the phoneme edges when preparing training sets for speech recognition. Preparing training sets is a labor-intensive task, which may be made easier when using as a starting point text obtained by previous speech analysis techniques or products. By identifying phoneme edges in the training set, transcribing the training set may be done with higher quality and thus require less user intervention.
[0046] Referring now to Fig. 1, showing a prior art method of receiving and analyzing a spectrogram from an audio signal.
[0047] On step 100 an audio signal to be analyzed may be received. The signal may be any signal, such as but not limited to an audio signal, or any other signal which may comprise noise. [0048] On step 104 the input signal may undergo preprocessing intended for improving its quality, assessing the signal quality and discarding noisy parts, removing silent periods, or the like.
[0049] On step 108, the audio signal may undergo Short Term Fourier Transform. The STFT is performed with predetermined parameters, and may thus miss transition effects.
[0050] The complexity of STFT may generally be represented as N * log2N, wherein N is the number of samples contained in a time window.
[0051] On step 112, a spectrogram may be generated from the results of the STFT. The spectrogram may use color or brightness scale to indicate the amplitude of a particular frequency at a particular point in time.
[0052] On step 116 the spectrogram may be analyzed using any image analysis technique such as edge detection, color analysis, brightness analysis or others, in order to identify substantially vertical areas of lower amplitude, representing a silence period or phoneme edges.
[0053] The retrieved information, such as the phoneme edges may then be used for identifying the phonemes and words contained in the audio signal.
[0054] Referring now to Fig. 2A and Fig. 2B. Fig. 2A shows an exemplary audio signal, wherein the horizontal axis represents the time and the vertical axis represents the amplitude. Fig. 2B shows a corresponding spectrogram created based upon STFT of the audio signal of Fig. 2A. The horizontal axis of the graph shown in Fig. 2B is the time along the signal, the vertical axis is the normalized frequency, and the value at each point is visualized by intensity or darkness of the point. It may be seen that while the phoneme boundaries in the audio signal are noticeable, they are much less noticeable in the spectrogram and can be seen or extracted using automated tools, but are not straightforward.
[0055] Referring now to Fig. 3, showing a method of receiving and analyzing a scalogram from an audio signal. [0056] On step 300 an audio signal to be analyzed may be received. The signal may be any signal, such as but not limited to an audio signal, or any other signal which may comprise noise.
[0057] On step 304 the input signal may undergo preprocessing intended for improving its quality, assessing the signal quality and discarding noisy parts, removing silent periods, or the like.
[0058] On step 308, the audio signal may undergo wavelet transform. The wavelet transform may be performed with various values of the scaling parameter, such as between about 1 and about 200 depending on the sampling rate and the sampling window, the time shift parameter changes in accordance with the time along the sampled signal, using any resolution in the time shift parameter, for example between about 0 mSec and about 2000 mSec.
[0059] It will also be appreciated that the wavelet transform may be continuous (CWT) or discrete (DWT), and the wavelet transform may be performed with any appropriate wavelet, such as but not limited to the Haar wavelet in the case of DWT.
[0060] The complexity of performing the wavelet transform is (N-l) x L, wherein N is the number of samples contained in a time window and L is the length of the filter, i.e., the number of coefficients. For example, the Haar wavelet is of length 2 and a Debichi wavelet is of length 3-6.
[0061] On step 312, a scalogram may be generated from the results of the wavelet transform. The scalogram may use color or brightness scale to indicate the amplitude of a particular frequency at a particular point in time.
[0062] On step 316 the scalogram may be analyzed using any analysis technique including image analysis techniques such as edge detection, color analysis, brightness analysis or others, in order to identify substantially vertical areas of lower amplitude, representing silence period or phoneme edges. [0063] One possible analysis comprises analyzing different horizontal lines of the scalogram, such as the line shown in Fig. 4D discussed below. Such horizontal line refers to a constant value of a and varying values of b, and provides, for each combination, the associated intensity. In such graph the phoneme edges are indicated as those in which the intensity decays to zero and then increases again. A multiplicity of such line graphs may be considered, since each phoneme contains different spectral components. For example, graphs with a values varying in low resolution may be selected and analyzed, for example 1=10, 20...100, and for those a values in which clear boundaries are located, further a values having higher resolution may be considered. For example, if clearer bounds are found for a=50 than for other a values, further a values of 45, 46, 47...50...54, 55 may be analyzed.
[0064] In areas having the most noticeable decays and rise for one or more particular values of the scaling parameter a, are the b values which provide the time of the phoneme edges within the original signal
[0065] Referring now to Figs. 4A-4D. Fig. 4A shows the same exemplary audio signal of Fig. 2A, and Fig. 4B shows a corresponding scalogram created based upon wavelet transforms of the audio signal of Fig. 4A, with a values of 110, 109, 108..1, and b values within the time range of the signal, i.e., the samples multiplied by the time per sample. The value for each a and b combination is indicated by the color intensity of the corresponding point in the graph.
[0066] The thick vertical lines in Fig. 4B indicate the phoneme boundaries along the time of the signal as marked by a human expert. It may be intuitively seen that the boundaries correspond to areas of change in the scalogram.
[0067] Fig. 4C shows the initial part of the audio signal of Fig. 4A with a larger zoom, in which the boundaries between the phonemes are marked by a human expert. Fig. 4D shows a graph based upon a horizontal line of the scalogram shown in Fig. 4B, relating to an exemplary a value of 105. The horizontal axis of the graph of Fig. 6D is the b values, while the vertical axis relates to the intensity, expressed in Fig. 6B by the color intensity.
[0068] It may be seen that the areas of the graph of Fig. 4D at which the graph amplitude decays to values close to zero and then start increasing again, correspond to the phoneme edges as marked by the human expert. Thus, by identifying areas in which the graph amplitude decays and then starts rising again may be used to identify the phoneme edges.
[0069] It will be appreciated that the a value that provides the best results is not known a-priori, and may also vary for different phonemes. Since different phonemes have different spectral components, then different phonemes may be better separated with different scaling factors. Thus, a multiplicity of line graphs similar to those of Fig. 4D and having various a values may be selected and analyzed for the decay and rise. For example, about 5% to about 20% of the graphs, optionally having substantially equally spaced a values, may be analyzed. One or more of these graphs, exhibiting the clearest fall and rise of the signal amplitude may be selected, and additional graphs having relatively close a values may be analyzed, and the areas most noticeable in one or more of the graphs may be identified.
[0070] Referring now to Fig. 5 showing a block diagram of components of an apparatus for improving the results of phoneme identification in an audio signal.
[0071] The apparatus comprises a computing device 500, which may comprise one or more processors 504. Any of processors 504 may be a Central Processing Unit (CPU), a microprocessor, an electronic circuit, an Integrated Circuit (IC) or the like. Alternatively, computing device 500 can be implemented as firmware written for or ported to a specific processor such as a digital signal processor (DSP) or microcontrollers, or can be implemented as hardware or configurable hardware such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). Processors 504 may be utilized to perform computations required by computing device 500 or any of it subcomponents. [0072] In some exemplary embodiments of the disclosed subject matter, computing device 500 may comprise MMI module 508. MMI module 508 may be utilized to provide communication between the apparatus and a user for providing input, receiving output or the like.
[0073] In some embodiments, computing device 500 may comprise an input- output (I/O) device 512 such as a terminal, a display, a keyboard, an input device or the like, used to interact with the system, to invoke the system and to receive results.
[0074] Computing device 500 may comprise one or more storage devices 516 for storing executable components, and which may also contain data during execution of one or more components. Storage device 516 may be persistent or volatile. For example, storage device 516 can be a Flash disk, a Random Access Memory (RAM), a memory chip, an optical storage device such as a CD, a DVD, or a laser disk; a magnetic storage device such as a tape, a hard disk, storage area network (SAN), a network attached storage (NAS), or others; a semiconductor storage device such as Flash device, memory stick, or the like. In some exemplary embodiments, storage device 516 may retain program code operative to cause any of processors 504 to perform acts associated with any of the steps shown in Fig. 3 above, for example receiving an audio file, performing a wavelet transform, creating a scalogram, or the like.
[0075] The components detailed below may be implemented as one or more sets of interrelated computer instructions, executed for example by any of processors 504 or by another processor. The components may be arranged as one or more executable files, dynamic libraries, static libraries, methods, functions, services, or the like, programmed in any programming language and under any computing environment. Storage device 516 may comprise or be loaded with one or more of the components, which can be executed on computing platform 500 by any one or more of processors 504. Alternatively, any of the executable components may be executed on any other computing device which may be in direct or indirect communication with computing platform 500.
[0076] Storage device 516 may comprise audio receiving component 524 for receiving audio signals. Audio signals can be received from any capturing device, from a storage device storing previously recorded or generated audio, from another computing device over a communication channel, or from any other source.
[0077] Storage device 516 may also comprise optional audio preprocessing component 528 for preprocessing the received audio, for example removing noise, silent periods, DTMF tones, or the like.
[0078] Yet another component loaded to storage device 516 may be wavelet transform component 532 for receiving the input signal optionally after it has been preprocessed, and optionally receiving ranges or resolution of the shift or scaling parameter, executing a multiplicity of wavelet transforms and providing the obtained coefficients.
[0079] The wavelet transform may be continuous or discrete, and may use any relevant mother wavelet function.
[0080] Storage device 516 may also comprise scalogram generation component 536 for generating a scalogram from the obtained coefficients.
[0081] The scalogram may use color or brightness to indicate the relevant value for each scaling and shift combination. However, scalogram generation component 536 may be omitted if only certain lines of the scalogram are analyzed.
[0082] Yet another component loaded to storage device 516 may be wavelet coefficient analysis component 540 for analyzing the wavelet coefficients, which may be part of the scalogram generated by scalogram generation component 536. Wavelet coefficient analysis component 540 may, for example, analyze values along one or more lines, each having a constant a value. Analysis may be targeted for locating phoneme edges by identifying areas in which the signal decays to zero or close to zero amplitude and then starts rising.
[0083] It will be appreciated that the obtained coefficients, the generated scalogram or the information retrieved from the scalogram may be stored on storage device 516 or on any other storage device.
[0084] Storage device 516 may also comprise additional audio analysis components 544 for retrieving additional information from the input audio signal, such as recognizing the phonemes, recognizing words, extracting additional information such as emotion, or any other analysis.
[0085] Storage device 516 may also comprise data and control flow management component 548 for invoking and passing information between the various components, for example receiving results for various a values and determining a values of further graphs to be analyzed, or the like.
[0086] Referring now to Figs. 6A, 6B and 6C. Fig. 6A is a graph representing the audio signal of Fig. 2A and 4A, but with a lower SNR, i.e., higher level of noise.
[0087] Fig. 6B is a spectrogram created based upon SFST of the signal of Fig. 6A. It can be seen that the edges in the spectrogram shown in Fig. 6B are less noticeable and are harder to extract than those of Fig. 2B, which represents the spectrogram based upon the signal with lower noise. Thus, a spectrogram is more vulnerable to noises, and the ability to detect information such as phoneme edges from a spectrogram degrades as the noise level increases.
[0088] Fig. 6C shows a scalogram based upon the audio signal of Fig. 6A. It may be seen that the scalogram of Fig. 6C is insignificantly less clear than the scalogram of Fig. 4B, which was based upon the signal with the higher SNR.
[0089] It will thus be appreciated that the scalogram based on wavelets of the audio signals is more robust and less vulnerable to noises than the spectrogram based upon SFST of the audio signal, and provides better results also when the signal is noisier. [0090] It will be appreciated that the disclosed method and apparatus may be used for detecting edges of phonemes in a received audio signal during speech recognition, whether online as the audio signal is being received, or offline when receiving a pre-captured audio signal.
[0091] Alternatively or additionally, the method and apparatus may be used for detecting phoneme edges when constructing a training set for speech recognition or other audio analysis tasks. The method may also be used for retrieving values or value ranges for the wavelet transform shift and scaling parameters, which provide satisfactory or optimal results, such that when recognizing speech the relevant parameters may be used for improving the obtained results. The method may also be used for improving phoneme edges received from other sources or alternatively providing initial phoneme edges to be improved by another system or by a user.
[0092] It will be appreciated that the disclosed method and apparatus can be used, mutatis mutandis, for other signals or purposes, such as medical signals, musical signals, or others. It will be appreciated that the disclosed method and apparatus can be used for detecting phenomena other than phoneme edges, such as QRS complex or duration in ECG signal, musical rhythm or others., and that the method and apparatus may also be applied towards processing signals other than audio signals.
[0093] The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart and some of the blocks in the block diagrams may represent a module, segment, or portion of program code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
[0094] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0095] As will be appreciated by one skilled in the art, the disclosed subject matter may be embodied as a system, method or computer program product. Accordingly, the disclosed subject matter may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, the present disclosure may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
[0096] Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, any non-transitory computer-readable medium, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer- usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, and the like.
[0097] Computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on a first computer, partly on the first computer, as a stand-alone software package, partly on the first computer and partly on a second computer or entirely on the second computer or server. In the latter scenario, the second computer may be connected to the first computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
[0098] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

CLAIMS What is claimed is:
1. A computer- implemented method performed by a computerized device, comprising: receiving a signal; performing wavelet transform of the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a multiplicity of values for a shift parameter; and analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon.
2. The computer- implemented method of Claim 1, further comprising creating a scalogram having a first axis associated with the scaling parameter, and a second axis associated with the shifting parameter, such that a value of the scalogram for each point associated with a scaling value and a shifting value is associated with the coefficient obtained for the scaling value and a shifting value, wherein the at least one graph represents a line in the scalogram.
3. The computer- implemented method of Claim 1, wherein the signal is an audio signal.
4. The computer-implemented method of Claim 3 wherein the phenomenon is phoneme edges.
5. The computer- implemented method of Claim 1 wherein the phenomenon is detected by identifying areas in the graph in which the signal amplitude decays substantially to zero and then rises.
6. The computer- implemented method of Claim 1 wherein the wavelet transform is continuous wavelet transform.
7. The computer- implemented method of Claim 1 wherein the wavelet transform is discrete wavelet transform.
8. The computer- implemented method of Claim 2 further comprising performing additional audio analysis activities on the signal.
9. The computer-implemented method of Claim 1 wherein the method is performed online as the signal is being captured.
10. The computer-implemented method of Claim 1 wherein the method is performed offline for a pre-captured signal.
11. The computer-implemented method of Claim 1 wherein the method is performed as part of preparing a training set.
12. An apparatus having a processing unit and a storage device, the apparatus comprising: a signal receiving component for receiving a signal; a wavelet transform component for transforming the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a shift parameter; and a wavelet coefficient analysis component for analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon.
13. The apparatus of Claim 12, further comprising a scalogram generation component for creating a scalogram having a first axis associated with the scaling parameter, and a second axis associated with the shifting parameter, such that a value of the scalogram for each point associated with a first scaling value and a first shifting value is associated with the coefficient obtained for the first scaling value and a first shifting value, the at least one graph represents a line in the scalogram.
14. The apparatus of Claim 12, wherein the signal is an audio signal.
15. The apparatus of Claim 12 wherein the phenomenon is phoneme edges.
16. The apparatus of Claim 12 wherein the wavelet transform is continuous wavelet transform.
17. The apparatus of Claim 12 wherein the wavelet transform is discrete wavelet transform.
18. The apparatus of Claim 12 further comprising an additional audio analysis component for performing additional audio analysis actions on the signal.
19. The apparatus of Claim 12 wherein the apparatus is activated as part of preparing a training set.
20. A computer program product comprising: a non-transitory computer readable medium;
a first program instruction for receiving a signal;
a second program instruction for performing wavelet transform of the signal to obtain a multiplicity of coefficients, using a multiplicity of values of a scaling parameter and a multiplicity of values for a shift parameter; and a third program instruction for analyzing at least one graph providing coefficients related to a fixed scaling parameter value and varying shift parameter values to obtain a location associated with a phenomenon; wherein said first, second, and third program instructions are stored on said non-transitory computer readable medium.
PCT/IL2013/051014 2013-01-09 2013-12-09 Method and apparatus for phoneme separation in an audio signal WO2014108890A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361750479P 2013-01-09 2013-01-09
US61/750,479 2013-01-09

Publications (1)

Publication Number Publication Date
WO2014108890A1 true WO2014108890A1 (en) 2014-07-17

Family

ID=51166593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2013/051014 WO2014108890A1 (en) 2013-01-09 2013-12-09 Method and apparatus for phoneme separation in an audio signal

Country Status (1)

Country Link
WO (1) WO2014108890A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420075A (en) * 2020-10-26 2021-02-26 四川长虹电器股份有限公司 Multitask-based phoneme detection method and device
US11205442B2 (en) * 2019-03-18 2021-12-21 Electronics And Telecommunications Research Institute Method and apparatus for recognition of sound events based on convolutional neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6253175B1 (en) * 1998-11-30 2001-06-26 International Business Machines Corporation Wavelet-based energy binning cepstal features for automatic speech recognition
US20030086341A1 (en) * 2001-07-20 2003-05-08 Gracenote, Inc. Automatic identification of sound recordings
US20110270904A1 (en) * 2010-04-30 2011-11-03 Nellcor Puritan Bennett Llc Systems And Methods For Estimating A Wavelet Transform With A Goertzel Technique

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6253175B1 (en) * 1998-11-30 2001-06-26 International Business Machines Corporation Wavelet-based energy binning cepstal features for automatic speech recognition
US20030086341A1 (en) * 2001-07-20 2003-05-08 Gracenote, Inc. Automatic identification of sound recordings
US20110270904A1 (en) * 2010-04-30 2011-11-03 Nellcor Puritan Bennett Llc Systems And Methods For Estimating A Wavelet Transform With A Goertzel Technique

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205442B2 (en) * 2019-03-18 2021-12-21 Electronics And Telecommunications Research Institute Method and apparatus for recognition of sound events based on convolutional neural network
CN112420075A (en) * 2020-10-26 2021-02-26 四川长虹电器股份有限公司 Multitask-based phoneme detection method and device

Similar Documents

Publication Publication Date Title
Priyadarshani et al. Birdsong denoising using wavelets
JP6198872B2 (en) Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues
EP2937862A1 (en) System and method for processing sound signals implementing a spectral motion transform
Xie et al. Acoustic classification of australian anurans using syllable features
WO2012064408A2 (en) Method for tone/intonation recognition using auditory attention cues
CN109346109B (en) Fundamental frequency extraction method and device
US10043538B2 (en) Analyzing changes in vocal power within music content using frequency spectrums
EP3469519A1 (en) Automatic speech recognition
WO2019119279A1 (en) Method and apparatus for emotion recognition from speech
US8725508B2 (en) Method and apparatus for element identification in a signal
CN110689885A (en) Machine-synthesized speech recognition method, device, storage medium and electronic equipment
WO2014108890A1 (en) Method and apparatus for phoneme separation in an audio signal
Sephus et al. Modulation spectral features: In pursuit of invariant representations of music with application to unsupervised source identification
KR100766170B1 (en) Music summarization apparatus and method using multi-level vector quantization
KR100659884B1 (en) Method on automatic detection of vibrato in music
KR20180101057A (en) Method and apparatus for voice activity detection robust to noise
Connor et al. Automating identification of avian vocalizations using time–frequency information extracted from the Gabor transform
Su et al. Minimum-latency time-frequency analysis using asymmetric window functions
US20240086759A1 (en) System and Method for Watermarking Training Data for Machine Learning Models
JP2006113298A (en) Audio signal analysis method, audio signal recognition method using the method, audio signal interval detecting method, their devices, program and its recording medium
Jin et al. Speech separation from background of music based on single-channel recording
CN113870836A (en) Audio generation method, device and equipment based on deep learning and storage medium
Figueiredo et al. A comparative study on filtering and classification of bird songs
CN116741156A (en) Speech recognition method, device, equipment and storage medium based on semantic scene
Liu Audio fingerprinting for speech reconstruction and recognition in noisy environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13870582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 09/11/2015)

122 Ep: pct application non-entry in european phase

Ref document number: 13870582

Country of ref document: EP

Kind code of ref document: A1