DE19629946A1

DE19629946A1 - LPC analysis and synthesis method for basic frequency descriptive functions

Info

Publication number: DE19629946A1
Application number: DE19629946A
Authority: DE
Inventors: Joachim Dipl Ing Mersdorf; Marc Brueggen; Ansgar Dipl Ing Rinscheid
Original assignee: Individual
Current assignee: Individual
Priority date: 1996-07-25
Filing date: 1996-07-25
Publication date: 1998-01-29

Abstract

The analysis method involves using filter parameterisation and simplified residual signal approximation. The LPC analysis is applied to function which represent a basic frequency contour. A synthesis method of basic frequency contours which consist of residual signals involves using LPC residual synthesis. The synthesis is carried out using the gained filter parameters and the residual signal approximations of the LPC analysis.

Description

field of use

Das Verfahren dient der parametrischen Analyse sowie der synthetischen Generierung von Sprachgrundfrequenzverläufen für natürliche und synthetische Sprachsignale. Der Sprachgrundfrequenzverlauf - im folgenden auch FO-Kontur oder FO-Verlauf genannt - wird hierbei als niederfrequentes Zeitsignal behandelt, welches die Sprachintonation über einen gewissen Zeitraum beschreibt. Das Verfahren erlaubt einerseits eine auto matische Analyse von Sprachgrundfrequenzdaten hinsichtlich bestimmter intonatori scher Akzente durch Approximation des LPC-Restsignals und andererseits eine flexible Synthese von Intonationsverläufen zur Grundfrequenzsteuerung von Sprach ausgaben mittels der LPC-Filterparameter und einer geeigneten Anregungsfunktion auf der Basis des approximierten LPC-Restsignals. Dieses Verfahren kann in der Intonati onssteuerung einer Sprachsynthese eingesetzt werden, zur Manipulation der Grundfre quenz in Sprachsignalen, zur intonatorischen Analyse von Sprachsignalen oder zur datenreduzierten parametrischen Kodierung komplexer Intonationsverläufe z. B. bei Ansageautomaten verwendet werden.The method is used for parametric analysis and synthetic generation of fundamental speech frequency curves for natural and synthetic speech signals. Of the Speech fundamental frequency curve - hereinafter also called FO contour or FO curve - is treated here as a low-frequency time signal, which is the voice intonation over a period of time. On the one hand, the method allows an auto Matic analysis of basic speech frequency data with regard to certain intonatori accents by approximation of the residual LPC signal and on the other hand one flexible synthesis of intonation courses for the basic frequency control of speech output using the LPC filter parameters and a suitable excitation function the base of the approximated LPC residual signal. This procedure can be found in the Intonati control of a speech synthesis are used to manipulate the basic fre sequence in speech signals, for intonational analysis of speech signals or for data-reduced parametric coding of complex intonation processes e.g. B. at Announcement machines can be used.

State of the art and its disadvantages

Bestehende Systeme zur Erzeugung synthetischer Sprache haben einen hohen Grad an Verständlichkeit erreicht. Deren Nutzerakzeptanz und Anwendungsvielfalt hängt jedoch unmittelbar auch von der erreichten Natürlichkeit ab. Sie ist derzeit generell noch sehr gering und läßt sich wesentlich auf prosodische Mängel zurückführen. Ein Bedarf an Wissen, Daten und Methoden zur Generierung der prosodischen Sprachei genschaften ist deshalb noch sehr groß.Existing systems for the generation of synthetic speech have a high degree Intelligibility achieved. Their user acceptance and variety of applications depend but also immediately depends on the naturalness achieved. It is general at the moment still very low and can be largely attributed to prosodic defects. A Need for knowledge, data and methods to generate prosodic language properties is therefore still very large.

Ein wesentliches Merkmal der Prosodie ist die Intonation. Eine geeignete Intonations steuerung erfordert aber komplexes Wissen über Zusammenhänge beim Einsatz in natürlicher Sprache und geeignete Verfahren zur Generierung der Intonationsverläufe in synthetischer Sprache. Verfahren, die sowohl eine signalbasierte prosodische Ana lyse als auch zugleich eine Synthese erlauben, sind kaum vorhanden.A key feature of prosody is intonation. A suitable intonation However, control requires complex knowledge about relationships when used in natural language and suitable methods for generating the intonation courses in synthetic language. Procedures involving both a signal-based prosodic Ana Lysis as well as allowing a synthesis are hardly available.

Auf der Analyseseite ist eine Reduktion und statistische Aufbereitung umfangreicher Sprachsignaldatenmengen unbedingt erforderlich. Die Aufbereitung geschieht derzeit häufig noch durch aufwendige manuelle oder halbautomatische phonetische Transkrip tion. Das heißt, anhand der graphischen Darstellung eines berechneten Grundfrequenz verlaufes werden in verschiedenen sprachlichen Einheiten und Ebenen prosodische Akzente - meist in Form einer symbolischen Beschreibung - zugeordnet. Diese Tran skriptionen, die sich an anerkannten lingui stisch-phonetischen Modellbildungen orien tieren (Silverman, K. et al., TOBI: A Standard for Transcribing Englisch Prosody, in: Proeeedings of International Conference of Speeeh and Natural Laguage Proeessing, 1992), erfordern umfangreiches phonetisches Wissen und Erfahrung. Aus den symbolisch transkribierten Daten werden dann Regeln, Modelle oder Datenbanken für die Ansteuerung von Synthesesystemen gewonnen.A reduction and statistical processing is more extensive on the analysis side Voice signal data sets are absolutely necessary. The preparation is currently taking place often through complex manual or semi-automatic phonetic transcripts tion. That is, based on the graphical representation of a calculated fundamental frequency history become prosodic in different language units and levels Accents - usually in the form of a symbolic description - assigned. This tran scripts based on recognized lingui or phonetic models animals (Silverman, K. et al., TOBI: A Standard for Transcribing English Prosody, in: Proeeedings of International Conference of Speeeh and Natural Laguage Proeessing, 1992), require extensive phonetic knowledge and experience. From the symbolic Transcribed data will then become rules, models or databases for the Control of synthesis systems gained.

Häufig bereitet aber auch die Berechnung eines Grundfrequenzverlaufs, d. h. die Umsetzung der phonetischen Transkription in geeignete FO-Konturen, einige Schwie rigkeiten. Schon die Verschiebung von einzelnen intonativen Silben-, Wort- oder Satz akzenten ist aufwendig und mit größerem Modellierungs- und Rechenaufwand verbunden oder führt zu nichtlinearen Optimierungsproblemen. Gerade dieser Punkt ist es jedoch, der zum einen ein besseres Verständnis der Intonation ermöglichen und zum anderen die Synthese von Konturen vereinfachen kann.Often, however, the calculation of a fundamental frequency response also prepares H. the Implementation of the phonetic transcription in suitable FO contours, some Schwie difficulties. Already the shift of individual intonative syllables, words or sentences accents is complex and requires a lot of modeling and computing connected or leads to non-linear optimization problems. That is precisely the point it, however, which on the one hand enables a better understanding of intonation and on the other can simplify the synthesis of contours for others.

Zur Generierung von FO-Verläufen ist das Modell von Fujisaki et al. (Fujisaki, H.: Dyna mic Characteristics of Voice Fundamental Frequency in Speech and Singing, in: Mae Neilage (Ed.): The Produetion of Speech, Springer, New York 1983, pp. 39-55; B. Möbius et al., Analy sis and Synthesis of German FO Contours by Means of Fujisaki′s Model, Speech Communica tion 13, North-Holland 1993, pp. 53-61) erfolgreich angewendet worden, welches auf rechteckförmigen bzw. impulsförmigen Anregungsfunktionen beruht. Mit diesen Anre gungsfunktionen werden verschiedene rekursive Filter 2. Ordnung angesteuert, an deren gewichteten Ausgang dann eine Grundfrequenzkontur vorliegt. Eine Akzentver schiebung kann beim Fujisaki-Modell ganz einfach dadurch realisiert werden, daß die entsprechende Anregung verschoben wird.The model from Fujisaki et al. (Fujisaki, H .: Dyna mic Characteristics of Voice Fundamental Frequency in Speech and Singing, in: Mae Neilage (Ed.): The Production of Speech, Springer, New York 1983, pp. 39-55; B. Möbius et al., Analy sis and Synthesis of German FO Contours by Means of Fujisaki′s Model, Speech Communica tion 13, North Holland 1993, pp. 53-61) which has been successfully applied rectangular or pulse-shaped excitation functions. With these suggestions different recursive filters of the 2nd order are activated whose weighted output then has a fundamental frequency contour. An accent With the Fujisaki model, shifting can be achieved simply by using the appropriate suggestion is postponed.

Ein Verfahren nach ′t Hart, Adriaens et al. (′t Hart, J., Collier, J., Cohen, A., A Peceptual Study of Intonation, Cambridge Studies in Speech Science and Communication, Cambridge University Press, 1990) führt auf sehr einfachem Wege zu sogenannten Kopiekonturen. Dabei wird durch abschnittsweise Geradenapproximation (Linien) eine reduzierte Beschreibung der Kontur erzeugt.A method according to ′ t Hart, Adriaens et al. (′ T Hart, J., Collier, J., Cohen, A., A Peceptual Study of Intonation, Cambridge Studies in Speech Science and Communication, Cambridge University Press, 1990) leads to so-called copy contours in a very simple way. In doing so, a section approximation of lines (lines) reduces the number Description of the contour created.

Die beiden vorgestellten Modelle, Approximation durch Geradenabschnitte (t′ Hart et al.) und die Approximation durch Filterantworten (Fujisaki et al.), weisen folgende Nachteile auf:The two models presented, approximation by straight line sections (t ′ Hart et al.) and the approximation by filter responses (Fujisaki et al.) have the following Disadvantages on:

Beim Fujisaki-Modell ist der Approximationsvorgang aufwendig, da es sich bei der Approximation um ein Analyse-durch-Synthese-System handelt, d. h. ausgehend von einer Startkonfiguration der Anregung und eines Basiswertes müssen die Parameter schrittweise durch nichtlineare Optimierung verbessert werden.In the Fujisaki model, the approximation process is complex because it is the Approximation is an analysis-by-synthesis system, i. H. starting from a start configuration of the excitation and a basic value, the parameters can be gradually improved through nonlinear optimization.

Nachteilig am Linien-Modell ist, daß Änderungen an der FO-Kontur z. B. durch Ver schiebung von Akzenten im Vergleich zum Fujisaki-Modell schwierig sind. Das heißt, die zeitliche Anpassung ist aufwendig.A disadvantage of the line model is that changes to the FO contour z. B. by Ver shifting accents compared to the Fujisaki model are difficult. This means, the time adjustment is complex.

Viele der realisierten Parametrisierungen sind also entweder schwer generierbar, da sie zu komplex sind oder sich - wie die Geradenabschnitte - nicht sehr gut zur Modifizie rung eignen.Many of the implemented parameterizations are either difficult to generate because they are too complex or - like the straight line sections - not very good for modification suitable.

Object of the invention

Ausgehend von dem Gedanken, daß die Intonation - bewußt oder unbewußt - gesteuert ist, dient das hier beschriebene Verfahren dazu, diese Steuerung mittels linearer Prädik tion aus der Grundfrequenzkontur bei Sprachsignalen zu extrahieren, geeignet zu ver einfachen und anschließend wieder zu resynthetisieren. Ein Vorteil liegt insbesondere darin, daß a priori keine System- oder Modellkenntnisse vorhanden sein müssen - bis auf die Annahme eines Quelle-Filter-Modells für die Intonation - und dennoch ein parametrisches Modell generiert wird. Das Modell ist zunächst also weder sprachphy siologisch oder sprachpsychologisch motiviert, und berücksichtigt auch nicht, in wel cher Form die Steuerung der Intonation in der Realität tatsächlich geschieht.Based on the idea that intonation - consciously or unconsciously - is controlled the method described here serves to control this by means of linear prediction extraction of the fundamental frequency contour for speech signals, suitable for ver simple and then resynthesize again. One advantage lies in particular in that a priori no system or model knowledge is required - until on the assumption of a source filter model for the intonation - and yet a parametric model is generated. So the model is initially neither sprachphy motivated sociologically or language psychologically, and does not take into account in what form the control of intonation actually happens in reality.

advantages

Bekannten Verfahren der Sprachgrundfrequenzanalyse und -synthese mangelt es ent weder an der notwendigen Abbildungssicherheit - häufig werden individuelle manuelle Transkriptionen verwendet - oder sie erfordern einen inpraktikablen Optimierungs- und Rechenaufwand. Das neue Verfahren ermöglicht es, umfangreiche Sprachdaten mengen automatisch und reproduzierbar zu analysieren, zu transformieren und zu para metrisieren.Known methods of basic speech frequency analysis and synthesis are lacking neither the necessary image security - often individual manual Transcriptions used - or they require an impractical optimization- and computing effort. The new process enables extensive voice data analyze and transform quantities automatically and reproducibly and to para metrize.

Der wesentliche Gedanke der Erfindung ist die Idee, das Fehlersignal der LPC-Analyse als Steuersignal der Intonation zu begreifen. Eine einfach beschriebene Steuerung ist in vielerlei Hinsicht vorteilhafter als eine direkte Beschreibung des FO-Verlaufs. Zum einen läßt sich die Steuerung einfacher analysieren, da sie weniger Information als die komplette FO-Kontur enthält. Besonders aber läßt sich die Steuerung einfacher mani pulieren (z. B. bezüglich der zeitlichen Abfolge). Dazu wird das Fehlersignal geeignet approximiert und vereinfacht zusammengefaßt.The main idea of the invention is the idea, the error signal of the LPC analysis to be understood as a control signal of intonation. A simply described control is in in many ways more advantageous than a direct description of the FO course. To the the control is easier to analyze because it contains less information than the contains complete FO contour. In particular, the control can be mani easier pulverize (e.g. with regard to the time sequence). The error signal is suitable for this approximated and summarized simply.

Die parametrische Beschreibung und die approximierte Anregungsfunktion können auf sehr einfache Weise durch LPC-Analyse eines Grundfrequenzverlaufs gewonnen wer den. Komplexere Grundfrequenzverläufe (z. B. auf Satzebene) können auf sehr einfa che Weise durch Filterung von vereinfachten Anregungsfunktionen erzeugt werden. Die Synthese-Filter-Parameter werden gegenüber den bekannten Verfahren in sehr ein facher Weise gewonnen.The parametric description and the approximated excitation function can be based on very simple way through LPC analysis of a fundamental frequency curve the. More complex fundamental frequency profiles (e.g. at the sentence level) can be very simple be generated by filtering simplified excitation functions. The synthesis filter parameters are very different from the known methods won multiple times.

Durch eine geeignete Approximation des verbleibenden Restsignals kann gegebenen falls auch der aufwendige modellabhängige Umweg über eine symbolische Transkrip tion reduziert werden.A suitable approximation of the remaining signal can give if also the complex model-dependent detour via a symbolic transcript tion can be reduced.

Ein weiterer großer Vorteil ist eine erhebliche Datenreduktion durch Parametrisierung und Approximation des Restsignals. Komplexe Grundfrequenzverläufe können auf wenige Merkmale und Parameter zurückgeführt werden.Another big advantage is a significant data reduction through parameterization and approximation of the residual signal. Complex fundamental frequency profiles can occur few features and parameters can be traced.

Verschiebungen und zeitliche Anpassungen intonatorischer Akzente werden sehr leicht durch einfache zeitliche Verschiebung der approximierten Anregungsfunktionen mög lich.Shifts and temporal adjustments of intonation accents become very easy by simply shifting the approximated excitation functions over time Lich.

Die Anregungsignale eignen sich sich hervorragend zur Editierung und können auch den Zugang zur Interpretation intonatorischer Merkmale erleichtern.The excitation signals are ideal for editing and can also facilitate access to the interpretation of intonational features.

Solution of the task

Die Grundfrequenzverläufe stellen einen funktionalen zeitlichen Verlauf FO(t) der Peri odizität stimmhafter Sprachsignale über der Zeit dar (Fig. 3). Der Grundfrequenzver lauf kann dabei sowohl durch Verfahren der digitalen oder analogen Signalverarbeitung als auch durch manuelle phonetische Transkription ermittelt wor den sein (Fig. 2.1). Die Eingangsdaten des Verfahrens müssen lediglich in irgendeiner Weise den zeitlichen Verlauf der Sprachgrundfrequenz oder einer entsprechenden Kor relierten (z. B. der zeitliche Verlauf der Periodizität, (Sub-)Harmonische der FO oder deren Mischprodukte) wiedergeben. Das Verfahren funktioniert generell auch für n-dimensionale Eingangsgrößen.The fundamental frequency profiles represent a functional time profile FO (t) of the periodicity of voiced speech signals over time ( FIG. 3). The fundamental frequency course can be determined both by methods of digital or analog signal processing and by manual phonetic transcription ( Fig. 2.1). The input data of the method only have to reproduce in some way the temporal course of the basic speech frequency or a corresponding correlated (e.g. the temporal course of the periodicity, (sub) harmonics of the FO or their mixed products). The method generally also works for n-dimensional input variables.

Eine Folge von Periodenmarken stellt eine optimale Abtastung (Fig. 3) dar, da jede Periode eines Sprachsignals eindeutig rekonstruiert wird. In der Vorverarbeitung wird aber durch Interpolation (z. B. kubische Spline-Interpolation) ein hinreichend stetiger und glatter FO-Verlauf zwischen den Abtastwerten der FO und besonders auch zwi schen den stimmhaften Signalabschnitten erreicht (Fig. 3 & 1.1). Das heißt, daß stimm lose Sprachsignalabschnitte durch Interpolation (Fig. 2.2) mit einer virtuellen Grundfrequenz- bzw. Periodizität versehen werden können (Fig. 1.1). Die Erzeugung einer äquidistanten Abtastfolge ist praktikabel, aber nicht unbedingte Voraussetzung für das Prinzip des Verfahrens. Ein Herausschneiden der stimmlosen Signalabschnitte ist auch möglich aber nicht sinnvoll und verändert die zeitliche Struktur bzw. staucht die Abfolge.A sequence of period marks represents an optimal sampling ( FIG. 3), since each period of a speech signal is uniquely reconstructed. In preprocessing, however, interpolation (e.g. cubic spline interpolation) achieves a sufficiently steady and smooth FO curve between the samples of the FO and especially also between the voiced signal sections ( Fig. 3 & 1.1). This means that unvoiced speech signal sections can be provided with a virtual fundamental frequency or periodicity by interpolation ( Fig. 2.2) ( Fig. 1.1). The generation of an equidistant scan sequence is practical, but not an unconditional requirement for the principle of the method. Cutting out the unvoiced signal sections is also possible but not sensible and changes the temporal structure or compresses the sequence.

Das Ergebnis ist eine stetige differenzierbare Abtasffolge eines niederfrequenten Zeit signals (Fig. 2.2). Auf diese Folge (Fig. 1.1) wird nun der LPC-Algorithmus angewen det (Fig. 1.2 & 2.3).The result is a continuously differentiable sampling sequence of a low-frequency time signal ( Fig. 2.2). The LPC algorithm is now applied to this sequence ( Fig. 1.1) ( Fig. 1.2 & 2.3).

Das Verfahren funktioniert prinzipiell für beliebige und auch nicht äquidistante Abtast folgen, solange dadurch kein Informationsverlust z. B. durch Verletzten der Nyquistbe dingung erfolgt. Das Verfahren funktioniert prinzipiell für beliebig große zeitliche Einheiten eines Sprachgrundfrequenzsignals, sofern geeignete prosodische Einheiten erfaßt und eine sinnvolle Prädikion der Parameter möglich ist.The method works in principle for any and also non-equidistant scanning follow as long as no loss of information z. B. by injuring the Nyquistbe condition occurs. In principle, the method works for any time Units of a basic speech frequency signal, if suitable prosodic units recorded and a meaningful prediction of the parameters is possible.

Die LPC-Ordnung, die Framelänge und die systembeschreibenden Parameter sind dabei prinzipiell beliebig bzw. variabel, sofern geeignete prosodische Einheiten erfaßt und eine sinnvolle Prädikion der Parameter möglich ist.The LPC order, the frame length and the system-describing parameters are in principle arbitrary or variable, provided suitable prosodic units are detected and a meaningful prediction of the parameters is possible.

Anschließend muß eine Informationsreduktion durch Approximation des LPC-Restsi gnals erfolgen (Fig. 1.3 & 2.4 & 4). Die Approximation d. h. die Darstellung, Vereinfa chung und Weiterverarbeitung des Restsignals durch Methoden der Signalverarbeitung (z. B. Filterung, Glättung, Interpolation, Nichtlinearitäten . . .) ist prinzipiell beliebig. Das approximierte LPC-Restsignal repräsentiert dabei die "intonatorische Anregung". Das heißt, die zeitliche Ausprägung des Restsignals kann - bei geeigneter Approximation - u. a. Position und Größe intonatorischer Gesten wie z. B. Haupt- und Nebenakzente oder den Fokus wiedergeben.Subsequently, information must be reduced by approximating the LPC residual signal ( Fig. 1.3 & 2.4 & 4). The approximation, ie the representation, simplification and further processing of the residual signal by methods of signal processing (e.g. filtering, smoothing, interpolation, non-linearities ...) is in principle arbitrary. The approximated LPC residual signal represents the "intonation excitation". This means that the temporal characteristics of the residual signal can - with suitable approximation - include the position and size of intonation gestures such as B. Play major and minor accents or focus.

Vom Prinzip her ist keineswegs a priori grundsätzlich festgelegt, von welcher Natur die Analysefilter bzw. Synthesefilter sein müssen (Fig. 5 & 6 & 7). Es ist außerdem nicht offenkundig, ob unterschiedliche Filterkoeffizienten aus unterschiedlichen Signal ab schnitten ihren Ursprung in der allgemeinen Intonation oder z. B. in sprecherindividuel len Merkmalen haben. Dies hängt von den Konturen sowie von den betrachteten Zeiteinheiten ab. Notwendige Voraussetzung für die LPC-Analyse ist, daß die Zeitab schnitte genügend Information tragen: d. h. genügend Samples beeinhalten.In principle, the nature of the analysis filter or synthesis filter must in no way be determined a priori ( Fig. 5 & 6 & 7). It is also not obvious whether different filter coefficients from different signals cut off their origin in general intonation or z. B. in spächerindividuel len features. This depends on the contours and the time units considered. A necessary prerequisite for the LPC analysis is that the time segments carry enough information: ie contain enough samples.

Wünschenswert - im Hinblick auf eine einfache Synthetisierbarkeit - sind möglichst zeitinvariante Filter. Der Schritt der LPC-Resynthese erfolgt entsprechend umgekehrt (Fig. 2.5). Das approximierte Restsignal (Fig. 1.3) beschreibt die Anregung und wird auf das Synthese-Filter gegeben. Auch die Synthese (Fig. 2.5) funktioniert prinzipiell für beliebige und auch nicht äquidistante Abtastfolgen.Time-invariant filters are desirable - with a view to being easy to synthesize. The step of LPC resynthesis is accordingly reversed ( Fig. 2.5). The approximated residual signal ( Fig. 1.3) describes the excitation and is applied to the synthesis filter. The synthesis ( Fig. 2.5) also works in principle for any and also non-equidistant scan sequences.

Das Verfahren funktioniert prinzipiell für beliebig große zeitliche Einheiten eines Sprachgrundfrequenzsignals, sofern geeignete prosodische Einheiten erfaßt werden.In principle, the method works for arbitrarily large time units of one Speech fundamental frequency signal, provided suitable prosodic units are detected.

Die LPC-Ordnung, die Framelänge und die Größenordnung der systembeschreibenden Parameter gehen dabei auf die zuvor beschriebene Analyse zurück, sind aber prinzipi ell beliebig und variabel, sofern die Filter stabil sind und geeignete prosodische Ein heiten erzeugt werden können. Das Ergebnis ist eine Abtastfolge eines niederfrequenten Zeitsignals, die einen approximierten Grundfrequenzverlauf beschreibt (Fig. 1.4).The LPC order, the frame length and the order of magnitude of the system-describing parameters are based on the analysis described above, but are in principle arbitrary and variable, provided the filters are stable and suitable prosodic units can be generated. The result is a sampling sequence of a low-frequency time signal that describes an approximated fundamental frequency curve ( Fig. 1.4).

Description of an embodiment

Die Beschreibung des Verfahrens soll nun anhand eines erprobten konkreten Beispiels erfolgen. Aus einem Sprachsignal wird der Grundfrequenzverlauf in Form von Peri odenmarken extrahiert. Aus den Kehrwerten der Perioden läßt sich zunächst eine nicht äquidistant abgetastete Folge von Grundfrequenzwerten FO(tν) berechnen (Fig. 3). Die Aufbereitung und Glättung der sich ergebenden FO-Kontur geschieht durch Median- Filterung insbesondere an den Rändern zwischen stimmlosen und stimmhaften Über gängen. Ausreißer können durch angepaßte Schwellwerte für die minimale und maxi male FO detektiert und entfernt werden.The description of the method will now be based on a tried and tested concrete example. The fundamental frequency curve is extracted in the form of period marks from a speech signal. A non-equidistantly sampled sequence of fundamental frequency values FO (tν) can first be calculated from the reciprocal values of the periods ( FIG. 3). The preparation and smoothing of the resulting FO contour is done by median filtering, especially at the edges between unvoiced and voiced transitions. Outliers can be detected and removed using adjusted threshold values for the minimum and maximum FO.

Nun wird die Grundfrequenzkontur mittels kubischer Splines interpoliert (Fig. 1.1 & 2.2) und der interpolierte Verlauf äquidistant abgetastet, so daß nun die LPC-Analyse ohne weitere Anpassung durchgeführt werden kann (Fig. 2.3).Now the fundamental frequency contour is interpolated by means of cubic splines ( Fig. 1.1 & 2.2) and the interpolated course is sampled equidistantly, so that the LPC analysis can now be carried out without further adjustment ( Fig. 2.3).

Die Abtastperiode beträgt im Beispiel Tp = 6 ms (fs = 166 Hz) und stellt natürlich eine Überabtastung dar, da die betrachteten Grundfrequenzkontursignale nur spektrale Komponenten bis etwa 5 Hz enthalten. Anschließend erfolgt eine Glättung durch ein nichtkausales Mittelungsfilter (FIR-Filter) der Ordnung n = 11.In the example, the sampling period is Tp = 6 ms (fs = 166 Hz) and of course represents one Oversampling because the fundamental frequency contour signals under consideration are only spectral Components up to about 5 Hz included. Then a smoothing takes place non-causal averaging filter (FIR filter) of order n = 11.

Die bis hierhin durchgeführten Vorverarbeitungen dienen der Vorbereitung auf die LPC-Analyse, und zwar in dem Sinne, daß eine stetige, glatte und äquidistant abgeta stete Funktion erzeugt wird (Fig. 1.1). Diese Schritte sind notwendig, um die lineare Prädiktion ohne weitere Anpassung anwenden zu können. Das heißt, daß der nun zu analysierende Verlauf wie ein beliebiges niederfrequentes Zeitsignal behandelt werden kann.The preprocessing carried out up to this point serves to prepare for the LPC analysis, in the sense that a continuous, smooth and equidistantly sampled function is generated ( FIG. 1.1). These steps are necessary to be able to use linear prediction without further adjustment. This means that the course now to be analyzed can be treated like any low-frequency time signal.

Als weitere Vorverarbeitungen der entstandenen Kontur sind auch Preemphase, Log arithmierung und Subtraktion von konstanten Werten möglich. Im Beispiel wird nur der erste Wert von allen Abtastwerten subtrahiert; so muß das Analysefilter nicht erst einschwingen.Preemphasis, log are further preprocessings of the resulting contour Arithmetic and subtraction of constant values possible. In the example only subtract the first value from all samples; so the analysis filter does not have to settle in.

Nun wird mit dem kompletten FO-Verlauf eine LPC-Analyse 4. Ordnung durchgeführt (Fig. 2.3). Für das Analyse-Resynthese-System stellte sich eine Prädiktionsordnung von n=4 als vorteilhaft heraus. Eine höhere Ordnung ist prinzipell möglich, aber nicht notwendig, da das Fehlersignal hierdurch weder an Energie abnimmt noch wird es dadurch unkorrelierter. Die Ordnung kann vom verwendeten Sprachmaterial bzw. den betrachteten Spracheinheiten abhängen.Now a 4th order LPC analysis is carried out with the complete FO profile ( Fig. 2.3). A prediction order of n = 4 turned out to be advantageous for the analysis-resynthesis system. A higher order is possible in principle, but not necessary, since the error signal does not decrease in energy as a result, nor is it uncorrelated. The order can depend on the language material used or the language units considered.

Das nun bei der Analyse entstehende originale Fehlersignale(n) wird (Fig. 1.2), wenn es auf ein zum Analysefilter H(z) inverses Filter 1/H(z) gegeben wird, den originalen interpolierten FO-Verlauf resynthetisieren. Das Fehlersignal wird im folgenden als Anregungs- oder Restsignal bezeichet.The original error signal (s) now arising during the analysis will ( Fig. 1.2), if it is applied to a filter 1 / H (z) inverse to the analysis filter H (z), will resynthesize the original interpolated FO curve. The error signal is referred to below as an excitation or residual signal.

Beim entstandenen Restsignal e(n) handelt es sich um ein Rauschsignal. Es ist damit zunächst ähnlich schwer beschreibbar wie ein Grundfrequenzverlauf. Damit wäre zwar eine Steuerung gewonnen, die aber nicht sehr einfach zu beschreiben ist. Deshalb soll aus dem Fehlersignale(n) ein Approximationsfehlersignal p(n) gebildet werden (Fig. 1.3 & 2.4 & 4) und zwar so, daß diese Approximation nach der Resynthese mit 1/H(z) wieder zu einer guten Repräsentation des FO-Verlaufs führt. Dazu wird das Fehlersi gnal tiefpassgefiltert (Differenzengleichung: y(n) = x(n) + 0.8 _* y(n-1)) bzw. eine glei tende Mittelung (z. B. n = 11) durchgeführt. Dann wird eine Vorzeichenuntersuchung an der geglätteten Version vorgenommen (Fig. 4). Je höher die Glättung desto geringer wird die Anzahl an Rechtecken; die Konturen werden dadurch schlechter. Durch den Grad der Mindestlänge und Mindestamplitude sowie den Grad der Glättung kann die Zahl der sich ergebenden Rechtecke beeinflußt werden. Die Approximation ist nun leichter beschreibbar und auch editierbar. The resulting residual signal e (n) is a noise signal. It is initially difficult to describe like a fundamental frequency curve. This would result in control, but it is not very easy to describe. Therefore, an approximation error signal p (n) should be formed from the error signals (n) ( FIGS. 1.3 & 2.4 & 4) in such a way that this approximation after the resynthesis with 1 / H (z) again leads to a good representation of the FO- Leads. For this purpose, the error signal is low-pass filtered (difference equation: y (n) = x (n) + 0.8 _* y (n-1)) or a moving averaging (e.g. n = 11). A sign examination is then carried out on the smoothed version ( FIG. 4). The higher the smoothing, the lower the number of rectangles; this will make the contours worse. The number of resulting rectangles can be influenced by the degree of the minimum length and minimum amplitude as well as the degree of smoothing. The approximation is now easier to describe and editable.

Die Approximation p(n) des Rest-Signals e(n) wird im Beispiel durch Abschnitte kon stanter Amplitude, also Rechtecke nachgebildet. An den Punkten wo keine Rechtecke gefunden werden, wird die Approximation zu Null gesetzt (Fig. 1.3).The approximation p (n) of the residual signal e (n) is simulated in the example by sections of constant amplitude, that is to say rectangles. At the points where no rectangles are found, the approximation is set to zero ( Fig. 1.3).

Dazu ist es sinnvoll, zunächst eine relativ hohe Zahl an Rechtecken zuzulassen und dann weitere Vereinfachungen an der Rechteckfolge vorzunehmen. Zunächst wird nach aufeinanderfolgenden Rechtecken gleichen Vorzeichens gesucht. Diese können, wenn sie sehr nahe beeinander liegen, zu einem einzelnen Rechteck zusammengefaßt werden. Aufeinanderfolgende Rechtecke unterschiedlichen Vorzeichens können gelöscht werden, da sie häufig im Sinne einer gegenseitigen Auslöschung wirken. Allen Rechtecken werden Prioritäten, die sich aus ihrer Fläche und Position ergeben, zugeordnet. Initialen oder finalen Rechtecken können höhere Prioritäten zugeordnet werden. Diese Schritte können automatisch und iterativ erfolgen.For this it makes sense to first allow a relatively large number of rectangles and then make further simplifications to the rectangular sequence. First of all searched for successive rectangles of the same sign. These can, if they are very close to each other, combined into a single rectangle will. Successive rectangles of different signs can be deleted, since they often act in the sense of mutual extinction. All rectangles become priorities based on their area and position, assigned. Initial or final rectangles can be assigned higher priorities will. These steps can be done automatically and iteratively.

Die LPC-Filterparameter, die sich beim hier untersuchten Sprachmaterial ergeben, wir ken stark integrierend (Fig. 7). Daher ist der Einfluß der Signalform des Anregungssi gnals zu vernachlässigen. Die Fläche der Anregungsfunktion ist entscheidend. So ist die im Beispiel durchgeführte Ersetzung durch Rechtecke wohl die einfachste Form.The LPC filter parameters that result from the speech material examined here have a strong integrating effect ( FIG. 7). Therefore, the influence of the waveform of the excitation signal is negligible. The area of the excitation function is decisive. The substitution by rectangles in the example is probably the simplest form.

Für das im Beispiel analysierte Sprachmaterial - kurze Aussage- und Fragesätze - sind die Filterkoeffizienten sehr invariant (Fig. 5). Lediglich für den dritten und vierten PARCOR-Koeffizienten ergeben sich größere Varianzen (Fig. 6), die aber perzeptiv auch nicht relevant sind. Auch zwischen den Filterkoeffizienten der männlichen und weiblichen Konturen oder zwischen linear oder logarithmisch skalierten Konturen bestehen nur geringe Varianzen, die vor allem perzeptiv nicht ins Gewicht fallen.For the language material analyzed in the example - short statements and questions - the filter coefficients are very invariant ( Fig. 5). Only for the third and fourth PARCOR coefficients there are larger variances ( FIG. 6), which, however, are not perceptually relevant. There are also only slight variances between the filter coefficients of the male and female contours or between linearly or logarithmically scaled contours, which above all are not perceptually important.

Mit den aus LPC-Analyse gewonnenen Filterparametern erfolgt nun die Resynthese des FO-Verlaufs (Fig. 1.4 & 2.5). Anschließend kann dann - bespielsweise mittels der PSOLA-Technik - die resynthetisierte FO-Kontur dem Original oder synthetischen Sprachsignal aufgeprägt werden (Fig. 2.6). Auf diese Weise kann z. B. die Intonation perzeptiv verglichen und beurteilt werden. Ein synthetisierter FO-Verlauf gilt als erfolgreich generiert, wenn er sich perzeptiv vom ursprünglich analysierten Verlauf nicht unterscheidet.The filter parameters obtained from the LPC analysis are then used to resynthesize the FO curve ( Fig. 1.4 & 2.5). The resynthesized FO contour can then be impressed on the original or synthetic speech signal, for example using PSOLA technology ( Fig. 2.6). In this way, e.g. B. the intonation can be perceptually compared and assessed. A synthesized FO curve is considered successfully generated if it does not differ perceptually from the originally analyzed curve.

In Hörversuchen konnte gezeigt werden, daß schon durch 4-6 Rechtecke ein perzeptiv gleicher und völlig natürlicher Verlauf erzeugt werden konnte. Die Rechtecke werden dabei lediglich durch Position, Dauer und Amplitude beschrieben.In listening tests it could be shown that 4-6 rectangles are perceptual same and completely natural course could be generated. The rectangles are described only by position, duration and amplitude.

Claims

1. A method of LPC analysis of functions describing fundamental frequency by means of filter parameterization and simplifying residual signal approximation, characterized in that an LPC analysis is applied to functions which represent a fundamental frequency profile.

2. Method of synthesizing fundamental frequency profiles from (approximated) residuals gnalen by LPC resynthesis, characterized in that a synthesis of fundamental frequency curves on the Basis of the filter parameters and residual signal approximation obtained in claim 1 he follows.

3. method of preprocessing interpolation, characterized in that preprocessing to carry out a LPC analysis according to claim 1 is carried out.

4. Use of the filter parameters, characterized in that a synthesis of fundamental frequency is described the functions based on the filter parameters obtained in claim 1 and structures.

5. method of approximation of the residual signal, characterized in that a linear or non-linear information reduction on the LPC residual signal obtained according to claim 1 or used according to claim 2 is made.