WO2003049039A1 - Performance-driven facial animation techniques - Google Patents

Performance-driven facial animation techniques Download PDF

Info

Publication number
WO2003049039A1
WO2003049039A1 PCT/GB2002/005418 GB0205418W WO03049039A1 WO 2003049039 A1 WO2003049039 A1 WO 2003049039A1 GB 0205418 W GB0205418 W GB 0205418W WO 03049039 A1 WO03049039 A1 WO 03049039A1
Authority
WO
WIPO (PCT)
Prior art keywords
facial
sequence
image
image sequence
ordinate
Prior art date
Application number
PCT/GB2002/005418
Other languages
French (fr)
Inventor
Glyn Cowe
Alan Johnston
Original Assignee
University College London
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University College London filed Critical University College London
Priority to AU2002349141A priority Critical patent/AU2002349141A1/en
Publication of WO2003049039A1 publication Critical patent/WO2003049039A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Definitions

  • the invention relates to performance-driven facial animation techniques and to films or other media products generated by the operation of such techniques.
  • Performance-driven facial animation has previously been approached with a variety of tracking techniques and deformable facial models.
  • Parke introduced the first computer generated facial model (Parke 1972; Parke 1974).
  • a polygonal mesh was painted onto a face, then the face was photographed from two angles and the 3d location of each vertex calculated by measurement and geometry. His choice of a polygonal mesh has always been the most popular approach and was animated simply by interpolating between key- frames.
  • Sophisticated underlying muscle models based on those of Platt and Badler (Platt and Badler 1981), have furthermore been incorporated to enable more anatomically realistic movements (Waters 1987; Terzopoulos and Waters 1993) and alternative tracking strategies, such as active contour models ('snakes') (Terzopoulos and Waters 1993), deformable templates (Yuille 1991) and more simplistic feature trackers (for cartoon animation) (Buck, Finkelstein et al. 2000) have been employed. Essa and Pentland (Essa and Pentland 1997) and Black and Yacoob (Black and Yacoob 1995) sought denser motion information by tracking facial motion with optical flow for recognition of expression.
  • Tiddeman and Perrett demonstrated a technique, based on prototyping, which allows them to transform existing facial image sequences in dimensions such as age, race and sex (Tiddeman and Perrett 2001). Prototypes are generated by averaging shape and texture information from a set of similar images (same race, for example). Each frame from the sequence can then be transformed towards this prototype. Points (179) must be located in each image and, although this can be automated using active shape models (Cootes, Taylor et al. 1995), a set of examples must first be delineated; so manual intervention cannot be avoided. Although this is not technically performance-driven facial animation, a new face is generated driven by the original motion.
  • Video Rewrite a system developed by Bregler et al., automatically lip-synchs existing footage to a new audio track (Bregler, Covell et al. 1997).
  • the video sequence remains the same, except for the mouth. Only the voice of the other actor drives the mouth and their facial expressions are ignored. They track the lips using eigenpoints (Covell and Bregler 1996) and employ hidden Markov models to learn the deformations associated with phonemes from the original audio track. Mouth shapes are predicted for each frame from the new audio track, and these are incorporated into the existing sequence by warping and blending.
  • An extension of this approach is described by Cosatto and Graf, driven by a text to speech synthesiser (Cosatto and Graf 2000). Ezzat and Poggio also describe similar work [Ezzat, 1998 #121; Ezzat, 2000 #122].
  • the invention is developed preferably around the application of principal components analysis to vectors representing faces; we now proceed to discuss previous work applying PCA in this area.
  • Principal components analysis is a mathematical technique that extracts uncorrelated vectors from a set of correlated vectors, in order of the variance they account for within the set. Early components thus provide strong descriptors of change within the set and later vectors have less relevance.
  • Sirovich and Kirby were the first to apply PCA to vectorised images of faces, principally as a means of data compression (Sirovich and Kirby 1987). Face images were turned into vectors by concatenating rows of pixel-wise grey level intensity values and transposing . They demonstrated how the weighted sum of just a small number of principal components can be used to reconstruct recognisable faces, requiring only the storage of the weights.
  • the principal components extracted from sets of facial images in this way are often termed eigenfaces and have been successfully applied since, particularly for facial recognition (Turk and Pentland 1991; Pentland, Moghaddam et al. 1994).
  • a problem with the application of PCA on intensity values of images is blurring, since linearly combining images results in deterioration of sharp edges.
  • Shape and texture information can thus be separated for an improved vectorisation.
  • Beymer, and Netter and Troje presented such improved vectorisations using optic flow to find pixel-to-pixel correspondences between images (Beymer 1995; Netter and Troje 1995).
  • flow fields were extracted from each face to a chosen reference face, these could be averaged to define the mean shape and, for each face, shape could be encoded as the flow field deviation from this mean.
  • warping faces onto the average shape was removed, leaving only texture.
  • PCA has often been used to find axes of variation between people, variations within people have been considered less often.
  • PCA has been applied to dot tracking data from facial sequences.
  • Arslan et al. used it simply for dimensionality reduction in building codebooks relating acoustic data and phonemes to three-dimensional positions of dots for speech-driven facial animation (Arslan and Talkin 1998).
  • Kshirsagar et al. used PCA on these vectors of dot positions and mapped a configuration associated with each phoneme into the principal component space (Kshirsagar, Molet et al. 2001).
  • Kuratate et al. captured laser scans of a face in eight different poses and used PCA to reduce the dimensionality of the data (Kuratate, Yehia et al. 1998). By relating the positions of a small number of points on the meshes to their principal component scores via a linear estimator, they were able to drive the 3D mesh by tracking points positioned analogously on an actor.
  • the invention takes a new approach, applying mathematical analysis techniques to the information available from a real facial image sequence in order to enable that information to be used to drive the movements of another face appropriately co-ordinately aligned with the original.
  • each example can be vectorised in a chosen manner.
  • the resultant virtual avatar can be controlled by projecting novel deformations into this co-ordinate frame if the new sequence of movements is appropriately aligned with the original in position and scale (although this alignment need not be precise).
  • the invention envisages an essentially space-based performance-driven facial animation technique in which frames from a preexisting real facial image sequence are analysed to generate a co-ordinate frame characterising an individual's permissible facial actions with a new sequence then being projected into the thus-defined co-ordinate frame to animate the end image accordingly.
  • FIGS. 1 through 7 are derived from facial image sequences of two male subjects Glyn (the younger man) and Harry (the elder) with the recorded facial movements of Glyn being used to drive the facial end image of Harry as will be described below.
  • nxm image a vector of grey level intensity values, one value for each pixel of the image.
  • Principal components analysis is a mathematical technique that seeks to linearly transform a set of correlated N-dimensional variables, ⁇ x,,x 2 ,...,x w ⁇ , into an
  • ⁇ T .
  • is, by definition, the covariance -l matrix of the set of image vectors (recall that the ⁇ p,'s are centred on their mean) and 1 u, T ⁇ u, gives a measure of the variance in the set that Ui accounts for.
  • Figure 3 shows the first five principal components from an image sequence of Harry speaking, vectorised as described previously. Together, these mere five principal components account for 75% of the variance in the sequence of 317 frames.
  • the central column always shows the mean image from the sequence.
  • the first P principal components can be learned by a neural network [Sanger, 1989 #112], or can be extracted using a convergence algorithm [Ro Stamm, 1998 #160].
  • Each element of c represents a weighting on the respective basis vector.
  • An optional rescaling step can be included, where the distribution of c/s can be transformed so the means and standard deviations of the weights associated with each basis vector match those for the training set.
  • the distribution can also be rescaled for exaggeration or anti-exaggeration purposes.
  • a face space is defined for Harry (the first five dimensions of which are shown in
  • Figure 3 The top row of Figure 4 shows five frames from a real image sequence of Glyn telling a joke. These are then projected into Harry's face space using the procedure defined, and the resulting images, transformed back to image space, are shown below their corresponding frames.
  • RGB colour images can be vectorised and the procedure outlined above can be applied.
  • FIG. 5 demonstrates this warping approach.
  • I a matrix containing the colour information for each pixel of the image, for example as an RGB triple.
  • I(x, y) we write to represent the colour information for the pixel at (x, y).
  • I(x, y) we choose the image shown in (a) as a reference; although this is a somewhat arbitrary choice, we select the image closest to the luminance mean, additionally ensuring it is in a 'neutral' pose with eyes open and mouth slightly open; this is because, for example, an open mouth can be warped onto a closed mouth, but a closed mouth cannot be warped onto an open mouth.
  • the Multi-channel Gradient Model is an optic flow algorithm modelled on the processing of the human visual system (Johnston, McOwan et al. 1999). We chose to apply the model to just two images for each frame, the reference and the target, since optic flow provides only an estimate and errors would be disproportionately magnified for frames temporally further from the reference, were fields to be combined over time. Some adaptation is thus required for this application, since the McGM would usually have a large temporal buffer of images to work with and, in this case, we have only two. This can be overcome by replacing the zero* and first temporal derivatives with their average and difference, respectively, and discarding all those of higher order.
  • a coarse-to-fine implementation of the McGM was applied at three spatial scales, 0.25, 0.5 and 1.0, progressively warping a reference facial image onto the target frame.
  • R is the reconstruction. Since (x - V(x, y), y - V( , y)) will rarely correspond exactly to pixel locations in Q, an interpolation technique is employed. Here we use bilinear interpolation. All images in the sequence can be represented as warps from Q and the entire sequence can be reconstructed by warping this one reference frame. Each vector field (U, V) can be vectorised, by concatenating each row of U and V, joining them and transposing to form one long vector.
  • Figure 6 shows the first five principal components from Harry's sequence, vectorised as warps from a reference as described above.
  • the middle column shows the chosen reference image and the left and right columns show the warp -2 standard deviations and +2 standard deviations respectively in the direction of each shown component. Together, these five components account for 85% of the variance in the whole set.
  • a first basis can be extracted from the set of forward training warps and a second basis can be extracted from the image-based vectorisations (luminance, or RGB, etc.) of the stabilised training sequence.
  • the first basis we refer to the configural basis, and the second as the image basis.
  • the stabilised images that it is generated from will be aligned.
  • the feature aligned texture information and the configural information in the form of the flow fields, can be combined together into one single vector for each frame of the sequence.
  • the basis can then be extracted as before from this information.
  • Such vectors can then be converted into images by simply warping the texture component by its configural component.
  • the decoding step would then be (from 3.22)
  • PCA is not the only way to generate a set of bases.
  • the original vectors from the training sequence could be used, for example (the transformation matrix U, would then be the matrix generated with these vectors as its columns, after normalisation, and the inverse transformation matrix would be its pseudoinverse rather than U ⁇ ).
  • PCA happens to be particularly good because it orders the bases in terms of descriptive importance, so noise can be truncated away, and orthogonality is enforced, so no pseudoinverses need to be calculated.

Abstract

A method of generating a facial animation sequence comprising the steps of: a) observing a real facial image sequence - the original image sequence - and capturing the information generated thereby; b) aligning another facial image - the end image - in an appropriate manner co-ordinate-wise with the original one; c) analysing the information from the original image sequence mathematically; and d) using the results so obtained to drive the movements which generate the end image sequence. Principal components analysis is applied to successive frames from the original image sequence to generate the necessary co-ordinate frames characterising an individual's permissible facial actions and the resulting new sequence is then projected into the thus-defined co-ordinate frames to drive the end image accordingly. Alternatively the necessary vectorisations are generated by non-PCA-based analytical tecniques in which the analysis proceeds from vectorial bases.

Description

PERFORMANCE-DRIVEN FACIAL ANIMATION TECHNIQUES
Field of the Invention
The invention relates to performance-driven facial animation techniques and to films or other media products generated by the operation of such techniques.
Review of Art known to the Applicants
Performance-driven facial animation has previously been approached with a variety of tracking techniques and deformable facial models. Parke introduced the first computer generated facial model (Parke 1972; Parke 1974). A polygonal mesh was painted onto a face, then the face was photographed from two angles and the 3d location of each vertex calculated by measurement and geometry. His choice of a polygonal mesh has always been the most popular approach and was animated simply by interpolating between key- frames.
Williams tracked coloured markers on his own face and manually defined corresponding points on a three-dimensional laser-scanned polygonal mesh of a head with Hanning window (cosine fall-off) influence zones for the surrounding motion around each point (Williams 1990). Markers have been used often since in variations on this theme and the introduction of automated dot tracking techniques and multiple cameras for 3-d motion estimation has led to high quality results (Guenter, Grimm et al. 1998). Commercial packages are even available with automated dot tracking (eg. Famous Faces).
Sophisticated underlying muscle models, based on those of Platt and Badler (Platt and Badler 1981), have furthermore been incorporated to enable more anatomically realistic movements (Waters 1987; Terzopoulos and Waters 1993) and alternative tracking strategies, such as active contour models ('snakes') (Terzopoulos and Waters 1993), deformable templates (Yuille 1991) and more simplistic feature trackers (for cartoon animation) (Buck, Finkelstein et al. 2000) have been employed. Essa and Pentland (Essa and Pentland 1997) and Black and Yacoob (Black and Yacoob 1995) sought denser motion information by tracking facial motion with optical flow for recognition of expression.
It is, however, very difficult to fool a human observer to believe that a computer model is a real face. Tiddeman and Perrett demonstrated a technique, based on prototyping, which allows them to transform existing facial image sequences in dimensions such as age, race and sex (Tiddeman and Perrett 2001). Prototypes are generated by averaging shape and texture information from a set of similar images (same race, for example). Each frame from the sequence can then be transformed towards this prototype. Points (179) must be located in each image and, although this can be automated using active shape models (Cootes, Taylor et al. 1995), a set of examples must first be delineated; so manual intervention cannot be avoided. Although this is not technically performance-driven facial animation, a new face is generated driven by the original motion.
Video Rewrite, a system developed by Bregler et al., automatically lip-synchs existing footage to a new audio track (Bregler, Covell et al. 1997). The video sequence remains the same, except for the mouth. Only the voice of the other actor drives the mouth and their facial expressions are ignored. They track the lips using eigenpoints (Covell and Bregler 1996) and employ hidden Markov models to learn the deformations associated with phonemes from the original audio track. Mouth shapes are predicted for each frame from the new audio track, and these are incorporated into the existing sequence by warping and blending. An extension of this approach is described by Cosatto and Graf, driven by a text to speech synthesiser (Cosatto and Graf 2000). Ezzat and Poggio also describe similar work [Ezzat, 1998 #121; Ezzat, 2000 #122].
Development of the Invention
The invention is developed preferably around the application of principal components analysis to vectors representing faces; we now proceed to discuss previous work applying PCA in this area.
Principal components analysis (PCA) is a mathematical technique that extracts uncorrelated vectors from a set of correlated vectors, in order of the variance they account for within the set. Early components thus provide strong descriptors of change within the set and later vectors have less relevance.
Sirovich and Kirby were the first to apply PCA to vectorised images of faces, principally as a means of data compression (Sirovich and Kirby 1987). Face images were turned into vectors by concatenating rows of pixel-wise grey level intensity values and transposing . They demonstrated how the weighted sum of just a small number of principal components can be used to reconstruct recognisable faces, requiring only the storage of the weights. The principal components extracted from sets of facial images in this way are often termed eigenfaces and have been successfully applied since, particularly for facial recognition (Turk and Pentland 1991; Pentland, Moghaddam et al. 1994). A problem with the application of PCA on intensity values of images is blurring, since linearly combining images results in deterioration of sharp edges. By first aligning face images onto an average shape, blurring can be dramatically reduced. Shape and texture information can thus be separated for an improved vectorisation. Beymer, and Netter and Troje presented such improved vectorisations using optic flow to find pixel-to-pixel correspondences between images (Beymer 1995; Netter and Troje 1995). Once flow fields were extracted from each face to a chosen reference face, these could be averaged to define the mean shape and, for each face, shape could be encoded as the flow field deviation from this mean. By then warping faces onto the average, shape was removed, leaving only texture.
Although PCA has often been used to find axes of variation between people, variations within people have been considered less often. PCA has been applied to dot tracking data from facial sequences. Arslan et al. used it simply for dimensionality reduction in building codebooks relating acoustic data and phonemes to three-dimensional positions of dots for speech-driven facial animation (Arslan and Talkin 1998). Kshirsagar et al. used PCA on these vectors of dot positions and mapped a configuration associated with each phoneme into the principal component space (Kshirsagar, Molet et al. 2001).
Kuratate et al. captured laser scans of a face in eight different poses and used PCA to reduce the dimensionality of the data (Kuratate, Yehia et al. 1998). By relating the positions of a small number of points on the meshes to their principal component scores via a linear estimator, they were able to drive the 3D mesh by tracking points positioned analogously on an actor.
Summary of the Invention
It will be appreciated from the review above that most work in the field to date has centred around tracking the motion of an actor's face and transferring this on to a computer-generated model. The invention takes a new approach, applying mathematical analysis techniques to the information available from a real facial image sequence in order to enable that information to be used to drive the movements of another face appropriately co-ordinately aligned with the original. Given a set of examples of a particular face in motion, each example can be vectorised in a chosen manner. Once having established a set of high dimensional vectors representing examples of facial movements, one can define a subspace therein, with a co-ordinate system constraining deformations to these observed permissible actions. The resultant virtual avatar can be controlled by projecting novel deformations into this co-ordinate frame if the new sequence of movements is appropriately aligned with the original in position and scale (although this alignment need not be precise).
Specifically therefore the invention envisages an essentially space-based performance-driven facial animation technique in which frames from a preexisting real facial image sequence are analysed to generate a co-ordinate frame characterising an individual's permissible facial actions with a new sequence then being projected into the thus-defined co-ordinate frame to animate the end image accordingly.
The subsequent claims which will define the boundaries of the invention clearly include within their scope an image, for example a film image sequence, or other media product, generated by applying techniques in accordance with the invention in its broad conceptual scope.
In accordance with that overall approach there will now be described certain currently preferred embodiments of the inventive concept which demonstrate this space-based approach to performance-driven facial animation.
Brief Description of the Figures
The accompanying Figures 1 through 7 are derived from facial image sequences of two male subjects Glyn (the younger man) and Harry (the elder) with the recorded facial movements of Glyn being used to drive the facial end image of Harry as will be described below.
The detailed content of each individual Figure will become apparent as the description proceeds as will their individual relevance to the text with which they are inter-referenced.
Description of Currently Preferred Techniques for putting the Invention into Practice
This detailed description begins by presenting an example of vectorisation, demonstrating how images from a facial image sequence can be represented in vector form as pixel-wise intensity variations from their mean. The space-based approach underlying the invention is then outlined, and is finally shown to be generahsable to more sophisticated vectorisations, all essentially by way of example of current work on the concept.
Proceeding then to this detailed explanation and expansion of the concept:
A simple vectorisation: Facial motion as pixel-wise intensity variations
Consider an nxm image to be a vector of grey level intensity values, one value for each pixel of the image. These vectors (of length N = nxm) can be thought of as representing locations in an N-dimensional image space. Now consider a set of M frames from a continuous recorded sequence of a face, x,, x2,...,xw , where each frame has been converted to a long vector by concatenating each row and transposing (Figure 1 Figure 1 ).
Since frames from a continuous recorded facial sequence tend to vary smoothly, these images will generally be clustered together in this space, centred
A M approximately on their mean, μ = — Y x, . Figure 2 shows how each face, x, in the set can be considered as a linear translation, φ, from the mean, φ = x - μ (note that φ is displayed in a different range with zero as mid-level grey, so that negative and positive values are visible).
In order to move around the subspace occupied by these particular vectors, we can set up a co-ordinate system that spans it using the examples as a basis. This space will necessarily have dimensionality of at most M, but it is unlikely that this will form a good description, since two or more example faces may be of a similar configuration and image noise will be responsible for most of the variance in those dimensions. By application of a mathematical technique, known as principal components analysis, we can define a new improved orthonormal coordinate system centred on μ, which optimally spans this subspace, with axes chosen in order of descriptive importance. That is, basis vectors are defined sequentially, each chosen to point in the direction of maximum variance unaccounted for so far by their predecessors, subject always to the constraint of orthonormality. Since noise tends to be uncorrelated, vectors describing this will be of low importance in the hierarchy and can be later discarded by truncation to a lower dimensionality.
Principal Components Analysis - Creating a Puppet
Principal components analysis is a mathematical technique that seeks to linearly transform a set of correlated N-dimensional variables, { x,,x2,...,xw }, into an
uncorrelated set that best describes the data, termed principal components, { u,,u2,...,uw } (Chatfield and Collins 1980). For simplicity, we translate the data
to a new set, { φ,,φ2,...,φw }, centred on the set's mean, μ, simply by subtracting
it from each datum, φ, = x, -μ . We define Φ = [φ,,φ2,...,φM] , the matrix with
columns consisting of the φ,'s. We proceed to show that these principal components, sequentially chosen to maximise the variance thus far accounted for, subject to the constraints of orthonormality, turn out simply to be the eigenvectors of the covariance matrix for { x,,x2,...,xyW }.
First principal component
Consider first ui. This is our first principal component and so must point in the direction of maximum variance of the data set. We thus wish to choose our first basis vector, such that the magnitude of the projection of each member of the dataset onto u, is optimal,
∑(φ, *u,)2 M
-ώ = ∑(φ, - u1)2 (3.1)
(since orthonormality of our basis set dictates that u, must have a magnitude of 1). This can be represented in matrix form as
(u,rΦ)(uI rΦ)ϊ' = 11^11, (3.2) where Σ = ΦΦ T . It should be noted that ∑ is, by definition, the covariance -l matrix of the set of image vectors (recall that the <p,'s are centred on their mean) and 1 u, T ∑u, gives a measure of the variance in the set that Ui accounts for.
M -l
T
Orthonormality adds the constraint that U[ u, = 1. Introducing a Lagrange multiplier, λ\, we can define a new function, Zι(u, ),
Z,(u1) = u1 r∑u1
Figure imgf000010_0001
- 1) (3.3)
Employing the procedure of Lagrange multipliers, maximisation is now a case of
finding when — - = 0 , dut
2∑uj - 2Λ,u, (3.4) du, dL. Setting — = 0 , we have
∑u.^u, (3.5)
This leaves us with an eigenvalue problem, where candidate solutions are the eigenvectors of Σ. Pre-multiplying by uιr yields, u, ∑u, =^u, u, =A, (3.6) which is the very function we sought to maximise (3.2), so the optimal solution is necessarily the eigenvector associated with the largest eigenvalue of Σ.
Second principal component
To find the second principal component, u2, we need, similarly, to maximise
Figure imgf000011_0001
subject to u u2 = 1 and u,ru2 = 0 (3.8)
With two Lagrange multipliers, λ_ and δ, we define the new function ,2(u2),
Z,2(u2) = u2 Σi^-A^i^ u2-ι)-£u, u2 (3.9)
Λ T
Again, maximisation is now a case of finding when — - = 0 , which leaves us δu2 with
dL 2 _ = 2∑u2-212u2-<5u1=0 (3.10) du.
Pre-multiplying by uιr,
2u, !Lx_2-2^ux u2-£u, u, = 0 (3.11) this reduces to
2u,r∑u2= (3.12)
due to orthonormality constraints (3.8). Rearranging this, we can use symmetry of Σ, (3.5) and (3.8) together, to show that δ= 0: δ = 2u,r∑u2 = 2(∑u,)ru2 = 2(λiu])τu2 = 2^u,ru2 = 0 (3.13)
This reduces (3.10) to
■^2- = 2Σu2 - 2^u2 = 0 (3.14)
<5u2 which leaves us again with the eigensystem,
Σu2 = Λ2u2 (3.15)
Since
Figure imgf000012_0001
is already the eigenvector associated with the largest eigenvalue, the next best solution will be the eigenvector associated with the second largest eigenvalue.
The other principal components
By continuing this process for each j e [l, ], with the constraints u u, = 1 and u u = 0 for all i < j, it is apparent that the principal components are the eigenvectors of Σ (or, equivalently, the eigenvectors of the set's covariance matrix) ordered by magnitude of their associated eigenvalues, λp
∑u , = λyιιy (3.16)
Pre-multiplying (3.16) by u7, we see that
Figure imgf000012_0002
1 T Since the variance accounted for by u, is given by u , ∑u , , the
corresponding eigenvalues provide a measure of this, differing only by scaling. With consideration towards these variances, lower order components can be discarded as noise, thus reducing the dimensionality to some P < M.
Figure 3 shows the first five principal components from an image sequence of Harry speaking, vectorised as described previously. Together, these mere five principal components account for 75% of the variance in the sequence of 317 frames. In each case, the central column always shows the mean image from the sequence. The left and right columns show the images two standard deviations, 2σ, away from the mean in the negative and positive directions respectively for each principal component. More explicitly, for rowy", from left to right, images are λ, 2σu , , μ and μ + 2σu , , where standard deviation, σ = . \ — - — ' y V -1
Reducing computation
Computationally, finding the eigenvalues and eigenvectors of the NxN matrix,
Σ = ΦΦr , is difficult due to its large size. We therefore look to find the eigenvalues and eigenvectors of the MxM matrix ΦrΦ when «N,
ΦrΦv = Av (3.18)
This can be exploited, since, pre-multiplying each side by Φ, φφrΦv = AΦv (3.19) and adding some parentheses,
(ΦΦr)(Φv) = ;i(Φv) (3.20) we see that ΦrΦ and ΦΦr share the same eigenvalues, and that, if v is an eigenvector of ΦrΦ , then u = Φv will be an eigenvector of ΦΦr . This provides us with a useful computational shortcut.
For particularly large values of M and N , however, memory constraints sometimes make it impractical to store the matrix of outer products, whether it be ΦΦT or Φτ Φ . In such situations, the first P principal components can be learned by a neural network [Sanger, 1989 #112], or can be extracted using a convergence algorithm [Roweis, 1998 #160].
Projecting into face space
Having found a new co-ordinate system representing an individual's face space, we can project any facial movement, ξ, from a sequence of any individual into this space, provided it is vectorised in the same manner, centred on its own sequence mean and roughly aligned (just performing a simple affine transform to bring the two eye positions and two mouth corners into correspondence has been found to be sufficient).
Given a set of Mtram training vectors from individual one (the face we wish to drive), x,,x, ΪW , and a set of Mdrιve driving vectors from individual two (the face that will be doing the driving), y,,y2,...,yM , we centre them both on their means and find matrices Φ and Ψ, such that Φ = {φ12,...,φΛγ } , where φ, = x, -μ,ra,„ > and Ψ =P Ψ2>-> ΨΛ,„ > where ψ, = y, - μdnve . Principal components analysis provides us with a set of basis vectors, u u2,...,u , where R ≤ Mtram. We project into the new lower dimensional co-ordinate frame provided by the principal components by employing the basis transformation matrix, U = { u,,u2,...,up }. For example, to project the N-dimensional vector, ψ, , into the R- dimensional subspace described by the principal components basis, we apply c, = Urψ, (3.21)
Each element of c, represents a weighting on the respective basis vector.
An optional rescaling step can be included, where the distribution of c/s can be transformed so the means and standard deviations of the weights associated with each basis vector match those for the training set. The distribution can also be rescaled for exaggeration or anti-exaggeration purposes.
In order to transform the projection, C; , back to N-dimensional space translated to the standard origin, we apply the inverse transformation and add the training mean,
Z, = Uc, +μ(rfl,π (3.22)
In the case of the pixel-wise intensity vectorisation, the new Nx 1 vector, z„ is then rearranged into n rows of m elements, to form an nxm image. Figures 3 and 4 demonstrate typical results from this process for the vectorisation defined.
A face space is defined for Harry (the first five dimensions of which are shown in
Figure 3). The top row of Figure 4 shows five frames from a real image sequence of Glyn telling a joke. These are then projected into Harry's face space using the procedure defined, and the resulting images, transformed back to image space, are shown below their corresponding frames.
Alternative Nectorisations
Alternative vectorisations could be employed to define the original space of an individual's facial movement. By concatenating the three colour planes, RGB colour images can be vectorised and the procedure outlined above can be applied.
A clear problem with the examples presented previously, however, is the blur inherent in linearly combining images. Given a facial image sequence, one approach for evading this drawback is to choose an arbitrary frame to be a reference and define the remaining frames in terms of warps from this single frame.
Figure 5 demonstrates this warping approach. Here we represent each frame, I, as a matrix containing the colour information for each pixel of the image, for example as an RGB triple. We write I(x, y) to represent the colour information for the pixel at (x, y). We choose the image shown in (a) as a reference; although this is a somewhat arbitrary choice, we select the image closest to the luminance mean, additionally ensuring it is in a 'neutral' pose with eyes open and mouth slightly open; this is because, for example, an open mouth can be warped onto a closed mouth, but a closed mouth cannot be warped onto an open mouth.
In order to warp from one image to another, it is necessary to be able to find pixel-wise correspondences between them. There are a variety of approaches for estimating these correspondences, but in these examples we apply an optic flow algorithm.
The Multi-channel Gradient Model (McGM)
The Multi-channel Gradient Model (McGM) is an optic flow algorithm modelled on the processing of the human visual system (Johnston, McOwan et al. 1999). We chose to apply the model to just two images for each frame, the reference and the target, since optic flow provides only an estimate and errors would be disproportionately magnified for frames temporally further from the reference, were fields to be combined over time. Some adaptation is thus required for this application, since the McGM would usually have a large temporal buffer of images to work with and, in this case, we have only two. This can be overcome by replacing the zero* and first temporal derivatives with their average and difference, respectively, and discarding all those of higher order.
A coarse-to-fine implementation of the McGM was applied at three spatial scales, 0.25, 0.5 and 1.0, progressively warping a reference facial image onto the target frame.
Warping a reference
By application of the McGM, we can find the flow field (U, V), that takes us from a reference Q to the target frame, P, (shown in (b)), where U and V are matrices containing the horizontal and vertical components of the field, respectively, for each location (x, y). The target can be approximately reconstructed from the reference and the flow field, by backward mapping:
¥(x,y) * R(x,y) = Q(x-V(x,y),y-V(x,y)) (1.1)
where R is the reconstruction. Since (x - V(x, y), y - V( , y)) will rarely correspond exactly to pixel locations in Q, an interpolation technique is employed. Here we use bilinear interpolation. All images in the sequence can be represented as warps from Q and the entire sequence can be reconstructed by warping this one reference frame. Each vector field (U, V) can be vectorised, by concatenating each row of U and V, joining them and transposing to form one long vector.
Once facial image sequences from two individuals have been vectorised in this manner, one can be driven by the other by application of the procedure as described previously.
Results
Figure 6 Figure 6 shows the first five principal components from Harry's sequence, vectorised as warps from a reference as described above. The middle column shows the chosen reference image and the left and right columns show the warp -2 standard deviations and +2 standard deviations respectively in the direction of each shown component. Together, these five components account for 85% of the variance in the whole set.
Projecting vectors from Glyn's sequence into this space results in a new sequence with Harry mimicking Glyn's facial movements. Five frames from this are shown below their corresponding frames in Figure 7 (using only 20 principal components, which account for 94% of the variance).
A difficulty with this vectorisation is the appearance of features of the face that were previously obscured. If there is no evidence of a feature in the reference image, then it is not possible to generate these features with a warp only. Teeth, for example, are often occluded by the lips. We refer to such changes as iconic. A vectorisation based on the luminance or RGB values in an image will capture such iconic changes, although with the disadvantage of blurring, so a combined approach can be applied. Figure 8Figure 8 outlines such an approach, where images from the training and driving sets are reverse warped onto their respective references to provide stabilised sequences where remaining changes should essentially consist of iconic and lighting variations only. A first basis can be extracted from the set of forward training warps and a second basis can be extracted from the image-based vectorisations (luminance, or RGB, etc.) of the stabilised training sequence. We refer to the first basis as the configural basis, and the second as the image basis. There should be little blurring in the image basis, since the stabilised images that it is generated from will be aligned. Once the driving flow fields are projected onto the configural basis, and the stabilised driving images are projected onto the image basis, a new sequence can be generated by applying the projected flow fields onto the projected stabilised images, thus incorporating iconic changes with minimal blurring.
Alternatively, the feature aligned texture information and the configural information (in the form of the flow fields), can be combined together into one single vector for each frame of the sequence. The basis can then be extracted as before from this information. Such vectors can then be converted into images by simply warping the texture component by its configural component.
Mapping between vectorisations
Since the basis sets discussed are generated from linear combinations of the examples, it is possible to apply the same weights to the examples in an alternative vectorisation, thus producing a (non-orthogonal) version of the basis set in a second encoding. Each frame from a novel sequence can then be encoded in the first mean-centred vectorisation as ψ, , projected onto the -dimensional basis for that vectorisation, A , then decoded using the second -dimensional basis B . The encoding step would thus be (from 3.21) c, = Arψ, (3.23)
The decoding step would then be (from 3.22)
Z, = Bc, +μ,ra,. (3.24) Despite the non-orthogonality of the second basis set, this technique has been found to work well and allows the possibility of encoding fast in a low quality, low resolution basis, then projecting onto a much better quality, high resolution basis set, crucially enabling high quality real-time implementation of the technology.
Conclusions
It is possible to generate a low-dimensional co-ordinate frame in high- dimensional space that encapsulates the dimensions of movement for an actor. Another's movements can then be projected into this space.
Only movements that can be made by combinations of those experienced in the training phase can be projected onto another face, since only those are represented by the basis set. This may seem to be a limitation, but can equally be considered advantageous, since only movements faithful to the target's repertoire can be made. It would, for example, be unnatural to see someone wink, or raise their eyebrows, were they not normally able to.
It is to be noted that PCA is not the only way to generate a set of bases. The original vectors from the training sequence could be used, for example (the transformation matrix U, would then be the matrix generated with these vectors as its columns, after normalisation, and the inverse transformation matrix would be its pseudoinverse rather than Uτ). PCA happens to be particularly good because it orders the bases in terms of descriptive importance, so noise can be truncated away, and orthogonality is enforced, so no pseudoinverses need to be calculated.
The scope of the claims which follow is to be interpreted accordingly.
The whole process, involving the vectorisations presented, requires no manual intervention other than the approximate alignment of the two sequences in position, rotation and scale. This is very desirable in a field where tracking often necessitates much tedious manual labour.
Specific Prior Art References
Arslan, L. M. and D. Talkin (1998). 3-D face point traiectorv synthesis using an automatically derived visual phoneme similarity matrix. Auditory- Visual Speech
Processing, Terrigal, NSW, Australia.
Beymer, D. (1995). Vectorizing face images by interleaving shape and texture computations, Massachusetts Institute of Technology.
Black, M. J. and Y. Yacoob (1995). Tracking and recognising rigid and non-rigid facial motions using local parametric models of image motions. International
Conference of Computer Vision.
Bregler, C, M. Covell, et al. (1997). Video Rewrite: driving visual speech with audio. SIGGRAPH Conference on Computer Graphics, Los Angeles, California.
Buck, I., A. Finkelstein, et al. (2000). Performance-driven hand-drawn animation.
Proceedings of the first international symposium of Non-photorealistic animation and rendering.
Chatfϊeld, C. and A. J. Collins (1980). Introduction to Multivariate Analysis.
London, Chapman and Hall.
Cootes, T., C. Taylor, et al. (1995). "Active shape models- their training and application." Computer Vision. Graphics and Image Understanding 61(1): 38-59.
Cosatto, E. and H. P. Graf (2000). "Photo-realistic talking heads from image samples." IEEE Transactions on Multimedia 2(3 : 152-163.
Covell, M. and C. Bregler (1996). Eigenpoints. International Conference for
Image Processing, Lausanne, Switzerland.
Essa, I. and A. Pentland (1997). "Coding, analysis, interpretation, and recognition of facial expressions." IEEE Transactions on Pattern Analysis and Machine
Intelligence 19(7): 757-763. Ezzat, T. and T. Poggio (1998). MikeTalk: A talking facial display based on morphing visemes. Computer Animation Conference,
Philadelphia, Pennsylvania. Ezzat, T. and T. Poggio (2000). "Visual speech synthesis by morphing visemes."
International Journal of Computer Vision 38(1): 45-57.
Guenter, B., C. Grimm, et al. (1998). Making faces. ACM SIGGRAPH, Orlando,
FL.
Johnston, A., P. W. McOwan, et al. (1999). "Robust velocity computation from a biologically motivated model of motion perception." Proceedings of the Royal
Society of London B266: 509-518.
Kshirsagar, S., T. Molet, et al. (2001). Principal components of expressive speech animation. Computer Graphics International, Hong Kong, IEEE Computer
Society.
Kuratate, T., H. Yehia, et al. (1998). Kinematics-based synthesis of realistic talking faces. Auditory- Visual Speech Processing, Terrigal, NSW, Australia.
Parke, F. I. (1972). Computer generated animation of faces. Salt Lake City,
Univesity of Utah.
Parke, F. I. (1974). A Parametric model for human faces. Salt Lake City,
University of Utah.
Pentland, A., B. Moghaddam, et al. (1994). View-based and modular eigenspaces for face recognition. Proc. Computer Vision and Pattern Recognition Conference.
Platt, S. M. and N. I. Badler (1981). "Animating facial expression." Computer
Graphics 15(3): 245-252.
Roweis, S. (1998). "EM algorithms for PCA and SPCA." Advances in Neural
Information Processing Systems 10.
Sanger, T. D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network." Neural Networks 2: 459-473.
Sirovich, L. and M. Kirby (1987). "Low-dimensional procedure for the characterization of human faces." Journal of the Optical Society of America A
4(3): 519-524.
Terzopoulos, D. and K. Waters (1993). "Analysis and synthesis of facial image sequences using physical and anatomical models." IEEE Transactions on Pattern
Analysis and Machine Intelligence 15(6): 569-579. Tiddeman, B. and D. Perrett (2001). Moving facial image transformations using static 2D prototypes. The 9th International Conference in Central Europe on
Computer Graphics, Visualisation and Computer Vision, Plzen, Czech Republic.
Turk, M. and A. Pentland (1991). "Eigenfaces for recognition." Journal of
Cognitive Neuroscience 3: 71-86.
Vetter, T. and N. Troje (1995). A separated linear shape and texture space for modeling two-dimensional images of human faces, Max-Planck-Institut fur biologische Kybernetik.
Waters, K. (1987). "A muscle model for animating three-dimensional facial expression." Computer Graphics 21(4): 17-24.
Williams, L. (1990). "Performance driven facial animation." Computer Graphics
24(4): 235-242.
Yuille, A. L. (1991). "Deformable templates for face recognition." Journal of
Cognitive Neuroscience 3(1): 59-70.

Claims

1. A method of generating a facial animation sequence comprising the steps of:
a) observing a real facial image sequence - the original image sequence - and capturing the information generated thereby;
b) aligning another facial image - the end image - in an appropriate manner co-ordinate-wise with the original one;
c) analysing the information from the original image sequence mathematically; and
d) using the results so obtained to drive the movements which generate the end image sequence.
2. A method in accordance with Claim 1 and in which principal components' analysis is applied to successive frames from the original image sequence to generate respective co-ordinate frames characterising an individual's permissible facial actions and the resulting new sequence then being projected into the thus- defined co-ordinate frames to drive and animate the end image accordingly.
3. A method in accordance with Claim 1 and in which the necessary vectorisations are generated by non-PCA-based analytical techniques in which the analysis proceeds from vectorial bases.
4. A method in accordance with Claims 1, 2 and 3 and further incorporating a post-processing step wherein the distribution of the resulting weights on each basis vector is statistically transformed to match the mean and standard deviation of its equivalents in the training set, or is statistically altered for exaggeration or anti-exaggeration purposes.
5. A facial animation technique incorporating all the essential steps of any one of the methods embodied in an appropriate combination of the teachings disclosed herein.
6. An image, for example a film image sequence, or other media product, generated by applying techniques in accordance with any of the preceding Claims.
7. A method employing multiple vectorisations as a means of combining separate encodings of both configural and iconic change as defined in the text.
8. A method employing two separate co-ordinate frames, with related axes, where the first is used for encoding novel movements and the second is used for constructing the end images.
PCT/GB2002/005418 2001-12-01 2002-11-29 Performance-driven facial animation techniques WO2003049039A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002349141A AU2002349141A1 (en) 2001-12-01 2002-11-29 Performance-driven facial animation techniques

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0128863.8 2001-12-01
GB0128863A GB0128863D0 (en) 2001-12-01 2001-12-01 Performance-driven facial animation techniques and media products generated therefrom

Publications (1)

Publication Number Publication Date
WO2003049039A1 true WO2003049039A1 (en) 2003-06-12

Family

ID=9926877

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/005418 WO2003049039A1 (en) 2001-12-01 2002-11-29 Performance-driven facial animation techniques

Country Status (3)

Country Link
AU (1) AU2002349141A1 (en)
GB (1) GB0128863D0 (en)
WO (1) WO2003049039A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8279228B2 (en) 2006-04-24 2012-10-02 Sony Corporation Performance driven facial animation
US10574883B2 (en) 2017-05-31 2020-02-25 The Procter & Gamble Company System and method for guiding a user to take a selfie
US10614623B2 (en) 2017-03-21 2020-04-07 Canfield Scientific, Incorporated Methods and apparatuses for age appearance simulation
US10621771B2 (en) 2017-03-21 2020-04-14 The Procter & Gamble Company Methods for age appearance simulation
US10818007B2 (en) 2017-05-31 2020-10-27 The Procter & Gamble Company Systems and methods for determining apparent skin age
US11055762B2 (en) 2016-03-21 2021-07-06 The Procter & Gamble Company Systems and methods for providing customized product recommendations

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285794B1 (en) * 1998-04-17 2001-09-04 Adobe Systems Incorporated Compression and editing of movies by multi-image morphing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285794B1 (en) * 1998-04-17 2001-09-04 Adobe Systems Incorporated Compression and editing of movies by multi-image morphing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOCKUSCH S ET AL: "Analysis-by-synthesis and example based animation with topology conserving neural nets", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP) AUSTIN, NOV. 13 - 16, 1994, LOS ALAMITOS, IEEE COMP. SOC. PRESS, US, vol. 3 CONF. 1, 13 November 1994 (1994-11-13), pages 953 - 957, XP010146492, ISBN: 0-8186-6952-7 *
LIU Z ET AL: "EXPRESSIVE EXPRESSION MAPPING WITH RATIO IMAGES", COMPUTER GRAPHICS. SIGGRAPH 2001. CONFERENCE PROCEEDINGS. LOS ANGELES, CA, AUG. 12 - 17, 2001, COMPUTER GRAPHICS PROCEEDINGS. SIGGRAPH, NEW YORK, NY: ACM, US, 12 August 2001 (2001-08-12), pages 271 - 275, XP001049896, ISBN: 1-58113-374-X *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8279228B2 (en) 2006-04-24 2012-10-02 Sony Corporation Performance driven facial animation
US11055762B2 (en) 2016-03-21 2021-07-06 The Procter & Gamble Company Systems and methods for providing customized product recommendations
US10614623B2 (en) 2017-03-21 2020-04-07 Canfield Scientific, Incorporated Methods and apparatuses for age appearance simulation
US10621771B2 (en) 2017-03-21 2020-04-14 The Procter & Gamble Company Methods for age appearance simulation
US10574883B2 (en) 2017-05-31 2020-02-25 The Procter & Gamble Company System and method for guiding a user to take a selfie
US10818007B2 (en) 2017-05-31 2020-10-27 The Procter & Gamble Company Systems and methods for determining apparent skin age

Also Published As

Publication number Publication date
AU2002349141A1 (en) 2003-06-17
GB0128863D0 (en) 2002-01-23

Similar Documents

Publication Publication Date Title
Thies et al. Headon: Real-time reenactment of human portrait videos
Noh et al. A survey of facial modeling and animation techniques
Blanz et al. Reanimating faces in images and video
Ichim et al. Dynamic 3D avatar creation from hand-held video input
Thies et al. Real-time expression transfer for facial reenactment.
US6556196B1 (en) Method and apparatus for the processing of images
Jones et al. Multidimensional morphable models: A framework for representing and matching object classes
Vlasic et al. Face transfer with multilinear models
Essa et al. Modeling, tracking and interactive animation of faces and heads//using input from video
Chai et al. Vision-based control of 3 D facial animation
Bickel et al. Multi-scale capture of facial geometry and motion
US6967658B2 (en) Non-linear morphing of faces and their dynamics
Vetter et al. Estimating coloured 3D face models from single images: An example based approach
Bronstein et al. Calculus of nonrigid surfaces for geometry and texture manipulation
US6400828B2 (en) Canonical correlation analysis of image/control-point location coupling for the automatic location of control points
Pighin et al. Modeling and animating realistic faces from images
WO2021228183A1 (en) Facial re-enactment
Pighin et al. Realistic facial animation using image-based 3D morphing
Kwolek Model based facial pose tracking using a particle filter
Fidaleo et al. Classification and volume morphing for performance-driven facial animation
Paier et al. Example-based facial animation of virtual reality avatars using auto-regressive neural networks
Paier et al. Neural face models for example-based visual speech synthesis
WO2003049039A1 (en) Performance-driven facial animation techniques
Hou et al. Smooth adaptive fitting of 3D face model for the estimation of rigid and nonrigid facial motion in video sequences
Cowe Example-based computer-generated facial mimicry

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP