US20150189318A1 - Feature-Based Video Compression - Google Patents

Feature-Based Video Compression Download PDF

Info

Publication number
US20150189318A1
US20150189318A1 US14/592,898 US201514592898A US2015189318A1 US 20150189318 A1 US20150189318 A1 US 20150189318A1 US 201514592898 A US201514592898 A US 201514592898A US 2015189318 A1 US2015189318 A1 US 2015189318A1
Authority
US
United States
Prior art keywords
feature
video
model
frames
instances
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/592,898
Inventor
Charles P. Pace
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Euclid Discoveries LLC
Original Assignee
Euclid Discoveries LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/191,562 external-priority patent/US7158680B2/en
Priority claimed from US11/230,686 external-priority patent/US7426285B2/en
Priority claimed from US11/280,625 external-priority patent/US7457435B2/en
Priority claimed from US11/336,366 external-priority patent/US7436981B2/en
Priority claimed from US11/396,010 external-priority patent/US7457472B2/en
Priority claimed from PCT/US2008/000090 external-priority patent/WO2008091483A2/en
Application filed by Euclid Discoveries LLC filed Critical Euclid Discoveries LLC
Priority to US14/592,898 priority Critical patent/US20150189318A1/en
Assigned to EUCLID DISCOVERIES, LLC reassignment EUCLID DISCOVERIES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PACE, CHARLES P.
Publication of US20150189318A1 publication Critical patent/US20150189318A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/54Motion estimation other than block-based using feature points or meshes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/23Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with coding of regions that are present throughout a whole video segment, e.g. sprites, background or mosaic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/543Motion estimation other than block-based using regions

Definitions

  • Image and video compression techniques generally attempt to exploit redundancy in the data that allows the most important information in the data to be captured in a “small” number of parameters. “Small” is defined relative to the size of the original raw data. It is not known in advance which parameters will be important for a given data set. Because of this, conventional image/video compression techniques compute (or measure) a relatively large number of parameters before selecting those that will yield the most compact encoding. For example, the JPEG and JPEG 2000 image compression standards are based on linear transforms (typically the discrete cosine transform [DCT] or discrete wavelet transform [DWT]) that convert image pixels into transform coefficients, resulting in a number of transform coefficients equal to the number of original pixels.
  • DCT discrete cosine transform
  • DWT discrete wavelet transform
  • the important coefficients can then be selected by various techniques.
  • One example is scalar quantization. When taken to an extreme, this is equivalent to magnitude thresholding. While the DCT and DWT can be computed efficiently, the need to compute the full transform before data reduction causes inefficiency. The computation requires a number of measurements equal to the size of the input data for these two transforms. This characteristic of conventional image/video compression techniques makes them impractical for use when high computational efficiency is required.
  • compressed sensing (CS) algorithms directly exploit much of the redundancy in the data during the measurement (“sensing”) step. Redundancy in the temporal, spatial, and spectral domains is a major contributor to higher compression rates.
  • the key result for all compressed sensing algorithms is that a compressible signal can be sensed with a relatively small number of random measurements and much smaller than the number required by conventional compression algorithms. The images can then be reconstructed accurately and reliably. Given known statistical characteristics, a subset of the visual information is used to infer the rest of the data.
  • the precise number of measurements required in a given CS algorithm depends on the type of signal as well as the “recovery algorithm” that reconstructs the signal from the measurements (coefficients). Note that the number of measurements required by a CS algorithm to reconstruct signals with some certainty is not directly related to the computational complexity of the algorithm. For example, a class of CS algorithms that uses L1-minimization to recover the signal requires a relatively small number of measurements, but the L1-minimization algorithm is very slow (not real-time). Thus, practical compressed sensing algorithms seek to balance the number of required measurements with the accuracy of the reconstruction and with computational complexity. CS provides a radically different model of codec design compared to conventional codecs.
  • LRLS Basri and Jacobs (“Lambertian Reflectances and Linear Subspaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2/03), henceforth referred to as LRLS, have shown that Lambertian objects (whose surfaces reflect light in all directions) can be well-approximated by a small (9-dimensional) linear subspace of LRLS “basis images” based on spherical harmonic functions.
  • the LRLS basis images can be visualized as versions of the object under different lighting conditions and textures.
  • the LRLS basis images thus depend on the structure of the object (through its surface normals), the albedo of the object at its different reflection points, and the illumination model (which follows Lambert's cosine law, integrated over direction, to produce spherical harmonic functions). Under the assumptions of the model, the 9-D subspace captures more than 99% of the energy intensity in the object image.
  • the low dimensionality of the appearance subspace indicates a greater redundancy in the data than is available to conventional compression schemes.
  • IC inverse compositional algorithm
  • a common dimensionality reduction technique involves the utilization of linear transformations on norm preserving bases.
  • Reduction of an SVD representation refers to the deletion of certain singular value/singular vector pairs in the SVD to produce a more computationally and representationally efficient representation of the data.
  • the SVD factorization is effectively reduced by zeroing all singular values below a certain threshold and deleting the corresponding singular vectors. This magnitude thresholding results in a reduced SVD with r singular values (r ⁇ N) that is the best r-dimensional approximation of the data matrix D from an L2-norm perspective.
  • the reduced SVD is given by
  • Ur is M ⁇ r
  • Sr is r ⁇ r diagonal
  • Vr is N ⁇ r
  • the singular value decomposition is a factorization of a data matrix that leads naturally to minimal (compact) descriptions of the data.
  • D data matrix
  • M data matrix
  • Matching pursuit is an iterative algorithm for deriving efficient signal representations. Given the problem of representing a signal vector s in terms of a dictionary D of basis functions (not necessarily orthogonal), MP selects functions for the representation via the iterative process described here.
  • the first basis function in the representation (denoted as d1) is selected as the one having maximum correlation with the signal vector.
  • the next function in the representation (d2) is selected as the one having maximum correlation with the residual r1.
  • the projection of d2 onto r1 is subtracted from r1 to form another residual r2. The same process is then repeated until the norm of the residual falls below a certain threshold.
  • Orthogonal matching pursuit follows the same iterative procedure as MP, except that an extra step is taken to ensure that the residual is orthogonal to every function already in the representation ensemble. While the OMP recursion is more complicated than in MP, the extra computations ensure that OMP converges to a solution in no more than Nd steps, where Nd is the number of functions in the dictionary D.
  • the present invention extends conventional video compression, especially in cases where the redundancy of visual phenomena exceeds the modeling capabilities of the conventional video codec.
  • the present invention extends, and may entirely replace, the existing methods of conventional video compression by employing robust Computer Vision and Pattern Recognition algorithms.
  • the present invention includes feature modeling methods and systems that focus on the segmentation, normalization, and integration of a feature occurring in one or more of the previously decoded frames of the video.
  • Feature-based video compression considers a greater number of previously decoded frames, and within each of those frames, a greater area and a much higher number of pels compared with conventional compression which considers fewer frames, smaller areas, and fewer pels.
  • Conventional compression provides an implicit form of segmentation at the macroblock level, by utilizing multiple reference frames, macroblock partitioning, sub-macroblock partitioning, and motion compensated prediction. Further, conventional compression utilizes motion compensated prediction to model the spatial deformation occurring in the video and transform coding to model the appearance variations.
  • the present invention extends these modeling techniques of disparate signal elements with more complex models including spatial segmentation masks, regular mesh deformation, feature affine motion, three dimensional feature motion, three dimensional illumination, and other Computer Vision and Pattern Recognition modeling techniques. Note that throughout the present text, “individual modes” and “disparate signal elements” are equivalent.
  • the present invention facilitates the identification and segmentation of individual modes of the video signal.
  • the concept of reference frame processing that is used in conventional motion compensated prediction is utilized in the present invention to facilitate this identification and segmentation.
  • the conventional motion compensated prediction process selects, at the macroblock level, portions of the signal from one or more reference frames. Note that the conventional motion compensated prediction process typically does such a selection based on some rate-distortion metric.
  • the present invention is able to apply analysis to the past frames to determine the frames that will have the highest probability of providing matches for the current frame. Additionally, the number of reference frames can be much greater than the typical one to sixteen reference frame maximum found in conventional compression.
  • the reference frames may number up to the limit of system memory; assuming that there are a sufficient number of useful matches in those frames. Further, the intermediate form of the data generated by the present invention can reduce the required amount of memory for storing the same number of reference frames.
  • the present invention infers the segmentation of the video signal based on this reference frame processing.
  • the macroblocks (block of pixels) in the current frame may select, through the motion compensated prediction process, tiles of pels from previously decoded frames such that those tiles are separated both spatially and also temporally, meaning that the source of tiles used in the motion compensated prediction process may come from different frames.
  • the separation implied by selection of source tiles, for predicting current frame macroblocks, from different frames indicates the potential that different signal modes are being identified. When the identified separate signal modes can be encoded in a more compact manner, this further verifies that separate modes have been identified.
  • these separate modes are called “features.”
  • these features are persistent over many frames of the video and the features can be correlated, a new type of redundancy in the video has been identified.
  • the present invention leverages this redundancy through the creation of appearance and deformation models in order to create further compression beyond what is available to conventional compression. Further, as features are identified within reference frames, reference frame processing is biased toward using reference frames containing features; this yields an increased probability that the reference frame processing will further yield a segmentation of the modes present in the signal.
  • Video data formed of a series of video frames may be received and encoded.
  • One or more instances of a candidate feature may be detected in one or more of the video frames.
  • the detection of the candidate feature involves determining positional information for instances in the one or more previously decoded video frames.
  • the positional information includes a frame number, a position within that frame, and a spatial perimeter of the instance.
  • the candidate feature can be a set of one or more detected instances.
  • a motion compensated prediction process can be used to predict a portion of a current video frame in the series using one or more previously decoded video frames.
  • the motion compensated prediction process can be initialized with positional predictions.
  • the positional predictions can provide positional information from detected feature instances in previously decoded video frames.
  • One or more of the instances can be transformed by augmenting the motion compensated prediction process.
  • a feature along with the transformed instances can be defined.
  • the one or more of the instances may be transformed using a linear transform.
  • the defined feature including the transformed instances can be used to create a first feature-based model.
  • the first feature-based model can enable prediction in the current frame of an appearance and a source position of a substantially matching feature instance.
  • the substantially matching feature is the best match determined using a rate-distortion metric.
  • the substantially matching feature instance can be a key feature instance.
  • the key feature instance can be the first feature-based model current frame feature instance synthesis.
  • the first feature-based model can be compared to a conventional video encoding model of the one or more defined features, and the comparison can be used to determine which model enables greater encoding compression.
  • the results of the comparing and determining step can be used to guide the encoding process in applying feature-based encoding to portions of one or more of the video frames, and applying conventional video encoding to other portions of the one or more video frames.
  • An instance of a candidate feature can be detected by identifying a spatially continuous group of pels having substantially close spatial proximity.
  • the identified pels can be used to define a portion of one of the one or more video frames.
  • the group of pels can include one or more macroblock or portions of one or more macroblocks.
  • the motion compensated prediction process can be used to select, from a plurality of candidate feature instances, one or more instances that are predicted to provide encoding efficiency.
  • a segmentation of the current instance of the candidate feature can be determined from other features and non-features in the current video frame. The segmentation can be based on the motion compensated prediction process' selection of predictions from unique previously decoded video frames.
  • the motion compensated prediction process can be initialized using positional information for feature instances belonging to one or more features (such features having instances in the current frame coincident with the video portion) where the video portion is in the current frame, and the positional information corresponds to feature instances associated with the same feature in previously decoded video frames.
  • a second feature-based model can be formed.
  • the second feature-based model can be formed using the first feature-based model as a target of prediction for one or more motion compensated predictions from one or more feature instance. This second feature-based model yields a set of predictions of the first feature-based model. Once the set of predictions is combined with the first feature-based model, the set of predictions can become the second feature-based model.
  • the second feature-based model can be used to model the residual from first feature-based model. Structural variation and appearance variation can be modeled from the second-feature based model relative to the residual.
  • the residual can be encoded with the feature instance, which yields appearance and deformation parameters. The parameters can be used to reduce the encoding size of the residual.
  • One or more features can include one or more aggregate features.
  • the aggregate features are based on one or more of the instances of the candidate feature.
  • the aggregate features can be created by aggregating the instances of different candidate features into an aggregate candidate feature.
  • the set of instances of the aggregate candidate features can be used to form a region substantially larger than the original instances of un-aggregated candidate features.
  • the larger region can be formed through the identification of coherency among the instances of the candidate feature in the set.
  • Coherency can be defined as appearance correspondences in the instances substantially approximated by a lower parameter motion model.
  • the second feature-based model can provide an optional rectangular area extent of pels associated with that instance in the decoded frame relative to the spatial position.
  • the second feature-model can be derived by modeling prior normalized instances of the feature.
  • the prior normalized instances can be any one of the following: the instance in the current frame; an instance that is from a previously decoded frame that is substantially recent temporally; or an average of the instances from the previously decoded video frames
  • the appearance model can be represented by a PCA decomposition of the normalized second feature-based model instances.
  • a deformation model can be determined using the spatial variation of correspondences in the feature instances of each set as compared to their second feature-based model instances. For each feature instance in the set, one or more of the following can be used to approximate variation in the deformation instances for the deformation model: a motion compensated prediction process; mesh deformation; and a motion model with a substantially reduced parameterization.
  • the deformation instances can be integrated into the deformation model.
  • the variation in the deformation model can be represented by a PCA decomposition.
  • Appearance parameters and deformation parameters may be predicted.
  • the predicated parameters can be used during the synthesis of the current instance using a feature-based model.
  • the appearance and deformation models as well as temporally recent parameters can be used to interpolate and extrapolate parameters from the feature-based model to predict pels in the current frame.
  • the values of the synthesis for the temporally recent feature instances may be either linearly interpolated or linearly extrapolated based on which method has yielded the most accurate approximation for those instances.
  • the actual parameters for the model can be optionally differentially encoded relative to the predicted parameters.
  • the motion compensated prediction process can operate on a selection of a substantially larger number of the previously decoded video frames than in conventional video data encoding.
  • the selection of previously decoded video frames need not rely on user supervision.
  • Conventional video encoding can be augmented by an instance prediction process that enables greater compression of portions of one or more of the video frames in memory, when forming a prediction of portions of the current frame.
  • the instance prediction process can use the feature-based model to determine one or more instances of the defined feature that are incident to a target macroblock being encoded. In this way, the instance prediction process can create the predicted portions of the current frame.
  • the feature-based model can be used to synthesize pels to predict portions of the current frame.
  • a probability for the previously decoded video frames can be assigned.
  • the probability can be based on the combined predicted encoding performance improvement for the frame, determined using positional predictions from the motion compensated prediction process.
  • the probability can be defined as the combined encoding performance of motion compensated prediction process, which was utilized during the analysis of the first feature-based model and a second feature-based model for the current frame.
  • An indexing based on sorting the previously decoded video frames can be created based on their probability, from best to worst. The indexed list can be truncated based on computational and memory requirements.
  • a feature-based model may be formed using one or more of the defined features.
  • the feature-based model may include positional information for the defined features.
  • the positional information may include a position and a spatial perimeter of defined features from the previously decoded video frames.
  • the positional information may include information regarding the spatial position of region within a specific frame, and a rectangular extent of the region in that frame.
  • the feature-based model may specify which previously decoded video frames (or portions thereof) are associated with the defined feature.
  • the defined features may be normalized and segmented from the video data using macroblock motion compensated prediction.
  • the defined features may be normalized using the feature-based model.
  • the macroblock motion compensated prediction may use the feature position in the previously decoded image frame as a positional prediction.
  • the resulting normalization provides the prediction of the feature in the current video frame.
  • the feature-based model may be compared to another model resulting from conventional encoding of the same video data.
  • the comparison can be used to determine which model enables greater encoding compression efficiency.
  • Different encoding techniques may be applied to the different parts of the video data depending on the results of the encoding comparison. In this way, differential encoding can be provided such that the system is capable of selecting a different video encoding scheme for each portion of video data depending on whether feature-based encoding or conventional based encoding provides more compression efficiency.
  • a defined feature may be represented as a set of instances of the feature in one or more video frames.
  • Each instance may include: a reference to a frame in which the instance occurs; a spatial position associated with the instance within that frame; and an optional rectangular area extent of pels associated with that instance in that frame relative to the spatial position.
  • the spatial position may provide a prediction of matches for encoding portions of one or more of the video frames.
  • An appearance model may be provided for each defined feature to model variation of the defined feature from instance to instance in the set.
  • the appearance model may be derived by modeling prior normalized instances of the feature.
  • the prior normalized instances may be normalized using any combination of motion compensated prediction process, mesh deformation, and parameter reduced motion modeling (e.g. affine).
  • the normalization can be used to build a deformation model that may be used to model the spatial variation of correspondences in the feature instances of each set. For each feature instance in the set, one or more of the following may be used to determine deformation instances for the deformation model: a motion compensated prediction process, mesh deformation, and parameter reduced motion modeling.
  • the deformation instances may be integrated into the deformation model.
  • the deformation model may be represented by a decomposition using Principal Component Analysis (PCA).
  • PCA Principal Component Analysis
  • the deformation model may be represented by a decomposition using any decomposing algorithm.
  • the motion compensation prediction process may operate on a substantially greater number of the previously decoded video frames than in conventional video data encoding without supervision.
  • the conventional video encoding may include motion-compensated block-based compression.
  • the conventional video encoding can be augmented by a residual reduction process that enables greater compression of portions of the video frames in memory when forming a residual frame.
  • the residual reduction process can include the feature-based model to determine one or more instances of the defined feature that are incident to a target macroblock being encoded to form the residual frame.
  • Pels may be synthesized using the feature-based models to predict the residual frame.
  • the feature-based model may be used for reference frame index prediction. The synthesized pels may be reused for other residual reductions in response to determining that one or more instances of the defined feature, overlaps more than one macroblock in the current frame.
  • the synthesized pels may be reused for other residual reductions in response to determining that one or more instances of the defined feature represents one macroblock when one or more instances of the defined feature substantially matches positional information for a macroblock in the current frame.
  • Appearance and deformation may be modeled based on the feature-based model.
  • the appearance model and deformation model may be used along with a historical set of parameters in those models to interpolate and extrapolate parameters from the feature-based model to predict pels in the current frame.
  • higher order quadratic and even extended Kalman filter models can be used to predict the appearance and deformation parameters.
  • the prediction of the parameters from the feature-based model enable a reduction in the magnitude of the residual parameters, resulting in a lower precision and therefore lower bit rate representation of the parameters required to predict pels in the current frame.
  • One or more macroblocks from one or more frames may be selected using the motion compensated prediction process.
  • Pels from macroblocks in a PCA model may be linearly combined pels, and the PCA model parameters may be interpolated. Equivalently, any decomposing algorithm can be used in place of PCA, and utilized based on its substantially relative benefit.
  • Substantially small spatial regions may be identified in the video frames. Coherency criteria may be used to identify spatial regions that can be combined into substantially larger spatial regions. For a larger spatial region, the suitability of the larger spatial region to be a defined feature can be determined by encoding a feature-based model of the larger spatial region. The smaller region may be a defined feature, and the larger region may be a defined feature.
  • Feature-based compression can include object-based compression processes.
  • Object based detection, tracking, and segmentation may be applied to a feature instance in the current frame or in previously decoded frames.
  • An intermediate form of the feature instance may be derived using spatial segmentation.
  • the spatial segmentation process may segment a foreground object from the non-object background.
  • the resulting segmentation may provide a pel level correspondence of a given object in a feature instance as it exists in one frame to its occurrence in a next frame.
  • the pel data associated with the object is resampled, and subsequently the spatial positions of the resampled pel data are restored using models.
  • the resampling effectively normalizes the object pel data from one frame to a next frame and results in providing an intermediate form of the video data which has computational and analytical advantages for video processing purposes.
  • object-based normalization and modeling processes may be applied to a feature instance (or portions thereof) in the current frame or in previously decoded frames during the feature-based encoding process.
  • Correspondence modeling, deformation modeling, appearance modeling, contour modeling, and structural modeling may be used to model a feature instance (or portions thereof) in the current frame or in previously decoded frames.
  • a defined feature may be free of correspondence to salient entities (object, sub-objects).
  • the salient entities may be determined through supervised labeling of detected features as belonging to or not belonging to an object.
  • the defined features may contain elements of two or more salient objects, background, or other parts of the video frames.
  • One or more features may constitute an object.
  • a defined feature may not correspond to an object.
  • a defined feature may not included in any objects. In this way, feature-based compression can be more flexible and versatile than object-based detection.
  • defined features can include objects and be included in objects, defined features do not need to be object-based and can take any form.
  • Compressed Sensing is applied to the feature based encoding technique.
  • CS is applied to pels in the video frames having working or defined features.
  • CS may also be applied to conventional encoding to the remaining pels of the video frames.
  • the video data may be made sparse to increase the effectiveness of the application of CS.
  • CS may be applied to resolve the model parameters from partial parameter measurements.
  • CS can be applied to the residual of the second feature-based model prediction.
  • the application of CS can utilize the average appearance as a measurement and predict the video signal from it.
  • Variance associated with the CS prediction can be removed from the second feature-based model.
  • the feature-based model can be used to focus on a more compact encoding of the remaining CS encoding can be applied to the remaining pels in the one or more video frames and to remaining video frames.
  • a hybrid codec decoder may be provided that uses feature-based decompression for decoding video data.
  • Encoded video data may be decoded by determining on a macroblock level whether there is an encoded feature in the encoded video data.
  • the encoded feature may include feature-based models. Where an encoded feature does not exist, macroblocks in the encoded video data may be decoded using conventional video decompression.
  • the decoder may respond to the detection of an encoded feature in the encoded video data by separating the feature encoded parts from the encoded video data. By separating the feature encoded parts, the system is able to synthesize the encoded feature separately from the conventionally encoded parts in the video stream.
  • Feature parameters from the encoded feature parts may be associated with feature models included with the encoded feature.
  • the feature parameters may be used by the decoder to synthesize the encoded feature.
  • the conventionally compressed parts/portions of the video data may be combined with the synthesized feature to reconstruct the original video frame.
  • a video codec is capable of handling a plurality of compressed video signal modes.
  • a codec encoder provides feature-based video compression.
  • the codec encoder provides conventional video compression.
  • a codec decoder is responsive to different video signal modes and is capable of providing feature-based video decompression, and conventional video compression depending on the contents of the video signal (e.g. the video signal mode).
  • the codec may determine which type of video compression is appropriate based on whether feature-based encoding or conventional based encoding provides more compression efficiency for one or more features in video frames of the video signal.
  • FIG. 1 is a schematic diagram of an embodiment (hybrid codec) of the present invention.
  • FIG. 2 is a block diagram of a video compression architecture embodied in encoders of the present invention.
  • FIG. 3 is a schematic diagram of a computer network environment in which embodiments of the present invention are deployed.
  • FIG. 4 is a block diagram of the computer nodes in the network of FIG. 3 .
  • FIG. 5 is a diagram depicting feature modeling representing one embodiment of the present invention.
  • FIG. 6 is a diagram describing the prediction process according to an embodiment of the present invention.
  • FIG. 7 is a block diagram of an embodiment (codec) of the present invention.
  • FIG. 8 is a diagram depicting feature tracking according to an embodiment of the present invention.
  • the invention is implemented in a software or hardware environment.
  • FIG. 3 illustrates a computer network or similar digital processing environment in which the present invention may be implemented.
  • Client computer(s)/devices 350 and server computer(s) 360 provide processing, storage, and input/output devices executing application programs and the like.
  • Client computer(s)/devices 350 can also be linked through communications network 370 to other computing devices, including other client devices/processes 350 and server computer(s) 360 .
  • Communications network 370 can be part of a remote access network, a global network (e.g., the Internet), a worldwide collection of computers, Local area or Wide area networks, and gateways that currently use respective protocols (TCP/IP, Bluetooth, etc.) to communicate with one another.
  • Other electronic device/computer network architectures are suitable.
  • FIG. 4 is a diagram of the internal structure of a computer (e.g., client processor/device 350 or server computers 360 ) in the computer system of FIG. 3 .
  • Each computer 350 , 360 contains a system bus 479 , where a bus is a set of actual or virtual hardware lines used for data transfer among the components of a computer or processing system.
  • Bus 479 is essentially a shared conduit that connects different elements of a computer system (e.g., processor, disk storage, memory, input/output ports, etc.) that enables the transfer of information between the elements.
  • Attached to system bus 479 is I/O device interface 482 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc.) to the computer 350 , 360 .
  • Network interface 486 allows the computer to connect to various other devices attached to a network (for example the network illustrated at 370 of FIG. 3 ).
  • Memory 490 provides volatile storage for computer software instructions 492 and data 494 used to implement an embodiment of the present invention (e.g., hybrid codec, video encoder compression code and decoder code/program routine detailed above).
  • Disk storage 495 provides non-volatile storage for computer software instructions 492 (equivalently “OS program”) and data 494 used to implement an embodiment of the present invention.
  • Central processor unit 484 is also attached to system bus 479 and provides for the execution of computer instructions. Note that throughout the present text, “computer software instructions” and “OS program” are equivalent.
  • the processor routines 492 and data 494 are a computer program product (generally referenced 492 ), including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the invention system.
  • Computer program product 492 can be installed by any suitable software installation procedure, as is well known in the art.
  • at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection.
  • the invention programs are a computer program propagated signal product 307 embodied on a propagated signal on a propagation medium (e.g., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s)).
  • a propagation medium e.g., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s).
  • Such carrier medium or signals provide at least a portion of the software instructions for the present invention routines/program 492 .
  • the propagated signal is an analog carrier wave or digital signal carried on the propagated medium.
  • the propagated signal may be a digitized signal propagated over a global network (e.g., the Internet), a telecommunications network, or other network.
  • the propagated signal is transmitted over the propagation medium over a period of time, such as the instructions for a software application sent in packets over a network over a period of milliseconds, seconds, minutes, or longer.
  • the computer readable medium of computer program product 492 is a propagation medium that the computer system 350 may receive and read, such as by receiving the propagation medium and identifying a propagated signal embodied in the propagation medium, as described above for computer program propagated signal product.
  • carrier medium or transient carrier encompasses the foregoing transient signals, propagated signals, propagated medium, storage medium and the like.
  • the present invention provides a hybrid (feature-based and conventional) codec method ( FIG. 1 ) with a means of detecting 113 , separating 115 , modeling 117 , encoding 111 , and decoding 124 features in video while allowing a conventional codec 118 to encode and decode the non-features as well as the features that cannot be advantageously processed through the feature encoder/decoder.
  • FIG. 1 illustrates that a subject video signal input (video data formed of a series of image frames) 110 is encoded by the invention hybrid codec 121 .
  • the hybrid codec contains the encoding decision heuristics and processes the video signal as follows:
  • the detection of features is primarily accomplished through the identification of groups of pels in close proximity that exhibit complexity.
  • Complexity is generally defined as any metric indicating that the encoding of the pels exceeds a level that would be encoded efficiently by conventional video compression.
  • This grouping of pels in close proximity provides segmentation of the detected feature (at 115 ) from the background and other features. The grouping is subsequently analyzed to determine if the complexity can be advantageously modeled using the invention's feature modeling 117 .
  • the conventional video encoding mechanism employing reference frame processing used in the motion compensated prediction of the current frame is utilized in this process. Because the comparison 119 has employed conventional reference frame processing, a segmentation of the current frame is yielded (based on the selection of predictions from different reference frames). The selection of pels (more typically as macroblocks) in one reference frame versus another reference frame indicates a segmentation of the features in the frame, and in the subject video 110 itself.
  • the resulting encoding 120 of the subject video signal input 110 includes a conventional video encoding stream (output of conventional encoder 118 ) accompanied by the additional encoded information needed to regenerate the features in the reference frames.
  • the hybrid codec decoder 122 illustrates decoding the encoded video in order to synthesize (approximate) the input video signal 110 .
  • the hybrid codec decoder makes a determination 123 on a sub-frame level, macroblock level, whether or not there is an encoded feature in the encoded video. If an encoded feature does not exist, the conventional macroblock, or non-feature macroblock, is decoded conventionally. If an encoded feature is encountered in the encoded video stream, the hybrid codec separates 125 the feature-encoded parts from the conventionally encoded parts in order to synthesize each separately, combining the parts after synthesis occurs.
  • the hybrid codec uses the encoded feature parameters with the feature models that were created by the decoder (models made exactly and in parallel to those made in the encoder) to synthesize the feature 124 . Then the conventionally encoded feature parts and the feature-encoded parts are composited 126 to produce a complete feature synthesis. Next the hybrid codec at 127 combines the feature synthesis with the non-feature synthesis to yield a fully synthesized video signal 128 .
  • FIG. 7 is a representation of an embodiment of the present invention which utilizes feature-based encoding as a replacement in part, and in some cases in full, for conventional encoding. Detecting 710 , tracking 720 , comparing 730 , modeling 740 , encoding 750 , and decoding 760 features in a video signal are illustrated.
  • the detection of features is primarily accomplished through the identification of spatially proximate groups of pels that exhibit complexity such that they can be encoded/modeled more efficiently than conventional means. These groups of pels effectively separate the detected feature ( 710 ) from the non-feature pels around it, as also noted in 115 .
  • the detected features, or feature instance candidates, or simply feature candidates are further analyzed to correlate the groups of pels over two or more frames. This correlation confirms that the feature instances are representative of a discrete entity in the video frames that can be tracked 720 thereby confirming additional redundancy in the video that can be potentially reduced through modeling the feature 740 .
  • the feature is tracked via the identification of the feature's instance (equivalently region) within the current frame along with instances of the feature in one or more other frames, also noted in 117 .
  • feature instance is equivalent with “region”.
  • instance is equivalent with “feature instance” and “region” when it references them.
  • each individual feature is considered candidate features, and become combined into a full fledged feature through grouping them into feature sets, or simply feature. These instances are analyzed, compared, and classified into feature sets in step 730 through the identification of correspondences between the instances.
  • feature candidates and feature instances are equivalent.
  • the feature sets are analyzed to obtain a model of the deformation variation and appearance variation of the feature instances.
  • the deformation variation between feature instances is determined through a deformation modeling process.
  • the deformation modeling process compares two or more instances in order to determine the spatial pel resampling that would be required to reduce the per pel differences between the instances.
  • Feature candidates are modeled within step 740 , which applies multiple analysis techniques to refine the sampled regions.
  • Feature encoding 750 of the video stream utilizes the feature models and encodes the video stream in part, or in full, without the use of conventional video encoding.
  • the decoding 760 synthesizes the features using the feature models in the inverse of the modeling operations 750 to decode the encoded features into a synthesis of the pels 770 of each feature instance, approximating the feature as it appeared originally in the video.
  • FIG. 6 demonstrates the process of predicting elements within the current video frame by utilizing information contained within one or more past frames being placed in one or more reference frames.
  • the prediction, Method 1 , 640 replicates regions from one or more prior decoded frames 610 into a reference frame 620 .
  • Method 2 , 650 additionally places feature instances 660 , comprised of feature regions 630 - 1 , 630 - 2 , . . . 630 - n , into the reference frame.
  • the insertion of the feature instance directly into the reference frame represents a simple form of the present invention, where, in one further embodiment, the segmentation is simply a rectangular region, and the model of the feature is the feature instance. Additional compression gains can be realized as further modeling techniques are applied to the identified features 660 and used within the reference frames.
  • Prediction Segmentation is the method by which conventional compression's motion compensated prediction method is extended to allow a more accurate prediction.
  • Conventional compression uses the additional reference frames that are generated using the invention's feature modeling methods to increase the accuracy. When parts of these feature reference frames are utilized by the conventional compression scheme, a gain in compression is achieved when the feature encoding is smaller than the conventional encoding would have been.
  • features are represented as a set of elements or feature instances.
  • the feature instances are realized as rectangular regions, each one providing a reference to a unique frame, a spatial position within that frame, and a rectangular extent of the region in that frame.
  • Each instance of the feature represents a sampled image of the feature. Variation in the appearance of the feature from instance to instance is modeled by the feature modeling method.
  • the reference frames are populated with one or more sub frame samples from previously synthesized frames.
  • the sub-frame samples are based on feature instance correspondences between those sub-frame regions in the previously synthesized (decoded) frame and the current frame.
  • the multiple image planes are consolidated into fewer image planes. These fewer image planes have the feature located close to the position expected in the frame to be predicted. Frame reduction is based on consolidating non-overlapping or near-zero spatially overlapping features into the same plane.
  • each consolidated frame is equal to the size of the frame being predicted and the features are spatially close to if not exactly at the position expected by the conventional motion compensated prediction mechanism.
  • FIG. 5 depicts a feature, 510 - 1 , 510 - 2 , . . . 510 - n that has been detected in one or more frames of the video 520 - 1 , 520 - 2 , . . . 520 - n .
  • a feature would be detected using several different criteria based on both structural information derived from pels and complexity criteria indicating that conventional compression utilizes a disproportionate amount of resources to encode the feature.
  • each feature can further be identified spatially in a frame 520 - 1 , 520 - 2 , . . . 520 - n by a corresponding spatial extent, perimeter, shown in the figure as a “region” 530 - 1 , 530 - 2 , . . . 530 - n.
  • These regions 530 - 1 , 530 - 2 , . . . 530 - n can be extracted, for instance as a simple rectangular region of pel data, and placed into an ensemble, 540 , the whole of the ensemble representing a feature.
  • Each instance of a feature in a frame is a sample of the appearance of the feature. Note that when a sufficient number of these samples are coalesced into an ensemble, they can be used to model the appearance of the feature in those frames, and also in other frames from which the feature was not sampled. Such a model is able to transform the appearance into an encoded set of parameters that can further be decoded through the inverse model to create a synthesis of the feature.
  • Small spatial regions are identified and analyzed to determine if they can be combined based on some coherency criteria into larger spatial regions. These larger spatial regions are then analyzed to determine their suitability as candidate features. Should the region's feature modeling not provide a beneficial encoding, the candidate feature is either discarded or retained for modeling future instances of that feature with subsequent frames. The detection process proceeds until only those candidate features exhibiting an advantageous modeling remain.
  • Spatial regions vary in size from small groups of pels or subpels to larger areas that may correspond to actual objects or parts of those objects as tending to be implicitly segmented through the macroblocks or sub-macroblocks partitioning steps as determined by conventional video compression algorithms.
  • the detected features may not correspond to discretely unique and separable entities such as objects and sub-objects. There is no requirement that the features correspond to such entities.
  • a single feature may contain elements of two or more objects or no object elements at all.
  • the critical factor is that the current invention has the potential to process these signal components with efficiency greater than conventional methods, and they sufficiently satisfy the definition of a feature purely based on their being efficiently modeled by feature-based video compression techniques.
  • Small spatial regions may be aggregated into larger regions in order to identify these larger regions as features. Small regions are aggregated into larger ones through the identification of coherency among them. There are several ways that coherency can be identified including coherent motion, motion compensated prediction, and encoding complexity.
  • Coherent motion may be discovered through higher order motion models. For example, the translational motion for each individual small region is integrated into an affine motion model which is able to approximate a simpler motion model for each of the small regions.
  • Encoding complexity can be determined through analysis of the bandwidth required by conventional compression to represent one or more of the small regions. Where there is a disproportionate allocation of bandwidth to a certain set of small regions that conventional encoding cannot efficiently compress and additionally may not be able to correlate as being redundant from frame to frame, these regions can potentially be aggregated into a feature whose encoding complexity may indicate the presence of a phenomena that feature modeling would better represent.
  • a set of known frames are each completely partitioned into uniform tiles arranged in a non-overlapping pattern.
  • Each tile is analyzed as an independent sampled region of pels that is determined in practice to contain enough information to characterize the feature.
  • the current invention uses these sampled regions to produce multiple classifications which, in turn, are used in training a classifier. Note that the final position of any feature may differ from this initial positioning.
  • a further embodiment generates sampled regions from the defined tiles and a tiling that overlaps those tiles.
  • the overlapping sampling may be offset so that the center of the overlapping tiles occur at the intersection of every four underlying tile's corners. This over-complete partitioning is meant to increase the likelihood that an initial sampling position will yield a detected feature.
  • Other, possibly more complex, topological partitioning methods are also anticipated.
  • a feature modeling predictor classifies sampled regions into clusters with significant probability that a region will have some correspondence to other regions in that same cluster.
  • the feature modeling predictor uses pattern examples derived from the sampled region(s).
  • the features are detected with assistance from spectral profiling (described below in Spectral Profiling section).
  • Spectral profiling provides regions of the frame that may be part of a single feature. This is used as a means of combining sampled regions into a feature.
  • a pattern feature is defined as a spectral feature.
  • the spectral feature is found by transforming the region from a color space into an HSV color space. The transformed region is then sub-sampled repeatedly down until the image vector space of the derived region is of a much smaller dimension than the original region's image vector space. These derived regions are considered the spectral features.
  • the spectral features are clustered using a modified K-means algorithm. The K-means clusters are used to label the original regions based on their spectral classification.
  • a classifier is built based on the edge content of the sampled regions. Each region is transformed into DCT space. The derived feature's DCT coefficients are then summed for the upper triangular matrix and the lower triangular matrix. These sums are then used to form an edge feature space. The feature space is then clustered using K-means, and the original regions are labeled according to the classification of their derived region clusters.
  • the spectral feature and edge pattern feature classifier are used to generate multiple classifications for each region.
  • One embodiment uses a combination of newly detected and previously tracked features as the basis for determining the instances of the same corresponding feature in the current frame.
  • the identification of this feature's instance in the current frame and the inclusion of this instance along with previously occurring instances of the region constitute the tracking of the feature.
  • FIG. 8 demonstrates the use of a feature tracker 830 along with the combination of newly detected and previously tracked features 810 to track and classify features 820 - 1 , 820 - 2 , . . . 820 - n .
  • a general feature detector 850 is used to identify features. Correspondence is determined based on the current frame 840 being matched to the previously detected features 810 .
  • the tracked features are organized into sets of features, or classified as belonging to a previously assembled feature set or to a new feature set.
  • Feature correspondence can initially be determined through conventional gradient descent minimizing an estimate of mean squared error.
  • the resulting spatial displacement gives an approximate position of the feature in the current frame.
  • the template that is used in the search need not be a single region of the feature, but can be any of the regions associated to the feature.
  • the final match is then evaluated in a robust manner as a count of non-overlapping region tiles that meet a MSE threshold.
  • the tracker By imposing a spatial constraint on the coincidence of two or more regions, the tracker is able to decrease the number of features being tracked and therefore increase the computational efficiency of the tracking.
  • the spatial coincidence of two or more features can also indicate additional feature cases, where the feature may have actually been two features in the past or some other complex feature topology.
  • the tracker modes allow for temporary degenerate tracking states that allow the feature to be tracked, but make the tracked regions of a lower priority.
  • the region to be predicted is used to traverse the Region Correspondence Model (RCM) in order to determine regions within the model that would be used to construct a region prediction model.
  • RCM Region Correspondence Model
  • the target region is used to update the RCM thereby generating translational and mid-point normalized correspondences between other regions contained within the RCM and the target region.
  • the resulting pair-wise region correspondences identify the other regions most likely to yield a prediction model for the target region.
  • the present invention includes the assembly of one or more of the best correspondences for a particular target region into a set termed an ensemble of regions.
  • the ensemble of regions can be spatially normalized toward one key region in the ensemble.
  • the region closest to the target region temporally is selected as the key region.
  • the deformations required to perform these normalizations are collected into a deformation ensemble, and the resulting normalized images are collected into an appearance ensemble, as described in U.S. Pat. Nos. 7,508,990, 7,457,472, 7,457,435, 7,426,285, 7,158,680, 7,424,157, and 7,436,981 and U.S. application Ser. No. 12/522,322, all by Assignee.
  • the entire teachings of the above listed patents and application are incorporated by reference.
  • the appearance ensemble is processed to yield an appearance model
  • the deformation ensemble is processed to yield a deformation model.
  • the appearance and deformation models in combination become the feature model for the target region.
  • the method of model formation is a Principal Component Analysis (PCA) decomposition of the ensemble followed by a truncation of the resulting basis vectors.
  • the criteria for truncation may be the intra-ensemble reconstruction.
  • the method of model formation is Compressed Sensing (CS), described elsewhere, wherein the model parameters are resolved from partial parameter measurements.
  • CS Compressed Sensing
  • the target region is projected onto the feature model yielding the feature parameters. They are for the deformation and appearance modeling of the region. Also, the feature parameters are the encoding of the target region.
  • the feature model parameters for two or more intra-ensemble regions are selected using temporal criteria. These parameters are used to predict the state of the target region given the known interval between the regions themselves and the target regions.
  • One example of a state model is a linear extrapolation of two or more feature parameters given temporal steps. The linear model is used to predict the feature parameters for the target region.
  • the extrapolated values provide a suitable synthesis (decoding) of the target region, the specification of the target region's feature parameters is not required, or they can be differentially specified relative to the extrapolated parameters.
  • the state model for extrapolation can be of higher order than a simple linear model.
  • an extended Kalman filter is used to estimate the feature parameter state.
  • the combination of classification, registration, and deformation analysis provides a set of information that indicates the probability that two or more regions can be combined into a joint model of appearance and deformation, called a Region Correspondence Model (RCM).
  • RCM Region Correspondence Model
  • the feature detection method (described above) analyzes novel features incrementally.
  • One result of this analysis is the higher probability that a region would correspond to other regions used to construct one of the feature detectors.
  • regions are classified into clusters as in the above-discussed feature detection, and given their respective cluster labels, the inter-cluster regions are analyzed to determine the per region correspondence between region pairs.
  • the classifier described above is used to define clusters of sampled regions whose region source pels are further analyzed and defined through region translational refinement (described below).
  • region correspondences can be further defined in terms of their region deformation analysis (discussed below).
  • the construction of the RCM is achieved incrementally. Two or more regions are used to initially seed the combined classifier/deformation analysis mechanism. The RCM is then updated with new regions that alter the classifiers and the deformation analysis elements.
  • the incremental update of the RCM described above is constructed such that regions correspondences for a given model are processed in a traversal order dependent on base complexity analysis detailed below.
  • the traversal order as discussed above dependent on a base complexity analysis are part of an iterative process that updates the RCM with traversal termination criteria.
  • the termination criteria leave the processing completed to a level that maximizes the RCM's ability to represent correspondences with the greatest probability to reduce complexity when appearance/deformation models are derived from the correspondences.
  • sampled regions are gathered together into a set of training sampled regions. The spatial position of these regions in each frame is refined.
  • a refinement includes an exhaustive comparison of each sampled region to every other sampled region. This comparison is comprised of two tile registrations. One registration is a comparison of a first region to a second region. The second registration is a comparison of the second region to the first region. Each registration is performed at the position of the regions in their respective images. The resulting registration offset along with the corresponding positional offset are retained and referred to as correlations.
  • the correlations are analyzed to determine if multiple registrations indicate that a sampled region's position should be refined. If the refined position in the source frame would yield a lower error match for one or more other regions, then that region position is adjusted to the refined position.
  • the refined position of the region in the source frame is determined through a linear interpolation of the positions of other region correspondences that temporally span the region in the source frame.
  • the Spectral Profiling method is a statistical “mean tracking and fitting” method. Other examples of such methods are described in the literature are CAMSHIFT, mean shift, medoid shift, and their derived methods as applied to detection, tracking, and modeling of spatial probability distributions occurring in images and video frames.
  • the Spectral Profiling method of the present invention starts with analyzing intensity elements, pels of the spectral (color) planes of a region of an image plane, across one or more frames. The intensity elements are processed first through a discretization of the values via a histogram binning method. Then the histogram for a region is used with a tracking mechanism to identify more corresponding regions in subsequent frames that have a similar histogram.
  • the region's set of elements (position, discretization criteria, and histograms) is iteratively refined so it converges on a common set of these elements.
  • the refined set of elements is the spectral profile.
  • the spectral profile method is a feature detection method.
  • the core basis functions for the present invention utilize preexisting data to derive models for the new data.
  • the preexisting data can be obtained through any encoding/decoding scheme and is assumed to be available.
  • the invention analyzes this data to determine a set of candidate pattern data, referred to as feature data, which can include data for both the appearance and deformation of a spatially localized component of the video signal.
  • the preexisting feature data is referred to as the candidate feature vectors and the target data point is referred to as the target vector. Further, the process is applicable to one or more target vectors.
  • a minimal subset of the candidate feature vectors is selected to synthesize the target vector with low error, resulting in a manifold representation that is both compact and accurate.
  • the present invention aggregates a set of candidate feature vectors into what is termed the feature ensemble.
  • the first step in creating the feature ensemble is to select a key vector, a feature vector determined to be a good approximation of the target vector.
  • the key vector is the first vector in the feature ensemble.
  • Other candidate feature vectors are selected for the feature ensemble in the order of their correlation with the key vector (so the second vector in the feature ensemble is the feature vector having next-highest correlation with the key vector). Ordering a feature ensemble in this way is termed key-correlation ordered (KCO).
  • KCO key-correlation ordered
  • the feature ensemble is created using the target vector itself.
  • Candidate feature vectors are selected for the feature ensemble based on their correlation with the target vector. Any ordering method making use of target vector correlation is termed target-correlation ordered (TCO).
  • TCO target-correlation ordered
  • the first feature vector in a TCO feature ensemble is the candidate feature having largest correlation with the target vector.
  • Ur ensemble-to-date
  • the approximate target reconstruction via the ensemble-to-date (Ur) is computed as Ur*Ur*t and then subtracted from the target vector t to form a residual vector.
  • the next feature vector for the ensemble is then selected as being the candidate feature having largest correlation with the residual vector.
  • STCO sequential target-correlation ordering
  • residual vectors are not computed and all candidate feature vectors are selected for the feature ensemble based on their correlation with the target vector itself.
  • This TCO method termed global target-correlation ordering (GTCO) is faster and simpler than STCO but may result in redundancies in the ensemble.
  • GTCO global target-correlation ordering
  • both TCO methods are generally far superior to the KCO method for selecting the ensemble.
  • a bitmask is used to transmit the feature vectors that were selected for the feature ensemble.
  • the feature vectors in the feature ensemble and the target vector itself are passed through a discrete wavelet transform (DWT) before SVD-based encoding.
  • DWT discrete wavelet transform
  • the DWT is a well known method for compacting signal information over multiple scales.
  • the DWT is applied with the Daubechies 9-7 bi-orthogonal wavelet.
  • the DWT is applied to each component separately as, the feature vectors are in YUV color space. For example, length-384 YUV vectors require a length-256 DWT on the Y component and length-64 DWT's on the U and V components.
  • Compressed Sensing is employed as the method of model formation (appearance and deformation models) in the Feature Modeling (described elsewhere) process.
  • OMP Orthogonal Matching Pursuit
  • L1M L1 Minimization
  • CP Chaining Pursuit
  • the effectiveness of CS algorithms is limited in practice by computation time, memory limits, or total number of measurements.
  • the present invention uses one or more of several possible methods. Briefly, the methods achieve benefit through: (1) reducing the number of measurements specified in the literature to attain a precise reconstruction; (2) increasing sparsity in the input data by one or more specific data reduction techniques; (3) partitioning the data to ease memory limitations; and (4) adaptively building an expectation of error into the reconstruction algorithm.
  • One embodiment exploits the fact that, typically, the mathematical requirements for reconstruction are stricter than necessary. It is possible to achieve “good” reconstruction of image data consistently with fewer measurements than specified in the literature. “Good” reconstruction means that to the human eye there is little difference visually compared with a “full” reconstruction. For example, applying Chaining Pursuit (CP) with half the number of measurements specified still achieves “good” reconstruction.
  • CP Chaining Pursuit
  • the input data is “reduced” to make it sparser, which reduces the number of measurements required.
  • Data reduction techniques include passing the data through a discrete wavelet transform (DWT), because data is often more sparse in the wavelet domain; physically reducing the total size of the input data by truncation, also known as down-sampling; and thresholding the data (removing all components that are less than some threshold).
  • DWT transformation is the least “invasive” and theoretically allows full recovery of the input data.
  • the other two reduction techniques are “lossy” and do not allow full signal recovery.
  • DWT works well with CP but not with Orthogonal Matching Pursuit (OMP) or L1 Minimization (L1M). So the ideal combination for this data reduction embodiment is Chaining Pursuit algorithm with the Discrete Wavelet Transform data reduction technique.
  • OMP Orthogonal Matching Pursuit
  • L1M L1 Minimization
  • the input data is partitioned into segments, (or 2-D images into tiles) and each segment is processed separately with a smaller number of required measurements.
  • This approach works well for both OMP and L1M which typically are impeded by a memory limitation.
  • the size of the required measurement matrix causes the memory limitation for both OMP and L1M.
  • the process builds some expectation of error into the reconstruction algorithm.
  • the expected error could be due to above normal noise or inaccurate measurements.
  • the process compensates either by relaxing the optimization constraint or by stopping the iteration prior to completion of the reconstruction process.
  • the reconstruction is then an approximate fit to the data, but such approximate solutions may be sufficient or may be the only solutions possible when the input data is noisy or inaccurate.
  • FIG. 2 displays a notional video compression architecture that implements compressed sensing measurements at the encoder.
  • the raw video stream 200 is sent through a motion compensated prediction algorithm 202 to register the data 203 thereby establishing correspondences between groups of pels in multiple frames such that the redundancies due to motion can be factored out.
  • preprocessing 204 is applied to make the data as sparse as possible (at 205 ) so that CS measurements and the reconstruction that follow will be as effective as possible.
  • CS measurements are taken 206 and become the CS encoding 207 (ready for transmission). Later during synthesis, the CS algorithm is used to decode the measurements.
  • the present invention identifies, separates, and preprocesses signal components from raw video streams into sparse signals that are well suited to CS processing.
  • CS algorithms are naturally compatible with embodiments of the invention. It should be noted that certain aspects of FIG. 2 are related to embodiments discussed in U.S. Pat. Nos. 7,508,990, 7,457,472, 7,457,435, 7,426,285, 7,158,680, 7,424,157, and 7,436,981 and U.S. application Ser. No. 12/522,322, all by Assignee. The entire teachings of the above listed patents and patent application are incorporated herein by reference.
  • CS delivers a significant benefit when the input image has some sparsity, or compressibility. If the input image is dense, then CS is not the correct approach for compression or reconstruction. CS algorithms can compress and reconstruct sparse input images with fewer measurements than required by conventional compression algorithms which require a number of measurements equal to the number of pixels in the image). Note that signal sparsity or compressibility is assumed by most compression techniques, so the images for which CS provides improvement are the images for which most compression techniques are designed.
  • Representative sampled video regions can be analyzed using a base method.
  • One such method would be conventional block-based compression, as MPEG-4.
  • the recently proposed IC implementation by Xu and Roy-Chowdhury uses the Inverse Compositional (IC) algorithm to estimate 3D motion and lighting parameters from a sequence of video frames.
  • IC Inverse Compositional
  • a 2D-to-3D-to-2D warping function is used to align (target) images from different frames with a “key” frame (template) at a canonical pose.
  • the 2D-to-3D map determines which 3D points (facets/vertices) in the 3D model correspond to which image pixels.
  • the object's pose is shifted in 3D by the previous frame's pose estimate, thereby aligning the current frame with the key frame.
  • the shifted object in 3D is then mapped back to 2D using the 3D-to-2D (projection) map to form a “pose normalized” image frame.
  • the resulting pose-normalized frame is used to estimate 15 motion parameters, corresponding to 9 illumination and 6 motion variables.
  • the illumination variables are estimated via a least-squares fit of the PNF to the LRLS (illumination) basis images.
  • the illumination component estimated by the LRLS basis images is then subtracted from the PNF, and the residual is used to estimate 6 motion parameters (3 translation and 3 rotation) via least-squares fit to the motion functions.
  • the PNF can then be reconstructed from the 15-dimensional “bilinear” illumination/motion basis and its corresponding parameter vector.
  • the present invention uses aspects of the Xu/Roy-Chowdhury IC implementation to aid with image registration applications.
  • the 2D-to-3D-to-2D mapping is used as a computationally efficient substitute for midpoint normalization of feature regions.
  • the mapping process is especially useful for features where accurate 3D models (such as the Vetter model for faces) exist.
  • the model points are specified at some pose (the “model pose”) and both the key frame (the template) and the current frame (or target frame) are registered to the model pose.
  • the SVD is reduced using a variation of the common magnitude thresholding method, termed here percentage thresholding.
  • percentage thresholding the total energy E of the singular values in a given SVD factorization is computed as the sum of the singular values.
  • a grouping of the singular values, referred to in the present text as a “reduced set,” is created when singular values are added sequentially (in decreasing order of magnitude, largest to smallest) until the sum of the singular values in the reduced set exceeds some percentage threshold of E.
  • This reduction method is equivalent to magnitude thresholding (see Prior Art), except the magnitude threshold does not need to be known ahead of time.
  • the singular value decomposition is applied to feature data as follows.
  • the M ⁇ N data matrix D consists of an ensemble of feature vectors, derived from the regions (tiles) of a given video image frame.
  • the M ⁇ 1 feature vectors are column-vectorized from 2D image tiles and are concatenated to form the columns of the data matrix D.
  • the left singular vectors are then used to encode the M ⁇ 1 target vector t, the feature to be transmitted, with the final encoding given by Ur′*t.
  • the incremental SVD is used to update the SVD based on the existing singular value decomposition and the data update.
  • a small number of feature vectors is grouped together to form an initial data matrix D0, for which the conventional SVD is easily computed.
  • the ISVD is used to update the SVD for the augmented data matrix.
  • a linear independence test is applied to the new data vectors before they are added to the existing ensemble.
  • the SVD is reduced using the correlations of the left singular vectors (the columns of Ur) with the target vector t.
  • the total correlation energy CE is computed as the sum of the correlations.
  • a grouping of the singular values referred to in the present text as a “reduced set,” is created when correlations are added sequentially (in decreasing order of magnitude, largest to smallest) until the sum of the correlations in the reduced set exceeds some percentage threshold of CE.
  • This method of reducing the SVD termed target-correlation percentage thresholding, follows the same methodology as the basic SVD reduction method of percentage thresholding, except that target correlations (of left singular vectors with the target vector) are used instead of singular values for the computations.
  • the present invention performs empirical feature classification on video frame data in transform space.
  • a set of Nt features from a reference frame presented as input to the classifier.
  • Each of the features is transformed from pel space to transform space using the linear transform of choice (possible transforms include the discrete wavelet transform [DWT] and curvelet transform [CuT]).
  • the indices corresponding to the largest P coefficients for each feature are tabulated, and the P most commonly occurring coefficients across all the coefficient lists are used to create a (P ⁇ 1) classification vector (CV) for each feature (a total of Nt “reference” CVs in all).
  • CV classification vector
  • each new feature vector v is classified by transforming the vector, extracting the CV indices for v, and computing a similarity measure between the CV for v and each of the reference CVs.
  • the test feature is classified as the feature whose reference CV maximizes the similarity measure.
  • Information from two or more linear transforms with different strengths and weaknesses can be combined using orthogonal matching pursuit to improve the performance of the empirical transform-based feature classifier.
  • basis vectors from the DWT which is effective at representing textures
  • CuT which is effective at representing edges
  • OMP is used to compute a signal representation using the functions in D for each of Nt features, as well as a representation for the “test” feature vector.
  • the classifier then proceeds as in the basic transform-based classifier described above. Combining the information from multiple transforms in this way can improve classifier performance over that achieved by each of the individual classifiers.
  • Linear transforms can also be used for compression and coding of features.
  • the transform coefficients are ordered by magnitude and thresholded according to an energy retention criterion (e.g., enough coefficients are kept such that 99% of the feature energy is retained). Typically, many fewer transform coefficients are needed to retain 99% of signal energy than pels are needed in pel space.
  • the transform coefficient values represent the encoding of the feature, and the compression gain is given by the percentage of transform coefficients kept relative to the number of pixels in the feature.
  • information from multiple transforms can again be combined using OMP to improve compression gain.

Abstract

Systems and methods of processing video data are provided. Video data having a series of video frames is received and processed. One or more instances of a candidate feature are detected in the video frames. The previously decoded video frames are processed to identify potential matches of the candidate feature. When a substantial amount of portions of previously decoded video frames include instances of the candidate feature, the instances of the candidate feature are aggregated into a set. The candidate feature set is used to create a feature-based model. The feature-based model includes a model of deformation variation and a model of appearance variation of instances of the candidate feature. The feature-based model compression efficiency is compared with the conventional video compression efficiency.

Description

    RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 13/341,482, filed Dec. 30, 2011, which is a continuation of U.S. application Ser. No. 13/121,904, filed Mar. 30, 2011, now U.S. Pat. No. 8,942,283, which is the U.S. National Stage of International Application No. PCT/US2009/059653, filed Oct. 6, 2009, which designates the U.S. and published in English, which claims the benefit of U.S. Provisional Application No. 61/103,362, filed 7 Oct. 2008 and U.S. application Ser. No. 13/121,904 is also a continuation-in-part of Ser. No. 12/522,322, filed Jul. 7, 2009, now U.S. Pat. No. 8,908,766, which is the U.S. National Stage of International Application No. PCT/US2008/000090, filed Jan. 4, 2008, which designates the U.S. and published in English which claims the benefit of U.S. Provisional Application No. 60/881,966, filed Jan. 23, 2007, is related to U.S. Provisional Application No. 60/811,890, filed Jun. 8, 2006, and is a continuation-in-part of U.S. application Ser. No. 11/396,010, filed Mar. 31, 2006, now U.S. Pat. No. 7,457,472, which is a continuation-in-part of U.S. application Ser. No. 11/336,366, filed Jan. 20, 2006, now U.S. Pat. No. 7,436,981, which is a continuation-in-part of U.S. application Ser. No. 11/280,625, filed Nov. 16, 2005, now U.S. Pat. No. 7,457,435, which is a continuation-in-part of U.S. application Ser. No. 11/230,686, filed Sep. 20, 2005, now U.S. Pat. No. 7,426,285, which is a continuation-in-part of U.S. application Ser. No. 11/191,562, filed Jul. 28, 2005, now U.S. Pat. No. 7,158,680. U.S. application Ser. No. 11/396,010, now U.S. Pat. No. 7,457,472, also claims priority to U.S. Provisional Application No. 60/667,532, filed Mar. 31, 2005 and U.S. Provisional Application No. 60/670,951, filed Apr. 13, 2005.
  • The entire teachings of the above applications are incorporated herein by reference.
  • BACKGROUND OF THE INVENTION Prediction Segmentation [Primary]
  • Conventional video compression, for example MPEG-4 and H.264, have the facilities for specifying a number of reference frames to use during the motion compensated prediction process in order to predict the current frame. These standards typically restrict the reference frames to one or more consecutive past frames, and in some cases any set of frames that has been previously decoded. Usually, there is a limit on the number of reference frames and also a limit on how far back in the stream of decoded frames the selection process may draw.
  • Compressed Sensing (CS)
  • Image and video compression techniques generally attempt to exploit redundancy in the data that allows the most important information in the data to be captured in a “small” number of parameters. “Small” is defined relative to the size of the original raw data. It is not known in advance which parameters will be important for a given data set. Because of this, conventional image/video compression techniques compute (or measure) a relatively large number of parameters before selecting those that will yield the most compact encoding. For example, the JPEG and JPEG 2000 image compression standards are based on linear transforms (typically the discrete cosine transform [DCT] or discrete wavelet transform [DWT]) that convert image pixels into transform coefficients, resulting in a number of transform coefficients equal to the number of original pixels. In transform space, the important coefficients can then be selected by various techniques. One example is scalar quantization. When taken to an extreme, this is equivalent to magnitude thresholding. While the DCT and DWT can be computed efficiently, the need to compute the full transform before data reduction causes inefficiency. The computation requires a number of measurements equal to the size of the input data for these two transforms. This characteristic of conventional image/video compression techniques makes them impractical for use when high computational efficiency is required.
  • Conventional compression allows for the blending of multiple matches from multiple frames to predict regions of the current frame. The blending is often linear, or a log scaled linear combination of the matches. One example of when this bi-prediction method is effective is when there is a fade from one image to another over time. The process of fading is a linear blending of two images, and the process can sometimes be effectively modeled using bi-prediction. Further, the MPEG-2 Interpolative mode allows for the interpolation of linear parameters to synthesize the bi-prediction model over many frames.
  • Conventional compression allows for the specification of one or more reference frames from which predictions for the encoding of the current frame can be drawn. While the reference frames are typically temporally adjacent to the current frame, there is also accommodation for the specification of reference frames from outside the set of the temporally adjacent frames.
  • In contrast with conventional transform-based image/video compression algorithms, compressed sensing (CS) algorithms directly exploit much of the redundancy in the data during the measurement (“sensing”) step. Redundancy in the temporal, spatial, and spectral domains is a major contributor to higher compression rates. The key result for all compressed sensing algorithms is that a compressible signal can be sensed with a relatively small number of random measurements and much smaller than the number required by conventional compression algorithms. The images can then be reconstructed accurately and reliably. Given known statistical characteristics, a subset of the visual information is used to infer the rest of the data.
  • The precise number of measurements required in a given CS algorithm depends on the type of signal as well as the “recovery algorithm” that reconstructs the signal from the measurements (coefficients). Note that the number of measurements required by a CS algorithm to reconstruct signals with some certainty is not directly related to the computational complexity of the algorithm. For example, a class of CS algorithms that uses L1-minimization to recover the signal requires a relatively small number of measurements, but the L1-minimization algorithm is very slow (not real-time). Thus, practical compressed sensing algorithms seek to balance the number of required measurements with the accuracy of the reconstruction and with computational complexity. CS provides a radically different model of codec design compared to conventional codecs.
  • In general, there are three major steps in a typical CS algorithm: (1) create the measurement matrix M; (2) take measurements of the data using the measurement matrix, also known as creating an encoding of the data; and (3) recover the original data from the encoding, also known as the decoding step. The recovery algorithm (decoder) can be complex, and because there are fewer limits to computational power at the receiver, the overall CS algorithm is usually named after its decoder. There are three practical applications of CS algorithms of interest in the prior art: Orthogonal Matching Pursuit (OMP), L1 Minimization (L1M), and Chaining Pursuit (CP). In general, the L1M in practice is prohibitively computationally inefficient for most video processing applications. The more efficient OMP and CP algorithms provide much of the same benefits of the L1M, and, as such, they are the two CS algorithms of choice for most applications of the L1M.
  • Image Alignment via Inverse Compositional Algorithm
  • Basri and Jacobs (“Lambertian Reflectances and Linear Subspaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2/03), henceforth referred to as LRLS, have shown that Lambertian objects (whose surfaces reflect light in all directions) can be well-approximated by a small (9-dimensional) linear subspace of LRLS “basis images” based on spherical harmonic functions. The LRLS basis images can be visualized as versions of the object under different lighting conditions and textures. The LRLS basis images thus depend on the structure of the object (through its surface normals), the albedo of the object at its different reflection points, and the illumination model (which follows Lambert's cosine law, integrated over direction, to produce spherical harmonic functions). Under the assumptions of the model, the 9-D subspace captures more than 99% of the energy intensity in the object image. The low dimensionality of the appearance subspace indicates a greater redundancy in the data than is available to conventional compression schemes.
  • The inverse compositional algorithm (IC) was first proposed as an efficient implementation of the Lucas-Kanade algorithm for 2D motion estimation and image registration. Subsequent implementations have used the IC algorithm to fit 3D models such as Active Appearance Models and the 3D morphable model (3DMM) to face images.
  • Application of Incremental Singular Value Decomposition (ISVD) Algorithm
  • A common dimensionality reduction technique involves the utilization of linear transformations on norm preserving bases. Reduction of an SVD representation refers to the deletion of certain singular value/singular vector pairs in the SVD to produce a more computationally and representationally efficient representation of the data. Most commonly, the SVD factorization is effectively reduced by zeroing all singular values below a certain threshold and deleting the corresponding singular vectors. This magnitude thresholding results in a reduced SVD with r singular values (r<N) that is the best r-dimensional approximation of the data matrix D from an L2-norm perspective. The reduced SVD is given by

  • D=UrSrVr T,   Equation 1
  • where Ur is M×r, Sr is r×r diagonal, and Vr is N×r.
  • The singular value decomposition (SVD) is a factorization of a data matrix that leads naturally to minimal (compact) descriptions of the data. Given a data matrix D of size M×N, the SVD factorization is given by D=U*S*V′ where U is an M×N column-orthogonal matrix of (left) singular vectors, S is an N×N diagonal matrix with singular values (s1, s2, . . . sN) along the diagonal, and V is an N×N orthogonal matrix of (right) singular vectors.
  • Compact Manifold Prediction
  • Matching pursuit (MP) is an iterative algorithm for deriving efficient signal representations. Given the problem of representing a signal vector s in terms of a dictionary D of basis functions (not necessarily orthogonal), MP selects functions for the representation via the iterative process described here. The first basis function in the representation (denoted as d1) is selected as the one having maximum correlation with the signal vector. Next, a residual vector r1 is computed by subtracting the projection of d1 onto the signal from the signal itself: r1=s−(d1′*s)*d1. Then, the next function in the representation (d2) is selected as the one having maximum correlation with the residual r1. The projection of d2 onto r1 is subtracted from r1 to form another residual r2. The same process is then repeated until the norm of the residual falls below a certain threshold.
  • Orthogonal matching pursuit (OMP) follows the same iterative procedure as MP, except that an extra step is taken to ensure that the residual is orthogonal to every function already in the representation ensemble. While the OMP recursion is more complicated than in MP, the extra computations ensure that OMP converges to a solution in no more than Nd steps, where Nd is the number of functions in the dictionary D.
  • SUMMARY OF THE INVENTION
  • The present invention extends conventional video compression, especially in cases where the redundancy of visual phenomena exceeds the modeling capabilities of the conventional video codec. The present invention extends, and may entirely replace, the existing methods of conventional video compression by employing robust Computer Vision and Pattern Recognition algorithms. Specifically, the present invention includes feature modeling methods and systems that focus on the segmentation, normalization, and integration of a feature occurring in one or more of the previously decoded frames of the video. Feature-based video compression considers a greater number of previously decoded frames, and within each of those frames, a greater area and a much higher number of pels compared with conventional compression which considers fewer frames, smaller areas, and fewer pels.
  • Conventional compression provides an implicit form of segmentation at the macroblock level, by utilizing multiple reference frames, macroblock partitioning, sub-macroblock partitioning, and motion compensated prediction. Further, conventional compression utilizes motion compensated prediction to model the spatial deformation occurring in the video and transform coding to model the appearance variations. The present invention extends these modeling techniques of disparate signal elements with more complex models including spatial segmentation masks, regular mesh deformation, feature affine motion, three dimensional feature motion, three dimensional illumination, and other Computer Vision and Pattern Recognition modeling techniques. Note that throughout the present text, “individual modes” and “disparate signal elements” are equivalent.
  • The present invention facilitates the identification and segmentation of individual modes of the video signal. The concept of reference frame processing that is used in conventional motion compensated prediction is utilized in the present invention to facilitate this identification and segmentation. The conventional motion compensated prediction process selects, at the macroblock level, portions of the signal from one or more reference frames. Note that the conventional motion compensated prediction process typically does such a selection based on some rate-distortion metric. The present invention is able to apply analysis to the past frames to determine the frames that will have the highest probability of providing matches for the current frame. Additionally, the number of reference frames can be much greater than the typical one to sixteen reference frame maximum found in conventional compression. Depending on system resources, the reference frames may number up to the limit of system memory; assuming that there are a sufficient number of useful matches in those frames. Further, the intermediate form of the data generated by the present invention can reduce the required amount of memory for storing the same number of reference frames.
  • In one embodiment, the present invention infers the segmentation of the video signal based on this reference frame processing. The macroblocks (block of pixels) in the current frame may select, through the motion compensated prediction process, tiles of pels from previously decoded frames such that those tiles are separated both spatially and also temporally, meaning that the source of tiles used in the motion compensated prediction process may come from different frames. The separation implied by selection of source tiles, for predicting current frame macroblocks, from different frames indicates the potential that different signal modes are being identified. When the identified separate signal modes can be encoded in a more compact manner, this further verifies that separate modes have been identified. In the present invention these separate modes are called “features.” When these features are persistent over many frames of the video and the features can be correlated, a new type of redundancy in the video has been identified. The present invention leverages this redundancy through the creation of appearance and deformation models in order to create further compression beyond what is available to conventional compression. Further, as features are identified within reference frames, reference frame processing is biased toward using reference frames containing features; this yields an increased probability that the reference frame processing will further yield a segmentation of the modes present in the signal.
  • Systems and methods may be provided for processing video data. Video data formed of a series of video frames may be received and encoded. One or more instances of a candidate feature may be detected in one or more of the video frames. The detection of the candidate feature involves determining positional information for instances in the one or more previously decoded video frames. The positional information includes a frame number, a position within that frame, and a spatial perimeter of the instance. The candidate feature can be a set of one or more detected instances. A motion compensated prediction process can be used to predict a portion of a current video frame in the series using one or more previously decoded video frames. The motion compensated prediction process can be initialized with positional predictions. The positional predictions can provide positional information from detected feature instances in previously decoded video frames. One or more of the instances can be transformed by augmenting the motion compensated prediction process. A feature along with the transformed instances can be defined. The one or more of the instances may be transformed using a linear transform. The defined feature including the transformed instances can be used to create a first feature-based model. The first feature-based model can enable prediction in the current frame of an appearance and a source position of a substantially matching feature instance. Preferably, the substantially matching feature is the best match determined using a rate-distortion metric. The substantially matching feature instance can be a key feature instance. The key feature instance can be the first feature-based model current frame feature instance synthesis. The first feature-based model can be compared to a conventional video encoding model of the one or more defined features, and the comparison can be used to determine which model enables greater encoding compression. The results of the comparing and determining step can be used to guide the encoding process in applying feature-based encoding to portions of one or more of the video frames, and applying conventional video encoding to other portions of the one or more video frames.
  • An instance of a candidate feature can be detected by identifying a spatially continuous group of pels having substantially close spatial proximity. The identified pels can be used to define a portion of one of the one or more video frames. The group of pels can include one or more macroblock or portions of one or more macroblocks.
  • The motion compensated prediction process can be used to select, from a plurality of candidate feature instances, one or more instances that are predicted to provide encoding efficiency. A segmentation of the current instance of the candidate feature can be determined from other features and non-features in the current video frame. The segmentation can be based on the motion compensated prediction process' selection of predictions from unique previously decoded video frames. The motion compensated prediction process can be initialized using positional information for feature instances belonging to one or more features (such features having instances in the current frame coincident with the video portion) where the video portion is in the current frame, and the positional information corresponds to feature instances associated with the same feature in previously decoded video frames.
  • A second feature-based model can be formed. The second feature-based model can be formed using the first feature-based model as a target of prediction for one or more motion compensated predictions from one or more feature instance. This second feature-based model yields a set of predictions of the first feature-based model. Once the set of predictions is combined with the first feature-based model, the set of predictions can become the second feature-based model. The second feature-based model can be used to model the residual from first feature-based model. Structural variation and appearance variation can be modeled from the second-feature based model relative to the residual. The residual can be encoded with the feature instance, which yields appearance and deformation parameters. The parameters can be used to reduce the encoding size of the residual.
  • One or more features can include one or more aggregate features. The aggregate features are based on one or more of the instances of the candidate feature. The aggregate features can be created by aggregating the instances of different candidate features into an aggregate candidate feature. The set of instances of the aggregate candidate features can be used to form a region substantially larger than the original instances of un-aggregated candidate features. The larger region can be formed through the identification of coherency among the instances of the candidate feature in the set. Coherency can be defined as appearance correspondences in the instances substantially approximated by a lower parameter motion model. The second feature-based model can provide an optional rectangular area extent of pels associated with that instance in the decoded frame relative to the spatial position. The second feature-model can be derived by modeling prior normalized instances of the feature. The prior normalized instances can be any one of the following: the instance in the current frame; an instance that is from a previously decoded frame that is substantially recent temporally; or an average of the instances from the previously decoded video frames.
  • The appearance model can be represented by a PCA decomposition of the normalized second feature-based model instances. A deformation model can be determined using the spatial variation of correspondences in the feature instances of each set as compared to their second feature-based model instances. For each feature instance in the set, one or more of the following can be used to approximate variation in the deformation instances for the deformation model: a motion compensated prediction process; mesh deformation; and a motion model with a substantially reduced parameterization. The deformation instances can be integrated into the deformation model. The variation in the deformation model can be represented by a PCA decomposition.
  • Appearance parameters and deformation parameters may be predicted. The predicated parameters can be used during the synthesis of the current instance using a feature-based model. The appearance and deformation models as well as temporally recent parameters can be used to interpolate and extrapolate parameters from the feature-based model to predict pels in the current frame. The values of the synthesis for the temporally recent feature instances may be either linearly interpolated or linearly extrapolated based on which method has yielded the most accurate approximation for those instances. The actual parameters for the model can be optionally differentially encoded relative to the predicted parameters.
  • The motion compensated prediction process can operate on a selection of a substantially larger number of the previously decoded video frames than in conventional video data encoding. The selection of previously decoded video frames need not rely on user supervision.
  • Conventional video encoding can be augmented by an instance prediction process that enables greater compression of portions of one or more of the video frames in memory, when forming a prediction of portions of the current frame. The instance prediction process can use the feature-based model to determine one or more instances of the defined feature that are incident to a target macroblock being encoded. In this way, the instance prediction process can create the predicted portions of the current frame. The feature-based model can be used to synthesize pels to predict portions of the current frame.
  • A probability for the previously decoded video frames can be assigned. The probability can be based on the combined predicted encoding performance improvement for the frame, determined using positional predictions from the motion compensated prediction process. The probability can be defined as the combined encoding performance of motion compensated prediction process, which was utilized during the analysis of the first feature-based model and a second feature-based model for the current frame. An indexing based on sorting the previously decoded video frames can be created based on their probability, from best to worst. The indexed list can be truncated based on computational and memory requirements.
  • A feature-based model may be formed using one or more of the defined features. The feature-based model may include positional information for the defined features. The positional information may include a position and a spatial perimeter of defined features from the previously decoded video frames. For example, the positional information may include information regarding the spatial position of region within a specific frame, and a rectangular extent of the region in that frame. The feature-based model may specify which previously decoded video frames (or portions thereof) are associated with the defined feature.
  • The defined features may be normalized and segmented from the video data using macroblock motion compensated prediction. The defined features may be normalized using the feature-based model. The macroblock motion compensated prediction may use the feature position in the previously decoded image frame as a positional prediction. The resulting normalization provides the prediction of the feature in the current video frame.
  • The feature-based model may be compared to another model resulting from conventional encoding of the same video data. The comparison can be used to determine which model enables greater encoding compression efficiency. Different encoding techniques may be applied to the different parts of the video data depending on the results of the encoding comparison. In this way, differential encoding can be provided such that the system is capable of selecting a different video encoding scheme for each portion of video data depending on whether feature-based encoding or conventional based encoding provides more compression efficiency.
  • A defined feature may be represented as a set of instances of the feature in one or more video frames. Each instance may include: a reference to a frame in which the instance occurs; a spatial position associated with the instance within that frame; and an optional rectangular area extent of pels associated with that instance in that frame relative to the spatial position. The spatial position may provide a prediction of matches for encoding portions of one or more of the video frames. An appearance model may be provided for each defined feature to model variation of the defined feature from instance to instance in the set. The appearance model may be derived by modeling prior normalized instances of the feature. The prior normalized instances may be normalized using any combination of motion compensated prediction process, mesh deformation, and parameter reduced motion modeling (e.g. affine).
  • The normalization can be used to build a deformation model that may be used to model the spatial variation of correspondences in the feature instances of each set. For each feature instance in the set, one or more of the following may be used to determine deformation instances for the deformation model: a motion compensated prediction process, mesh deformation, and parameter reduced motion modeling. The deformation instances may be integrated into the deformation model. The deformation model may be represented by a decomposition using Principal Component Analysis (PCA). The deformation model may be represented by a decomposition using any decomposing algorithm. The motion compensation prediction process may operate on a substantially greater number of the previously decoded video frames than in conventional video data encoding without supervision.
  • The conventional video encoding may include motion-compensated block-based compression. The conventional video encoding can be augmented by a residual reduction process that enables greater compression of portions of the video frames in memory when forming a residual frame. The residual reduction process can include the feature-based model to determine one or more instances of the defined feature that are incident to a target macroblock being encoded to form the residual frame. Pels may be synthesized using the feature-based models to predict the residual frame. The feature-based model may be used for reference frame index prediction. The synthesized pels may be reused for other residual reductions in response to determining that one or more instances of the defined feature, overlaps more than one macroblock in the current frame. The synthesized pels may be reused for other residual reductions in response to determining that one or more instances of the defined feature represents one macroblock when one or more instances of the defined feature substantially matches positional information for a macroblock in the current frame. Appearance and deformation may be modeled based on the feature-based model. The appearance model and deformation model may be used along with a historical set of parameters in those models to interpolate and extrapolate parameters from the feature-based model to predict pels in the current frame. Furthermore, higher order quadratic and even extended Kalman filter models can be used to predict the appearance and deformation parameters. The prediction of the parameters from the feature-based model enable a reduction in the magnitude of the residual parameters, resulting in a lower precision and therefore lower bit rate representation of the parameters required to predict pels in the current frame.
  • One or more macroblocks from one or more frames may be selected using the motion compensated prediction process. Pels from macroblocks in a PCA model may be linearly combined pels, and the PCA model parameters may be interpolated. Equivalently, any decomposing algorithm can be used in place of PCA, and utilized based on its substantially relative benefit.
  • Substantially small spatial regions may be identified in the video frames. Coherency criteria may be used to identify spatial regions that can be combined into substantially larger spatial regions. For a larger spatial region, the suitability of the larger spatial region to be a defined feature can be determined by encoding a feature-based model of the larger spatial region. The smaller region may be a defined feature, and the larger region may be a defined feature.
  • Feature-based compression can include object-based compression processes. Object based detection, tracking, and segmentation may be applied to a feature instance in the current frame or in previously decoded frames. An intermediate form of the feature instance may be derived using spatial segmentation. For example, the spatial segmentation process may segment a foreground object from the non-object background. The resulting segmentation may provide a pel level correspondence of a given object in a feature instance as it exists in one frame to its occurrence in a next frame. The pel data associated with the object is resampled, and subsequently the spatial positions of the resampled pel data are restored using models. The resampling effectively normalizes the object pel data from one frame to a next frame and results in providing an intermediate form of the video data which has computational and analytical advantages for video processing purposes. In this way, object-based normalization and modeling processes may be applied to a feature instance (or portions thereof) in the current frame or in previously decoded frames during the feature-based encoding process. Correspondence modeling, deformation modeling, appearance modeling, contour modeling, and structural modeling may be used to model a feature instance (or portions thereof) in the current frame or in previously decoded frames.
  • A defined feature may be free of correspondence to salient entities (object, sub-objects). For example, the salient entities may be determined through supervised labeling of detected features as belonging to or not belonging to an object. The defined features may contain elements of two or more salient objects, background, or other parts of the video frames. One or more features may constitute an object. Also, a defined feature may not correspond to an object. A defined feature may not included in any objects. In this way, feature-based compression can be more flexible and versatile than object-based detection. Although defined features can include objects and be included in objects, defined features do not need to be object-based and can take any form.
  • In another embodiment, Compressed Sensing (CS) is applied to the feature based encoding technique. CS is applied to pels in the video frames having working or defined features. CS may also be applied to conventional encoding to the remaining pels of the video frames. The video data may be made sparse to increase the effectiveness of the application of CS. During model formation (appearance and deformation models), CS may be applied to resolve the model parameters from partial parameter measurements.
  • CS can be applied to the residual of the second feature-based model prediction. The application of CS can utilize the average appearance as a measurement and predict the video signal from it. Variance associated with the CS prediction can be removed from the second feature-based model. The feature-based model can be used to focus on a more compact encoding of the remaining CS encoding can be applied to the remaining pels in the one or more video frames and to remaining video frames.
  • A hybrid codec decoder may be provided that uses feature-based decompression for decoding video data. Encoded video data may be decoded by determining on a macroblock level whether there is an encoded feature in the encoded video data. The encoded feature may include feature-based models. Where an encoded feature does not exist, macroblocks in the encoded video data may be decoded using conventional video decompression. Where an encoded feature does exist, the decoder may respond to the detection of an encoded feature in the encoded video data by separating the feature encoded parts from the encoded video data. By separating the feature encoded parts, the system is able to synthesize the encoded feature separately from the conventionally encoded parts in the video stream. Feature parameters from the encoded feature parts may be associated with feature models included with the encoded feature. The feature parameters may be used by the decoder to synthesize the encoded feature. The conventionally compressed parts/portions of the video data may be combined with the synthesized feature to reconstruct the original video frame.
  • In another embodiment, a video codec is capable of handling a plurality of compressed video signal modes. In one of the video signal modes, a codec encoder provides feature-based video compression. In another mode, the codec encoder provides conventional video compression. Similarly, a codec decoder is responsive to different video signal modes and is capable of providing feature-based video decompression, and conventional video compression depending on the contents of the video signal (e.g. the video signal mode).
  • The codec may determine which type of video compression is appropriate based on whether feature-based encoding or conventional based encoding provides more compression efficiency for one or more features in video frames of the video signal.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
  • FIG. 1 is a schematic diagram of an embodiment (hybrid codec) of the present invention.
  • FIG. 2 is a block diagram of a video compression architecture embodied in encoders of the present invention.
  • FIG. 3 is a schematic diagram of a computer network environment in which embodiments of the present invention are deployed.
  • FIG. 4 is a block diagram of the computer nodes in the network of FIG. 3.
  • FIG. 5 is a diagram depicting feature modeling representing one embodiment of the present invention.
  • FIG. 6 is a diagram describing the prediction process according to an embodiment of the present invention.
  • FIG. 7 is a block diagram of an embodiment (codec) of the present invention.
  • FIG. 8 is a diagram depicting feature tracking according to an embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION Introduction Section
  • A description of example embodiments of the invention follows.
  • The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
  • Digital Processing Environment and Network
  • Preferably, the invention is implemented in a software or hardware environment. One such environment is shown in FIG. 3, which illustrates a computer network or similar digital processing environment in which the present invention may be implemented.
  • Client computer(s)/devices 350 and server computer(s) 360 provide processing, storage, and input/output devices executing application programs and the like. Client computer(s)/devices 350 can also be linked through communications network 370 to other computing devices, including other client devices/processes 350 and server computer(s) 360. Communications network 370 can be part of a remote access network, a global network (e.g., the Internet), a worldwide collection of computers, Local area or Wide area networks, and gateways that currently use respective protocols (TCP/IP, Bluetooth, etc.) to communicate with one another. Other electronic device/computer network architectures are suitable.
  • FIG. 4 is a diagram of the internal structure of a computer (e.g., client processor/device 350 or server computers 360) in the computer system of FIG. 3. Each computer 350, 360 contains a system bus 479, where a bus is a set of actual or virtual hardware lines used for data transfer among the components of a computer or processing system. Bus 479 is essentially a shared conduit that connects different elements of a computer system (e.g., processor, disk storage, memory, input/output ports, etc.) that enables the transfer of information between the elements. Attached to system bus 479 is I/O device interface 482 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc.) to the computer 350, 360. Network interface 486 allows the computer to connect to various other devices attached to a network (for example the network illustrated at 370 of FIG. 3). Memory 490 provides volatile storage for computer software instructions 492 and data 494 used to implement an embodiment of the present invention (e.g., hybrid codec, video encoder compression code and decoder code/program routine detailed above). Disk storage 495 provides non-volatile storage for computer software instructions 492 (equivalently “OS program”) and data 494 used to implement an embodiment of the present invention. Central processor unit 484 is also attached to system bus 479 and provides for the execution of computer instructions. Note that throughout the present text, “computer software instructions” and “OS program” are equivalent.
  • In one embodiment, the processor routines 492 and data 494 are a computer program product (generally referenced 492), including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the invention system. Computer program product 492 can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection. In other embodiments, the invention programs are a computer program propagated signal product 307 embodied on a propagated signal on a propagation medium (e.g., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s)). Such carrier medium or signals provide at least a portion of the software instructions for the present invention routines/program 492.
  • In alternate embodiments, the propagated signal is an analog carrier wave or digital signal carried on the propagated medium. For example, the propagated signal may be a digitized signal propagated over a global network (e.g., the Internet), a telecommunications network, or other network. In one embodiment, the propagated signal is transmitted over the propagation medium over a period of time, such as the instructions for a software application sent in packets over a network over a period of milliseconds, seconds, minutes, or longer. In another embodiment, the computer readable medium of computer program product 492 is a propagation medium that the computer system 350 may receive and read, such as by receiving the propagation medium and identifying a propagated signal embodied in the propagation medium, as described above for computer program propagated signal product.
  • Generally speaking, the term “carrier medium” or transient carrier encompasses the foregoing transient signals, propagated signals, propagated medium, storage medium and the like.
  • Overview—Feature-Based Video Compression
  • The present invention provides a hybrid (feature-based and conventional) codec method (FIG. 1) with a means of detecting 113, separating 115, modeling 117, encoding 111, and decoding 124 features in video while allowing a conventional codec 118 to encode and decode the non-features as well as the features that cannot be advantageously processed through the feature encoder/decoder. FIG. 1 illustrates that a subject video signal input (video data formed of a series of image frames) 110 is encoded by the invention hybrid codec 121. The hybrid codec contains the encoding decision heuristics and processes the video signal as follows: At step 113, the detection of features is primarily accomplished through the identification of groups of pels in close proximity that exhibit complexity. Complexity is generally defined as any metric indicating that the encoding of the pels exceeds a level that would be encoded efficiently by conventional video compression. This grouping of pels in close proximity provides segmentation of the detected feature (at 115) from the background and other features. The grouping is subsequently analyzed to determine if the complexity can be advantageously modeled using the invention's feature modeling 117.
  • Once features are detected and tracked and models of the features are generated (at 117), the feature modeling and conventional modeling are compared (at comparator 119) to determine which one is of greater benefit. The conventional video encoding mechanism (at 118) employing reference frame processing used in the motion compensated prediction of the current frame is utilized in this process. Because the comparison 119 has employed conventional reference frame processing, a segmentation of the current frame is yielded (based on the selection of predictions from different reference frames). The selection of pels (more typically as macroblocks) in one reference frame versus another reference frame indicates a segmentation of the features in the frame, and in the subject video 110 itself. The resulting encoding 120 of the subject video signal input 110 includes a conventional video encoding stream (output of conventional encoder 118) accompanied by the additional encoded information needed to regenerate the features in the reference frames.
  • The hybrid codec decoder 122 illustrates decoding the encoded video in order to synthesize (approximate) the input video signal 110. When examining the stream of information contained in the encoded video, the hybrid codec decoder makes a determination 123 on a sub-frame level, macroblock level, whether or not there is an encoded feature in the encoded video. If an encoded feature does not exist, the conventional macroblock, or non-feature macroblock, is decoded conventionally. If an encoded feature is encountered in the encoded video stream, the hybrid codec separates 125 the feature-encoded parts from the conventionally encoded parts in order to synthesize each separately, combining the parts after synthesis occurs. The hybrid codec uses the encoded feature parameters with the feature models that were created by the decoder (models made exactly and in parallel to those made in the encoder) to synthesize the feature 124. Then the conventionally encoded feature parts and the feature-encoded parts are composited 126 to produce a complete feature synthesis. Next the hybrid codec at 127 combines the feature synthesis with the non-feature synthesis to yield a fully synthesized video signal 128.
  • FIG. 7 is a representation of an embodiment of the present invention which utilizes feature-based encoding as a replacement in part, and in some cases in full, for conventional encoding. Detecting 710, tracking 720, comparing 730, modeling 740, encoding 750, and decoding 760 features in a video signal are illustrated.
  • At step 710, as in 113, the detection of features is primarily accomplished through the identification of spatially proximate groups of pels that exhibit complexity such that they can be encoded/modeled more efficiently than conventional means. These groups of pels effectively separate the detected feature (710) from the non-feature pels around it, as also noted in 115. The detected features, or feature instance candidates, or simply feature candidates are further analyzed to correlate the groups of pels over two or more frames. This correlation confirms that the feature instances are representative of a discrete entity in the video frames that can be tracked 720 thereby confirming additional redundancy in the video that can be potentially reduced through modeling the feature 740. Within step 720, the feature is tracked via the identification of the feature's instance (equivalently region) within the current frame along with instances of the feature in one or more other frames, also noted in 117. Note that throughout the present text “feature instance” is equivalent with “region”. Also, “instance” is equivalent with “feature instance” and “region” when it references them.
  • The instances of each individual feature are considered candidate features, and become combined into a full fledged feature through grouping them into feature sets, or simply feature. These instances are analyzed, compared, and classified into feature sets in step 730 through the identification of correspondences between the instances.
  • In the present text, feature candidates and feature instances are equivalent. The feature sets are analyzed to obtain a model of the deformation variation and appearance variation of the feature instances. The deformation variation between feature instances is determined through a deformation modeling process. The deformation modeling process compares two or more instances in order to determine the spatial pel resampling that would be required to reduce the per pel differences between the instances.
  • Feature candidates are modeled within step 740, which applies multiple analysis techniques to refine the sampled regions. Feature encoding 750 of the video stream utilizes the feature models and encodes the video stream in part, or in full, without the use of conventional video encoding. The decoding 760 synthesizes the features using the feature models in the inverse of the modeling operations 750 to decode the encoded features into a synthesis of the pels 770 of each feature instance, approximating the feature as it appeared originally in the video.
  • Prediction Segmentation [Primary]
  • FIG. 6 demonstrates the process of predicting elements within the current video frame by utilizing information contained within one or more past frames being placed in one or more reference frames. In one embodiment, the prediction, Method 1, 640, replicates regions from one or more prior decoded frames 610 into a reference frame 620. Method 2, 650, additionally places feature instances 660, comprised of feature regions 630-1, 630-2, . . . 630-n, into the reference frame. The insertion of the feature instance directly into the reference frame represents a simple form of the present invention, where, in one further embodiment, the segmentation is simply a rectangular region, and the model of the feature is the feature instance. Additional compression gains can be realized as further modeling techniques are applied to the identified features 660 and used within the reference frames.
  • Prediction Segmentation is the method by which conventional compression's motion compensated prediction method is extended to allow a more accurate prediction. Conventional compression uses the additional reference frames that are generated using the invention's feature modeling methods to increase the accuracy. When parts of these feature reference frames are utilized by the conventional compression scheme, a gain in compression is achieved when the feature encoding is smaller than the conventional encoding would have been.
  • In one embodiment, features are represented as a set of elements or feature instances. In one embodiment, the feature instances are realized as rectangular regions, each one providing a reference to a unique frame, a spatial position within that frame, and a rectangular extent of the region in that frame. Each instance of the feature represents a sampled image of the feature. Variation in the appearance of the feature from instance to instance is modeled by the feature modeling method.
  • In one embodiment, the reference frames are populated with one or more sub frame samples from previously synthesized frames. The sub-frame samples are based on feature instance correspondences between those sub-frame regions in the previously synthesized (decoded) frame and the current frame.
  • In a further embodiment, the multiple image planes are consolidated into fewer image planes. These fewer image planes have the feature located close to the position expected in the frame to be predicted. Frame reduction is based on consolidating non-overlapping or near-zero spatially overlapping features into the same plane.
  • Applicant's reduction to practice has gone further with this as well, by estimating a bounding box of the feature-based on the feature information (previous matches, tracking information, modeling information).
  • In another non-limiting embodiment, each consolidated frame is equal to the size of the frame being predicted and the features are spatially close to if not exactly at the position expected by the conventional motion compensated prediction mechanism.
  • Feature Detection
  • FIG. 5 depicts a feature, 510-1, 510-2, . . . 510-n that has been detected in one or more frames of the video 520-1, 520-2, . . . 520-n. Typically, such a feature would be detected using several different criteria based on both structural information derived from pels and complexity criteria indicating that conventional compression utilizes a disproportionate amount of resources to encode the feature. As compared with feature encoding, each feature can further be identified spatially in a frame 520-1, 520-2, . . . 520-n by a corresponding spatial extent, perimeter, shown in the figure as a “region” 530-1, 530-2, . . . 530-n.
  • These regions 530-1, 530-2, . . . 530-n can be extracted, for instance as a simple rectangular region of pel data, and placed into an ensemble, 540, the whole of the ensemble representing a feature.
  • Each instance of a feature in a frame is a sample of the appearance of the feature. Note that when a sufficient number of these samples are coalesced into an ensemble, they can be used to model the appearance of the feature in those frames, and also in other frames from which the feature was not sampled. Such a model is able to transform the appearance into an encoded set of parameters that can further be decoded through the inverse model to create a synthesis of the feature.
  • Small spatial regions are identified and analyzed to determine if they can be combined based on some coherency criteria into larger spatial regions. These larger spatial regions are then analyzed to determine their suitability as candidate features. Should the region's feature modeling not provide a beneficial encoding, the candidate feature is either discarded or retained for modeling future instances of that feature with subsequent frames. The detection process proceeds until only those candidate features exhibiting an advantageous modeling remain.
  • Spatial regions vary in size from small groups of pels or subpels to larger areas that may correspond to actual objects or parts of those objects as tending to be implicitly segmented through the macroblocks or sub-macroblocks partitioning steps as determined by conventional video compression algorithms. However, it is important to note that the detected features may not correspond to discretely unique and separable entities such as objects and sub-objects. There is no requirement that the features correspond to such entities. A single feature may contain elements of two or more objects or no object elements at all. The critical factor is that the current invention has the potential to process these signal components with efficiency greater than conventional methods, and they sufficiently satisfy the definition of a feature purely based on their being efficiently modeled by feature-based video compression techniques.
  • Small spatial regions may be aggregated into larger regions in order to identify these larger regions as features. Small regions are aggregated into larger ones through the identification of coherency among them. There are several ways that coherency can be identified including coherent motion, motion compensated prediction, and encoding complexity.
  • Coherent motion may be discovered through higher order motion models. For example, the translational motion for each individual small region is integrated into an affine motion model which is able to approximate a simpler motion model for each of the small regions.
  • If the small regions motion can be integrated into more complex models on a consistent basis, this implies a dependency among the regions that may potentially provide an advantage over a conventional motion compensated prediction method, and also indicates a coherency between the small regions that could be exploited through feature modeling.
  • Encoding complexity can be determined through analysis of the bandwidth required by conventional compression to represent one or more of the small regions. Where there is a disproportionate allocation of bandwidth to a certain set of small regions that conventional encoding cannot efficiently compress and additionally may not be able to correlate as being redundant from frame to frame, these regions can potentially be aggregated into a feature whose encoding complexity may indicate the presence of a phenomena that feature modeling would better represent.
  • A set of known frames are each completely partitioned into uniform tiles arranged in a non-overlapping pattern. Each tile is analyzed as an independent sampled region of pels that is determined in practice to contain enough information to characterize the feature. The current invention uses these sampled regions to produce multiple classifications which, in turn, are used in training a classifier. Note that the final position of any feature may differ from this initial positioning.
  • A further embodiment generates sampled regions from the defined tiles and a tiling that overlaps those tiles. The overlapping sampling may be offset so that the center of the overlapping tiles occur at the intersection of every four underlying tile's corners. This over-complete partitioning is meant to increase the likelihood that an initial sampling position will yield a detected feature. Other, possibly more complex, topological partitioning methods are also anticipated.
  • A feature modeling predictor classifies sampled regions into clusters with significant probability that a region will have some correspondence to other regions in that same cluster. The feature modeling predictor uses pattern examples derived from the sampled region(s).
  • In a preferred embodiment, the features are detected with assistance from spectral profiling (described below in Spectral Profiling section). Spectral profiling provides regions of the frame that may be part of a single feature. This is used as a means of combining sampled regions into a feature.
  • In one embodiment, a pattern feature is defined as a spectral feature. The spectral feature is found by transforming the region from a color space into an HSV color space. The transformed region is then sub-sampled repeatedly down until the image vector space of the derived region is of a much smaller dimension than the original region's image vector space. These derived regions are considered the spectral features. The spectral features are clustered using a modified K-means algorithm. The K-means clusters are used to label the original regions based on their spectral classification.
  • In one embodiment, a classifier is built based on the edge content of the sampled regions. Each region is transformed into DCT space. The derived feature's DCT coefficients are then summed for the upper triangular matrix and the lower triangular matrix. These sums are then used to form an edge feature space. The feature space is then clustered using K-means, and the original regions are labeled according to the classification of their derived region clusters.
  • In yet another embodiment, the spectral feature and edge pattern feature classifier are used to generate multiple classifications for each region.
  • Feature Tracking
  • One embodiment uses a combination of newly detected and previously tracked features as the basis for determining the instances of the same corresponding feature in the current frame. The identification of this feature's instance in the current frame and the inclusion of this instance along with previously occurring instances of the region constitute the tracking of the feature.
  • FIG. 8 demonstrates the use of a feature tracker 830 along with the combination of newly detected and previously tracked features 810 to track and classify features 820-1, 820-2, . . . 820-n. Initially, a general feature detector 850 is used to identify features. Correspondence is determined based on the current frame 840 being matched to the previously detected features 810. The tracked features are organized into sets of features, or classified as belonging to a previously assembled feature set or to a new feature set.
  • Feature correspondence can initially be determined through conventional gradient descent minimizing an estimate of mean squared error. The resulting spatial displacement gives an approximate position of the feature in the current frame. The template that is used in the search need not be a single region of the feature, but can be any of the regions associated to the feature. The final match is then evaluated in a robust manner as a count of non-overlapping region tiles that meet a MSE threshold.
  • By imposing a spatial constraint on the coincidence of two or more regions, the tracker is able to decrease the number of features being tracked and therefore increase the computational efficiency of the tracking. The spatial coincidence of two or more features can also indicate additional feature cases, where the feature may have actually been two features in the past or some other complex feature topology. The tracker modes allow for temporary degenerate tracking states that allow the feature to be tracked, but make the tracked regions of a lower priority.
  • Feature Modeling
  • During some analysis phase, the region to be predicted is used to traverse the Region Correspondence Model (RCM) in order to determine regions within the model that would be used to construct a region prediction model.
  • In one embodiment, the target region is used to update the RCM thereby generating translational and mid-point normalized correspondences between other regions contained within the RCM and the target region. The resulting pair-wise region correspondences identify the other regions most likely to yield a prediction model for the target region.
  • The present invention includes the assembly of one or more of the best correspondences for a particular target region into a set termed an ensemble of regions. The ensemble of regions can be spatially normalized toward one key region in the ensemble. In one embodiment, the region closest to the target region temporally is selected as the key region. The deformations required to perform these normalizations are collected into a deformation ensemble, and the resulting normalized images are collected into an appearance ensemble, as described in U.S. Pat. Nos. 7,508,990, 7,457,472, 7,457,435, 7,426,285, 7,158,680, 7,424,157, and 7,436,981 and U.S. application Ser. No. 12/522,322, all by Assignee. The entire teachings of the above listed patents and application are incorporated by reference.
  • The appearance ensemble is processed to yield an appearance model, and the deformation ensemble is processed to yield a deformation model. The appearance and deformation models in combination become the feature model for the target region. In one embodiment, the method of model formation is a Principal Component Analysis (PCA) decomposition of the ensemble followed by a truncation of the resulting basis vectors. In a further embodiment, the criteria for truncation may be the intra-ensemble reconstruction.
  • In another embodiment, the method of model formation (appearance and deformation models) is Compressed Sensing (CS), described elsewhere, wherein the model parameters are resolved from partial parameter measurements.
  • The target region is projected onto the feature model yielding the feature parameters. They are for the deformation and appearance modeling of the region. Also, the feature parameters are the encoding of the target region.
  • The feature model parameters for two or more intra-ensemble regions are selected using temporal criteria. These parameters are used to predict the state of the target region given the known interval between the regions themselves and the target regions. One example of a state model is a linear extrapolation of two or more feature parameters given temporal steps. The linear model is used to predict the feature parameters for the target region.
  • If the extrapolated values provide a suitable synthesis (decoding) of the target region, the specification of the target region's feature parameters is not required, or they can be differentially specified relative to the extrapolated parameters.
  • The state model for extrapolation can be of higher order than a simple linear model. In one embodiment, an extended Kalman filter is used to estimate the feature parameter state.
  • Region Correspondence Model
  • The combination of classification, registration, and deformation analysis provides a set of information that indicates the probability that two or more regions can be combined into a joint model of appearance and deformation, called a Region Correspondence Model (RCM).
  • In one preferred embodiment, the feature detection method (described above) analyzes novel features incrementally. One result of this analysis is the higher probability that a region would correspond to other regions used to construct one of the feature detectors.
  • Once regions are classified into clusters as in the above-discussed feature detection, and given their respective cluster labels, the inter-cluster regions are analyzed to determine the per region correspondence between region pairs.
  • In a preferred embodiment, the classifier described above is used to define clusters of sampled regions whose region source pels are further analyzed and defined through region translational refinement (described below).
  • Additionally, in a preferred embodiment, subsequent to translational region refinement, region correspondences can be further defined in terms of their region deformation analysis (discussed below).
  • In one embodiment, the construction of the RCM is achieved incrementally. Two or more regions are used to initially seed the combined classifier/deformation analysis mechanism. The RCM is then updated with new regions that alter the classifiers and the deformation analysis elements.
  • In one embodiment, the incremental update of the RCM described above is constructed such that regions correspondences for a given model are processed in a traversal order dependent on base complexity analysis detailed below.
  • In one embodiment, the traversal order as discussed above dependent on a base complexity analysis (described below) are part of an iterative process that updates the RCM with traversal termination criteria. The termination criteria leave the processing completed to a level that maximizes the RCM's ability to represent correspondences with the greatest probability to reduce complexity when appearance/deformation models are derived from the correspondences.
  • Region Translational Refinement
  • In one embodiment, sampled regions are gathered together into a set of training sampled regions. The spatial position of these regions in each frame is refined.
  • A refinement includes an exhaustive comparison of each sampled region to every other sampled region. This comparison is comprised of two tile registrations. One registration is a comparison of a first region to a second region. The second registration is a comparison of the second region to the first region. Each registration is performed at the position of the regions in their respective images. The resulting registration offset along with the corresponding positional offset are retained and referred to as correlations.
  • The correlations are analyzed to determine if multiple registrations indicate that a sampled region's position should be refined. If the refined position in the source frame would yield a lower error match for one or more other regions, then that region position is adjusted to the refined position.
  • The refined position of the region in the source frame is determined through a linear interpolation of the positions of other region correspondences that temporally span the region in the source frame.
  • Spectral Profiling
  • The Spectral Profiling method is a statistical “mean tracking and fitting” method. Other examples of such methods are described in the literature are CAMSHIFT, mean shift, medoid shift, and their derived methods as applied to detection, tracking, and modeling of spatial probability distributions occurring in images and video frames. The Spectral Profiling method of the present invention starts with analyzing intensity elements, pels of the spectral (color) planes of a region of an image plane, across one or more frames. The intensity elements are processed first through a discretization of the values via a histogram binning method. Then the histogram for a region is used with a tracking mechanism to identify more corresponding regions in subsequent frames that have a similar histogram. The region's set of elements (position, discretization criteria, and histograms) is iteratively refined so it converges on a common set of these elements. The refined set of elements is the spectral profile. The spectral profile method is a feature detection method.
  • There is an advantage to using a one dimensional K-means classification, so the Hue channel of an HSV color-space is utilized in the formation of the classifier. Additionally, the pels are classified and histogram bins are filled, and spatial invariant moments are determined.
  • The core basis functions for the present invention utilize preexisting data to derive models for the new data. The preexisting data can be obtained through any encoding/decoding scheme and is assumed to be available. The invention analyzes this data to determine a set of candidate pattern data, referred to as feature data, which can include data for both the appearance and deformation of a spatially localized component of the video signal.
  • Given a particular set of preexisting feature data and a novel target data point, analysis is performed to determine a minimal description of the feature data required to build a model for representing the target data point. Without loss of generality, the preexisting feature data is referred to as the candidate feature vectors and the target data point is referred to as the target vector. Further, the process is applicable to one or more target vectors.
  • Given a target vector and a set of candidate feature vectors (all deemed to be part of the same feature), a minimal subset of the candidate feature vectors is selected to synthesize the target vector with low error, resulting in a manifold representation that is both compact and accurate.
  • The present invention aggregates a set of candidate feature vectors into what is termed the feature ensemble. In one embodiment, the first step in creating the feature ensemble is to select a key vector, a feature vector determined to be a good approximation of the target vector. The key vector is the first vector in the feature ensemble. Other candidate feature vectors are selected for the feature ensemble in the order of their correlation with the key vector (so the second vector in the feature ensemble is the feature vector having next-highest correlation with the key vector). Ordering a feature ensemble in this way is termed key-correlation ordered (KCO).
  • In another embodiment, the feature ensemble is created using the target vector itself. Candidate feature vectors are selected for the feature ensemble based on their correlation with the target vector. Any ordering method making use of target vector correlation is termed target-correlation ordered (TCO). The first feature vector in a TCO feature ensemble is the candidate feature having largest correlation with the target vector. In a preferred embodiment, every time a feature vector “enters” the ensemble, the approximate target reconstruction via the ensemble-to-date (Ur) is computed as Ur*Ur*t and then subtracted from the target vector t to form a residual vector. The next feature vector for the ensemble is then selected as being the candidate feature having largest correlation with the residual vector. This iterative process of computing the residual vector and then selecting the best match to the residual is thus termed sequential target-correlation ordering (STCO). STCO ensures the most efficient representation of the target vector for a given ensemble size. It is functionally equivalent to orthogonal matching pursuit (see Prior Art) but more computationally efficient for small ensemble sizes.
  • In another embodiment, residual vectors are not computed and all candidate feature vectors are selected for the feature ensemble based on their correlation with the target vector itself. This TCO method, termed global target-correlation ordering (GTCO) is faster and simpler than STCO but may result in redundancies in the ensemble. However, both TCO methods are generally far superior to the KCO method for selecting the ensemble.
  • A bitmask is used to transmit the feature vectors that were selected for the feature ensemble.
  • In one embodiment, the feature vectors in the feature ensemble and the target vector itself are passed through a discrete wavelet transform (DWT) before SVD-based encoding. This makes the information in the target vector more compact and more easily represented by a small subspace of SVD vectors. The DWT is a well known method for compacting signal information over multiple scales. In a preferred embodiment, the DWT is applied with the Daubechies 9-7 bi-orthogonal wavelet. The DWT is applied to each component separately as, the feature vectors are in YUV color space. For example, length-384 YUV vectors require a length-256 DWT on the Y component and length-64 DWT's on the U and V components.
  • Compressed Sensing (CS)
  • In one embodiment of the present invention, Compressed Sensing (CS) is employed as the method of model formation (appearance and deformation models) in the Feature Modeling (described elsewhere) process.
  • There are three practical applications of CS algorithms of interest in the present invention: Orthogonal Matching Pursuit (OMP), L1 Minimization (L1M), and Chaining Pursuit (CP). Each of the algorithms has its own strengths and weaknesses, but the L1M is prohibitively slow for most video processing applications, so in this field, OMP and CP are the two CS algorithms of choice, and L1M is used infrequently.
  • The effectiveness of CS algorithms is limited in practice by computation time, memory limits, or total number of measurements. To combat these limitations and improve the performance of CS algorithms in practice, the present invention uses one or more of several possible methods. Briefly, the methods achieve benefit through: (1) reducing the number of measurements specified in the literature to attain a precise reconstruction; (2) increasing sparsity in the input data by one or more specific data reduction techniques; (3) partitioning the data to ease memory limitations; and (4) adaptively building an expectation of error into the reconstruction algorithm.
  • One embodiment exploits the fact that, typically, the mathematical requirements for reconstruction are stricter than necessary. It is possible to achieve “good” reconstruction of image data consistently with fewer measurements than specified in the literature. “Good” reconstruction means that to the human eye there is little difference visually compared with a “full” reconstruction. For example, applying Chaining Pursuit (CP) with half the number of measurements specified still achieves “good” reconstruction.
  • In another embodiment, the input data is “reduced” to make it sparser, which reduces the number of measurements required. Data reduction techniques include passing the data through a discrete wavelet transform (DWT), because data is often more sparse in the wavelet domain; physically reducing the total size of the input data by truncation, also known as down-sampling; and thresholding the data (removing all components that are less than some threshold). Of the data reduction techniques, DWT transformation is the least “invasive” and theoretically allows full recovery of the input data. The other two reduction techniques are “lossy” and do not allow full signal recovery. DWT works well with CP but not with Orthogonal Matching Pursuit (OMP) or L1 Minimization (L1M). So the ideal combination for this data reduction embodiment is Chaining Pursuit algorithm with the Discrete Wavelet Transform data reduction technique.
  • In another embodiment especially well-suited to parallel processing architectures, the input data is partitioned into segments, (or 2-D images into tiles) and each segment is processed separately with a smaller number of required measurements. This approach works well for both OMP and L1M which typically are impeded by a memory limitation. The size of the required measurement matrix causes the memory limitation for both OMP and L1M. One can compute the amount by which the memory matrix exceeds the memory of the system. This excess memory requirement is an “oversampling” factor. It sets a lower limit for the number of segments into which the signal is divided.
  • In another embodiment, the process builds some expectation of error into the reconstruction algorithm. The expected error could be due to above normal noise or inaccurate measurements. The process compensates either by relaxing the optimization constraint or by stopping the iteration prior to completion of the reconstruction process. The reconstruction is then an approximate fit to the data, but such approximate solutions may be sufficient or may be the only solutions possible when the input data is noisy or inaccurate.
  • FIG. 2 displays a notional video compression architecture that implements compressed sensing measurements at the encoder. The raw video stream 200 is sent through a motion compensated prediction algorithm 202 to register the data 203 thereby establishing correspondences between groups of pels in multiple frames such that the redundancies due to motion can be factored out. Then preprocessing 204 is applied to make the data as sparse as possible (at 205) so that CS measurements and the reconstruction that follow will be as effective as possible. CS measurements are taken 206 and become the CS encoding 207 (ready for transmission). Later during synthesis, the CS algorithm is used to decode the measurements.
  • The present invention identifies, separates, and preprocesses signal components from raw video streams into sparse signals that are well suited to CS processing. CS algorithms are naturally compatible with embodiments of the invention. It should be noted that certain aspects of FIG. 2 are related to embodiments discussed in U.S. Pat. Nos. 7,508,990, 7,457,472, 7,457,435, 7,426,285, 7,158,680, 7,424,157, and 7,436,981 and U.S. application Ser. No. 12/522,322, all by Assignee. The entire teachings of the above listed patents and patent application are incorporated herein by reference.
  • In the context of video compression, CS delivers a significant benefit when the input image has some sparsity, or compressibility. If the input image is dense, then CS is not the correct approach for compression or reconstruction. CS algorithms can compress and reconstruct sparse input images with fewer measurements than required by conventional compression algorithms which require a number of measurements equal to the number of pixels in the image). Note that signal sparsity or compressibility is assumed by most compression techniques, so the images for which CS provides improvement are the images for which most compression techniques are designed.
  • Note also that adding noise to a sparse image makes it denser mathematically but does not make it less sparse “informationally.” It is still a sparse signal, and using CS with one or more of the above practical implementations can produce useful reconstructions of these kinds of signals.
  • Base Complexity Analysis
  • Representative sampled video regions can be analyzed using a base method. One such method would be conventional block-based compression, as MPEG-4.
  • Image Alignment via Inverse Compositional Algorithm
  • Xu and Roy-Chowdhury (“Integrating Motion, Illumination, and Structure in Video Sequences . . . ,” IEEE Trans. Pattern Analysis and Machine Intelligence, May 2007) extended the LRLS framework to moving objects (e.g., in video sequences), showing that such objects are well-approximated by a 15-dimensional bilinear basis of 9 illumination functions (the original LRLS basis images) and 6 motion functions that reflect the effect of motion on the LRLS basis images.
  • The recently proposed IC implementation by Xu and Roy-Chowdhury (“Inverse Compositional Estimation of 3D Pose and Lighting in Dynamic Scenes,” IEEE Trans. Pattern Analysis and Machine Intelligence, to be published) uses the Inverse Compositional (IC) algorithm to estimate 3D motion and lighting parameters from a sequence of video frames. A 2D-to-3D-to-2D warping function is used to align (target) images from different frames with a “key” frame (template) at a canonical pose. Given a frame of image data and an underlying 3D model of the object being imaged, the 2D-to-3D map determines which 3D points (facets/vertices) in the 3D model correspond to which image pixels. Once the 2D-to-3D map has been defined, the object's pose is shifted in 3D by the previous frame's pose estimate, thereby aligning the current frame with the key frame. The shifted object in 3D is then mapped back to 2D using the 3D-to-2D (projection) map to form a “pose normalized” image frame.
  • Once the target frame has been registered to the template (key frame) using the 2D-to-3D-to-2D map, the resulting pose-normalized frame (PNF) is used to estimate 15 motion parameters, corresponding to 9 illumination and 6 motion variables. The illumination variables are estimated via a least-squares fit of the PNF to the LRLS (illumination) basis images. In one embodiment, the illumination component estimated by the LRLS basis images is then subtracted from the PNF, and the residual is used to estimate 6 motion parameters (3 translation and 3 rotation) via least-squares fit to the motion functions. The PNF can then be reconstructed from the 15-dimensional “bilinear” illumination/motion basis and its corresponding parameter vector.
  • The present invention uses aspects of the Xu/Roy-Chowdhury IC implementation to aid with image registration applications. In one embodiment, the 2D-to-3D-to-2D mapping is used as a computationally efficient substitute for midpoint normalization of feature regions. The mapping process is especially useful for features where accurate 3D models (such as the Vetter model for faces) exist. In this embodiment, the model points are specified at some pose (the “model pose”) and both the key frame (the template) and the current frame (or target frame) are registered to the model pose.
  • Application of Incremental Singular Value Decomposition (ISVD) Algorithm
  • In the present invention, the SVD is reduced using a variation of the common magnitude thresholding method, termed here percentage thresholding. In one embodiment, the total energy E of the singular values in a given SVD factorization is computed as the sum of the singular values. A grouping of the singular values, referred to in the present text as a “reduced set,” is created when singular values are added sequentially (in decreasing order of magnitude, largest to smallest) until the sum of the singular values in the reduced set exceeds some percentage threshold of E. This reduction method is equivalent to magnitude thresholding (see Prior Art), except the magnitude threshold does not need to be known ahead of time.
  • In the present invention, the singular value decomposition (SVD) is applied to feature data as follows. The M×N data matrix D consists of an ensemble of feature vectors, derived from the regions (tiles) of a given video image frame. The M×1 feature vectors are column-vectorized from 2D image tiles and are concatenated to form the columns of the data matrix D. In one embodiment, the data matrix is then factorized into its SVD and then reduced, Dr=Ur*Sr*Vr′, where the reduction is via percentage thresholding. The left singular vectors are then used to encode the M×1 target vector t, the feature to be transmitted, with the final encoding given by Ur′*t. Typical dimensions might be M=384, N=20, and r=10, so that a length-384 target vector is compressed (encoded) with 10 coefficients.
  • Because not all feature vectors in the ensemble data matrix D are available at once, the incremental SVD (ISVD) is used to update the SVD based on the existing singular value decomposition and the data update. In one embodiment, a small number of feature vectors is grouped together to form an initial data matrix D0, for which the conventional SVD is easily computed. Then, as additional feature data vectors are added to the ensemble data matrix, the ISVD is used to update the SVD for the augmented data matrix. In a further embodiment, because new feature data vectors can sometimes be redundant with the subspace already represented in the ensemble data matrix, a linear independence test is applied to the new data vectors before they are added to the existing ensemble. Once the full set of feature data vectors has been added to the ensemble, the SVD is updated and reduced (via percentage thresholding) to provide the final SVD-based encoding.
  • In another embodiment, the SVD is reduced using the correlations of the left singular vectors (the columns of Ur) with the target vector t. The total correlation energy CE is computed as the sum of the correlations. A grouping of the singular values, referred to in the present text as a “reduced set,” is created when correlations are added sequentially (in decreasing order of magnitude, largest to smallest) until the sum of the correlations in the reduced set exceeds some percentage threshold of CE. This method of reducing the SVD, termed target-correlation percentage thresholding, follows the same methodology as the basic SVD reduction method of percentage thresholding, except that target correlations (of left singular vectors with the target vector) are used instead of singular values for the computations.
  • Transform-Based Processing
  • The present invention performs empirical feature classification on video frame data in transform space. In one embodiment, a set of Nt features from a reference frame presented as input to the classifier. Each of the features is transformed from pel space to transform space using the linear transform of choice (possible transforms include the discrete wavelet transform [DWT] and curvelet transform [CuT]). Then, the indices corresponding to the largest P coefficients for each feature are tabulated, and the P most commonly occurring coefficients across all the coefficient lists are used to create a (P×1) classification vector (CV) for each feature (a total of Nt “reference” CVs in all). Then, each new feature vector v is classified by transforming the vector, extracting the CV indices for v, and computing a similarity measure between the CV for v and each of the reference CVs. The test feature is classified as the feature whose reference CV maximizes the similarity measure.
  • Information from two or more linear transforms with different strengths and weaknesses can be combined using orthogonal matching pursuit to improve the performance of the empirical transform-based feature classifier. In one embodiment, basis vectors from the DWT, which is effective at representing textures, and from the CuT, which is effective at representing edges, are combined into a dictionary D. Then, OMP is used to compute a signal representation using the functions in D for each of Nt features, as well as a representation for the “test” feature vector. The classifier then proceeds as in the basic transform-based classifier described above. Combining the information from multiple transforms in this way can improve classifier performance over that achieved by each of the individual classifiers.
  • Linear transforms (e.g., DWT and CuT) can also be used for compression and coding of features. In one embodiment, once a feature is transformed, the transform coefficients are ordered by magnitude and thresholded according to an energy retention criterion (e.g., enough coefficients are kept such that 99% of the feature energy is retained). Typically, many fewer transform coefficients are needed to retain 99% of signal energy than pels are needed in pel space. The transform coefficient values represent the encoding of the feature, and the compression gain is given by the percentage of transform coefficients kept relative to the number of pixels in the feature. In a further embodiment, information from multiple transforms can again be combined using OMP to improve compression gain.
  • While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims (1)

What is claimed is:
1. A computer implemented method of processing video data formed of a series of video frames comprising:
using a first video encoding process and a feature-based encoding process, processing the video frames by:
encoding the video frames with the first video encoding process;
processing one or more of the frames to detect one or more instances of a feature by searching the one or more frames for a region of pels having coherency and computational complexity as compared to other pels in the one or more frames;
modeling variation of the feature instance relative to other instances of the feature to create a feature-based encoding of the feature instance; and
comparing compression efficiency of the feature-based encoding of the feature instance relative to an encoding of the feature instance resulting from the first video encoding process.
US14/592,898 2005-03-31 2015-01-08 Feature-Based Video Compression Abandoned US20150189318A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/592,898 US20150189318A1 (en) 2005-03-31 2015-01-08 Feature-Based Video Compression

Applications Claiming Priority (15)

Application Number Priority Date Filing Date Title
US66753205P 2005-03-31 2005-03-31
US67095105P 2005-04-13 2005-04-13
US11/191,562 US7158680B2 (en) 2004-07-30 2005-07-28 Apparatus and method for processing video data
US11/230,686 US7426285B2 (en) 2004-09-21 2005-09-20 Apparatus and method for processing video data
US11/280,625 US7457435B2 (en) 2004-11-17 2005-11-16 Apparatus and method for processing video data
US11/336,366 US7436981B2 (en) 2005-01-28 2006-01-20 Apparatus and method for processing video data
US11/396,010 US7457472B2 (en) 2005-03-31 2006-03-31 Apparatus and method for processing video data
US88196607P 2007-01-23 2007-01-23
PCT/US2008/000090 WO2008091483A2 (en) 2007-01-23 2008-01-04 Computer method and apparatus for processing image data
US10336208P 2008-10-07 2008-10-07
US52232209A 2009-07-07 2009-07-07
PCT/US2009/059653 WO2010042486A1 (en) 2008-10-07 2009-10-06 Feature-based video compression
US201113121904A 2011-03-30 2011-03-30
US13/341,482 US8964835B2 (en) 2005-03-31 2011-12-30 Feature-based video compression
US14/592,898 US20150189318A1 (en) 2005-03-31 2015-01-08 Feature-Based Video Compression

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/341,482 Continuation US8964835B2 (en) 2005-03-31 2011-12-30 Feature-based video compression

Publications (1)

Publication Number Publication Date
US20150189318A1 true US20150189318A1 (en) 2015-07-02

Family

ID=41528424

Family Applications (3)

Application Number Title Priority Date Filing Date
US13/121,904 Active 2026-08-29 US8942283B2 (en) 2005-03-31 2009-10-06 Feature-based hybrid video codec comparing compression efficiency of encodings
US13/341,482 Expired - Fee Related US8964835B2 (en) 2005-03-31 2011-12-30 Feature-based video compression
US14/592,898 Abandoned US20150189318A1 (en) 2005-03-31 2015-01-08 Feature-Based Video Compression

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US13/121,904 Active 2026-08-29 US8942283B2 (en) 2005-03-31 2009-10-06 Feature-based hybrid video codec comparing compression efficiency of encodings
US13/341,482 Expired - Fee Related US8964835B2 (en) 2005-03-31 2011-12-30 Feature-based video compression

Country Status (7)

Country Link
US (3) US8942283B2 (en)
EP (1) EP2345256B1 (en)
JP (1) JP5567021B2 (en)
CN (1) CN102172026B (en)
CA (1) CA2739482C (en)
TW (1) TW201016016A (en)
WO (1) WO2010042486A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9532069B2 (en) 2004-07-30 2016-12-27 Euclid Discoveries, Llc Video compression repository and model reuse
US9578345B2 (en) 2005-03-31 2017-02-21 Euclid Discoveries, Llc Model-based video encoding and decoding
US9621917B2 (en) 2014-03-10 2017-04-11 Euclid Discoveries, Llc Continuous block tracking for temporal prediction in video encoding
US9743078B2 (en) 2004-07-30 2017-08-22 Euclid Discoveries, Llc Standards-compliant model-based video encoding and decoding
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
US20180189143A1 (en) * 2017-01-03 2018-07-05 International Business Machines Corporation Simultaneous compression of multiple stored videos
US10091507B2 (en) 2014-03-10 2018-10-02 Euclid Discoveries, Llc Perceptual optimization for model-based video encoding
US10097851B2 (en) 2014-03-10 2018-10-09 Euclid Discoveries, Llc Perceptual optimization for model-based video encoding
KR20180129339A (en) * 2017-05-26 2018-12-05 라인 가부시키가이샤 Method for image compression and method for image restoration
CN110113607A (en) * 2019-04-25 2019-08-09 长沙理工大学 A kind of compressed sensing video method for reconstructing based on part and non-local constraint
WO2021201642A1 (en) * 2020-04-03 2021-10-07 엘지전자 주식회사 Video transmission method, video transmission device, video reception method, and video reception device
US11227396B1 (en) * 2020-07-16 2022-01-18 Meta Platforms, Inc. Camera parameter control using face vectors for portal
US20230162372A1 (en) * 2021-11-24 2023-05-25 Microsoft Technology Licensing, Llc Feature Prediction for Efficient Video Processing

Families Citing this family (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8902971B2 (en) 2004-07-30 2014-12-02 Euclid Discoveries, Llc Video compression repository and model reuse
WO2008091483A2 (en) 2007-01-23 2008-07-31 Euclid Discoveries, Llc Computer method and apparatus for processing image data
WO2007050593A2 (en) * 2005-10-25 2007-05-03 William Marsh Rice University Method and apparatus for signal detection, classification, and estimation from compressive measurements
EP2106663A2 (en) 2007-01-23 2009-10-07 Euclid Discoveries, LLC Object archival systems and methods
US8243118B2 (en) * 2007-01-23 2012-08-14 Euclid Discoveries, Llc Systems and methods for providing personal video services
US8503523B2 (en) * 2007-06-29 2013-08-06 Microsoft Corporation Forming a representation of a video item and use thereof
US20100014755A1 (en) * 2008-07-21 2010-01-21 Charles Lee Wilson System and method for grid-based image segmentation and matching
EP2345256B1 (en) 2008-10-07 2018-03-14 Euclid Discoveries, LLC Feature-based video compression
US8401075B2 (en) * 2008-12-31 2013-03-19 General Instrument Corporation Hybrid video encoder including real-time and off-line video encoders
FR2959037A1 (en) * 2010-04-14 2011-10-21 Orange Vallee METHOD FOR CREATING A MEDIA SEQUENCE BY COHERENT GROUPS OF MEDIA FILES
US9106933B1 (en) 2010-05-18 2015-08-11 Google Inc. Apparatus and method for encoding video using different second-stage transform
US8860835B2 (en) 2010-08-11 2014-10-14 Inview Technology Corporation Decreasing image acquisition time for compressive imaging devices
US8929456B2 (en) * 2010-09-30 2015-01-06 Alcatel Lucent Video coding using compressive measurements
JP5652097B2 (en) * 2010-10-01 2015-01-14 ソニー株式会社 Image processing apparatus, program, and image processing method
KR20120040015A (en) * 2010-10-18 2012-04-26 한국전자통신연구원 Vector classifier and vector classification method thereof
US9210442B2 (en) 2011-01-12 2015-12-08 Google Technology Holdings LLC Efficient transform unit representation
US9380319B2 (en) 2011-02-04 2016-06-28 Google Technology Holdings LLC Implicit transform unit representation
US9154799B2 (en) 2011-04-07 2015-10-06 Google Inc. Encoding and decoding motion via image segmentation
US9160914B2 (en) * 2011-06-28 2015-10-13 Inview Technology Corporation User control of the visual performance of a compressive imaging system
CN106791842A (en) * 2011-09-06 2017-05-31 英特尔公司 Analysis auxiliaring coding
US9262670B2 (en) 2012-02-10 2016-02-16 Google Inc. Adaptive region of interest
JP5669982B2 (en) 2012-03-05 2015-02-18 エンパイア テクノロジー ディベロップメント エルエルシー Integrated occupancy sensor and ambient light sensor
CA2868448A1 (en) * 2012-03-26 2013-10-03 Euclid Discoveries, Llc Context based video encoding and decoding
FR2989805A1 (en) * 2012-04-19 2013-10-25 France Telecom METHOD FOR ENCODING AND DECODING INTEGRAL IMAGES, DEVICE FOR ENCODING AND DECODING INTEGRAL IMAGES, AND CORRESPONDING COMPUTER PROGRAMS
US9473780B2 (en) 2012-07-13 2016-10-18 Apple Inc. Video transmission using content-based frame search
US9286648B2 (en) * 2012-08-03 2016-03-15 Nadar Mariappan S Zero communication block partitioning
TW201421423A (en) 2012-11-26 2014-06-01 Pixart Imaging Inc Image sensor and operating method thereof
US9219915B1 (en) 2013-01-17 2015-12-22 Google Inc. Selection of transform size in video coding
US9967559B1 (en) 2013-02-11 2018-05-08 Google Llc Motion vector dependent spatial transformation in video coding
US9544597B1 (en) 2013-02-11 2017-01-10 Google Inc. Hybrid transform in video encoding and decoding
US9349072B2 (en) * 2013-03-11 2016-05-24 Microsoft Technology Licensing, Llc Local feature based image compression
US9674530B1 (en) 2013-04-30 2017-06-06 Google Inc. Hybrid transforms in video coding
US10728298B2 (en) * 2013-09-12 2020-07-28 Qualcomm Incorporated Method for compressed sensing of streaming data and apparatus for performing the same
CN104750697B (en) * 2013-12-27 2019-01-25 同方威视技术股份有限公司 Searching system, search method and Security Inspection Equipments based on fluoroscopy images content
US9258525B2 (en) * 2014-02-25 2016-02-09 Alcatel Lucent System and method for reducing latency in video delivery
US9392272B1 (en) 2014-06-02 2016-07-12 Google Inc. Video coding using adaptive source variance based partitioning
US9578324B1 (en) 2014-06-27 2017-02-21 Google Inc. Video coding using statistical-based spatially differentiated partitioning
US9565451B1 (en) 2014-10-31 2017-02-07 Google Inc. Prediction dependent transform coding
US10142647B2 (en) 2014-11-13 2018-11-27 Google Llc Alternating block constrained decision mode coding
US9769499B2 (en) 2015-08-11 2017-09-19 Google Inc. Super-transform video coding
US10277905B2 (en) 2015-09-14 2019-04-30 Google Llc Transform selection for non-baseband signal coding
US9807423B1 (en) 2015-11-24 2017-10-31 Google Inc. Hybrid transform scheme for video coding
CN108463787B (en) 2016-01-05 2021-11-30 瑞尔D斯帕克有限责任公司 Gaze correction of multi-perspective images
TWI577178B (en) * 2016-01-06 2017-04-01 睿緻科技股份有限公司 Image processing device and related image compression method
US10542267B2 (en) 2016-01-21 2020-01-21 Samsung Display Co., Ltd. Classification preserving image transform compression
US10140734B2 (en) * 2016-01-29 2018-11-27 Wisconsin Alumni Research Foundation System and method for simulataneous image artifact reduction and tomographic reconstruction of images depicting temporal contrast dynamics
US10339235B1 (en) * 2016-03-23 2019-07-02 Emc Corporation Massively parallel processing (MPP) large-scale combination of time series data
WO2018066980A1 (en) * 2016-10-04 2018-04-12 김기백 Image data encoding/decoding method and apparatus
GB2556923B (en) * 2016-11-25 2020-04-15 Canon Kk Generation of VCA Reference results for VCA Auto-setting
KR102649234B1 (en) * 2017-04-27 2024-03-20 파나소닉 인텔렉츄얼 프로퍼티 코포레이션 오브 아메리카 Encoding device, decoding device, encoding method and decoding method
WO2019010233A1 (en) 2017-07-05 2019-01-10 Red. Com, Llc Video image data processing in electronic devices
EP4293574A3 (en) 2017-08-08 2024-04-03 RealD Spark, LLC Adjusting a digital representation of a head region
WO2019191892A1 (en) * 2018-04-02 2019-10-10 北京大学 Method and device for encoding and decoding video
CN108832934B (en) * 2018-05-31 2022-02-18 安徽大学 Two-dimensional orthogonal matching pursuit optimization algorithm based on singular value decomposition
WO2020061797A1 (en) * 2018-09-26 2020-04-02 华为技术有限公司 Method and apparatus for compressing and decompressing 3d graphic data
US11109065B2 (en) 2018-09-26 2021-08-31 Google Llc Video encoding by providing geometric proxies
US10405005B1 (en) 2018-11-28 2019-09-03 Sherman McDermott Methods and systems for video compression based on dynamic vector wave compression
CN109447921A (en) * 2018-12-05 2019-03-08 重庆邮电大学 A kind of image measurement matrix optimizing method based on reconstructed error
CN110458902B (en) * 2019-03-26 2022-04-05 华为技术有限公司 3D illumination estimation method and electronic equipment
US11122297B2 (en) 2019-05-03 2021-09-14 Google Llc Using border-aligned block functions for image compression
US10911775B1 (en) * 2020-03-11 2021-02-02 Fuji Xerox Co., Ltd. System and method for vision-based joint action and pose motion forecasting
JP7017596B2 (en) * 2020-03-17 2022-02-08 本田技研工業株式会社 Information processing equipment, information processing systems, information processing methods and programs
US11451480B2 (en) * 2020-03-31 2022-09-20 Micron Technology, Inc. Lightweight artificial intelligence layer to control the transfer of big data
WO2021205067A1 (en) * 2020-04-07 2021-10-14 Nokia Technologies Oy Feature-domain residual for video coding for machines
CN112559618B (en) * 2020-12-23 2023-07-11 光大兴陇信托有限责任公司 External data integration method based on financial wind control business
US11917162B2 (en) 2021-04-30 2024-02-27 Tencent America LLC Content-adaptive online training with feature substitution in neural image compression
US20230024288A1 (en) * 2021-07-13 2023-01-26 Tencent America LLC Feature-based multi-view representation and coding
CN114422805B (en) * 2022-03-30 2022-09-02 浙江智慧视频安防创新中心有限公司 Video coding and decoding method, device and equipment
WO2024025280A1 (en) * 2022-07-27 2024-02-01 Samsung Electronics Co., Ltd. Method and system for content-based scaling for artificial intelligence based in-loop filters
CN115941966B (en) * 2022-12-30 2023-08-22 深圳大学 Video compression method and electronic equipment
CN116760983B (en) * 2023-08-09 2023-11-28 中国科学技术大学 Loop filtering method and device for video coding

Family Cites Families (188)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH082107B2 (en) * 1990-03-02 1996-01-10 国際電信電話株式会社 Method and apparatus for moving picture hybrid coding
JP2606523B2 (en) 1992-02-28 1997-05-07 日本ビクター株式会社 Predictive encoding device and decoding device
JPH0738873B2 (en) 1992-07-14 1995-05-01 株式会社日本水処理技研 Method and tool for antibacterial treatment of tile joints
US6018771A (en) 1992-11-25 2000-01-25 Digital Equipment Corporation Dynamic assignment of multicast network addresses
US5592228A (en) 1993-03-04 1997-01-07 Kabushiki Kaisha Toshiba Video encoder using global motion estimation and polygonal patch motion estimation
JPH0795587A (en) * 1993-06-30 1995-04-07 Ricoh Co Ltd Method for detecting moving vector
US5586200A (en) * 1994-01-07 1996-12-17 Panasonic Technologies, Inc. Segmentation based image compression system
JPH07288789A (en) 1994-04-15 1995-10-31 Hitachi Ltd Intelligent encoder and picture communication equipment
US5710590A (en) * 1994-04-15 1998-01-20 Hitachi, Ltd. Image signal encoding and communicating apparatus using means for extracting particular portions of an object image
KR100235343B1 (en) * 1994-12-29 1999-12-15 전주범 Apparatus for calculating motion vector in encoder using segmentation method
JP2739444B2 (en) 1995-03-01 1998-04-15 株式会社エイ・ティ・アール通信システム研究所 Motion generation device using three-dimensional model
JP2727066B2 (en) 1995-03-20 1998-03-11 株式会社エイ・ティ・アール通信システム研究所 Plastic object feature detector
KR0171151B1 (en) * 1995-03-20 1999-03-20 배순훈 Improved apparatus for approximating a control image using curvature calculation technique
AU711488B2 (en) * 1995-09-12 1999-10-14 Koninklijke Philips Electronics N.V. Hybrid waveform and model-based encoding and decoding of image signals
US5959673A (en) * 1995-10-05 1999-09-28 Microsoft Corporation Transform coding of dense motion vector fields for frame and object based video coding applications
US5774591A (en) * 1995-12-15 1998-06-30 Xerox Corporation Apparatus and method for recognizing facial expressions and facial gestures in a sequence of images
US5969755A (en) * 1996-02-05 1999-10-19 Texas Instruments Incorporated Motion based event detection system and method
US6037988A (en) 1996-03-22 2000-03-14 Microsoft Corp Method for generating sprites for object-based coding sytems using masks and rounding average
US6614847B1 (en) * 1996-10-25 2003-09-02 Texas Instruments Incorporated Content-based video compression
US6088484A (en) * 1996-11-08 2000-07-11 Hughes Electronics Corporation Downloading of personalization layers for symbolically compressed objects
US6044168A (en) * 1996-11-25 2000-03-28 Texas Instruments Incorporated Model based faced coding and decoding using feature detection and eigenface coding
US6047088A (en) 1996-12-16 2000-04-04 Sharp Laboratories Of America, Inc. 2D mesh geometry and motion vector compression
US5826165A (en) * 1997-01-21 1998-10-20 Hughes Electronics Corporation Advertisement reconciliation system
KR100309086B1 (en) 1997-02-13 2001-12-17 다니구찌 이찌로오, 기타오카 다카시 Moving image prediction system and method
US5991447A (en) * 1997-03-07 1999-11-23 General Instrument Corporation Prediction and coding of bi-directionally predicted video object planes for interlaced digital video
IL122194A0 (en) 1997-11-13 1998-06-15 Scidel Technologies Ltd Method and apparatus for personalized images inserted into a video stream
US6061400A (en) * 1997-11-20 2000-05-09 Hitachi America Ltd. Methods and apparatus for detecting scene conditions likely to cause prediction errors in reduced resolution video decoders and for using the detected information
US6625316B1 (en) * 1998-06-01 2003-09-23 Canon Kabushiki Kaisha Image processing apparatus and method, and image processing system
JP3413720B2 (en) * 1998-06-26 2003-06-09 ソニー株式会社 Image encoding method and apparatus, and image decoding method and apparatus
US6711278B1 (en) * 1998-09-10 2004-03-23 Microsoft Corporation Tracking semantic objects in vector image sequences
US6256423B1 (en) * 1998-09-18 2001-07-03 Sarnoff Corporation Intra-frame quantizer selection for video compression
US7124065B2 (en) * 1998-10-26 2006-10-17 Speech Technology And Applied Research Corporation Determining a tangent space and filtering data onto a manifold
US6546117B1 (en) * 1999-06-10 2003-04-08 University Of Washington Video object segmentation using active contour modelling with global relaxation
EP1185106A4 (en) * 1999-01-29 2006-07-05 Mitsubishi Electric Corp Method of image feature encoding and method of image search
US6751354B2 (en) * 1999-03-11 2004-06-15 Fuji Xerox Co., Ltd Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
US6774917B1 (en) * 1999-03-11 2004-08-10 Fuji Xerox Co., Ltd. Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video
GB9909362D0 (en) 1999-04-23 1999-06-16 Pace Micro Tech Plc Memory database system for encrypted progarmme material
US6307964B1 (en) * 1999-06-04 2001-10-23 Mitsubishi Electric Research Laboratories, Inc. Method for ordering image spaces to represent object shapes
US6870843B1 (en) 1999-06-22 2005-03-22 World Multicast.Com, Inc. Self implementing multicast level escalation
US7352386B1 (en) * 1999-06-22 2008-04-01 Microsoft Corporation Method and apparatus for recovering a three-dimensional scene from two-dimensional images
KR100611999B1 (en) 1999-08-27 2006-08-11 삼성전자주식회사 Motion compensating method in object based quad-tree mesh using greedy algorithm
JP2001100731A (en) 1999-09-28 2001-04-13 Toshiba Corp Object picture display device
US6731813B1 (en) 1999-10-07 2004-05-04 World Multicast.Com, Inc. Self adapting frame intervals
US6792154B1 (en) 1999-10-07 2004-09-14 World Multicast.com, Inc Video compression system and method using time
US7356082B1 (en) 1999-11-29 2008-04-08 Sony Corporation Video/audio signal processing method and video-audio signal processing apparatus
JP3694888B2 (en) * 1999-12-03 2005-09-14 ソニー株式会社 Decoding device and method, encoding device and method, information processing device and method, and recording medium
US6738424B1 (en) * 1999-12-27 2004-05-18 Objectvideo, Inc. Scene model generation from video for use in video processing
US6574353B1 (en) * 2000-02-08 2003-06-03 University Of Washington Video object tracking using a hierarchy of deformable templates
US7054539B2 (en) 2000-02-09 2006-05-30 Canon Kabushiki Kaisha Image processing method and apparatus
WO2001063555A2 (en) * 2000-02-24 2001-08-30 Massachusetts Institute Of Technology Image deconvolution techniques for probe scanning apparatus
JP4443722B2 (en) * 2000-04-25 2010-03-31 富士通株式会社 Image recognition apparatus and method
US6731799B1 (en) * 2000-06-01 2004-05-04 University Of Washington Object segmentation with background extraction and moving boundary techniques
US6795875B2 (en) * 2000-07-31 2004-09-21 Microsoft Corporation Arbitrating and servicing polychronous data requests in direct memory access
US8005145B2 (en) * 2000-08-11 2011-08-23 Nokia Corporation Method and apparatus for transferring video frame in telecommunication system
FR2814312B1 (en) * 2000-09-07 2003-01-24 France Telecom METHOD FOR SEGMENTATION OF A VIDEO IMAGE SURFACE BY ELEMENTARY OBJECTS
GB2367966B (en) * 2000-10-09 2003-01-15 Motorola Inc Method and apparatus for determining regions of interest in images and for image transmission
JP4310916B2 (en) * 2000-11-08 2009-08-12 コニカミノルタホールディングス株式会社 Video display device
JP2002182961A (en) 2000-12-13 2002-06-28 Nec Corp Synchronization system for database and method of the synchronization
US20040135788A1 (en) * 2000-12-22 2004-07-15 Davidson Colin Bruce Image processing system
US20020085633A1 (en) * 2001-01-03 2002-07-04 Kim Hyun Mun Method of performing video encoding rate control
US7061483B2 (en) * 2001-02-08 2006-06-13 California Institute Of Technology Methods for computing barycentric coordinates generalized to irregular n-gons and applications of the same
US6614466B2 (en) 2001-02-22 2003-09-02 Texas Instruments Incorporated Telescopic reconstruction of facial features from a speech pattern
US6625310B2 (en) * 2001-03-23 2003-09-23 Diamondback Vision, Inc. Video segmentation using statistical pixel modeling
US7043058B2 (en) * 2001-04-20 2006-05-09 Avid Technology, Inc. Correcting motion vector maps for image processing
US20020164068A1 (en) * 2001-05-03 2002-11-07 Koninklijke Philips Electronics N.V. Model switching in a communication system
US6909745B1 (en) 2001-06-05 2005-06-21 At&T Corp. Content adaptive video encoder
US6496217B1 (en) 2001-06-12 2002-12-17 Koninklijke Philips Electronics N.V. Video communication system using model-based coding and prioritzation techniques
US7003039B2 (en) * 2001-07-18 2006-02-21 Avideh Zakhor Dictionary generation method for video and image compression
US7173925B1 (en) 2001-07-18 2007-02-06 Cisco Technology, Inc. Method and system of control signaling for a wireless access network
US7457359B2 (en) 2001-09-26 2008-11-25 Mabey Danny L Systems, devices and methods for securely distributing highly-compressed multimedia content
GB2382289B (en) * 2001-09-28 2005-07-06 Canon Kk Method and apparatus for generating models of individuals
EP1309181A1 (en) 2001-11-06 2003-05-07 Thomson Licensing S.A. Device, method and system for multimedia content adaption
US7130446B2 (en) * 2001-12-03 2006-10-31 Microsoft Corporation Automatic detection and tracking of multiple individuals using multiple cues
US20030122966A1 (en) * 2001-12-06 2003-07-03 Digeo, Inc. System and method for meta data distribution to customize media content playback
US6842177B2 (en) 2001-12-14 2005-01-11 University Of Washington Macroblock padding
US7673136B2 (en) 2002-02-26 2010-03-02 Stewart Ian A Method for secure multicast repeating on the public Internet
JP2003253190A (en) 2002-03-06 2003-09-10 Kansai Paint Co Ltd Aqueous coating composition for can interior
US6950123B2 (en) * 2002-03-22 2005-09-27 Intel Corporation Method for simultaneous visual tracking of multiple bodies in a closed structured environment
US7136505B2 (en) * 2002-04-10 2006-11-14 National Instruments Corporation Generating a curve matching mapping operator by analyzing objects of interest and background information
US7483487B2 (en) 2002-04-11 2009-01-27 Microsoft Corporation Streaming methods and systems
US7203356B2 (en) * 2002-04-11 2007-04-10 Canesta, Inc. Subject segmentation and tracking using 3D sensing technology for video compression in multimedia applications
KR100491530B1 (en) * 2002-05-03 2005-05-27 엘지전자 주식회사 Method of determining motion vector
US7505604B2 (en) * 2002-05-20 2009-03-17 Simmonds Precision Prodcuts, Inc. Method for detection and recognition of fog presence within an aircraft compartment using video images
AU2003237289A1 (en) 2002-05-29 2003-12-19 Pixonics, Inc. Maintaining a plurality of codebooks related to a video signal
US8752197B2 (en) * 2002-06-18 2014-06-10 International Business Machines Corporation Application independent system, method, and architecture for privacy protection, enhancement, control, and accountability in imaging service systems
JP3984191B2 (en) 2002-07-08 2007-10-03 株式会社東芝 Virtual makeup apparatus and method
US7031499B2 (en) * 2002-07-22 2006-04-18 Mitsubishi Electric Research Laboratories, Inc. Object recognition system
US6925122B2 (en) * 2002-07-25 2005-08-02 National Research Council Method for video-based nose location tracking and hands-free computer input devices based thereon
JP2004356747A (en) 2003-05-27 2004-12-16 Kddi Corp Method and apparatus for matching image
EP1387588A2 (en) 2002-08-02 2004-02-04 KDDI Corporation Image matching device and method for motion estimation
US20040028139A1 (en) 2002-08-06 2004-02-12 Andre Zaccarin Video encoding
US20040113933A1 (en) 2002-10-08 2004-06-17 Northrop Grumman Corporation Split and merge behavior analysis and understanding using Hidden Markov Models
TW200407799A (en) 2002-11-05 2004-05-16 Ind Tech Res Inst Texture partition and transmission method for network progressive transmission and real-time rendering by using the wavelet coding algorithm
KR100455294B1 (en) 2002-12-06 2004-11-06 삼성전자주식회사 Method for detecting user and detecting motion, and apparatus for detecting user within security system
WO2004061702A1 (en) * 2002-12-26 2004-07-22 The Trustees Of Columbia University In The City Of New York Ordered data compression system and methods
US7003117B2 (en) * 2003-02-05 2006-02-21 Voltage Security, Inc. Identity-based encryption system for secure data distribution
US7606305B1 (en) * 2003-02-24 2009-10-20 Vixs Systems, Inc. Method and system for transcoding video data
FR2852773A1 (en) 2003-03-20 2004-09-24 France Telecom Video image sequence coding method, involves applying wavelet coding on different images obtained by comparison between moving image and estimated image corresponding to moving image
US7574406B2 (en) 2003-03-31 2009-08-11 Satyam Computer Services Limited Of Mayfair Centre System and method maximizing video license utilization using billboard services
US7184073B2 (en) 2003-04-11 2007-02-27 Satyam Computer Services Limited Of Mayfair Centre System and method for warning drivers based on road curvature
US7424164B2 (en) 2003-04-21 2008-09-09 Hewlett-Packard Development Company, L.P. Processing a detected eye of an image to provide visual enhancement
US7956889B2 (en) * 2003-06-04 2011-06-07 Model Software Corporation Video surveillance system
US7415527B2 (en) 2003-06-13 2008-08-19 Satyam Computer Services Limited Of Mayfair Centre System and method for piecewise streaming of video using a dedicated overlay network
US7603022B2 (en) 2003-07-02 2009-10-13 Macrovision Corporation Networked personal video recording system
JPWO2005006766A1 (en) * 2003-07-09 2007-09-20 日本電気株式会社 Moving picture coding method, moving picture decoding method, moving picture coding apparatus, moving picture decoding apparatus and computer program
US7296030B2 (en) * 2003-07-17 2007-11-13 At&T Corp. Method and apparatus for windowing in entropy encoding
US7383180B2 (en) * 2003-07-18 2008-06-03 Microsoft Corporation Constant bitrate media encoding techniques
KR20050040712A (en) 2003-10-28 2005-05-03 삼성전자주식회사 2-dimensional graphic decoder including graphic display accelerating function based on commands, graphic display accelerating method therefor and reproduction apparatus
WO2005055602A1 (en) 2003-12-04 2005-06-16 Telefonaktiebolaget Lm Ericsson (Publ) Video application node
GB2409029A (en) 2003-12-11 2005-06-15 Sony Uk Ltd Face detection
US7535515B2 (en) 2003-12-23 2009-05-19 Ravi Ananthapur Bacche Motion detection in video signals
US8175412B2 (en) * 2004-02-17 2012-05-08 Yeda Research & Development Co. Ltd. Method and apparatus for matching portions of input images
US7447331B2 (en) * 2004-02-24 2008-11-04 International Business Machines Corporation System and method for generating a viewable video index for low bandwidth applications
JP2005244585A (en) 2004-02-26 2005-09-08 Alps Electric Co Ltd Isolator
WO2006002299A2 (en) * 2004-06-22 2006-01-05 Sarnoff Corporation Method and apparatus for recognizing 3-d objects
US8902971B2 (en) 2004-07-30 2014-12-02 Euclid Discoveries, Llc Video compression repository and model reuse
US7436981B2 (en) * 2005-01-28 2008-10-14 Euclid Discoveries, Llc Apparatus and method for processing video data
US7508990B2 (en) 2004-07-30 2009-03-24 Euclid Discoveries, Llc Apparatus and method for processing video data
US9578345B2 (en) 2005-03-31 2017-02-21 Euclid Discoveries, Llc Model-based video encoding and decoding
US9743078B2 (en) 2004-07-30 2017-08-22 Euclid Discoveries, Llc Standards-compliant model-based video encoding and decoding
CA2575211C (en) 2004-07-30 2012-12-11 Euclid Discoveries, Llc Apparatus and method for processing video data
US7457472B2 (en) 2005-03-31 2008-11-25 Euclid Discoveries, Llc Apparatus and method for processing video data
WO2008091483A2 (en) * 2007-01-23 2008-07-31 Euclid Discoveries, Llc Computer method and apparatus for processing image data
US7457435B2 (en) 2004-11-17 2008-11-25 Euclid Discoveries, Llc Apparatus and method for processing video data
US8724891B2 (en) 2004-08-31 2014-05-13 Ramot At Tel-Aviv University Ltd. Apparatus and methods for the detection of abnormal motion in a video stream
EP1800238A4 (en) * 2004-09-21 2012-01-25 Euclid Discoveries Llc Apparatus and method for processing video data
WO2006055512A2 (en) 2004-11-17 2006-05-26 Euclid Discoveries, Llc Apparatus and method for processing video data
US20060120571A1 (en) 2004-12-03 2006-06-08 Tu Peter H System and method for passive face recognition
WO2007044044A2 (en) * 2004-12-21 2007-04-19 Sarnoff Corporation Method and apparatus for tracking objects over a wide area using a network of stereo sensors
US7715597B2 (en) * 2004-12-29 2010-05-11 Fotonation Ireland Limited Method and component for image recognition
JP2008529414A (en) 2005-01-28 2008-07-31 ユークリッド・ディスカバリーズ・エルエルシー Apparatus and method for processing video data
CN101167363B (en) * 2005-03-31 2010-07-07 欧几里得发现有限责任公司 Method for processing video data
US20060274949A1 (en) 2005-06-02 2006-12-07 Eastman Kodak Company Using photographer identity to classify images
CN101223786A (en) 2005-07-13 2008-07-16 皇家飞利浦电子股份有限公司 Processing method and device with video temporal up-conversion
US7672306B2 (en) 2005-07-18 2010-03-02 Stewart Ian A Method for secure reliable point to multi-point bi-directional communications
US8867618B2 (en) * 2005-07-22 2014-10-21 Thomson Licensing Method and apparatus for weighted prediction for scalable video coding
US7689021B2 (en) * 2005-08-30 2010-03-30 University Of Maryland, Baltimore Segmentation of regions in measurements of a body based on a deformable model
US20080232477A1 (en) * 2005-09-01 2008-09-25 Koninklijke Philips Electronics, N.V. Method and Device For Coding and Decoding of Video Error Resilience
CA2622744C (en) 2005-09-16 2014-09-16 Flixor, Inc. Personalizing a video
US9258519B2 (en) 2005-09-27 2016-02-09 Qualcomm Incorporated Encoder assisted frame rate up conversion using various motion models
JP4654864B2 (en) 2005-09-30 2011-03-23 パナソニック株式会社 Method for manufacturing plasma display panel
US8019170B2 (en) * 2005-10-05 2011-09-13 Qualcomm, Incorporated Video frame motion-based automatic region-of-interest detection
US20070153025A1 (en) * 2005-12-29 2007-07-05 Mitchell Owen R Method, apparatus, and system for encoding and decoding a signal on a viewable portion of a video
US8150155B2 (en) * 2006-02-07 2012-04-03 Qualcomm Incorporated Multi-mode region-of-interest video object segmentation
US7630522B2 (en) 2006-03-08 2009-12-08 Microsoft Corporation Biometric measurement using interactive display systems
US20070268964A1 (en) 2006-05-22 2007-11-22 Microsoft Corporation Unit co-location-based motion estimation
JP2009540675A (en) * 2006-06-08 2009-11-19 ユークリッド・ディスカバリーズ・エルエルシー Apparatus and method for processing video data
US20080027917A1 (en) * 2006-07-31 2008-01-31 Siemens Corporate Research, Inc. Scalable Semantic Image Search
CN101513069B (en) * 2006-09-30 2011-04-13 汤姆逊许可公司 Method and equipment for encoding and decoding video color enhancement layer
EP2090110A2 (en) * 2006-10-13 2009-08-19 Thomson Licensing Reference picture list management syntax for multiple view video coding
EP2098078A1 (en) 2006-12-11 2009-09-09 THOMSON Licensing Method of encoding an image and device implementing said method
EP2105029A2 (en) * 2006-12-15 2009-09-30 Thomson Licensing Distortion estimation
US8804829B2 (en) 2006-12-20 2014-08-12 Microsoft Corporation Offline motion description for video generation
EP2106663A2 (en) * 2007-01-23 2009-10-07 Euclid Discoveries, LLC Object archival systems and methods
US8243118B2 (en) * 2007-01-23 2012-08-14 Euclid Discoveries, Llc Systems and methods for providing personal video services
KR101366242B1 (en) * 2007-03-29 2014-02-20 삼성전자주식회사 Method for encoding and decoding motion model parameter, and method and apparatus for video encoding and decoding using motion model parameter
EP2191402A4 (en) 2007-08-20 2014-05-21 Nokia Corp Segmented metadata and indexes for streamed multimedia data
US8036464B2 (en) 2007-09-07 2011-10-11 Satyam Computer Services Limited System and method for automatic segmentation of ASR transcripts
US8065293B2 (en) 2007-10-24 2011-11-22 Microsoft Corporation Self-compacting pattern indexer: storing, indexing and accessing information in a graph-like data structure
US8091109B2 (en) 2007-12-18 2012-01-03 At&T Intellectual Property I, Lp Set-top box-based TV streaming and redirecting
US20100322300A1 (en) 2008-03-18 2010-12-23 Zhen Li Method and apparatus for adaptive feature of interest color model parameters estimation
JP5429445B2 (en) 2008-04-08 2014-02-26 富士フイルム株式会社 Image processing system, image processing method, and program
JP2009284298A (en) * 2008-05-23 2009-12-03 Hitachi Ltd Moving image encoding apparatus, moving image decoding apparatus, moving image encoding method and moving image decoding method
US8140550B2 (en) 2008-08-20 2012-03-20 Satyam Computer Services Limited Of Mayfair Centre System and method for bounded analysis of multimedia using multiple correlations
US8065302B2 (en) 2008-08-27 2011-11-22 Satyam Computer Services Limited System and method for annotation aggregation
US8259794B2 (en) * 2008-08-27 2012-09-04 Alexander Bronstein Method and system for encoding order and frame type selection optimization
US8086692B2 (en) 2008-08-27 2011-12-27 Satyam Computer Services Limited System and method for efficient delivery in a multi-source, multi destination network
US8090670B2 (en) 2008-09-19 2012-01-03 Satyam Computer Services Limited System and method for remote usage modeling
US8392942B2 (en) 2008-10-02 2013-03-05 Sony Corporation Multi-coded content substitution
EP2345256B1 (en) * 2008-10-07 2018-03-14 Euclid Discoveries, LLC Feature-based video compression
BRPI0923200A2 (en) 2008-12-01 2016-01-26 Nortel Networks Ltd method and apparatus for providing a video representation of a three-dimensional computer generated Vietnamese environment
US8386318B2 (en) 2008-12-30 2013-02-26 Satyam Computer Services Ltd. System and method for supporting peer interactions
EP2216750A1 (en) 2009-02-06 2010-08-11 Thomson Licensing Method and apparatus for encoding 3D mesh models, and method and apparatus for decoding encoded 3D mesh models
WO2010118254A1 (en) 2009-04-08 2010-10-14 Watchitoo, Inc. System and method for image compression
US20100316131A1 (en) 2009-06-12 2010-12-16 Motorola, Inc. Macroblock level no-reference objective quality estimation of video
TWI442777B (en) 2009-06-23 2014-06-21 Acer Inc Method for spatial error concealment
US8068677B2 (en) 2009-08-25 2011-11-29 Satyam Computer Services Limited System and method for hierarchical image processing
US8848802B2 (en) * 2009-09-04 2014-09-30 Stmicroelectronics International N.V. System and method for object based parametric video coding
US20110087703A1 (en) 2009-10-09 2011-04-14 Satyam Computer Services Limited Of Mayfair Center System and method for deep annotation and semantic indexing of videos
WO2011061709A1 (en) 2009-11-19 2011-05-26 Nokia Corporation Method and apparatus for tracking and recognition with rotation invariant feature descriptors
US8290038B1 (en) 2009-11-30 2012-10-16 Google Inc. Video coding complexity estimation
US9313465B2 (en) 2010-06-07 2016-04-12 Thomson Licensing Learned transform and compressive sensing for video coding
US8577179B2 (en) 2010-08-19 2013-11-05 Stmicroelectronics International N.V. Image processing arrangement illuminating regions of an image based on motion
WO2012033970A1 (en) 2010-09-10 2012-03-15 Thomson Licensing Encoding of a picture in a video sequence by example - based data pruning using intra- frame patch similarity
US8661076B2 (en) 2010-09-23 2014-02-25 Salesforce.Com, Inc. Business networking information feed alerts
US8531535B2 (en) 2010-10-28 2013-09-10 Google Inc. Methods and systems for processing a video for stabilization and retargeting
US8804815B2 (en) 2011-07-29 2014-08-12 Dialogic (Us) Inc. Support vector regression based video quality prediction
US20130035979A1 (en) 2011-08-01 2013-02-07 Arbitron, Inc. Cross-platform audience measurement with privacy protection

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9743078B2 (en) 2004-07-30 2017-08-22 Euclid Discoveries, Llc Standards-compliant model-based video encoding and decoding
US9532069B2 (en) 2004-07-30 2016-12-27 Euclid Discoveries, Llc Video compression repository and model reuse
US9578345B2 (en) 2005-03-31 2017-02-21 Euclid Discoveries, Llc Model-based video encoding and decoding
US10091507B2 (en) 2014-03-10 2018-10-02 Euclid Discoveries, Llc Perceptual optimization for model-based video encoding
US9621917B2 (en) 2014-03-10 2017-04-11 Euclid Discoveries, Llc Continuous block tracking for temporal prediction in video encoding
US10097851B2 (en) 2014-03-10 2018-10-09 Euclid Discoveries, Llc Perceptual optimization for model-based video encoding
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
US9946933B2 (en) * 2016-08-18 2018-04-17 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
US20180189143A1 (en) * 2017-01-03 2018-07-05 International Business Machines Corporation Simultaneous compression of multiple stored videos
KR20180129339A (en) * 2017-05-26 2018-12-05 라인 가부시키가이샤 Method for image compression and method for image restoration
KR102256110B1 (en) 2017-05-26 2021-05-26 라인 가부시키가이샤 Method for image compression and method for image restoration
KR20210060411A (en) * 2017-05-26 2021-05-26 라인 가부시키가이샤 Method for image compression and method for image restoration
KR102343325B1 (en) 2017-05-26 2021-12-27 라인 가부시키가이샤 Method for image compression and method for image restoration
CN110113607A (en) * 2019-04-25 2019-08-09 长沙理工大学 A kind of compressed sensing video method for reconstructing based on part and non-local constraint
WO2021201642A1 (en) * 2020-04-03 2021-10-07 엘지전자 주식회사 Video transmission method, video transmission device, video reception method, and video reception device
US11227396B1 (en) * 2020-07-16 2022-01-18 Meta Platforms, Inc. Camera parameter control using face vectors for portal
US20230334674A1 (en) * 2020-07-16 2023-10-19 Meta Platforms, Inc. Camera parameter control using face vectors for portal
US20230162372A1 (en) * 2021-11-24 2023-05-25 Microsoft Technology Licensing, Llc Feature Prediction for Efficient Video Processing

Also Published As

Publication number Publication date
JP5567021B2 (en) 2014-08-06
WO2010042486A1 (en) 2010-04-15
CN102172026B (en) 2015-09-09
JP2012505600A (en) 2012-03-01
EP2345256A1 (en) 2011-07-20
US20110182352A1 (en) 2011-07-28
US8964835B2 (en) 2015-02-24
CN102172026A (en) 2011-08-31
US20120155536A1 (en) 2012-06-21
CA2739482C (en) 2017-03-14
TW201016016A (en) 2010-04-16
CA2739482A1 (en) 2010-04-15
EP2345256B1 (en) 2018-03-14
US8942283B2 (en) 2015-01-27

Similar Documents

Publication Publication Date Title
US8964835B2 (en) Feature-based video compression
US7508990B2 (en) Apparatus and method for processing video data
CA2676219C (en) Computer method and apparatus for processing image data
JP4928451B2 (en) Apparatus and method for processing video data
US9743078B2 (en) Standards-compliant model-based video encoding and decoding
US7457472B2 (en) Apparatus and method for processing video data
JP4573895B2 (en) Apparatus and method for processing video data
CA2654513A1 (en) Apparatus and method for processing video data
US20130107948A1 (en) Context Based Video Encoding and Decoding
JP2008514136A (en) Apparatus and method for processing video data
WO2013148002A2 (en) Context based video encoding and decoding
US20130114703A1 (en) Context Based Video Encoding and Decoding
JP2008529414A (en) Apparatus and method for processing video data
JP2008521347A (en) Apparatus and method for processing video data
WO2023069337A1 (en) Systems and methods for optimizing a loss function for video coding for machines
Georgiadis Scene representations for video compression

Legal Events

Date Code Title Description
AS Assignment

Owner name: EUCLID DISCOVERIES, LLC, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PACE, CHARLES P.;REEL/FRAME:035195/0915

Effective date: 20100201

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION