WO2000018128A1

WO2000018128A1 - System and method for semantic video object segmentation

Info

Publication number: WO2000018128A1
Application number: PCT/US1999/022264
Authority: WO
Inventors: Shih-Fu Chang; Di Zhong
Original assignee: The Trustees Of Columbia University In The City Of New York
Priority date: 1998-09-24
Filing date: 1999-09-24
Publication date: 2000-03-30
Also published as: JP2002525988A; AU6265499A

Abstract

The segmenting and actively tracking of semantic video objects from a sequence of frames of digital video information involves receiving an initial frame of video image data and at least one succeeding frame of video image data from a sequence of frames of digital video information. From the first frame, at least one object boundary is defined by a user. The initial frame of video image data is then partitioned into a plurality of regions based on selected substantially homogenous video features. Tracking is then performed on the object boundary and plurality of regions which are projected to at least one succeeding frame of the video image data.

Description

SYSTEM AND METHOD FOR SEMANTIC VIDEO OBJECT SEGMENTATION

SPECIFICATION

This application claims priority to United States Provisional Application Serial No. 60/101,675, entitled An Active System and Algorithm for Semantic Video Object Segmentation, filed on September 24, 1998.

FIELD OF THE INVENTION The present invention relates generally to digital image processing and more particularly relates to systems and methods for performing object segmentation and image tracking in digital video image sequences.

BACKGROUND OF THE INVENTION Object segmentation and tracking is a fundamental step for many digital video applications, such as object-based coding used in MPEG-4; content-based video indexing, for example as described in S. F. Chang, et al., "VideoQ: An Automated Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia Conference, Seattle, WA, Nov. 1997; video editing and authorization. There have been several attempts in this field at decomposing images into regions with uniform features. See, for example, M.P. Dubuisson et al., "Contour Extraction of Moving Objects in Complex Outdoor Scenes," 14 Int'l J. of Computer Vision 83-105 (1995); Nikhil R. Pal et al., "A review on Image Segmentation

Techniques," 26 Pattern Recognition 1277-1294 (1993); Mark Tabb et al., "Multiscale Image Segmentation by Integrated Edge and Region Detection," 6 IEEE Trans, on Image Proc. (1997); Alain Tremeau et al., "A Region Growing and Merging Algorithm to Color Segmentation," 30 Pattern Recognition 1191-1203 (1997); and E. Saber et al., "Fusion of Color and Edge Information for Improved Segmentation and Edge Linking," IEEE ICASSP '96. However, no robust techniques for tracking video objects through video sequences are presently available. Known image segmentation techniques include thresholding methods, such as described in the Pal et al. article; region growing methods, for example as described in the Dubuisson et al., Pal et al., and Tremeau et al. articles; split-and-merge techniques as described in "Picture Segmentation by a directed split-and-merge procedure," by S. Horowitz et al., Proc. 2nd Intl. Joint Conf. on Pattern Recognition, pp. 424-433 (1974); watershed techniques, for example as described by Luc Vincent et al., in "Watersheds in Digital Spaces: An efficient Algorithm Based on Immersion Simulations", 13 IEEE Trans. Pattern Analysis & Machine Intelligence (1991); and certain stochastic models, for example as described by Geman, "Stochastic relaxation, Gibbs distributions and the Bayesian Restoration of Images", PAMI-6 IEEE Trans. Pattern Analysis and Machine Intelligence 721-741 (1984). These various approaches extract homogeneous image regions based on visual features such as color, edge or their combinations.

Although there is no unified method that is good for all kinds of images, generally, image segmentation techniques are context independent and can be adapted to various applications. On the contrary, video object tracking techniques are usually directed to meaningful objects (e.g. walking people, moving cars) and are domain specific. As uniform motion often indicates a semantic object in many applications, motion is widely used as the main feature in object segmentation and tracking techniques.

Several approaches have been developed to segment moving objects using either a motion field or optical flow. As the motion field tends to be very noisy in real-world scenes, direct segmentation of optical flow can be erroneous and unstable. Model based motion estimation and segmentation methods, such as the technique described in J.R. Bergen et al., "Hierarchical model-based motion estimation," ECCV-92, Santa Margarita Ligure, pp. 237-252 (1992) may be more stable. In the affine-clustering based algorithm presented in Jhon Y.A. Wang et al., "Representing Moving Images with Layers", 3 IEEE Trans, on Image Processing (1994), motion layers are extracted from the original motion field by iteratively estimating and refining affine models. A method which avoids use of optical flow by estimating motion models and their layer support simultaneously is proposed in A. Ayer et al., "Layered Representation of Motion Video Using Robust Maximum-Likelihood Estimation of Mixture Models and MDL Encoding," Proc. Fifth Int'l Conf. Computer Vision (1995). In this article, the number of layers is decided by a formulation based on a maximum likelihood estimation of mixture models and the minimum description length encoding principle. In Francois G. Meyer et al., "Region Based Tracking using Affine Motion Models in Logn Image Sequences," 60 CVGIP: Image Understanding 119-140 (1994), a pursuit algorithm to track an object based on the multi-resolution estimation of affine model from the motion field within the object. In general, the above methods concentrate on segmenting moving objects and cannot readily be applied to track static objects or objects with intermittent motion, such as people crossing the street. Furthermore, due to the limited accuracy of motion estimation, motion segmentation generally does not provide clear object boundaries. In the context of machine vision, some research has focused on tracking feature points or contour segments. For example, the methods described in R. Deriche et al., "Tracking line segments", 8 Image & Vision Computing 261-270 (1990) and V.S. Hwang, "Tracking feature points in time-varying images using an opportunistic selection approach", 22 Pattern Recognition 247-256 (1989) generate good results when moving objects have strong and stable localized features such as corners and edges. However, these techniques are highly sensitive to object deformation, object occlusion and image noise.

For the tracking of non-rigid objects, deformable models have been studied. For example, M. Kass, et al., "Snakes: Active contour models," Int'l J. of Computer Vision, pp 321-331 (1988) describes the use of active contour (snakes) as a basic energy-minimizing elastic contour models. Given the energy function, an elastic contour (i.e., object) at the current frame can be moved and deformed iteratively to the best position at the next frame using the snake algorithm. The use of snakes generally requires very accurate initialization, can handle only slow motion, and can be sensitive to textured image regions. Recently, in light of the demand for object tracking in general videos and the requirement of providing more accurate segmentation boundaries, region-based tracking algorithms, which combine common image segmentation techniques with motion estimation methods, have been explored. For example, techniques described in the Dubuisson et al. article, Chuang Gu et al., "Semantic Video Object Segmentation and Tracking Using Mathematical Morphology and Perspective Motion Model," ICIP'97, and in Kui Zhang et al., "Motion Based Image Segmentation for Video Coding," Proc. Int'l Conf. on Image Processing, Washington (ICIP'95) pp.476-479, exploit more region features than feature points, segments or boundary based tracking, and are more robust in handling real-world problems such as occlusion. In the Dubuisson et al. article, an approach to combine motion segmentation using image subtraction with static color segmentation using the split-and-merge paradigm is presented. Color segmentation regions are examined against motion masks to form the final object mask based on certain confidence measures. The Zhang et al. article proposes an algorithm to match edge detection and line approximation results with motion segmentation, and to determine the final object boundary. The Gu et al. article proposes a semantic object tracking system using mathematical morphology and perspective motion. That system uses a modified morphological watershed procedure to segment uncertain areas between the interior and exterior outlines. Flooding seeds are sampled on both interior and exterior outlines, and regions growing from interior seeds define the segmented object boundary. In the first frame, the interior outline is defined by users, and the exterior outline is obtained by dilation of the interior one. For subsequent frames, the interior and exterior outlines are created by erosion and dilation of the motion projected boundary from the previous frame.

The aforementioned region-based techniques may achieve satisfactory results for a specific class of video content, i.e., video which includes only rigid objects exhibiting simple motion. However, these techniques all track a single contour or video object, ignoring possibly complex components and motions within the semantic object. In real-world video sources, a semantic object usually contains several parts of differing motion, some of which are non-rigid and/or exhibit rapid motion changes. Thus, a single motion model is generally inadequate to track a semantic object.

In addition, the above-discussed prior art techniques rely on field motion as the main feature for tracking purposes. Static color or gray-level segmentation is fulfilled separately, and the fusion of the two segmentation results is typically done only at the final stage using certain heuristic rules. Due to the noisy nature of a motion field in real-world scenes, tracking results may also be error prone.

Because no constraints are applied between motion and static segmentation, the generation of a final object mask is difficult when the two results are widely separated. Furthermore, these techniques tend to ignore the background content while tracking the foreground object. This can cause problems in tracking regions near the boundary of the object.

SUMMARY OF THE INVENTION To solve the above problems for real-world video sources, the present invention combines an innovative method for combining low level automatic region segmentation and tracking with an active method for defining and tracking video objects at a higher level. A semantic object is modeled as a set of regions with corresponding spatial and visual features. The present invention directly links the semantic object to its underlying regions and is able to handle many real-world situations, such as complex objects, fast and/or intermittent motion, complicated backgrounds, multiple moving objects and partial occlusion, which cannot be satisfactorily modeled using existing techniques.

The instant technique can be carried out in two stages: an initial object segmentation stage where user input at the starting frame is used to create a semantic object with underlying homogeneous regions; and an object tracking stage where homogeneous regions and the semantic object are tracked through successive frames. At the first stage, the object model is constructed through a region segmentation process and a region aggregation process. The segmentation process decomposes the frame image into a set of non-overlapping regions with homogeneous visual features including color, edge and motion. The aggregation process groups these regions into foreground and background objects according to the initial object boundary given by users. At the second stage, innovative methods are deployed to map regions to the future frame as seeds for an inter-frame region tracking and segmentation process. An estimated object boundary is extracted from the tracked regions and then an aggregation process is used to classify regions into foreground objects or background based on their tracks as well as feature similarity.

In accordance with the present invention, a method for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, includes the steps of: receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; defining at least one object boundary in the initial frame of video; partitioning the initial frame of video image data into a plurality of regions based on substantially homogenous video features; and tracking the object boundary and plurality of regions in the at least one succeeding frame of video image data. In such a method, the defining step can include receiving an approximation of the object boundary from a user.

Also, the partitioning step can include edge, color and a motion field as the applicable video features. An aggregation step can be performed after the partitioning step in order to classify the regions as background and foreground regions.

Preferably, the plurality of regions do not overlap. The classification of regions can be determined by a percentage of the region which resides within the object boundary. The classification of regions after a first frame can also be determined by the classification of the region in a previous frame. The classification of regions can also be determined by features such as color, motion and edge.

BRIEF DESCRIPTION OF THE DRAWING Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which Fig. 1 is a system diagram of an illustrative embodiment of the present invention;

Fig. 2 is a block diagram of an object segmentation system in accordance with a preferred embodiment of the present invention; Fig. 3 is a flow diagram of an iterative region merging process useful in the object segmentation process of Fig. 2;

Fig. 4 is a block drawing of an automatic semantic object tracking process in accordance with a preferred embodiment of the present invention;

Fig. 5 is an illustrative diagram showing the inter-frame pixel labeling process useful in the process of Fig. 4;

Fig. 6 is an illustrative drawing showing the results of the application of the present invention to five different video sequences;

Figs. 7a and 7b are graphs plotting the average number of false pixels and missed pixels, respectively, against the number of user inputs for the video sequences shown in Fig. 6; and

Fig. 8 is a block diagram of a computer system suitable for practicing the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Figure 1 illustrates an embodiment of a system which combines automatic region segmentation methods with an active method for defining and tracking video objects at a higher level. The system generally contains two stages: an initial object segmentation stage, where user input 100 at the starting frame 102 is used to create a semantic object with underlying homogeneous regions 104; and an object tracking stage 110, where homogeneous regions 104 and the object are tracked through the successive frames.

In the first stage, based on a rough object boundary provided by user input in an object definition block 100 at the starting frame 102, the semantic object is generated through a region segmentation process 106 and aggregation process. A region segmentation method can be applied to an area inside a slightly expanded bounding box of the user-specified object which effectively fuses color and edge features in a region-growing process that produces homogeneous color regions with accurate boundaries.

To extract homogeneous regions 104 in both color and motion, motion segmentation based on a dense motion field can be used to further split the color regions. As color segmentation is more robust and gives more accurate region boundaries, it is beneficial if color segmentation is applied before motion segmentation. Motion is generally used as a secondary feature due to the noisy nature of the motion field. This is important for the following region aggregation procedure 108 where homogeneous regions are classified as either foreground or background to form the semantic object with an accurate boundary. Region aggregation 108 is based on the coverage of each region by the initial object boundary: regions that are covered by more than a certain percentage are grouped into the foreground object. The final contour of the semantic object is computed from foreground regions. Foreground regions belonging to the object and background regions are stored and can be tracked over time in the successive frames.

Tracking at both the region and object levels is the main task in the second stage of processing. Segmented regions from the previous frame are projected to the current frame using their individual affine motion models. An expanded bounding box including all projected foreground regions can then be computed. Then the area inside the bounding box can be split to homogeneous color and motion regions following a region tracking process. Unlike in other existing approaches, projected regions are not used directly as the new segmentation. Rather, the projected regions are used as seeds in another color based region growing process, which is similar to the fusing algorithm in the first stage, to track existing regions. Pixels that can not be tracked from any old regions are labeled as new regions. Thus, the resulting homogeneous regions are tagged as either foreground, which are tracked from a foreground region, or background, which are tracked from a background region, or new, which are not tracked. The homogeneous regions 104 are then passed to a region aggregation process 108 and are classified as either belonging to the foreground object or the background. Here instead of the being provided directly by user input, the approximated object boundary is obtained from the projected foreground regions. To better handle possible motion estimation errors, the region aggregation process 108 can be carried out iteratively. The object contour can then be computed from foreground regions and all regions are advanced to the motion projection process 112 for a succeeding frame 114 of video image data. Figure 2 is a system diagram further outlining a semantic object segmentation operation at an initial frame. Semantic object segmentation preferably utilizes four processes which are discussed below, i.e., object definition and threshold specification, feature map generation, region segmentation, and object composition. For the first process, object definition 202 and threshold specification, users initially define a semantic object by using a tracing interface, such as a digital pointer, light pen, touch screen, mouse or the like. The input can be a polygon whose vertices and edges are roughly placed along the desired object boundary. To improve system tolerance to user-input error, a snake algorithm 204, such as that described in the Kass et al. article, which is hereby incorporated by references in its entirety, can be used to align the user-specified polygon to the actual object boundary (i.e. edges). (The snake algorithm can be based on minimizing a specific energy function associated with edge pixels.) Users can also choose to skip the snake module 204 if a relatively accurate outline of the object is provided. The final accurate object boundary will be obtained by the subsequent region segmentation and aggregation processes.

After the object definition process 202, users can start the tracking process by specifying a set of thresholds. These thresholds include a color merging threshold, weights on the three color channels, a motion merging threshold and a tracking buffer size. These thresholds can be chosen based on the characteristic of a given video shot and experimental results. For example, for a video shot where foreground objects have similar luminance with background regions, users may put a lower weight on the luminance channel. Users can perform the tracking process over a small number of frames using predetermined default thresholds which are automatically generated by the system, and then adjust the thresholds based on the segmentation and tracking results. The user can stop the tracking process at any frame, modify the object boundary that is being tracked and then restart the tracking process from the modified frame.

Given the initial object boundary from a user, or the snake module, a slightly extended (-15 pixels) bounding box 206 surrounding the arbitrarily shaped object is computed. The following segmentation procedures are fulfilled inside this bounding box.

The second process involves the creation of three feature maps, i.e., an edge map 208, color map 210 and a motion field 212, within the bounding box from the original frame image. The color map is the major feature map in the following segmentation module 214. A process to generate the color map 210 can include the following steps. The original image is converted into CIE L*u*v* color space. It is well known that perceptually non-uniform color spaces such as RGB are not well suited for color segmentation, as distance measure in these spaces is not proportional to perceptual difference. L*u*v* is a perceptually uniform color space which divides color into one luminance channel (L*) and two chrominance channels (u* and v*). This channel separation also allows different weights to be placed on luminance and chrominance distance measures. The color difference is thus given by the weighted Euclidean distance in the three dimensional space, shown in equation (1):

Δs =Jw_L(ΔL *)² +w_u(Δu *)² +w_v(Δv *)² (*

where w_L, w_υ and w_v are weights assigned by a user. Experimental results indicate that it is generally preferable to put more weighting on the chrominance channels (e.g., about two times more) rather than the luminance channel.

The L*u*v* image is preferably simplified and smoothed to remove possible noise as well as minor details for the purpose of region merging. This can be accomplished by an adaptive quantization and a median filtering process. Quantization can be performed to produce homogeneous regions in which pixels having similar colors are mapped into a common bin. To quantize an image into a very limited number of colors (e.g., 32 or 16 bins), common fixed-level quantization methods can not always preserve color information under variant situations. Thus, the use of a clustering based method, such as for example a K-means technique, to analyze the input image and determine an adaptive quantization scale in the L*u*v* space is preferred. The above weighted Euclidean distance (Eq. 1) can be used as the distance measure in the clustering process. After quantization, a median filtering process can be applied on each of the L*u*v* channels. Such non-linear median filtering substantially reduces image noise and emphasizes prominent color regions while preserving salient image features, such as edges. This process can be used to produce the final color map 210.

The edge map 208 is a binary mask wherein edge pixels are distinguished from non-edge pixels, such as by setting the edge pixels to 1 and setting the non-edge-pixels to 0. The edge map 208 can be generated by applying edge detection 214, such as the CANNY edge detection algorithm. The CANNY edge detector performs 2-D Gaussian pre-smoothing on the image and then takes directional derivatives in the horizontal and vertical directions. These derivatives are used to calculate a gradient. Local gradient maxima can then be defined as candidate edge pixels. This output is preferably run through a two-level thresholding synthesis process to produce the final edge map 208. A simple algorithm can be used to select the two threshold levels in the synthesis process based on a histogram of the edge gradient.

The motion field 212 can be generated by a hierarchical block matching algorithm, such as described in M. Bierling, "Displacement Estimation by Hierarchical Block Matching", 1001 SPIE Visual Communication & Image Processing (1988) which is hereby incorporated by reference. This algorithm uses distinct sizes of measurement windows for block matching at different grid resolutions, and thus is able to generate a homogeneous motion field that is closer to the true motion than block based estimation algorithms. A 3-level hierarchy, such as suggested in Bierling, is appropriate for use in the instant process.

The region segmentation process is developed based on the three feature maps: color map 210, edge map 208 and motion field 212. Because the color map 210 and edge map 208 are complementary, i.e., the former captures low frequency information (means) while the latter captures high-frequency details (edges) of an image, fusion of these maps greatly improves segmentation results. Referring to Figure 2, during segmentation 214 a color-based pixel labeling process is applied to the color map. Labeling is a process where one label is assigned to a group of neighboring pixels which exhibit substantially the same color. To prevent assigning one label to two regions with the same color but separated by edge pixels, only non-edge pixels (i.e., pixels that are set to 0 in the edge mask) are labeled in this process. Edge pixels remains un-labeled. This process generates an initial group of regions (i.e., pixels with the same label) as well as their connection graph. Two regions are linked as neighbors if pixels in one region has neighboring pixels (4-connection) in another region. As edge-pixels are not labeled, two regions separated by edge pixels are not linked as neighboring regions.

Referring to Fig. 3, the color merging process is an iterative spatial-constrained clustering process. Color distances (Eq. 1) between two connected regions are computed in block 302. Two connected regions are merged if the color distance between them is 1 ) smaller than the given color threshold; and 2) the local minimal, i.e., it is smaller than all other color distances associated with the two regions. Once a new region is generated from two adjoining regions, the mean color of the new region can be computed (block 306) by taking a weighted average of the mean colors of the two old regions m„ m₂ according to the equation:

_{l +}s₂

where the sizes of the two old regions s,, s₂ are used as weights. The region connections are also updated for all neighbors of the two old regions (block 308). The new region can be assigned a label (block 310), such as the label of the larger one of the two merged regions. The two old regions can then be dropped (block 312).

The color merging process can be iterated until color distances between every two connected regions are above the color threshold. As edge pixels are not labeled, two regions separated by edge pixels are not connected as neighbors. Thus, the growth of each region is naturally stopped at its edge pixels. After the color merging process is complete, the edge pixels can simply be assigned to their neighboring regions with the smallest color distances.

To ensure homogeneous motion, a motion-based segmentation using the dense optical flow can be applied to the segmented color regions to check the uniformity of the motion distribution. A similar iterative spatially constrained clustering process can also be used to group pixels inside a color region according to their motion vectors and the given motion threshold. Because an objective is to track the object, this process need only be applied to those regions that intersect with the initial object boundary.

Relatively small regions can be merged with their neighboring regions based again on the color distance. Generally, this simplifies the segmentation result and reduces the computation complexity.

For object composition, the region aggregation module 220 receives homogeneous regions from the segmentation module 214 and the initial object boundary from the snake module 204 or user input 220 directly. Region aggregation at the starting frame is relatively simple compared with that for the subsequent frames, as all regions are newly generated (not tracked) and the initial outline is usually not far from the real object boundary. A region is classified as foreground if more than a certain percentage (e.g., 90%) of the region is included by the initial object outline. On the other hand, if less than a certain percentage (e.g., 30%) of a region is covered by the initial object outline, the region is considered as background. Regions between the low and high thresholds are split into foreground and background regions according to the intersection with the initial object outline.

After region aggregation 220, affine estimation 222 is performed. Affine motion parameters of all regions, including both foreground and background, can be estimated by a multivariate linear regression process over the dense optical flow inside each region. Preferably, a two dimensional affine model with six parameters, as set forth in Equation 3, can be used as a good first-order approximation of a distant object undergoing three dimensional translation and linear deformation:

(3)

where [x, y] and [x¹, y¹] are the original and transformed coordinates respectively, and t_x, t_y, s_x, s_y, r_x, and x_y, are the six affine transformation parameters. These affine models will be used to track the regions and object in the future frames.

Given the object with homogeneous regions constructed at the starting frame, tracking in both the region and object levels is the main task in the successive frames. The objective of the tracking process is to avoid losing foreground regions and also avoid including false background regions. This task can be performed through the following steps. An inter-frame segmentation process can be employed to segment a frame into homogeneous regions. Unlike in the starting frame where all regions are tagged as new, these regions are classified in the inter-frame segmentation process as either foreground, background or new, according to their relationship with the existing foreground and background regions of the previous frame. In the region aggregation process, the estimated, or projected, object boundary can then be used to group the tagged regions into the object and background. For regions around the object boundary with the tag "new," their visual features are also examined to determine whether they belong to the object or not.

As shown in Figure 4, by way of a motion projection module 402, segmented regions from the previous frame, including both foreground and background region, are projected onto the current frame (virtually) using their individual affine motion models. Projected regions keep their labels and original classifications. For video shots with static or homogeneous background (i.e. only one moving object), users can choose not to project background regions to save time and processing overhead. An expanded bounding box of all projected foreground regions can be computed. Similarly, the following segmentation and aggregation processes are generally applied only to the area inside the bounding box.

Generation of the projected feature maps, the edge map 404, color map 406 and motion field 408, generally utilize the same methods as described above. The main difference being that in the quantization step, the existing color palette computed at the starting frame is directly used to quantize the current frame. By using a consistent quantization palette, the color consistency of segmented regions between successive frames is enhanced which improves the performance of region based tracking. As object tracking is limited to single video shots in which there is no abrupt scene change, using one color palette is generally valid. If necessary, a new quantization palette can be generated automatically, such as when a large quantization error is encountered. With reference to Figures 4 and 5, in the tracking module 410, based on the projected foreground and background regions, three feature maps are preferably fused to track existing regions and segment the current frame. First, an inter-frame labeling process is performed where non-edge pixels are sequentially labeled, such as by labeling the pixels one by one from left to right and top to bottom. As in the common labeling process, if a pixel has the same color as that of its labeled neighboring pixels, it is assigned the label of its neighbors. When the color of a pixel is different from the colors of its labeled neighboring pixels, its color is compared with all projected regions that cover its coordinate at the current frame. If the pixels's color distance to the closest region is under the given color threshold, this pixel is "tracked," and assigned the label and classification (foreground or background) of the closest projected region. Otherwise, a new label is generated and assigned to the pixel. Regions identified by this new label are classified as "new" regions.

The interframe color merge, edge labeling, motion split and small region elimination processes which take place in the tracking module 410 are similar to those previously described, with some additional constraints. Foreground or background regions tracked from the previous frame are allowed to be merged only with regions of the same class or new regions. Merging of the foreground and background regions is forbidden. New regions can be merged with each other or merged with foreground/background regions. When a new region is merged with a tracked region, the merging result inherits the label and classification from the tracked region. In motion segmentation, split regions remain in their original classes. After the inter-frame tracking process 410, a list of regions temporarily tagged as either foreground, background, or new is obtained. The temporarily tagged regions are then passed to an iterative region aggregation module 412. As shown in Figure 4, the region aggregation module 412 receives two inputs: the homogeneous regions from tracking module 410 and the estimated object boundary from the motion projection module 402. The object boundary can be estimated from projected foreground regions. Foreground regions from the previous frame are projected independently of one another and the combination of projected regions forms the mask of the estimated object. The mask can be refined using a morphological closing operation (i.e. dilation followed by erosion) with a size of several pixels in order to close tiny holes and smooth boundaries. To tolerate motion estimation error, the result can be further dilated for the tracking buffer size, which is specified by users at the beginning of the tracking process.

The region aggregation module 412 implements a region grouping and boundary alignment algorithm based on the estimated object boundary as well as the edge and motion features of the region. An exemplary algorithm includes the following steps:

1. For every segmented region, do step 2 to step 8.

2. If the region is tagged as background, keep it as background. Go to step 8.

3. Compute an intersection ratio of the region with the object mask. Go to step 5 if the region is tagged as new.

4. If the region (foreground) is covered by the object outline by more than a given threshold, such as about 80%, the region belongs to the semantic object. Otherwise, the region is intersected with the object mask and split: a) split regions inside the object mask are still kept as foreground b) split regions outside the object mask are tagged as new; go to step 8.

5. If the region (new) is covered by the object mask by less than a threshold, such as about 30%, keep it as new. Go to step 8.

6. If the region is covered by the object mask by more than a threshold, such as about 80%, classify it as foreground. Go to step 8. 7. Compute numbers of edge pixels (using edge map) between this region and the current background and foreground regions. Compute differences between the mean motion vector of this region with those of its neighboring regions and find the closest motion neighbor.

If the region is separated from background regions by more edge pixels than foreground regions ( or this region is not connected to any background regions) and its closet motion neighbor is a foreground region, intersect it with the object mask and split: a) split regions inside the object mask are classified as foreground b) split regions outside the object mask are tagged as new. Otherwise, keep the region as new. 8. Advance to the next region. Go to step 2.

Here the object mask refers to the estimated object projection. Compared with the aggregation process described above, a relatively lower ratio (e.g., 80%) can be used to include a foreground or new region. This is to improve handling of motion projection errors. As it is possible to have multiple layers of new regions emerge between the foreground and the background, the above aggregation and boundary alignment process can be iterated multiple times. This is useful in correcting errors which can result from the estimation of rapid motion. At the end of the last iteration, all remaining new regions are classified into background regions.

Finally, affine models of all regions, including both foreground and background, can be estimated by a linear regression process over the optical flow. As described before, these affine models are used to project regions into a future frame in the motion projection module 112.

The methods and modules described in Figures 1-4 can be implemented on conventional computer based systems operating under the control of appropriate software designed in accordance with the present invention. Figure 8 is a block diagram which illustrates the essential operational components of a system for practicing the image segmentation and tracking methods described herein. The system, can take the form of a personal computer with appropriate high speed processor and video handling peripherals, suitable digital video editing workstation and the like. Such a system will generally include a processor 802 which is coupled to a video source 804. The processor 802 generally includes a microprocessor, memory, control logic, and input/output circuits operatively interconnected to perform the described processing operations. The video source 804 can take the form of magnetic data storage media, optical data storage media, streaming video content from a remote source and the like. An input device 806, such as a digital pointer, light pen, computer "mouse" and the like, is coupled to the processor 802. The input device is used to provide user input for, inter alia, establishing object boundaries and providing threshold assignments for object definition. A display device 808 is also coupled to the processor 802 to provide a visual display of the digital video.

EXAMPLES The present invention has shown very good segmentation and tracking results on general video sources. As shown in Figures 6A-E, five different types of video sequences were used to do subjective and objective evaluation of the present system. The first sequence (akiyo), illustrated in Figure 6A, contains an anchor person with small motion and strong color difference with background. The second sequence (foreman), illustrated in Figure 6B contains an object with relatively large motion and very similar brightness to background. In the third sequence, Figure 6C, a skater having many different fast motion parts (i.e. body, arms and legs) is depicted. The characteristic of the fourth sequence shown in Figure 6D is that the shape of the flying bird changes abruptly from frame to frame. In the last sequence, Figure 6E, a plane is flying away from the camera. One hundred successive frames were used for each sequence in the experiments. After the first user input at the starting frame, subsequent user inputs are applied at frames where the largest tracking errors occur or start to occur. Generally, the number of frames tracked after earlier user inputs are larger than that after later user inputs. Figures 6A-E show object tracking results of the five testing sequence after 3 user inputs. For subjective evaluation, segmented objects were superimposed onto gray background with random noise and played in real time for users to see whether there were observable errors. To remove boundary jitter (i.e. high frequency noise), a temporal median filtering process (with a radius of 2 frames) was applied to the binary object masks before they were put onto the background with random noise. In the sequence of Figure 6A, there was no noticeable errors after only a single user input at the starting frame. In Figures 6B, 6D and 6E, the sequence of three user inputs provided outputs without noticeable errors. The skater sequence of Figure 6E, which has fast and complex motion, required four user inputs to remove noticeable errors.

In the objective evaluation, extracted objects for the five sequences were extracted with the help of the snake algorithm, and then the average numbers of missing and false pixels over the 100 frames at different user inputs (from 1 to 10) were computed. The average numbers of missing and false pixels versus the number of user inputs are plotted in Figure 7. The average size of each object is shown in the legends. As shown in the plots, tracking errors are usually reduced to a very low level after three inputs. Considering errors caused by boundary blur (especially for the MPEG-1 sequences with fast motion) and manual segmentation, the present system and methods generate very good tracking results. The sequence of Figure 6A only requires one user input to get very good tracking result over 100 frames. Missing and false pixels are mainly around the hair, which is not clearly separated from background either by color or edge. The following user inputs bring very small improvement because only a few number of frames need to be modified. The bird sequence of Figure 6D contains a small and fast moving object. The abrupt shape change between successive frames causes the missing of part of the wings at some frames. These errors are substantially corrected after 2-3 more user inputs.

For the plane sequence of Figure 6E, after the first input, most false pixels come from a black strip at the left frame border. Part of the strip is merged into the plane at some successive frames. This error is easily corrected after 1 -2 more user inputs.

The foreman and skater sequences, Figures 6B and 6C, respectively, have relatively large errors at the first user input due to complex object motion. Missing pixels are mainly caused by the emerging of new foreground regions around the frame border, e.g., the left shoulder in the foreman sequence. These new regions connect with old object regions only by a few pixels when appearing and also abruptly change the shape of object. Thus they are classified as background regions during the tracking process. False pixels are included mainly because of uncovered background regions enclosed by the object. In the skater sequence shown in Figure 6C, frame 50 has a background region included to the object. The background region is uncovered when the two legs moving apart. As it is enclosed by the object and the border and thus not connected to any exist background regions, the new background region is falsely classified as foreground. The above two types of errors can be corrected by one user input at the frame where the error starts to happen. Restarting the tracking process after the correction will correct errors in the successive frames. As shown in Figures 7A and 7B, for the foreman and skater sequence, the numbers of missing and false pixels drop rapidly after the first 2 user inputs.

Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

1. A method for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, comprising the steps of: a. receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; b. defining at least one object boundary in the initial frame of video; c. partitioning the initial frame of video image data into a plurality of regions based on substantially homogenous video features; and d. tracking the object boundary and plurality of regions in the at least one succeeding frame of video image data.

2. The method for segmenting and actively tracking semantic video objects of claim 1, wherein said defining step includes receiving an approximation of the object boundary from a user.

3. The method for segmenting and actively tracking semantic video objects of claim 1, wherein the video features of the partitioning step include at least one of edge, color and motion field video features.

4. The method for segmenting and actively tracking semantic video objects of claim 1, wherein the plurality of regions do not overlap.

5. The method for segmenting and actively tracking semantic video objects of claim 1 , wherein an aggregation step is performed after the pardoning step to classify the regions as one of background and foreground regions.

6. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions is determined at least in part by a percentage of the region which resides within the object boundary.

7. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions is determined at least in part by a classification of the region in a previous frame.

8. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions around the object boundary is determined at least in part by at least one of a color feature, an edge feature and a motion feature.

9. A system for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, comprising: a processor, the processor receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; an input device, said input device being operatively coupled to said processor for defining at least one object boundary in the initial frame of video image data; a display device, said display device being operatively coupled to said processor for displaying the video image data; and wherein said processor partitions the initial frame of video image data into a plurality of regions based on substantially homogenous video features and tracks the object boundary and plurality of regions in the at least one succeeding frame of video image data.

10. The system for segmenting and actively tracking semantic video objects of claim 9, wherein said input device is a user input interface for providing an approximation of the object boundary from a user.

11. The system for segmenting and actively tracking semantic video objects of claim 9, wherein the video features used by the processor include at least one of edge, color and motion field video features.

12. The system for segmenting and actively tracking semantic video objects of claim 9, wherein said processor classifies the regions as one of background and foreground regions.

13. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions at least in part in accordance with a percentage of the region which resides within the object boundary.

14. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions at least in part in accordance with a classification of the region in a previous frame of the video image data.

15. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions around the object boundary at least in part in accordance with at least one of a color feature, an edge feature and a motion feature.