WO2000018128A1 - System and method for semantic video object segmentation - Google Patents

System and method for semantic video object segmentation Download PDF

Info

Publication number
WO2000018128A1
WO2000018128A1 PCT/US1999/022264 US9922264W WO0018128A1 WO 2000018128 A1 WO2000018128 A1 WO 2000018128A1 US 9922264 W US9922264 W US 9922264W WO 0018128 A1 WO0018128 A1 WO 0018128A1
Authority
WO
WIPO (PCT)
Prior art keywords
regions
video
segmenting
region
image data
Prior art date
Application number
PCT/US1999/022264
Other languages
French (fr)
Inventor
Shih-Fu Chang
Di Zhong
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Priority to AU62654/99A priority Critical patent/AU6265499A/en
Priority to JP2000571664A priority patent/JP2002525988A/en
Publication of WO2000018128A1 publication Critical patent/WO2000018128A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/20Contour coding, e.g. using detection of edges
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based

Definitions

  • the present invention relates generally to digital image processing and more particularly relates to systems and methods for performing object segmentation and image tracking in digital video image sequences.
  • Object segmentation and tracking is a fundamental step for many digital video applications, such as object-based coding used in MPEG-4; content-based video indexing, for example as described in S. F. Chang, et al., "VideoQ: An Automated Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia Conference, Seattle, WA, Nov. 1997; video editing and authorization.
  • object-based coding used in MPEG-4
  • content-based video indexing for example as described in S. F. Chang, et al., “VideoQ: An Automated Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia Conference, Seattle, WA, Nov. 1997
  • video editing and authorization There have been several attempts in this field at decomposing images into regions with uniform features. See, for example, M.P. Dubuisson et al., "Contour Extraction of Moving Objects in Complex Outdoor Scenes," 14 Int'l J. of Computer Vision 83-105 (1995); Nikhil R. Pal et
  • Pattern Analysis & Machine Intelligence (1991); and certain stochastic models, for example as described by Geman, "Stochastic relaxation, Gibbs distributions and the Bayesian Restoration of Images", PAMI-6 IEEE Trans. Pattern Analysis and Machine Intelligence 721-741 (1984). These various approaches extract homogeneous image regions based on visual features such as color, edge or their combinations.
  • image segmentation techniques are context independent and can be adapted to various applications.
  • video object tracking techniques are usually directed to meaningful objects (e.g. walking people, moving cars) and are domain specific. As uniform motion often indicates a semantic object in many applications, motion is widely used as the main feature in object segmentation and tracking techniques.
  • region-based tracking algorithms which combine common image segmentation techniques with motion estimation methods. For example, techniques described in the Dubuisson et al. article, Chuang Gu et al., “Semantic Video Object Segmentation and Tracking Using Mathematical Morphology and Perspective Motion Model,” ICIP'97, and in Kui Zhang et al., “Motion Based Image Segmentation for Video Coding," Proc. Int'l Conf. on Image Processing, Washington (ICIP'95) pp.476-479, exploit more region features than feature points, segments or boundary based tracking, and are more robust in handling real-world problems such as occlusion.
  • the Dubuisson et al. article proposes an approach to combine motion segmentation using image subtraction with static color segmentation using the split-and-merge paradigm. Color segmentation regions are examined against motion masks to form the final object mask based on certain confidence measures.
  • the Zhang et al. article proposes an algorithm to match edge detection and line approximation results with motion segmentation, and to determine the final object boundary.
  • the Gu et al. article proposes a semantic object tracking system using mathematical morphology and perspective motion. That system uses a modified morphological watershed procedure to segment uncertain areas between the interior and exterior outlines. Flooding seeds are sampled on both interior and exterior outlines, and regions growing from interior seeds define the segmented object boundary. In the first frame, the interior outline is defined by users, and the exterior outline is obtained by dilation of the interior one. For subsequent frames, the interior and exterior outlines are created by erosion and dilation of the motion projected boundary from the previous frame.
  • region-based techniques may achieve satisfactory results for a specific class of video content, i.e., video which includes only rigid objects exhibiting simple motion.
  • these techniques all track a single contour or video object, ignoring possibly complex components and motions within the semantic object.
  • a semantic object usually contains several parts of differing motion, some of which are non-rigid and/or exhibit rapid motion changes.
  • a single motion model is generally inadequate to track a semantic object.
  • the present invention combines an innovative method for combining low level automatic region segmentation and tracking with an active method for defining and tracking video objects at a higher level.
  • a semantic object is modeled as a set of regions with corresponding spatial and visual features.
  • the present invention directly links the semantic object to its underlying regions and is able to handle many real-world situations, such as complex objects, fast and/or intermittent motion, complicated backgrounds, multiple moving objects and partial occlusion, which cannot be satisfactorily modeled using existing techniques.
  • the instant technique can be carried out in two stages: an initial object segmentation stage where user input at the starting frame is used to create a semantic object with underlying homogeneous regions; and an object tracking stage where homogeneous regions and the semantic object are tracked through successive frames.
  • the object model is constructed through a region segmentation process and a region aggregation process.
  • the segmentation process decomposes the frame image into a set of non-overlapping regions with homogeneous visual features including color, edge and motion.
  • the aggregation process groups these regions into foreground and background objects according to the initial object boundary given by users.
  • innovative methods are deployed to map regions to the future frame as seeds for an inter-frame region tracking and segmentation process.
  • An estimated object boundary is extracted from the tracked regions and then an aggregation process is used to classify regions into foreground objects or background based on their tracks as well as feature similarity.
  • a method for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information includes the steps of: receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; defining at least one object boundary in the initial frame of video; partitioning the initial frame of video image data into a plurality of regions based on substantially homogenous video features; and tracking the object boundary and plurality of regions in the at least one succeeding frame of video image data.
  • the defining step can include receiving an approximation of the object boundary from a user.
  • the partitioning step can include edge, color and a motion field as the applicable video features.
  • An aggregation step can be performed after the partitioning step in order to classify the regions as background and foreground regions.
  • the plurality of regions do not overlap.
  • the classification of regions can be determined by a percentage of the region which resides within the object boundary.
  • the classification of regions after a first frame can also be determined by the classification of the region in a previous frame.
  • the classification of regions can also be determined by features such as color, motion and edge.
  • FIG. 1 is a system diagram of an illustrative embodiment of the present invention
  • Fig. 2 is a block diagram of an object segmentation system in accordance with a preferred embodiment of the present invention
  • Fig. 3 is a flow diagram of an iterative region merging process useful in the object segmentation process of Fig. 2;
  • Fig. 4 is a block drawing of an automatic semantic object tracking process in accordance with a preferred embodiment of the present invention.
  • Fig. 5 is an illustrative diagram showing the inter-frame pixel labeling process useful in the process of Fig. 4;
  • Fig. 6 is an illustrative drawing showing the results of the application of the present invention to five different video sequences
  • Figs. 7a and 7b are graphs plotting the average number of false pixels and missed pixels, respectively, against the number of user inputs for the video sequences shown in Fig. 6;
  • Fig. 8 is a block diagram of a computer system suitable for practicing the present invention.
  • Figure 1 illustrates an embodiment of a system which combines automatic region segmentation methods with an active method for defining and tracking video objects at a higher level.
  • the system generally contains two stages: an initial object segmentation stage, where user input 100 at the starting frame 102 is used to create a semantic object with underlying homogeneous regions 104; and an object tracking stage 110, where homogeneous regions 104 and the object are tracked through the successive frames.
  • the semantic object is generated through a region segmentation process 106 and aggregation process.
  • a region segmentation method can be applied to an area inside a slightly expanded bounding box of the user-specified object which effectively fuses color and edge features in a region-growing process that produces homogeneous color regions with accurate boundaries.
  • region aggregation procedure 108 where homogeneous regions are classified as either foreground or background to form the semantic object with an accurate boundary. Region aggregation 108 is based on the coverage of each region by the initial object boundary: regions that are covered by more than a certain percentage are grouped into the foreground object. The final contour of the semantic object is computed from foreground regions. Foreground regions belonging to the object and background regions are stored and can be tracked over time in the successive frames.
  • Tracking at both the region and object levels is the main task in the second stage of processing. Segmented regions from the previous frame are projected to the current frame using their individual affine motion models. An expanded bounding box including all projected foreground regions can then be computed. Then the area inside the bounding box can be split to homogeneous color and motion regions following a region tracking process. Unlike in other existing approaches, projected regions are not used directly as the new segmentation. Rather, the projected regions are used as seeds in another color based region growing process, which is similar to the fusing algorithm in the first stage, to track existing regions. Pixels that can not be tracked from any old regions are labeled as new regions.
  • the resulting homogeneous regions are tagged as either foreground, which are tracked from a foreground region, or background, which are tracked from a background region, or new, which are not tracked.
  • the homogeneous regions 104 are then passed to a region aggregation process 108 and are classified as either belonging to the foreground object or the background.
  • the approximated object boundary is obtained from the projected foreground regions.
  • the region aggregation process 108 can be carried out iteratively.
  • the object contour can then be computed from foreground regions and all regions are advanced to the motion projection process 112 for a succeeding frame 114 of video image data.
  • Figure 2 is a system diagram further outlining a semantic object segmentation operation at an initial frame.
  • Semantic object segmentation preferably utilizes four processes which are discussed below, i.e., object definition and threshold specification, feature map generation, region segmentation, and object composition.
  • object definition 202 and threshold specification users initially define a semantic object by using a tracing interface, such as a digital pointer, light pen, touch screen, mouse or the like.
  • the input can be a polygon whose vertices and edges are roughly placed along the desired object boundary.
  • a snake algorithm 204 such as that described in the Kass et al.
  • the snake algorithm can be based on minimizing a specific energy function associated with edge pixels.
  • Users can also choose to skip the snake module 204 if a relatively accurate outline of the object is provided. The final accurate object boundary will be obtained by the subsequent region segmentation and aggregation processes.
  • users can start the tracking process by specifying a set of thresholds. These thresholds include a color merging threshold, weights on the three color channels, a motion merging threshold and a tracking buffer size. These thresholds can be chosen based on the characteristic of a given video shot and experimental results. For example, for a video shot where foreground objects have similar luminance with background regions, users may put a lower weight on the luminance channel. Users can perform the tracking process over a small number of frames using predetermined default thresholds which are automatically generated by the system, and then adjust the thresholds based on the segmentation and tracking results. The user can stop the tracking process at any frame, modify the object boundary that is being tracked and then restart the tracking process from the modified frame.
  • thresholds include a color merging threshold, weights on the three color channels, a motion merging threshold and a tracking buffer size. These thresholds can be chosen based on the characteristic of a given video shot and experimental results. For example, for a video shot where foreground objects have similar luminance with background regions, users may put
  • a slightly extended (-15 pixels) bounding box 206 surrounding the arbitrarily shaped object is computed.
  • the following segmentation procedures are fulfilled inside this bounding box.
  • the second process involves the creation of three feature maps, i.e., an edge map 208, color map 210 and a motion field 212, within the bounding box from the original frame image.
  • the color map is the major feature map in the following segmentation module 214.
  • a process to generate the color map 210 can include the following steps.
  • the original image is converted into CIE L*u*v* color space. It is well known that perceptually non-uniform color spaces such as RGB are not well suited for color segmentation, as distance measure in these spaces is not proportional to perceptual difference.
  • L*u*v* is a perceptually uniform color space which divides color into one luminance channel (L*) and two chrominance channels (u* and v*). This channel separation also allows different weights to be placed on luminance and chrominance distance measures. The color difference is thus given by the weighted Euclidean distance in the three dimensional space, shown in equation (1):
  • the L*u*v* image is preferably simplified and smoothed to remove possible noise as well as minor details for the purpose of region merging. This can be accomplished by an adaptive quantization and a median filtering process. Quantization can be performed to produce homogeneous regions in which pixels having similar colors are mapped into a common bin. To quantize an image into a very limited number of colors (e.g., 32 or 16 bins), common fixed-level quantization methods can not always preserve color information under variant situations. Thus, the use of a clustering based method, such as for example a K-means technique, to analyze the input image and determine an adaptive quantization scale in the L*u*v* space is preferred.
  • the edge map 208 is a binary mask wherein edge pixels are distinguished from non-edge pixels, such as by setting the edge pixels to 1 and setting the non-edge-pixels to 0.
  • the edge map 208 can be generated by applying edge detection 214, such as the CANNY edge detection algorithm.
  • the CANNY edge detector performs 2-D Gaussian pre-smoothing on the image and then takes directional derivatives in the horizontal and vertical directions. These derivatives are used to calculate a gradient. Local gradient maxima can then be defined as candidate edge pixels.
  • This output is preferably run through a two-level thresholding synthesis process to produce the final edge map 208.
  • a simple algorithm can be used to select the two threshold levels in the synthesis process based on a histogram of the edge gradient.
  • the motion field 212 can be generated by a hierarchical block matching algorithm, such as described in M. Bierling, "Displacement Estimation by Hierarchical Block Matching", 1001 SPIE Visual Communication & Image Processing (1988) which is hereby incorporated by reference.
  • This algorithm uses distinct sizes of measurement windows for block matching at different grid resolutions, and thus is able to generate a homogeneous motion field that is closer to the true motion than block based estimation algorithms.
  • a 3-level hierarchy, such as suggested in Bierling, is appropriate for use in the instant process.
  • the region segmentation process is developed based on the three feature maps: color map 210, edge map 208 and motion field 212. Because the color map 210 and edge map 208 are complementary, i.e., the former captures low frequency information (means) while the latter captures high-frequency details (edges) of an image, fusion of these maps greatly improves segmentation results.
  • a color-based pixel labeling process is applied to the color map. Labeling is a process where one label is assigned to a group of neighboring pixels which exhibit substantially the same color.
  • non-edge pixels i.e., pixels that are set to 0 in the edge mask
  • Edge pixels remains un-labeled.
  • This process generates an initial group of regions (i.e., pixels with the same label) as well as their connection graph. Two regions are linked as neighbors if pixels in one region has neighboring pixels (4-connection) in another region. As edge-pixels are not labeled, two regions separated by edge pixels are not linked as neighboring regions.
  • the color merging process is an iterative spatial-constrained clustering process.
  • Color distances (Eq. 1) between two connected regions are computed in block 302. Two connected regions are merged if the color distance between them is 1 ) smaller than the given color threshold; and 2) the local minimal, i.e., it is smaller than all other color distances associated with the two regions.
  • the mean color of the new region can be computed (block 306) by taking a weighted average of the mean colors of the two old regions m consult m 2 according to the equation:
  • the region connections are also updated for all neighbors of the two old regions (block 308).
  • the new region can be assigned a label (block 310), such as the label of the larger one of the two merged regions.
  • the two old regions can then be dropped (block 312).
  • the color merging process can be iterated until color distances between every two connected regions are above the color threshold. As edge pixels are not labeled, two regions separated by edge pixels are not connected as neighbors. Thus, the growth of each region is naturally stopped at its edge pixels. After the color merging process is complete, the edge pixels can simply be assigned to their neighboring regions with the smallest color distances.
  • a motion-based segmentation using the dense optical flow can be applied to the segmented color regions to check the uniformity of the motion distribution.
  • a similar iterative spatially constrained clustering process can also be used to group pixels inside a color region according to their motion vectors and the given motion threshold. Because an objective is to track the object, this process need only be applied to those regions that intersect with the initial object boundary.
  • the region aggregation module 220 receives homogeneous regions from the segmentation module 214 and the initial object boundary from the snake module 204 or user input 220 directly.
  • Region aggregation at the starting frame is relatively simple compared with that for the subsequent frames, as all regions are newly generated (not tracked) and the initial outline is usually not far from the real object boundary.
  • a region is classified as foreground if more than a certain percentage (e.g., 90%) of the region is included by the initial object outline. On the other hand, if less than a certain percentage (e.g., 30%) of a region is covered by the initial object outline, the region is considered as background. Regions between the low and high thresholds are split into foreground and background regions according to the intersection with the initial object outline.
  • affine estimation 222 is performed.
  • Affine motion parameters of all regions, including both foreground and background, can be estimated by a multivariate linear regression process over the dense optical flow inside each region.
  • a two dimensional affine model with six parameters, as set forth in Equation 3 can be used as a good first-order approximation of a distant object undergoing three dimensional translation and linear deformation:
  • An inter-frame segmentation process can be employed to segment a frame into homogeneous regions. Unlike in the starting frame where all regions are tagged as new, these regions are classified in the inter-frame segmentation process as either foreground, background or new, according to their relationship with the existing foreground and background regions of the previous frame.
  • the estimated, or projected, object boundary can then be used to group the tagged regions into the object and background. For regions around the object boundary with the tag "new," their visual features are also examined to determine whether they belong to the object or not.
  • a motion projection module 402 segmented regions from the previous frame, including both foreground and background region, are projected onto the current frame (virtually) using their individual affine motion models. Projected regions keep their labels and original classifications. For video shots with static or homogeneous background (i.e. only one moving object), users can choose not to project background regions to save time and processing overhead. An expanded bounding box of all projected foreground regions can be computed. Similarly, the following segmentation and aggregation processes are generally applied only to the area inside the bounding box.
  • the edge map 404, color map 406 and motion field 408, generally utilize the same methods as described above.
  • the existing color palette computed at the starting frame is directly used to quantize the current frame.
  • the color consistency of segmented regions between successive frames is enhanced which improves the performance of region based tracking.
  • object tracking is limited to single video shots in which there is no abrupt scene change, using one color palette is generally valid.
  • a new quantization palette can be generated automatically, such as when a large quantization error is encountered.
  • three feature maps are preferably fused to track existing regions and segment the current frame.
  • an inter-frame labeling process is performed where non-edge pixels are sequentially labeled, such as by labeling the pixels one by one from left to right and top to bottom.
  • a pixel has the same color as that of its labeled neighboring pixels, it is assigned the label of its neighbors.
  • its color is compared with all projected regions that cover its coordinate at the current frame. If the pixels's color distance to the closest region is under the given color threshold, this pixel is "tracked," and assigned the label and classification (foreground or background) of the closest projected region. Otherwise, a new label is generated and assigned to the pixel. Regions identified by this new label are classified as "new" regions.
  • interframe color merge, edge labeling, motion split and small region elimination processes which take place in the tracking module 410 are similar to those previously described, with some additional constraints. Foreground or background regions tracked from the previous frame are allowed to be merged only with regions of the same class or new regions. Merging of the foreground and background regions is forbidden. New regions can be merged with each other or merged with foreground/background regions. When a new region is merged with a tracked region, the merging result inherits the label and classification from the tracked region. In motion segmentation, split regions remain in their original classes. After the inter-frame tracking process 410, a list of regions temporarily tagged as either foreground, background, or new is obtained.
  • the temporarily tagged regions are then passed to an iterative region aggregation module 412.
  • the region aggregation module 412 receives two inputs: the homogeneous regions from tracking module 410 and the estimated object boundary from the motion projection module 402.
  • the object boundary can be estimated from projected foreground regions. Foreground regions from the previous frame are projected independently of one another and the combination of projected regions forms the mask of the estimated object.
  • the mask can be refined using a morphological closing operation (i.e. dilation followed by erosion) with a size of several pixels in order to close tiny holes and smooth boundaries. To tolerate motion estimation error, the result can be further dilated for the tracking buffer size, which is specified by users at the beginning of the tracking process.
  • the region aggregation module 412 implements a region grouping and boundary alignment algorithm based on the estimated object boundary as well as the edge and motion features of the region.
  • An exemplary algorithm includes the following steps:
  • step 2 For every segmented region, do step 2 to step 8.
  • step 8 If the region is tagged as background, keep it as background. Go to step 8.
  • step 3 Compute an intersection ratio of the region with the object mask. Go to step 5 if the region is tagged as new.
  • the region foreground
  • a given threshold such as about 80%
  • the region belongs to the semantic object. Otherwise, the region is intersected with the object mask and split: a) split regions inside the object mask are still kept as foreground b) split regions outside the object mask are tagged as new; go to step 8.
  • step 8 If the region (new) is covered by the object mask by less than a threshold, such as about 30%, keep it as new. Go to step 8.
  • a threshold such as about 30%
  • step 6 If the region is covered by the object mask by more than a threshold, such as about 80%, classify it as foreground. Go to step 8. 7. Compute numbers of edge pixels (using edge map) between this region and the current background and foreground regions. Compute differences between the mean motion vector of this region with those of its neighboring regions and find the closest motion neighbor.
  • a threshold such as about 80%
  • the region is separated from background regions by more edge pixels than foreground regions ( or this region is not connected to any background regions) and its closet motion neighbor is a foreground region, intersect it with the object mask and split: a) split regions inside the object mask are classified as foreground b) split regions outside the object mask are tagged as new. Otherwise, keep the region as new. 8. Advance to the next region. Go to step 2.
  • the object mask refers to the estimated object projection.
  • a relatively lower ratio e.g., 80%
  • 80% can be used to include a foreground or new region. This is to improve handling of motion projection errors.
  • the above aggregation and boundary alignment process can be iterated multiple times. This is useful in correcting errors which can result from the estimation of rapid motion. At the end of the last iteration, all remaining new regions are classified into background regions.
  • affine models of all regions can be estimated by a linear regression process over the optical flow. As described before, these affine models are used to project regions into a future frame in the motion projection module 112.
  • FIG. 8 is a block diagram which illustrates the essential operational components of a system for practicing the image segmentation and tracking methods described herein.
  • the system can take the form of a personal computer with appropriate high speed processor and video handling peripherals, suitable digital video editing workstation and the like.
  • Such a system will generally include a processor 802 which is coupled to a video source 804.
  • the processor 802 generally includes a microprocessor, memory, control logic, and input/output circuits operatively interconnected to perform the described processing operations.
  • the video source 804 can take the form of magnetic data storage media, optical data storage media, streaming video content from a remote source and the like.
  • the input device is used to provide user input for, inter alia, establishing object boundaries and providing threshold assignments for object definition.
  • a display device 808 is also coupled to the processor 802 to provide a visual display of the digital video.
  • the present invention has shown very good segmentation and tracking results on general video sources.
  • Figures 6A-E five different types of video sequences were used to do subjective and objective evaluation of the present system.
  • the first sequence (akiyo), illustrated in Figure 6A, contains an anchor person with small motion and strong color difference with background.
  • the second sequence (foreman) illustrated in Figure 6B contains an object with relatively large motion and very similar brightness to background.
  • Figure 6C a skater having many different fast motion parts (i.e. body, arms and legs) is depicted.
  • the characteristic of the fourth sequence shown in Figure 6D is that the shape of the flying bird changes abruptly from frame to frame.
  • Figure 6E a plane is flying away from the camera.
  • FIGS. 6A-E show object tracking results of the five testing sequence after 3 user inputs.
  • segmented objects were superimposed onto gray background with random noise and played in real time for users to see whether there were observable errors.
  • boundary jitter i.e. high frequency noise
  • a temporal median filtering process was applied to the binary object masks before they were put onto the background with random noise.
  • the following user inputs bring very small improvement because only a few number of frames need to be modified.
  • the bird sequence of Figure 6D contains a small and fast moving object.
  • the abrupt shape change between successive frames causes the missing of part of the wings at some frames.
  • the new background region is falsely classified as foreground.
  • the above two types of errors can be corrected by one user input at the frame where the error starts to happen. Restarting the tracking process after the correction will correct errors in the successive frames. As shown in Figures 7A and 7B, for the foreman and skater sequence, the numbers of missing and false pixels drop rapidly after the first 2 user inputs.

Abstract

The segmenting and actively tracking of semantic video objects from a sequence of frames of digital video information involves receiving an initial frame of video image data and at least one succeeding frame of video image data from a sequence of frames of digital video information. From the first frame, at least one object boundary is defined by a user. The initial frame of video image data is then partitioned into a plurality of regions based on selected substantially homogenous video features. Tracking is then performed on the object boundary and plurality of regions which are projected to at least one succeeding frame of the video image data.

Description

SYSTEM AND METHOD FOR SEMANTIC VIDEO OBJECT SEGMENTATION
SPECIFICATION
This application claims priority to United States Provisional Application Serial No. 60/101,675, entitled An Active System and Algorithm for Semantic Video Object Segmentation, filed on September 24, 1998.
FIELD OF THE INVENTION The present invention relates generally to digital image processing and more particularly relates to systems and methods for performing object segmentation and image tracking in digital video image sequences.
BACKGROUND OF THE INVENTION Object segmentation and tracking is a fundamental step for many digital video applications, such as object-based coding used in MPEG-4; content-based video indexing, for example as described in S. F. Chang, et al., "VideoQ: An Automated Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia Conference, Seattle, WA, Nov. 1997; video editing and authorization. There have been several attempts in this field at decomposing images into regions with uniform features. See, for example, M.P. Dubuisson et al., "Contour Extraction of Moving Objects in Complex Outdoor Scenes," 14 Int'l J. of Computer Vision 83-105 (1995); Nikhil R. Pal et al., "A review on Image Segmentation
Techniques," 26 Pattern Recognition 1277-1294 (1993); Mark Tabb et al., "Multiscale Image Segmentation by Integrated Edge and Region Detection," 6 IEEE Trans, on Image Proc. (1997); Alain Tremeau et al., "A Region Growing and Merging Algorithm to Color Segmentation," 30 Pattern Recognition 1191-1203 (1997); and E. Saber et al., "Fusion of Color and Edge Information for Improved Segmentation and Edge Linking," IEEE ICASSP '96. However, no robust techniques for tracking video objects through video sequences are presently available. Known image segmentation techniques include thresholding methods, such as described in the Pal et al. article; region growing methods, for example as described in the Dubuisson et al., Pal et al., and Tremeau et al. articles; split-and-merge techniques as described in "Picture Segmentation by a directed split-and-merge procedure," by S. Horowitz et al., Proc. 2nd Intl. Joint Conf. on Pattern Recognition, pp. 424-433 (1974); watershed techniques, for example as described by Luc Vincent et al., in "Watersheds in Digital Spaces: An efficient Algorithm Based on Immersion Simulations", 13 IEEE Trans. Pattern Analysis & Machine Intelligence (1991); and certain stochastic models, for example as described by Geman, "Stochastic relaxation, Gibbs distributions and the Bayesian Restoration of Images", PAMI-6 IEEE Trans. Pattern Analysis and Machine Intelligence 721-741 (1984). These various approaches extract homogeneous image regions based on visual features such as color, edge or their combinations.
Although there is no unified method that is good for all kinds of images, generally, image segmentation techniques are context independent and can be adapted to various applications. On the contrary, video object tracking techniques are usually directed to meaningful objects (e.g. walking people, moving cars) and are domain specific. As uniform motion often indicates a semantic object in many applications, motion is widely used as the main feature in object segmentation and tracking techniques.
Several approaches have been developed to segment moving objects using either a motion field or optical flow. As the motion field tends to be very noisy in real-world scenes, direct segmentation of optical flow can be erroneous and unstable. Model based motion estimation and segmentation methods, such as the technique described in J.R. Bergen et al., "Hierarchical model-based motion estimation," ECCV-92, Santa Margarita Ligure, pp. 237-252 (1992) may be more stable. In the affine-clustering based algorithm presented in Jhon Y.A. Wang et al., "Representing Moving Images with Layers", 3 IEEE Trans, on Image Processing (1994), motion layers are extracted from the original motion field by iteratively estimating and refining affine models. A method which avoids use of optical flow by estimating motion models and their layer support simultaneously is proposed in A. Ayer et al., "Layered Representation of Motion Video Using Robust Maximum-Likelihood Estimation of Mixture Models and MDL Encoding," Proc. Fifth Int'l Conf. Computer Vision (1995). In this article, the number of layers is decided by a formulation based on a maximum likelihood estimation of mixture models and the minimum description length encoding principle. In Francois G. Meyer et al., "Region Based Tracking using Affine Motion Models in Logn Image Sequences," 60 CVGIP: Image Understanding 119-140 (1994), a pursuit algorithm to track an object based on the multi-resolution estimation of affine model from the motion field within the object. In general, the above methods concentrate on segmenting moving objects and cannot readily be applied to track static objects or objects with intermittent motion, such as people crossing the street. Furthermore, due to the limited accuracy of motion estimation, motion segmentation generally does not provide clear object boundaries. In the context of machine vision, some research has focused on tracking feature points or contour segments. For example, the methods described in R. Deriche et al., "Tracking line segments", 8 Image & Vision Computing 261-270 (1990) and V.S. Hwang, "Tracking feature points in time-varying images using an opportunistic selection approach", 22 Pattern Recognition 247-256 (1989) generate good results when moving objects have strong and stable localized features such as corners and edges. However, these techniques are highly sensitive to object deformation, object occlusion and image noise.
For the tracking of non-rigid objects, deformable models have been studied. For example, M. Kass, et al., "Snakes: Active contour models," Int'l J. of Computer Vision, pp 321-331 (1988) describes the use of active contour (snakes) as a basic energy-minimizing elastic contour models. Given the energy function, an elastic contour (i.e., object) at the current frame can be moved and deformed iteratively to the best position at the next frame using the snake algorithm. The use of snakes generally requires very accurate initialization, can handle only slow motion, and can be sensitive to textured image regions. Recently, in light of the demand for object tracking in general videos and the requirement of providing more accurate segmentation boundaries, region-based tracking algorithms, which combine common image segmentation techniques with motion estimation methods, have been explored. For example, techniques described in the Dubuisson et al. article, Chuang Gu et al., "Semantic Video Object Segmentation and Tracking Using Mathematical Morphology and Perspective Motion Model," ICIP'97, and in Kui Zhang et al., "Motion Based Image Segmentation for Video Coding," Proc. Int'l Conf. on Image Processing, Washington (ICIP'95) pp.476-479, exploit more region features than feature points, segments or boundary based tracking, and are more robust in handling real-world problems such as occlusion. In the Dubuisson et al. article, an approach to combine motion segmentation using image subtraction with static color segmentation using the split-and-merge paradigm is presented. Color segmentation regions are examined against motion masks to form the final object mask based on certain confidence measures. The Zhang et al. article proposes an algorithm to match edge detection and line approximation results with motion segmentation, and to determine the final object boundary. The Gu et al. article proposes a semantic object tracking system using mathematical morphology and perspective motion. That system uses a modified morphological watershed procedure to segment uncertain areas between the interior and exterior outlines. Flooding seeds are sampled on both interior and exterior outlines, and regions growing from interior seeds define the segmented object boundary. In the first frame, the interior outline is defined by users, and the exterior outline is obtained by dilation of the interior one. For subsequent frames, the interior and exterior outlines are created by erosion and dilation of the motion projected boundary from the previous frame.
The aforementioned region-based techniques may achieve satisfactory results for a specific class of video content, i.e., video which includes only rigid objects exhibiting simple motion. However, these techniques all track a single contour or video object, ignoring possibly complex components and motions within the semantic object. In real-world video sources, a semantic object usually contains several parts of differing motion, some of which are non-rigid and/or exhibit rapid motion changes. Thus, a single motion model is generally inadequate to track a semantic object.
In addition, the above-discussed prior art techniques rely on field motion as the main feature for tracking purposes. Static color or gray-level segmentation is fulfilled separately, and the fusion of the two segmentation results is typically done only at the final stage using certain heuristic rules. Due to the noisy nature of a motion field in real-world scenes, tracking results may also be error prone.
Because no constraints are applied between motion and static segmentation, the generation of a final object mask is difficult when the two results are widely separated. Furthermore, these techniques tend to ignore the background content while tracking the foreground object. This can cause problems in tracking regions near the boundary of the object.
SUMMARY OF THE INVENTION To solve the above problems for real-world video sources, the present invention combines an innovative method for combining low level automatic region segmentation and tracking with an active method for defining and tracking video objects at a higher level. A semantic object is modeled as a set of regions with corresponding spatial and visual features. The present invention directly links the semantic object to its underlying regions and is able to handle many real-world situations, such as complex objects, fast and/or intermittent motion, complicated backgrounds, multiple moving objects and partial occlusion, which cannot be satisfactorily modeled using existing techniques.
The instant technique can be carried out in two stages: an initial object segmentation stage where user input at the starting frame is used to create a semantic object with underlying homogeneous regions; and an object tracking stage where homogeneous regions and the semantic object are tracked through successive frames. At the first stage, the object model is constructed through a region segmentation process and a region aggregation process. The segmentation process decomposes the frame image into a set of non-overlapping regions with homogeneous visual features including color, edge and motion. The aggregation process groups these regions into foreground and background objects according to the initial object boundary given by users. At the second stage, innovative methods are deployed to map regions to the future frame as seeds for an inter-frame region tracking and segmentation process. An estimated object boundary is extracted from the tracked regions and then an aggregation process is used to classify regions into foreground objects or background based on their tracks as well as feature similarity.
In accordance with the present invention, a method for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, includes the steps of: receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; defining at least one object boundary in the initial frame of video; partitioning the initial frame of video image data into a plurality of regions based on substantially homogenous video features; and tracking the object boundary and plurality of regions in the at least one succeeding frame of video image data. In such a method, the defining step can include receiving an approximation of the object boundary from a user.
Also, the partitioning step can include edge, color and a motion field as the applicable video features. An aggregation step can be performed after the partitioning step in order to classify the regions as background and foreground regions.
Preferably, the plurality of regions do not overlap. The classification of regions can be determined by a percentage of the region which resides within the object boundary. The classification of regions after a first frame can also be determined by the classification of the region in a previous frame. The classification of regions can also be determined by features such as color, motion and edge.
BRIEF DESCRIPTION OF THE DRAWING Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which Fig. 1 is a system diagram of an illustrative embodiment of the present invention;
Fig. 2 is a block diagram of an object segmentation system in accordance with a preferred embodiment of the present invention; Fig. 3 is a flow diagram of an iterative region merging process useful in the object segmentation process of Fig. 2;
Fig. 4 is a block drawing of an automatic semantic object tracking process in accordance with a preferred embodiment of the present invention;
Fig. 5 is an illustrative diagram showing the inter-frame pixel labeling process useful in the process of Fig. 4;
Fig. 6 is an illustrative drawing showing the results of the application of the present invention to five different video sequences;
Figs. 7a and 7b are graphs plotting the average number of false pixels and missed pixels, respectively, against the number of user inputs for the video sequences shown in Fig. 6; and
Fig. 8 is a block diagram of a computer system suitable for practicing the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Figure 1 illustrates an embodiment of a system which combines automatic region segmentation methods with an active method for defining and tracking video objects at a higher level. The system generally contains two stages: an initial object segmentation stage, where user input 100 at the starting frame 102 is used to create a semantic object with underlying homogeneous regions 104; and an object tracking stage 110, where homogeneous regions 104 and the object are tracked through the successive frames.
In the first stage, based on a rough object boundary provided by user input in an object definition block 100 at the starting frame 102, the semantic object is generated through a region segmentation process 106 and aggregation process. A region segmentation method can be applied to an area inside a slightly expanded bounding box of the user-specified object which effectively fuses color and edge features in a region-growing process that produces homogeneous color regions with accurate boundaries.
To extract homogeneous regions 104 in both color and motion, motion segmentation based on a dense motion field can be used to further split the color regions. As color segmentation is more robust and gives more accurate region boundaries, it is beneficial if color segmentation is applied before motion segmentation. Motion is generally used as a secondary feature due to the noisy nature of the motion field. This is important for the following region aggregation procedure 108 where homogeneous regions are classified as either foreground or background to form the semantic object with an accurate boundary. Region aggregation 108 is based on the coverage of each region by the initial object boundary: regions that are covered by more than a certain percentage are grouped into the foreground object. The final contour of the semantic object is computed from foreground regions. Foreground regions belonging to the object and background regions are stored and can be tracked over time in the successive frames.
Tracking at both the region and object levels is the main task in the second stage of processing. Segmented regions from the previous frame are projected to the current frame using their individual affine motion models. An expanded bounding box including all projected foreground regions can then be computed. Then the area inside the bounding box can be split to homogeneous color and motion regions following a region tracking process. Unlike in other existing approaches, projected regions are not used directly as the new segmentation. Rather, the projected regions are used as seeds in another color based region growing process, which is similar to the fusing algorithm in the first stage, to track existing regions. Pixels that can not be tracked from any old regions are labeled as new regions. Thus, the resulting homogeneous regions are tagged as either foreground, which are tracked from a foreground region, or background, which are tracked from a background region, or new, which are not tracked. The homogeneous regions 104 are then passed to a region aggregation process 108 and are classified as either belonging to the foreground object or the background. Here instead of the being provided directly by user input, the approximated object boundary is obtained from the projected foreground regions. To better handle possible motion estimation errors, the region aggregation process 108 can be carried out iteratively. The object contour can then be computed from foreground regions and all regions are advanced to the motion projection process 112 for a succeeding frame 114 of video image data. Figure 2 is a system diagram further outlining a semantic object segmentation operation at an initial frame. Semantic object segmentation preferably utilizes four processes which are discussed below, i.e., object definition and threshold specification, feature map generation, region segmentation, and object composition. For the first process, object definition 202 and threshold specification, users initially define a semantic object by using a tracing interface, such as a digital pointer, light pen, touch screen, mouse or the like. The input can be a polygon whose vertices and edges are roughly placed along the desired object boundary. To improve system tolerance to user-input error, a snake algorithm 204, such as that described in the Kass et al. article, which is hereby incorporated by references in its entirety, can be used to align the user-specified polygon to the actual object boundary (i.e. edges). (The snake algorithm can be based on minimizing a specific energy function associated with edge pixels.) Users can also choose to skip the snake module 204 if a relatively accurate outline of the object is provided. The final accurate object boundary will be obtained by the subsequent region segmentation and aggregation processes.
After the object definition process 202, users can start the tracking process by specifying a set of thresholds. These thresholds include a color merging threshold, weights on the three color channels, a motion merging threshold and a tracking buffer size. These thresholds can be chosen based on the characteristic of a given video shot and experimental results. For example, for a video shot where foreground objects have similar luminance with background regions, users may put a lower weight on the luminance channel. Users can perform the tracking process over a small number of frames using predetermined default thresholds which are automatically generated by the system, and then adjust the thresholds based on the segmentation and tracking results. The user can stop the tracking process at any frame, modify the object boundary that is being tracked and then restart the tracking process from the modified frame.
Given the initial object boundary from a user, or the snake module, a slightly extended (-15 pixels) bounding box 206 surrounding the arbitrarily shaped object is computed. The following segmentation procedures are fulfilled inside this bounding box.
The second process involves the creation of three feature maps, i.e., an edge map 208, color map 210 and a motion field 212, within the bounding box from the original frame image. The color map is the major feature map in the following segmentation module 214. A process to generate the color map 210 can include the following steps. The original image is converted into CIE L*u*v* color space. It is well known that perceptually non-uniform color spaces such as RGB are not well suited for color segmentation, as distance measure in these spaces is not proportional to perceptual difference. L*u*v* is a perceptually uniform color space which divides color into one luminance channel (L*) and two chrominance channels (u* and v*). This channel separation also allows different weights to be placed on luminance and chrominance distance measures. The color difference is thus given by the weighted Euclidean distance in the three dimensional space, shown in equation (1):
Δs =JwL(ΔL *)2 +wu(Δu *)2 +wv(Δv *)2 (*
where wL, wυ and wv are weights assigned by a user. Experimental results indicate that it is generally preferable to put more weighting on the chrominance channels (e.g., about two times more) rather than the luminance channel.
The L*u*v* image is preferably simplified and smoothed to remove possible noise as well as minor details for the purpose of region merging. This can be accomplished by an adaptive quantization and a median filtering process. Quantization can be performed to produce homogeneous regions in which pixels having similar colors are mapped into a common bin. To quantize an image into a very limited number of colors (e.g., 32 or 16 bins), common fixed-level quantization methods can not always preserve color information under variant situations. Thus, the use of a clustering based method, such as for example a K-means technique, to analyze the input image and determine an adaptive quantization scale in the L*u*v* space is preferred. The above weighted Euclidean distance (Eq. 1) can be used as the distance measure in the clustering process. After quantization, a median filtering process can be applied on each of the L*u*v* channels. Such non-linear median filtering substantially reduces image noise and emphasizes prominent color regions while preserving salient image features, such as edges. This process can be used to produce the final color map 210.
The edge map 208 is a binary mask wherein edge pixels are distinguished from non-edge pixels, such as by setting the edge pixels to 1 and setting the non-edge-pixels to 0. The edge map 208 can be generated by applying edge detection 214, such as the CANNY edge detection algorithm. The CANNY edge detector performs 2-D Gaussian pre-smoothing on the image and then takes directional derivatives in the horizontal and vertical directions. These derivatives are used to calculate a gradient. Local gradient maxima can then be defined as candidate edge pixels. This output is preferably run through a two-level thresholding synthesis process to produce the final edge map 208. A simple algorithm can be used to select the two threshold levels in the synthesis process based on a histogram of the edge gradient.
The motion field 212 can be generated by a hierarchical block matching algorithm, such as described in M. Bierling, "Displacement Estimation by Hierarchical Block Matching", 1001 SPIE Visual Communication & Image Processing (1988) which is hereby incorporated by reference. This algorithm uses distinct sizes of measurement windows for block matching at different grid resolutions, and thus is able to generate a homogeneous motion field that is closer to the true motion than block based estimation algorithms. A 3-level hierarchy, such as suggested in Bierling, is appropriate for use in the instant process.
The region segmentation process is developed based on the three feature maps: color map 210, edge map 208 and motion field 212. Because the color map 210 and edge map 208 are complementary, i.e., the former captures low frequency information (means) while the latter captures high-frequency details (edges) of an image, fusion of these maps greatly improves segmentation results. Referring to Figure 2, during segmentation 214 a color-based pixel labeling process is applied to the color map. Labeling is a process where one label is assigned to a group of neighboring pixels which exhibit substantially the same color. To prevent assigning one label to two regions with the same color but separated by edge pixels, only non-edge pixels (i.e., pixels that are set to 0 in the edge mask) are labeled in this process. Edge pixels remains un-labeled. This process generates an initial group of regions (i.e., pixels with the same label) as well as their connection graph. Two regions are linked as neighbors if pixels in one region has neighboring pixels (4-connection) in another region. As edge-pixels are not labeled, two regions separated by edge pixels are not linked as neighboring regions.
Referring to Fig. 3, the color merging process is an iterative spatial-constrained clustering process. Color distances (Eq. 1) between two connected regions are computed in block 302. Two connected regions are merged if the color distance between them is 1 ) smaller than the given color threshold; and 2) the local minimal, i.e., it is smaller than all other color distances associated with the two regions. Once a new region is generated from two adjoining regions, the mean color of the new region can be computed (block 306) by taking a weighted average of the mean colors of the two old regions m„ m2 according to the equation:
Figure imgf000014_0001
l +s2
where the sizes of the two old regions s,, s2 are used as weights. The region connections are also updated for all neighbors of the two old regions (block 308). The new region can be assigned a label (block 310), such as the label of the larger one of the two merged regions. The two old regions can then be dropped (block 312).
The color merging process can be iterated until color distances between every two connected regions are above the color threshold. As edge pixels are not labeled, two regions separated by edge pixels are not connected as neighbors. Thus, the growth of each region is naturally stopped at its edge pixels. After the color merging process is complete, the edge pixels can simply be assigned to their neighboring regions with the smallest color distances.
To ensure homogeneous motion, a motion-based segmentation using the dense optical flow can be applied to the segmented color regions to check the uniformity of the motion distribution. A similar iterative spatially constrained clustering process can also be used to group pixels inside a color region according to their motion vectors and the given motion threshold. Because an objective is to track the object, this process need only be applied to those regions that intersect with the initial object boundary.
Relatively small regions can be merged with their neighboring regions based again on the color distance. Generally, this simplifies the segmentation result and reduces the computation complexity.
For object composition, the region aggregation module 220 receives homogeneous regions from the segmentation module 214 and the initial object boundary from the snake module 204 or user input 220 directly. Region aggregation at the starting frame is relatively simple compared with that for the subsequent frames, as all regions are newly generated (not tracked) and the initial outline is usually not far from the real object boundary. A region is classified as foreground if more than a certain percentage (e.g., 90%) of the region is included by the initial object outline. On the other hand, if less than a certain percentage (e.g., 30%) of a region is covered by the initial object outline, the region is considered as background. Regions between the low and high thresholds are split into foreground and background regions according to the intersection with the initial object outline.
After region aggregation 220, affine estimation 222 is performed. Affine motion parameters of all regions, including both foreground and background, can be estimated by a multivariate linear regression process over the dense optical flow inside each region. Preferably, a two dimensional affine model with six parameters, as set forth in Equation 3, can be used as a good first-order approximation of a distant object undergoing three dimensional translation and linear deformation:
(3)
Figure imgf000015_0001
where [x, y] and [x1, y1] are the original and transformed coordinates respectively, and tx, ty, sx, sy, rx, and xy, are the six affine transformation parameters. These affine models will be used to track the regions and object in the future frames.
Given the object with homogeneous regions constructed at the starting frame, tracking in both the region and object levels is the main task in the successive frames. The objective of the tracking process is to avoid losing foreground regions and also avoid including false background regions. This task can be performed through the following steps. An inter-frame segmentation process can be employed to segment a frame into homogeneous regions. Unlike in the starting frame where all regions are tagged as new, these regions are classified in the inter-frame segmentation process as either foreground, background or new, according to their relationship with the existing foreground and background regions of the previous frame. In the region aggregation process, the estimated, or projected, object boundary can then be used to group the tagged regions into the object and background. For regions around the object boundary with the tag "new," their visual features are also examined to determine whether they belong to the object or not.
As shown in Figure 4, by way of a motion projection module 402, segmented regions from the previous frame, including both foreground and background region, are projected onto the current frame (virtually) using their individual affine motion models. Projected regions keep their labels and original classifications. For video shots with static or homogeneous background (i.e. only one moving object), users can choose not to project background regions to save time and processing overhead. An expanded bounding box of all projected foreground regions can be computed. Similarly, the following segmentation and aggregation processes are generally applied only to the area inside the bounding box.
Generation of the projected feature maps, the edge map 404, color map 406 and motion field 408, generally utilize the same methods as described above. The main difference being that in the quantization step, the existing color palette computed at the starting frame is directly used to quantize the current frame. By using a consistent quantization palette, the color consistency of segmented regions between successive frames is enhanced which improves the performance of region based tracking. As object tracking is limited to single video shots in which there is no abrupt scene change, using one color palette is generally valid. If necessary, a new quantization palette can be generated automatically, such as when a large quantization error is encountered. With reference to Figures 4 and 5, in the tracking module 410, based on the projected foreground and background regions, three feature maps are preferably fused to track existing regions and segment the current frame. First, an inter-frame labeling process is performed where non-edge pixels are sequentially labeled, such as by labeling the pixels one by one from left to right and top to bottom. As in the common labeling process, if a pixel has the same color as that of its labeled neighboring pixels, it is assigned the label of its neighbors. When the color of a pixel is different from the colors of its labeled neighboring pixels, its color is compared with all projected regions that cover its coordinate at the current frame. If the pixels's color distance to the closest region is under the given color threshold, this pixel is "tracked," and assigned the label and classification (foreground or background) of the closest projected region. Otherwise, a new label is generated and assigned to the pixel. Regions identified by this new label are classified as "new" regions.
The interframe color merge, edge labeling, motion split and small region elimination processes which take place in the tracking module 410 are similar to those previously described, with some additional constraints. Foreground or background regions tracked from the previous frame are allowed to be merged only with regions of the same class or new regions. Merging of the foreground and background regions is forbidden. New regions can be merged with each other or merged with foreground/background regions. When a new region is merged with a tracked region, the merging result inherits the label and classification from the tracked region. In motion segmentation, split regions remain in their original classes. After the inter-frame tracking process 410, a list of regions temporarily tagged as either foreground, background, or new is obtained. The temporarily tagged regions are then passed to an iterative region aggregation module 412. As shown in Figure 4, the region aggregation module 412 receives two inputs: the homogeneous regions from tracking module 410 and the estimated object boundary from the motion projection module 402. The object boundary can be estimated from projected foreground regions. Foreground regions from the previous frame are projected independently of one another and the combination of projected regions forms the mask of the estimated object. The mask can be refined using a morphological closing operation (i.e. dilation followed by erosion) with a size of several pixels in order to close tiny holes and smooth boundaries. To tolerate motion estimation error, the result can be further dilated for the tracking buffer size, which is specified by users at the beginning of the tracking process.
The region aggregation module 412 implements a region grouping and boundary alignment algorithm based on the estimated object boundary as well as the edge and motion features of the region. An exemplary algorithm includes the following steps:
1. For every segmented region, do step 2 to step 8.
2. If the region is tagged as background, keep it as background. Go to step 8.
3. Compute an intersection ratio of the region with the object mask. Go to step 5 if the region is tagged as new.
4. If the region (foreground) is covered by the object outline by more than a given threshold, such as about 80%, the region belongs to the semantic object. Otherwise, the region is intersected with the object mask and split: a) split regions inside the object mask are still kept as foreground b) split regions outside the object mask are tagged as new; go to step 8.
5. If the region (new) is covered by the object mask by less than a threshold, such as about 30%, keep it as new. Go to step 8.
6. If the region is covered by the object mask by more than a threshold, such as about 80%, classify it as foreground. Go to step 8. 7. Compute numbers of edge pixels (using edge map) between this region and the current background and foreground regions. Compute differences between the mean motion vector of this region with those of its neighboring regions and find the closest motion neighbor.
If the region is separated from background regions by more edge pixels than foreground regions ( or this region is not connected to any background regions) and its closet motion neighbor is a foreground region, intersect it with the object mask and split: a) split regions inside the object mask are classified as foreground b) split regions outside the object mask are tagged as new. Otherwise, keep the region as new. 8. Advance to the next region. Go to step 2.
Here the object mask refers to the estimated object projection. Compared with the aggregation process described above, a relatively lower ratio (e.g., 80%) can be used to include a foreground or new region. This is to improve handling of motion projection errors. As it is possible to have multiple layers of new regions emerge between the foreground and the background, the above aggregation and boundary alignment process can be iterated multiple times. This is useful in correcting errors which can result from the estimation of rapid motion. At the end of the last iteration, all remaining new regions are classified into background regions.
Finally, affine models of all regions, including both foreground and background, can be estimated by a linear regression process over the optical flow. As described before, these affine models are used to project regions into a future frame in the motion projection module 112.
The methods and modules described in Figures 1-4 can be implemented on conventional computer based systems operating under the control of appropriate software designed in accordance with the present invention. Figure 8 is a block diagram which illustrates the essential operational components of a system for practicing the image segmentation and tracking methods described herein. The system, can take the form of a personal computer with appropriate high speed processor and video handling peripherals, suitable digital video editing workstation and the like. Such a system will generally include a processor 802 which is coupled to a video source 804. The processor 802 generally includes a microprocessor, memory, control logic, and input/output circuits operatively interconnected to perform the described processing operations. The video source 804 can take the form of magnetic data storage media, optical data storage media, streaming video content from a remote source and the like. An input device 806, such as a digital pointer, light pen, computer "mouse" and the like, is coupled to the processor 802. The input device is used to provide user input for, inter alia, establishing object boundaries and providing threshold assignments for object definition. A display device 808 is also coupled to the processor 802 to provide a visual display of the digital video.
EXAMPLES The present invention has shown very good segmentation and tracking results on general video sources. As shown in Figures 6A-E, five different types of video sequences were used to do subjective and objective evaluation of the present system. The first sequence (akiyo), illustrated in Figure 6A, contains an anchor person with small motion and strong color difference with background. The second sequence (foreman), illustrated in Figure 6B contains an object with relatively large motion and very similar brightness to background. In the third sequence, Figure 6C, a skater having many different fast motion parts (i.e. body, arms and legs) is depicted. The characteristic of the fourth sequence shown in Figure 6D is that the shape of the flying bird changes abruptly from frame to frame. In the last sequence, Figure 6E, a plane is flying away from the camera. One hundred successive frames were used for each sequence in the experiments. After the first user input at the starting frame, subsequent user inputs are applied at frames where the largest tracking errors occur or start to occur. Generally, the number of frames tracked after earlier user inputs are larger than that after later user inputs. Figures 6A-E show object tracking results of the five testing sequence after 3 user inputs. For subjective evaluation, segmented objects were superimposed onto gray background with random noise and played in real time for users to see whether there were observable errors. To remove boundary jitter (i.e. high frequency noise), a temporal median filtering process (with a radius of 2 frames) was applied to the binary object masks before they were put onto the background with random noise. In the sequence of Figure 6A, there was no noticeable errors after only a single user input at the starting frame. In Figures 6B, 6D and 6E, the sequence of three user inputs provided outputs without noticeable errors. The skater sequence of Figure 6E, which has fast and complex motion, required four user inputs to remove noticeable errors.
In the objective evaluation, extracted objects for the five sequences were extracted with the help of the snake algorithm, and then the average numbers of missing and false pixels over the 100 frames at different user inputs (from 1 to 10) were computed. The average numbers of missing and false pixels versus the number of user inputs are plotted in Figure 7. The average size of each object is shown in the legends. As shown in the plots, tracking errors are usually reduced to a very low level after three inputs. Considering errors caused by boundary blur (especially for the MPEG-1 sequences with fast motion) and manual segmentation, the present system and methods generate very good tracking results. The sequence of Figure 6A only requires one user input to get very good tracking result over 100 frames. Missing and false pixels are mainly around the hair, which is not clearly separated from background either by color or edge. The following user inputs bring very small improvement because only a few number of frames need to be modified. The bird sequence of Figure 6D contains a small and fast moving object. The abrupt shape change between successive frames causes the missing of part of the wings at some frames. These errors are substantially corrected after 2-3 more user inputs.
For the plane sequence of Figure 6E, after the first input, most false pixels come from a black strip at the left frame border. Part of the strip is merged into the plane at some successive frames. This error is easily corrected after 1 -2 more user inputs.
The foreman and skater sequences, Figures 6B and 6C, respectively, have relatively large errors at the first user input due to complex object motion. Missing pixels are mainly caused by the emerging of new foreground regions around the frame border, e.g., the left shoulder in the foreman sequence. These new regions connect with old object regions only by a few pixels when appearing and also abruptly change the shape of object. Thus they are classified as background regions during the tracking process. False pixels are included mainly because of uncovered background regions enclosed by the object. In the skater sequence shown in Figure 6C, frame 50 has a background region included to the object. The background region is uncovered when the two legs moving apart. As it is enclosed by the object and the border and thus not connected to any exist background regions, the new background region is falsely classified as foreground. The above two types of errors can be corrected by one user input at the frame where the error starts to happen. Restarting the tracking process after the correction will correct errors in the successive frames. As shown in Figures 7A and 7B, for the foreman and skater sequence, the numbers of missing and false pixels drop rapidly after the first 2 user inputs.
Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

1. A method for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, comprising the steps of: a. receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; b. defining at least one object boundary in the initial frame of video; c. partitioning the initial frame of video image data into a plurality of regions based on substantially homogenous video features; and d. tracking the object boundary and plurality of regions in the at least one succeeding frame of video image data.
2. The method for segmenting and actively tracking semantic video objects of claim 1, wherein said defining step includes receiving an approximation of the object boundary from a user.
3. The method for segmenting and actively tracking semantic video objects of claim 1, wherein the video features of the partitioning step include at least one of edge, color and motion field video features.
4. The method for segmenting and actively tracking semantic video objects of claim 1, wherein the plurality of regions do not overlap.
5. The method for segmenting and actively tracking semantic video objects of claim 1 , wherein an aggregation step is performed after the pardoning step to classify the regions as one of background and foreground regions.
6. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions is determined at least in part by a percentage of the region which resides within the object boundary.
7. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions is determined at least in part by a classification of the region in a previous frame.
8. The method for segmenting and actively tracking semantic video objects of claim 5, wherein the classification of regions around the object boundary is determined at least in part by at least one of a color feature, an edge feature and a motion feature.
9. A system for segmenting and actively tracking semantic video objects from a sequence of frames of digital video information, comprising: a processor, the processor receiving an initial frame of video image data and at least one succeeding frame of video image data from said sequence of frames of digital video information; an input device, said input device being operatively coupled to said processor for defining at least one object boundary in the initial frame of video image data; a display device, said display device being operatively coupled to said processor for displaying the video image data; and wherein said processor partitions the initial frame of video image data into a plurality of regions based on substantially homogenous video features and tracks the object boundary and plurality of regions in the at least one succeeding frame of video image data.
10. The system for segmenting and actively tracking semantic video objects of claim 9, wherein said input device is a user input interface for providing an approximation of the object boundary from a user.
11. The system for segmenting and actively tracking semantic video objects of claim 9, wherein the video features used by the processor include at least one of edge, color and motion field video features.
12. The system for segmenting and actively tracking semantic video objects of claim 9, wherein said processor classifies the regions as one of background and foreground regions.
13. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions at least in part in accordance with a percentage of the region which resides within the object boundary.
14. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions at least in part in accordance with a classification of the region in a previous frame of the video image data.
15. The system for segmenting and actively tracking semantic video objects of claim 12, wherein the processor performs the classification of regions around the object boundary at least in part in accordance with at least one of a color feature, an edge feature and a motion feature.
PCT/US1999/022264 1998-09-24 1999-09-24 System and method for semantic video object segmentation WO2000018128A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU62654/99A AU6265499A (en) 1998-09-24 1999-09-24 System and method for semantic video object segmentation
JP2000571664A JP2002525988A (en) 1998-09-24 1999-09-24 System and method for semantic video object segmentation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10167598P 1998-09-24 1998-09-24
US60/101,675 1998-09-24

Publications (1)

Publication Number Publication Date
WO2000018128A1 true WO2000018128A1 (en) 2000-03-30

Family

ID=22285841

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/022264 WO2000018128A1 (en) 1998-09-24 1999-09-24 System and method for semantic video object segmentation

Country Status (3)

Country Link
JP (1) JP2002525988A (en)
AU (1) AU6265499A (en)
WO (1) WO2000018128A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2426137A (en) * 2005-05-10 2006-11-15 Thomson Licensing Sa Tracking object contours in a sequence of images
US7428315B2 (en) 2001-12-03 2008-09-23 Microsoft Corporation Automatic detection and tracking of multiple individuals using multiple cues
US8027513B2 (en) 2007-03-23 2011-09-27 Technion Research And Development Foundation Ltd. Bitmap tracker for visual tracking under very general conditions
US9389767B2 (en) 2013-08-30 2016-07-12 Cyberlink Corp. Systems and methods for object tracking based on user refinement input
US9633446B2 (en) 2014-02-20 2017-04-25 Nokia Technologies Oy Method, apparatus and computer program product for segmentation of objects in media content
US9720937B2 (en) * 2008-12-22 2017-08-01 Koninklijke Philips N.V. Relevance feedback on a segment of a data object
US10121254B2 (en) 2013-08-29 2018-11-06 Disney Enterprises, Inc. Methods and systems of detecting object boundaries
US20190311480A1 (en) * 2018-04-10 2019-10-10 Facebook, Inc. Automated cinematic decisions based on descriptive models
CN110782469A (en) * 2019-10-25 2020-02-11 北京达佳互联信息技术有限公司 Video frame image segmentation method and device, electronic equipment and storage medium
CN112800850A (en) * 2020-12-31 2021-05-14 上海商汤智能科技有限公司 Video processing method and device, electronic equipment and storage medium
EP3979199A4 (en) * 2019-05-30 2022-08-03 Panasonic Intellectual Property Management Co., Ltd. Image processing method, image processing apparatus, and program

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006131970A1 (en) * 2005-06-09 2006-12-14 Techno Visual Corporation Moving picture compression program and moving picture decompression program
CN101911124B (en) * 2007-12-26 2013-10-23 皇家飞利浦电子股份有限公司 Image processor for overlaying graphics object

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0579319A2 (en) * 1992-07-16 1994-01-19 Philips Electronics Uk Limited Tracking moving objects
EP0587329A2 (en) * 1992-09-05 1994-03-16 International Business Machines Corporation Image processing system
WO1998033323A1 (en) * 1997-01-29 1998-07-30 Levent Onural Rule-based moving object segmentation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0579319A2 (en) * 1992-07-16 1994-01-19 Philips Electronics Uk Limited Tracking moving objects
EP0587329A2 (en) * 1992-09-05 1994-03-16 International Business Machines Corporation Image processing system
WO1998033323A1 (en) * 1997-01-29 1998-07-30 Levent Onural Rule-based moving object segmentation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GEIGER D ET AL: "DYNAMIC PROGRAMMING FOR DETECTING, TRACKING, AND MATCHING DEFORMABLE CONTOURS", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,US,IEEE INC. NEW YORK, vol. 17, no. 3, 1 March 1995 (1995-03-01), pages 294 - 302, XP000498121, ISSN: 0162-8828 *
GUNSEL B ET AL: "TEMPORAL VIDEO SEGMENTATION USING UNSUPERVISED CLUSTERING AND SEMANTIC OBJECT TRACKING", JOURNAL OF ELECTRONIC IMAGING,US,SPIE + IS&T, vol. 7, no. 3, 1 July 1998 (1998-07-01), pages 592 - 604, XP000771766, ISSN: 1017-9909 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7428315B2 (en) 2001-12-03 2008-09-23 Microsoft Corporation Automatic detection and tracking of multiple individuals using multiple cues
US7433495B2 (en) 2001-12-03 2008-10-07 Microsoft Corporation Automatic detection and tracking of multiple individuals using multiple cues
GB2426137A (en) * 2005-05-10 2006-11-15 Thomson Licensing Sa Tracking object contours in a sequence of images
US8027513B2 (en) 2007-03-23 2011-09-27 Technion Research And Development Foundation Ltd. Bitmap tracker for visual tracking under very general conditions
US9720937B2 (en) * 2008-12-22 2017-08-01 Koninklijke Philips N.V. Relevance feedback on a segment of a data object
US10121254B2 (en) 2013-08-29 2018-11-06 Disney Enterprises, Inc. Methods and systems of detecting object boundaries
US9389767B2 (en) 2013-08-30 2016-07-12 Cyberlink Corp. Systems and methods for object tracking based on user refinement input
US9633446B2 (en) 2014-02-20 2017-04-25 Nokia Technologies Oy Method, apparatus and computer program product for segmentation of objects in media content
US20190311480A1 (en) * 2018-04-10 2019-10-10 Facebook, Inc. Automated cinematic decisions based on descriptive models
WO2019199904A1 (en) * 2018-04-10 2019-10-17 Facebook, Inc. Automated decisions based on descriptive models
US10511808B2 (en) 2018-04-10 2019-12-17 Facebook, Inc. Automated cinematic decisions based on descriptive models
US10523864B2 (en) 2018-04-10 2019-12-31 Facebook, Inc. Automated cinematic decisions based on descriptive models
US10659731B2 (en) 2018-04-10 2020-05-19 Facebook, Inc. Automated cinematic decisions based on descriptive models
CN112335256A (en) * 2018-04-10 2021-02-05 脸谱公司 Automatic decision making based on descriptive model
US10979669B2 (en) 2018-04-10 2021-04-13 Facebook, Inc. Automated cinematic decisions based on descriptive models
EP3979199A4 (en) * 2019-05-30 2022-08-03 Panasonic Intellectual Property Management Co., Ltd. Image processing method, image processing apparatus, and program
CN110782469A (en) * 2019-10-25 2020-02-11 北京达佳互联信息技术有限公司 Video frame image segmentation method and device, electronic equipment and storage medium
CN112800850A (en) * 2020-12-31 2021-05-14 上海商汤智能科技有限公司 Video processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
JP2002525988A (en) 2002-08-13
AU6265499A (en) 2000-04-10

Similar Documents

Publication Publication Date Title
Zhong et al. An integrated approach for content-based video object segmentation and retrieval
Price et al. Livecut: Learning-based interactive video segmentation by evaluation of multiple propagated cues
US7783118B2 (en) Method and apparatus for determining motion in images
Zhong et al. AMOS: an active system for MPEG-4 video object segmentation
Mezaris et al. Video object segmentation using Bayes-based temporal tracking and trajectory-based region merging
JP3178529B2 (en) Object boundary detection device and object boundary detection method
Fu et al. Tracking visible boundary of objects using occlusion adaptive motion snake
Erdem et al. Video object tracking with feedback of performance measures
WO2000018128A1 (en) System and method for semantic video object segmentation
Toklu et al. Simultaneous alpha map generation and 2-D mesh tracking for multimedia applications
Lin et al. Temporally coherent 3D point cloud video segmentation in generic scenes
Reso et al. Occlusion-aware method for temporally consistent superpixels
Nagahashi et al. Image segmentation using iterated graph cuts based on multi-scale smoothing
Liu et al. An efficient face segmentation algorithm based on binary partition tree
Tripathi et al. Improving streaming video segmentation with early and mid-level visual processing
Casas et al. Mutual feedback scheme for face detection and tracking aimed at density estimation in demonstrations
Guo et al. Semantic video object segmentation for content-based multimedia applications
Galmar et al. Graph-based spatio-temporal region extraction
Patras et al. Semi-automatic object-based video segmentation with labeling of color segments
Talouki et al. An introduction to various algorithms for video completion and their features: a survey
Zeng et al. Semantic object segmentation by a spatio-temporal MRF model
Jabid et al. An edge-texture based moving object detection for video content based application
Dorea et al. A motion-based binary partition tree approach to video object segmentation
Babu et al. Kernel-based spatial-color modeling for fast moving object tracking
Dorea et al. Trajectory tree as an object-oriented hierarchical representation for video

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref country code: JP

Ref document number: 2000 571664

Kind code of ref document: A

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase