WO1999005865A1 - Content-based video access - Google Patents

Content-based video access Download PDF

Info

Publication number
WO1999005865A1
WO1999005865A1 PCT/US1998/015063 US9815063W WO9905865A1 WO 1999005865 A1 WO1999005865 A1 WO 1999005865A1 US 9815063 W US9815063 W US 9815063W WO 9905865 A1 WO9905865 A1 WO 9905865A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
video segment
accessing
vector
keyframe
Prior art date
Application number
PCT/US1998/015063
Other languages
French (fr)
Inventor
Francis Quek
Original Assignee
The Board Of Trustees Of The University Of Illinois
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The University Of Illinois filed Critical The Board Of Trustees Of The University Of Illinois
Publication of WO1999005865A1 publication Critical patent/WO1999005865A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor

Definitions

  • the field of the invention relates to imaging and more particularly to video imaging.
  • Video is at least 30 pictures per second. Video has the potential of being a rich and compelling source of information.
  • Many challenges have to be met before this vision can be realized.
  • Much research has gone into technologies and standards for video compression, synchronization between video and sound, transport of video over networks and other communication channels and operating system issues of the storage and retrieval of video media on demand. In our research, we address the challenge of how video may be accessed and manipulated in an interactive fashion, and how video may be segmented by content to provide conceptual-level access to the data.
  • Video has to be packaged in a way that makes it amenable for search and rapid browsing.
  • Video is capable of generating vast amounts of raw footage—and such footage is by far the most voluminous and ever growing data trove available.
  • Given the sheer volume of video data it is desirable to automate the majority of the task of video segmentation. Given the variability in the semantic content of video, it is necessary to facilitate user interaction with the segmentation process.
  • the sequential nature of video also makes its access cumbersome.
  • Teen who owns a video camera is acquainted with the frustration of tapes gathering dust on the shelf because it is too time consuming to sift through the hours of video.
  • Video Access Interaction and Manipulation
  • Content-Based Video Segmentation The former addresses the higher level system needs for handling video not as a stream of video frames but as conceptual objects based on content.
  • the latter address the challenge of automating the production of these content-based video objects.
  • a method and apparatus are provided for accessing a video segment of a plurality of video frames .
  • the method includes the steps of segmenting the plurality of video frames into a plurality of video segments based upon semantic content and designating a frame of each video segment of the plurality of segments as a keyframe and as an index to the segment.
  • the method further includes the steps of ordering the keyframes and placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location.
  • a keyframe may be designated as a selected keyframe.
  • the ordered keyframes may be precessed through the ordered display until the selected keyframe occupies the selected location.
  • FIG. 1 is a block diagram of a content based video access system in accordance with an embodiment of the invention
  • FIG. 2 depicts the shot hierarchical organization of the system of FIG. 1;
  • FIG. 3 depicts a shot hierarchy data model of the system of FIG. 1;
  • FIG. 4 depicts a multiply-linked representation for video access of the system of FIG. 1
  • FIG. 5 depicts a summary representation of a standard keyframe and filmstrip representation of the system of FIG. 1;
  • FIG. 6 depicts a keyframe-annotation box representation of the system of FIG. 1 ;
  • FIG. 7 depicts a multiresolution animated representation showing shot ordering in boxed arrows of the system of FIG. 1;
  • FIG. 8 depicts an animation sequence in ordered mode of the system of FIG. 1 ;
  • FIG. 9 depicts an animation sequence in straight line mode of the system of FIG. 1;
  • FIG. 10 depicts a multiresolution animated browser with a magnified view of the system of FIG. 1 ;
  • FIG. 11 depicts a screen setup for scanning video frames of the system of FIG. 1;
  • FIG. 12 depicts a total search time and animation time versus animation speed plot of the system of FIG. 1;
  • FIG. 13 depicts histograms showing time clustering for different animation speeds of the system of FIG. 1 ;
  • FIG. 14 depicts a hierarchical shot representation control panel of the system of FIG. 1;
  • FIG. 15 depicts a VCR-like control panel of the system of FIG. 1;
  • FIG. 16 depicts a timeline representation control panel of the system of FIG. 1;
  • FIG. 17 depicts a hierarchy of events inherent in the processing of video frames by the system of FIG. 1;
  • FIG. 18 depicts VCM output of a pan-right sequence showing the local vector field as fine lines and a dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 19 depicts a four-quadrant approach to zoom detection of the system of FIG. 1;
  • FIG. 20 depicts VCM output of a zoom-in sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 21 depicts a VCM output of a pan-zoom sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 22 depicts a method of creating an NCM for a given interest point of simulated video frame processed by the system of FIG. 1;
  • FIG. 23 depicts a vector field obtained using ADC
  • FIG. 24 depicts a vector field obtained using VCM from a video frame processed by the system of FIG. 1 ;
  • FIG. 25 depicts a global VCM computed for vertical pan sequence of a video frame processed by the system of FIG. 1;
  • FIG. 26 depicts a vertical camera pan of a video frame analyzed using VCM by the system of FIG. 1 ;
  • FIG. 27 depicts a vector field obtained without using temporal prediction by the system of FIG. 1;
  • FIG. 28 depicts a vector field obtained using temporal prediction by the system of FIG. 1 ;
  • FIG. 29 depicts an example of vector clustering combining a moving object and camera pan by the system of FIG. 1;
  • FIG. 30 depicts hand movements detected using VCM and clustering by the system of FIG. 1 ;
  • FIG. 32 depicts another example of a noisy frame
  • FIG. 33 depicts a pure zoom sequence analyzed with VCM by the system of FIG. 1 ;
  • FIG. 34 depicts a vector field obtained for a pan and zoom sequence by the system of FIG. 1;
  • FIG. 35 depicts a vector field obtained for a camera rotating about its optical axis by the system of FIG. 1.
  • FIG. 1 is a block diagram of a video processing system 10, generally, that may be used to segment a video file based upon a semantic content of the frames. Segmentation based upon semantics allows the system 10 to function as a powerful tool in organizing and accessing the video data.
  • MAR Mul tiresolution Animated Representation
  • MMI man-machine interface
  • FIG. 2 shows the shot architecture of our system that may be displayed on a display 16 of the system 10.
  • FIG. 2 shows a portion of the MAR (to be described in more detail later) at level 0 along with subgroups of the MAR at levels 1 and 2. In the figure, each shot at the highest level (we call this level 0) is numbered starting from 1.
  • Shot 2 is shown to have four subshots which are numbered 2.1 to 2.4. These shots are said to be in level 2 (as are shots 6.1 to 6.3) .
  • Shot 2.3 in turn contains three subshots, 2.3.1 to 2.3.3. In our organization, shot 2 spans its subshots. In other words, all the video frames in shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to 2.3.3 are also in shot 2.3. Hence the same frame may be contained in a shot in each level of the hierarchy. One could, for example, select shot 2 and begin playing the video from its first frame. The frames would play through shot 2, and eventually enter shot 3. If one begins playing from the first frame of shot 2.2, the video would play through shot 2.4 and proceed to shot 3.
  • the system is situated at one level, and only the shots at that level appear in the visual summary representation (to be described later) .
  • the system would be in level 2 although the current frame is also in shot 2 . 3 and shot 2 in levels 1 and 0 respectively.
  • This organization may, for example, be used to break a video archive of a courtroom video into the opening statements, a series of witnesses, closing statements, jury instructions, and verdict at the highest level.
  • a witness testimony unit may be broken into direct and cross-examination sessions at the next level, and may be further broken into a series of attorney-witness exchanges, etc.
  • FIG. 3 shows the data model which makes up our shot database.
  • Each shot defines its first and final frame within the video (represented by F and L respectively) and its resources for visual summary (the keyframe and filmstrip representations represented by the K) .
  • Each shot unit is linked to its predecessor and successor in the shot sequence.
  • Each shot may also be comprised of a list of subshots. This allows the shots to be organized in a recursive hierarchy.
  • Interface consistency and metaphor realism are critical to helping the user form a good mental model of both the system of the MAR and the data (e.g., the shots and subshots of the MAR) .
  • Our system features a mul tiply-linked representation set that provides this consistency and realism. By maintaining this realism and consistency, we provide a degree of tangibleness to the system and video data that allows the user to think of our representational units as 'real' objects. The consistency among the representational components also permits the user to maintain a degree of 'temporal situatedness ' through the video elements displayed.
  • the system is made up four major interlinked representational components :
  • the system interface showing these components is shown in FIG. 4. While all these components involve detailed design and implementation, the key to making them consistent and real is that they are mul tiply- linked.
  • the hierarchical video shot representation, the VCR-like video screen display, and the timeline representation are updated to show the video segment in each of their representations .
  • the timeline moves to the temporal displacement of the selected shot, the VCR-like video screen displays the first frame in the shot, and the hierarchical representation indicates where the shot is in the shot hierarchy.
  • the fundamental summary element in our system is the keyframe . It has been shown that this simple representation is a surprisingly effective way to provide the user with a sense of the content of a video sequence. Our key contribution is in the way such keyframes are organized and presented within the context of a video access and browsing environment. We provide three summary organizations: • The standard keyframe presentation
  • FIG. 5 shows the standard keyframe representation. Each shot is summarized by its keyframe, and the keyframes are displayed in a scrollable table. The keyframe representing the "current shot” is highlighted with a boundary. A shot can be selected as the current shot from this interface by selecting its keyframe with the computer pointing device (e.g., a mouse) .
  • the computer pointing device e.g., a mouse
  • the position of the current shot in the shot hierarchy appears in the shot hierarchy representation
  • the first frame of the shot appears as the "current frame” in the display of the VCR-like interface
  • the timeline representation is updated to show the offsets of the first frame of the shot in the video data.
  • the video can be played in the VCR display by activating the "Play" button on the VCR-like interface.
  • the current keyframe highlight boundary blinks.
  • the frame displayed in the VCR display is the current frame.
  • the next shot becomes the current shot, and the current shot highlight boundary is updated.
  • Beneath the keyframe window is a subwindow showing the annotation information of the current shot.
  • This current frame annota tion panel contains the classification of the shot and a textual description that includes a set of keywords.
  • the classification and keywords are dependent on the domain to which the system is applied. For example, for a video database for court room archival, the classification may be "prosecution questioning", "testimony", etc. If editing is enabled, the user may modify the annotation content of the current shot .
  • a filmstrip presentation of any shot may be activated by clicking on its keyframe with the keyboard shift key depressed (other key combinations or alternate mouse buttons may be chosen for this function) .
  • the filmstrip representation may be similarly activated from all the visual summary modes.
  • Filmstrip presentation of the current shot may be as shown in FIG. 5. This filmstrip shows every n th frame at a lower resolution (we use the 10 th frame in our implementation) in a horizontally scrollable window. The user can select any frame in the filmstrip and make it the current frame . All other representation components are updated accordingly (i.e., the VCR-like display will jump to the frame and enter pause mode and the timeline representation will be updated) .
  • the filmstrip representation provides higher temporal resolution access into the video and is particularly useful for editing the shot hierarchy.
  • FIG. 6 shows the visual summary mode that is particularly useful for browsing and updating annotation content.
  • the keyframes of the shots are displayed beside their annotation panels subwindow in a scrollable window.
  • the annotation panel is identical to the current frame annotation panel in the standard keyframe representation.
  • the keyframes behave exactly the same way as in the standard keyframe representation (i.e., selection of the current shot, activation of the filmstrip representation of any shot, and the interlinking with the other representation components). While this representation does not permit the user to see as many keyframes, its purpose is to let the user browse and edit the shot annotations .
  • Annotation information from neighboring shots is available, and one does not have to make a particular shot the current shot to see its annotation or to edit the annotation content.
  • FIG. 7 shows the layout of our MAR.
  • the purpose of using the MAR is to permit more keyframes to fit within the same screen real-estate.
  • Our current implementation shown in FIG. 7 may present 75 keyframes in a 640x480- pixel window. It may also contain 72 low resolution keyframes displayed at 128x96 pixels, two intermediate resolution keyframes displayed at 128x96 pixels, and one high resolution keyframe (representing the current shot) displayed at 256x192 pixels. This is in contrast with the standard keyframe representation which displays 40 keyframes at 128x96 pixels and takes more than twice the screen real-estate (a 750x610 pixel window) . Even at the lowest resolution, the thumbnail images provide sufficient information for the user to recognize its contents .
  • the animation control panel above the MAR allows the user to select the animation mode and speed. We shall discuss the different path modes in the next sections. The user is also able to select either the "real image” or "box animation” modes from this panel. The former mode animates the thumbnail images while the latter replaces them with empty boxes during animation.
  • box animation mode may be necessary for machines with less computational capacity. We shall also discuss the need for being able to set the animation speed in a later section.
  • FIG. 7 shows the ordering of the shots in the MAR in the boxed arrows .
  • the highest resolution keyframe in the center is always the current shot keyframe .
  • a highlighting boundary appears around the keyframe (see FIG. 10) . This indicates that the corresponding shot will be selected if the mouse selection button is depressed.
  • the interface animates so that the selected frame moves and expands to the current frame location.
  • the other keyframes move their respective new locations according to their shot order.
  • FIG. 8 shows the ordered mode animation.
  • the keyframes scroll along the path determined by the shot order shown in FIG . 7 , expanding as they reach the center keyframe location and contracting as it passes the center. This animation proceeds until the selected keyframe arrives at the current keyframe location at the center of the interface.
  • Ordered mode animation is useful for understanding the shot order in the browser and for scanning the keyframes by scrolling through the keyframe database. Ordered mode animation is impractical for random shot access because of the excessive time taken in the animation.
  • the filmstrip In accordance with our overall system design, one can access the filmstrip representation in exactly the same way as in the other visual summary modes.
  • the filmstrip of course, functions the same way no matter which visual summary mode is active when it is launched.
  • the current frame annotation panel is also accessible in the MAR as with the standard keyframe representation.
  • the second animation mode is the straight line animation illustrated in FIG. 9.
  • the system determines the final location of every keyframe within the shot database and animates the motion of each keyframe to its destination in a straight line. If the resolution of the keyframe changes between its origin and destination, the keyframe grows or shrinks in a linear fashion along its animated path. As is evident in the animation sequence shown in FIG. 9, new keyframes move into and some keyframes move off the browser panel to maintain the shot order.
  • the advantage of this animation mode is its speed. As will be seen later, the trade-off achieved between animation time and visual search time determines the total time it takes to access a particular shot through the interface.
  • thumbnail images generally provide sufficient information to select the required shot, we have found that it is sometimes convenient to be able to see a higher resolution display of a keyframe before selecting the shot. This may be characterized as a magnifying glass feature. Being able to see a higher resolution display of a keyframe is especially important if the shot keyframes are very similar or where the keyframes are complex. If the mouse pointer stays over a particular keyframe for a specified period (2 seconds in one implementation) , a magnified view of the keyframe appears. This magnified view disappears once the cursor leaves the keyframe.
  • FIG. 10 shows a highlighted keyframe under the cursor and a magnified display of the same keyframe.
  • the first is that the resolution-for-screen real- estate tradeoff gives the user an advantage in viewing greater scope without seriously impairing her ability to recognize the content of the keyframe.
  • the second assumption is that the animation assists the user to select a sequence of shots.
  • the scope advantage of the first assumption is self-evident since one can indeed see more shot keyframes for any given window size using the multiresolution format. Whether the reduction in resolution proves to be a serious impediment is harder to determine. Thumbnail representations are used by such programs as Adobe's Photoshop which uses a much lower resolution image as image file icons on the Macintosh computer. The subjects who have tested our system demonstrated that they can easily recognize the images in the low resolution thumbnails.
  • We tested the second assumption by conducting a response time experiment which will be detailed in this section. The goal of the experiment was to determine if animation improves user performance, and how different animation speeds may impact selection time.
  • FIG. 11 shows the screen layout of our experiment.
  • Each subject was presented with the interface and a window showing five target keyframes .
  • a window showing the list of targets is displayed above the MAR selection window. Since our experiment tests the situation in which the user knows which images she was interested in, the same target sequence of five keyframes was used throughout the experiment. The subjects were required to select the five target images displayed in order from left to right. The keyframe to be selected is highlighted in the target list window. If the subject makes a correct selection, the interface functions as designed (i.e., the animation occurs). Erroneous selections were recorded. The interface does not animate when an error is made. We shall call each such sequence of five selections a selection set.
  • the experiment was performed by five subjects. Each subject was required to perform 8 sets of selections at each of five animation speeds. Hence each subject made 40 selections per animation speed yielding a total of 32 between selection intervals which were recorded. This constitutes a total of 200 selections and 128 time interval measurements per subject.
  • the subjects rested after each selection set.
  • the fastest animation speed was the 'jump' animation in which the keyframes moved instantaneously to their destinations .
  • animation speed 1 the keyframes moved to their destinations in 30 steps.
  • Animation speeds 2 to 4 correspond to 45, 60, and 75 animation steps respectively.
  • the actual times for each animation are shown in animation time histograms in FIG. 13.
  • FIG. 12 plots the average animation, search and total selection times against animation speed for all subjects.
  • the plots show that the total time for selection decreases at animation speed 1 (from 2.87 sec to 2.36 sec) and increases steadily thereafter (2.78 sec, 3.35 sec and 3.67 for animation speeds 2, 3 and 4 respectively) .
  • the break-even point for animation appears to be around animation speed 2.
  • Our plots also show that the search time decreases from animation speed 0 to 2. Thereafter, increased animation speed seems to have little effect on search time. The animation time, however, increases steadily from speed 0 to 4.
  • An ANOVA single factor analysis on the results yields shows that these averages are reliable (see Table
  • the histograms for each of the measurements for the different animation speeds in FIG. 13 give us an insight to what is happening.
  • selection time is the sum of an animation time, visual search time, and motor action time (time for the user to move the cursor to the target and make the selection) .
  • animation we observe that subjects move the cursor during tracking before animation terminates. This actually permits faster selection with animation.
  • subjects move the cursor as a pointer in tandem with visual search (even at animation time 0 or jump mode) so that it is not easy to separate pure visual search and motion time.
  • the histogram of the total selection time for animation speed 0 (this is the same as search time since animation time is zero) in FIG. 13 is more broadly distributed than for the other animation speeds.
  • animation speed 1 (average of 1.5 sec) was optimal for the subjects who participated in the experiments. We caution, however, that this may not be generalizable to the general population. Our subjects were all between 24 and 35 years of age. It is likely that people from other age groups will have a different average optimal animation speed. Furthermore, for our subjects, animation speeds above animation speed 2 (average of 2.24 sec) did not appreciably decrease search time while it increased total selection time. We observe that at higher animation times, the subjects appear to be waiting for the animation to cease before making their selections . This delay may lead to frustration or a loss of concentration in users of the system if the animation time is too long. For this reason, we have made the animation speed selectable by the user.
  • the Hierarchical Video Shot Representa tion panel is designed to allow the user to navigate, to view, and to organize video in our hierarchical shot model shown in FIG. 2. It comprises two panels for shot manipulation, a panel which shows the ancestry of the current shot and permits navigation through the hierarchy, a panel for switching among visual summary modes, and a panel that controls the shot display in the visual summary.
  • the hierarchical shot representation is tied intimately to the other representational components. It shows the hierarchical context of the current shot in the visual summary which is in turn ganged to the VCR-like display and the timeline representation.
  • the shot editing panel permits manipulation of the shot sequence at the hierarchical level of the current shot. It allows the user to split the current shot at the current frame position. Since the user can make any frame filmstrip representation of the visual summary the current frame, the filmstrip representation along with the VCR-like control panel are useful tools in the shot manipulation process. As illustrated by the graphic icon, the new shot created by shot splitting becomes the current shot. In the same panel shots may be merged into the current shot and groups of shots (marked in the annotation visual summary interface) may be merged into a single shot .
  • This panel also allows the user to create new shots by setting the first and last frames in the shot and capturing its keyframe.
  • the "Set Start” and “Set End” buttons are pressed, the current frame (displayed in the VCR display) becomes the start and end frames respectively of the new shot.
  • the default keyframe is the first frame in the new shot, but the user can select any frame within the shot by pausing the VCR display at that shot and activating the "Grab Keyframe” button.
  • the second shot manipulation panel is designed for subshot editing. As is obvious from the bottom icons in this panel, it permits the deletion of a subshot sequence (the subshot data remains within the first and last frames of the supershot, which is the current shot) .
  • the "promote subshot” button permits the current shot to be replaced by its subshots, and the "group to subshot” permits the user to create a new shot as the current shot and make all shots marked in the annotation visual summary interface its subshots.
  • the subshot navigation panel in FIG. 14 permits the user to view the ancestry of the current shot and to navigate the hierarchy.
  • the "Down” button in this panel indicates that the current shot has subshots, and the user can descend the hierarchy by clicking on the button. If the current shot has no subshots, the button becomes blank (i.e., the icon disappears) . Since the current shot in the figure (labeled as "Ingrid Bergman") is a top level shot, the "Up” button (above the "down” button) is left blank. The user can also navigate the hierarchy from the visual summary window. In our implementation, if a shot contains subshots, the user can click on the shot keyframe with the right mouse button to descend one level into the hierarchy. The user can ascend the hierarchy by clicking on any keyframe with the middle mouse button. These button assignments are, however, arbitrary and can be replaced by any mouse and key-chord combinations .
  • the hierarchical shot representation panel also permits the user to hide the current shot or a series of marked shots in the visual summary display. This makes is easier for the user to view a greater video over a greater temporal span in the video summary window at the expense of access to same shots in the sequence.
  • the hide feature can, of course, be switched off to reveal all shots.
  • FIG. 15 shows the VCR-like control panel that provides a handle to the data using the familiar video metaphor.
  • the user can operate the video just as through it were a video tape or video disk while maintaining situated in the shot hierarchy.
  • the video is played, one might think of the image in the VCR display as the current frame.
  • the current frame advances (or reverses, jumps, fast forwards/reverses , etc.) across shot boundaries, all the other representation components respond accordingly to maintain situatedness .
  • the user can also loop the video through the current shot and jump to the current shot boundaries .
  • the VCR display will jump to the new current frame and the VCR panel will be ready to play the video from that point.
  • any other representational component e.g., by selecting a shot in video summary, selecting a frame in a filmstrip representation, navigating the shot hierarchy in the hierarchical shot representation, and changing the temporal offset in the timeline representation.
  • FIG. 16 shows the timeline representation of the current video state.
  • the lower timeline shows the offset of the current frame in the entire video.
  • the vertical bars represent the shot boundaries in the video, and the numeric value shows the percent offset of the current frame into the video .
  • the slider is strongly coupled to the other representational components.
  • the location of the current frame is tracked on the timeline. Any change in the current frame by interaction from either the visual summary representation and the hierarchical shot representation is reflected by the timeline as well.
  • the timeline also serves as a slider control. As the slider is moved, the current frame is updated in the VCR-like representation, and the current shot is changed in the visual summary and hierarchical shot representations .
  • the timeline on the top shows the offset of the current frame in the video sequence of the currently active hierarchical level. It functions in exactly the same way as the global timeline.
  • the current frame is within the first shot in the highest level of the hierarchy, offset 5% into the global video sequence.
  • the same frame is in the eleventh (last) subshot of shot 1 in the highest level, and is offset 96% into the subshots of shot 1.
  • semantic content of a video sequence may include any symbolic or real information element in a frame (or frames) of a video sequence. Semantic content, in fact, may be defined on any of a number of levels.
  • segmentation based upon semantic content may be triggered based upon detection of a scene change. Segmentation may also be triggered based upon camera pans or zooms, or upon pictorial or audio changes among frames of a video sequence.
  • segmentation need not be perfect for it to be useful.
  • a tourist takes six hours of video on a trip. If a system is able to detect every time the scene changed (i.e., the camera went off and on) and divided the video into shots based at scene change boundaries, we would have fewer than 200 scene changes if the average time the camera stayed on was about 2 minutes.
  • the tourist would be able to browse this initial shot segmentation using our interface. She would be able to quickly find the shot she took at the Great Wall of China by just scanning the keyframe database. Given that she has a sense of the temporal progression of her trip, this database would prove useful with scene change-based segmentation.
  • the video would now be accessible as higher level objects rather than as a sequence of image frames that must be played. She could further organize the archive of her trip using the hierarchical editing capability of our system.
  • video may be segmented by detecting video events which serve as cues that the semantic content has changed.
  • video events which serve as cues that the semantic content has changed.
  • scene change detection example one may use some discontinuity in the video content as such a cue.
  • FIG. 17 shows the event hierarchy that one might detect in video data.
  • domain events Such events are strongly dependent on the domain in which the video is taken.
  • the task may be to detect each time a different person is speaking, and when different witnesses enter and leave the witness box.
  • a 'drama model' In a general tourist and home video domain, one might apply a 'drama model' and detect every time a person is added to the scene. One may then locate a specific person by having the system display the keyframes of all 'new actor' shots. The desired person will invariably appear.
  • Camera/photography events are essentially events of perspective change in the video.
  • the two that we detect are camera pans and zooms .
  • scene change events We have already discussed scene change events.
  • a scene change is a discontinuity in the video stream where the video preceding and succeeding the discontinuity differ abruptly in location and time.
  • Examples of scene change events in video include cuts, fades, dissolves and wipes. Cuts, by far the most common in raw video footage, involve abrupt scene changes in the video. The other modes may be classified as forms of more gradual scene change.
  • some scene changes are inherently undetectable . For example, one may cut a video stream and restart it viewing exactly the same video content . It is not possible to detect such changes from the video alone. Hence, we modify our definition to include the constraint that scene change events must include significant alteration in the video content before and after the scene change event.
  • Boreczky and Rowe provide a good review and evaluation of various algorithms to detect scene changes. They compare five different scene boundary detection algorithms using four different types of video: television programs, news programs, movies, and television commercials.
  • the five scene boundary detection algorithms evaluated were global histograms, region histograms, motion compensated pixel differences, and discrete cosine transform coefficient differences.
  • Their test data had 2507 cuts and 506 gradual transitions. They concluded that global histograms, region histograms, running histograms performed well overall. The more complex algorithms actually performed more poorly. In their evaluation, running histograms produced the best results using the criterion of 'least number of transition missed' . It detected above 93% of the cuts and 88% of the gradual transitions.
  • Camera or photography events are related to the change of camera perspective.
  • camera pans and zooms Such perspective changes can be detected by examining the motion field in the video sequence.
  • the camera pans we expect the video to exhibit a translation field in the video data in the opposite direction form the pan.
  • the camera zooms in or out we expect a divergent field from the camera focal axis respectively.
  • pan-zoom combinations are dominated in the vector field by the pan effect.
  • VCM Vector Coherence Mapping
  • the dominant vector computed is denoted V p .
  • the dominant vector may indicate either a global translation field or a dominant object motion in the frame. One may distinguish between the two by noting that the global translation field is distributed across the entire image while the dominant object motion is likely to be clustered. Further, a true translation field is likely to persist across several frames .
  • We detect global translation fields by taking the average of the dot products between all scene vectors and the dominant vector:
  • L pan is the likelihood that the field belongs to a camera pan at time t.
  • FIG. 18 shows the output of VCM for a pan of a computer workbench. The local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
  • FIG. 19 illustrates our approach for zoom detection. Since we assume pure zooms to converge or diverge form the optical axis of the camera, we detect a zoom by dividing all vectors computed into four quadrants Ql to Q4. In each quadrant, we take a 45° unit vector in that quadrant VI to V4 respectively and compute the dot product between all the vectors in each quadrant. Our likelihood measure for zoom is the average of the absolute values of the dot products :
  • FIG. 20 shows the output of VCM for a zoom of the back of a computer workbench. the local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
  • Table 2 and Table 3 show the first four pan and zoom likelihoods for the pan and zoom sequences of FIG. 18 and FIG. 20 respectively. As can be seen, the pan likelihood is consistently high for the pan sequence and the zoom likelihood is high for the zoom sequence.
  • FIG. 21 shows the VCM output for a pan-zoom sequence.
  • the pan and zoom likelihood results are tabulated in table 3. As we predicted, the pan effect dominates that of the zoom vector field.
  • the pan likelihood values are consistently above 80% throughout the sequence. Since we are interests only in video segmentation, it is not important that the system distinguish between pan-zooms and pure pans.
  • VCM Vector Coherence Mapping
  • VCM is a completely parallel voting algorithm in "vector parameter space.” The voting is distributed with each vector being influenced by elements in its neighborhood. Since the voting takes place in vector space, it is relatively immune to noise in image space. Our results show that VCM functions under both synthetic and natural (e.g., motion blur) conditions.
  • the fuzzy combination process provides a handle by which high level constrain information is used to guide the correlation process. Hence, no thresholds need to be applied early on in the algorithm, and there are no non-linear decisions (and consequently errors) to propagate to later processes .
  • VCM extends the state of the art in feature- based optical flow computation.
  • the algorithm is straight-forward and easily implementable in either serial or parallel computation.
  • VCM is able to compute dominant translation fields in video sequence and multiple vector fields representing various combinations of moving cameras and moving objects.
  • the algorithm can compute good vector fields from noisy image sequences. Our results on real image sequences evidence this robustness.
  • VCM has good noise immunity.
  • M- estimators to enforce robustness
  • the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors .
  • VCM provides a framework for the implementation of various constraints using fuzzy image processing. A number of other constraints may be added within this paradigm (e.g., color, texture, and other model-based constraints) .
  • VCM is an effective way for generating flow fields from video data. researchers can use the algorithm to produce flow fields that serve as input into dynamic flow recognition problems like video segmentation and gesture analysis .
  • Barron et al provide a good review of optical flow techniques and their performance. We shall adopt their philology of the field to put our work in context. They classify optical flow computation techniques into four classes. The first of these, pioneered by Horn and Schunck computes optical fields using the spatial- temporal gradients in an image sequence by the application of an image flow equation. The second class performs "region-based matching" by explicit correlation of feature points and computing coherent fields by maximizing some similarity measure. The third class are "energy-based” methods which extract image flow fields in the frequency domain by the application of "velocity- tuned” filters. The fourth class are "phase-based” methods which extract optical flow using the phase behavior of band-pass filters. Barron et al . include zero-crossing approaches such as that due to Hildreth under this category. Under this classification, our approach falls under the second (region-based matching) category.
  • region or feature-based correlation approaches involve three computational stages: pre-processing to obtain a set of trackable regions, correlation to obtain an initial set of correspondences, and the application of some constraints (usually using calculus of variations to minimize deviations from vector field smoothness, or using relaxation labeling to find optimal set of disparities) to obtain the final flow field.
  • some constraints usually using calculus of variations to minimize deviations from vector field smoothness, or using relaxation labeling to find optimal set of disparities.
  • the kinds of features selected in the first stage are often related to the domain in which the tracking occurs. Essentially good features to be tracked should have good localization properties and it must be reliably detected. Shi and Tomasi provide an evaluation of the effectiveness of various features. They contend that the texture property that makes features unique are also the ones that enhance tracking. Tracking systems have been presented that use corners, local texture measures., mixed edge, region and textural features, local spatial frequency along 2 orthogonal directions, and a composite of color, pixel position and spatiotemporal intensity gradient.
  • Simple correlation using any feature type typically results in a very noisy vector field.
  • Such correlation is usually performed using such techniques as template matching, absolute difference correlation (ADC) , and the sum of squared differences (SSD) .
  • a key trade-off in correlation is the size of the correlation region or template. Larger regions yield higher confidence matches while smaller ones are better for localization.
  • Zheng and Chellappa apply a weighted correlation that assigns greater weights to the center of the correlation area to overcome this problem.
  • a further reference also claims that by applying subpixel matching estimation and using affine predictions of image motion given previous ego-motion estimates, they can compute good ego-motion fields without requiring post processing to smooth the field.
  • constraints include rigid-body assumptions, spatial field coherence, and temporal path coherence. These constraints may be enforced using such techniques as relaxation labeling, greedy vector exchange and competitive learning for clustering. These algorithms are typically iterative and converge on a consistent coherent solution.
  • VCM performs a voting process in vector parameter space and biases this voting by likelihood distributions that enforce the spatial and temporal constraints.
  • VCM is similar to the Hough-based approaches. The difference is that in VCM, the voting is distributed and the constraints enforced on each vector is local to the region of the vector. Furthermore, in VCM the correlation and constraint enforcement functions are integrated in such a way that the constraints "guide" the correlation process by the likelihood distribution.
  • the Hough methods apply a global voting space.
  • One reference for example, first computes the set of vectors and estimates the parameters of dominant motion in an image using such a global Hough space. To track multiple objects, one reference divides the image into patches and computes parameters in each patch, and applies M-estimators to exclude outliers from the Hough voting.
  • VCM has good noise immunity. Unlike other approaches which use such techniques as M- estimators to enforce robustness, the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors .
  • P fc be the set of interest points detected in image I s at time t. These may be computed by any suitable interest operator. Since VCM is feature-agnostic, we apply a simple image gradient operator and select high gradient points.
  • Nt ⁇ j , n] X N "
  • W 1 t (p : t ) s(k., k 2 , max - x ⁇ , y' - y ⁇ )) (3)
  • vcm implements a voting scheme by which neighborhood point correlations affect the vector j at point P j .
  • a In k 2 - k- 1 ⁇ ⁇ .
  • a global vcm is computed corresponding to the dominant translation occurring in the frame (e.g., due to camera pan) .
  • a vcm can be computed for ANY point in image I fc whether or not it is an interest point.
  • a vcm built in this way can be used to estimate optical flow at any point, so a dense optic flow field can be computed.
  • H?' 1 - S( ⁇ ⁇ + ⁇ , T w + ⁇ , N(p ⁇ )) (7) where T w is a threshold and ⁇ controls the steepness of the sigmoidal function.
  • T w is a threshold and ⁇ controls the steepness of the sigmoidal function.
  • vcm(p and Np we can then apply a fuzzy-AND operation of
  • ® denotes pixel-wise multiplication.
  • This scatter template is fuzzy-ANDed with ypj) to obtain a new temporal ncm N ⁇ (p ⁇ j:
  • N ⁇ (p ⁇ t ) N(p ⁇ t )® T ⁇ (11)
  • ® denotes pixel-wise multiplication. This applies the highest weight to the area of
  • the motion detector motion-sensitive edge detector
  • the motion detector provides strong evidence for a moving object in a region that does not have an existing vector. This is similar to case 1, except that it applies only to the region of interest, and not to the whole image.
  • equation 13 yields no suitable match for a vector being tracked, three situations may the cause: (1) rapid acceleration/deceleration pushed the point beyond the search region; (2) the point has been occluded; or (3) an error occurred in previous tracking.
  • the Vector Coherence Mapping algorithm implementation consists of three main parts :
  • 16x16 subwindows To ensure an even distribution of the flow field across the frame, we subdivide it into 16x16 subwindows and pick two points with the highest gradient from each subwindow (their gradients have to be above a certain threshold) , which gives 600 interest points in our implementation. If a given 16x16 subwindow does not contain any pixel with high enough gradient magnitude, no pixels are chosen from that subwindow. Instead, more pixels are chosen from other subwindows, so the number of interest points remains constant.
  • the initial set ⁇ feature point array) of 600 feature points is computed ONLY for the first frame of analyzed seguence of frames , and then the set is updated during vector tracking.
  • a least one permanent feature points array may be maintained over the entire sequence of frames .
  • the array is filled with initial set of feature points found for the first frame of the sequence. After calculating vcm's for all feature points from the permanent array, the array may be updated as follows:
  • the 5x5 region around each point p ⁇ in the current frame is correlated against the 65x49 area of the NEXT frame, centered at the coordinates of pj as shown on FIG 22.
  • the resulting 65x49 array serves as N(PJJ.
  • the hot spot found on N(P ) could be a basis for vector computation, but the vector field obtained is usually noisy (see FIG. 23) . This is precisely the result of the ADC process alone.
  • the vcm for a given feature point is created according to equations 2 and 4. For efficiency, our implementation considers only N(P J )'S of the points p fc within a 65x65 neighborhood of pj when computing vcm(p ⁇ . Each vcm is then normalized. The vector v ⁇ is computed as the vector starting at the center of the vcm and ending at the coordinates of the peak value of the vcm. If the maximal value in vcm is smaller than a certain threshold, the hot spot is considered to be too weak, the whole vcm is reset to 0, the vector related to p ⁇ is labeled UNDEFINED, and a new interest point is selected as detailed above.
  • Feature drift arises because features in real space do not always appear at integer positions of an image pixel grid. While some attempts to solve this problem by subpixel estimation have been described, we explicitly allow features locations to vary by integral pixel displacements. We want to avoid assigning a vector v ⁇ to p ⁇ if it does not correspond to a high correlation in Hence, we inspect (the ADC response) to see if the value corresponding to the vcm (p ) hot spot is above threshold T w (see equation 7) . Secondly, to improve the tracking accuracy for subsequent frames, we want to ensure that p ⁇ + ⁇ t is also a feature point.
  • FIG. 24 shows VCM algorithm performance. It shows the same frame as FIG. 23.
  • Dominant translation may be computed according to equation 6. Since we do not want any individual ncm to dominate the computation, they are normalized before summing. Hence, the dominant motion is computed based on the number of vectors pointing in a certain direction, and not on the quality of match which produced these vectors.
  • FIG. 25 shows the example image of a global vcm computed for the pan sequence shown in FIG. 26.
  • the hot spot is visible in the lower center of FIG. 25. This corresponds to the global translation vector shown as a stronger line in the middle of the image on FIG. 26.
  • the intensity distribution in FIG. 25 is an evidence of an aperture problem existing along the arm and dress boundaries (long edges with relatively low texture). These appear as high intensity ridges in the vcm. In the VCM voting scheme, such edge points still contribute to the correct vector value, the hot spot is still well defined.
  • Temporal coherence may now be considered. As discussed above, we compute two likelihood maps for each feature point using equations 8 and 13 to compute the spatial and spatial-temporal likelihood maps, respectively.
  • the fact that the scatter template is applied to ncm's and not only vcm's allows the neighboring points' temporal prediction to affect each other.
  • a given point's movement history affects predicted positions of its neighbors. This way, when there is a new feature point in some area and this point does not have ANY movement history, its neighbors (through temporally weighted ncm's) can affect the predicted position of that point.
  • FIG. 27 shows the vector field obtained without temporal prediction
  • FIG. 28 shows vector field obtained for the same data with temporal prediction. Without temporal prediction, we can see a lot of false vectors between objects as they pass each other. Temporal prediction solves this problem.
  • the vectors can be clustered according to three features: origin location, direction and magnitude. The importance of each feature used during clustering can be adjusted. It is also possible to cluster vectors with respect to two or one feature only. We use a one pass clustering method. Example of vector clustering is shown in FIG. 29.
  • FIG. 23 shows a noisy vector field obtained from the ADC response (ncm's) on a hand motion video sequence.
  • the vector field computed for the same sequence produced a vector field characterized by the smooth field shown in FIG. 24.
  • FIG. 26 presents the performance of VCM on a video sequence with an up-panning camera, and where the aperture problem is evident.
  • FIG. 25 shows the global vcm computed on a frame in the sequence.
  • the bold line in the center of FIG. 26 shows the correct image vector corresponding to an upward pan.
  • FIG. 28 shows the results of the VCM algorithm for a synthetic image sequence in which two identical objects move in opposite directions at 15 pixels per frame.
  • the correct field computed in FIG. 28 shows the efficacy of the temporal cohesion. Without this constraint, most of the vectors were produced by false correspondences between the two objects (FIG. 27) .
  • This experiment also shows that VCM is not prone to boundary oversmoothing .
  • FIG. 29 shows the vector fields computed for a video sequence with a down-panning camera and a moving hand.
  • the sequence contains significant motion blur.
  • VCM and vector clustering algorithms extracted two distinct vector fields with no visible boundary oversmoothing .
  • the subject is gesturing with both hands and nodding his head.
  • Three distinct motion fields were correctly extracted by the VCM algorithm.
  • FIGs. 31 and 32 show the efficacy of the VCM algorithm on noisy data.
  • the video sequence was corrupted with uniform random additive noise to give a S/N ratio of 21.7 dB .
  • FIG. 31 shows the result of ADC correlation (i.e., using the ncm's alone).
  • FIG. 32 shows the vector field computed by the VCM algorithm. The difference in vector field quality is easily visible .
  • FIGs. 33, 34 and 35 show analysis of video sequences with various camera motions .
  • the zoom-out sequence resulted in the anticipated convergent field (FIG. 33) .
  • Fig. 34 shows the vector field for a combined up-panning and zooming sequence.
  • FIG. 35 shows the rotating field obtained from a camera rotating about its optical axis .
  • the algorithm features a voting scheme in vector parameter space, making it robust to image noise.
  • the spatial and temporal coherence constraints are applied using fuzzy-image- processing technique by which the constraints are applied to the bias the correlation process. Hence, the algorithm does not require the typical iterative second stage process of constraint application.
  • VCM is capable of extracting vector fields out of image sequences with significant synthetic and real noise (e.g., motion blur) . It produced good results on videos with multiple independent or composite (e.g., moving camera with moving object) motion fields.
  • Our method performs well for aperture problems and permits the extraction of vector fields containing sharp discontinuities with no discernible over-smoothing effects.
  • the technology described in this patent application facilitates a family of applications that involve the organization and access of video data. The commonality of these applications is that they require segmentation of the video into semantically significant units, the ability to access, annotate, refine, and organize these video segments, and the ability to access the video data segments in a integrated fashion. Following are a number of examples of such applications.
  • One application of the techniques described above may be to video databases of sporting events . Professionals and serious amateurs of organized sports often need to study video of sporting events.
  • a game may be organized into halves, series and plays.
  • Each series (or drive) may be characterized by the roles of the teams (offense or defense) , the distance covered, the time consumed, number of plays, and the result of the drive.
  • Each play may be described by the kind of play (passing, rushing, or kicking) , the outcome, the down, and the distance covered.
  • the segmentation may be obtained by analysis of the image flow fields obtained by an algorithm like our VCM to detect the most atomic of these units (the play) .
  • the other units may be built up from these units.
  • VCM facilitates the application of various constraints in the computation of the vector fields.
  • vector fields may be computed for the movement of players on each team.
  • the team on offense (apart from the man-in-motion) must be set for a whole second before the snap of the ball. In regular video this is at least 30 frames.
  • the defensive team is permitted to move.
  • This detection may also be used to obtain the direction, duration, and distance of the play. The fact that the ground is green with yard markers will also be useful in this segmentation.
  • Specialized event detectors may be used for kickoffs, and to track the path of the ball in pass plays. What is important is that a set of domain event detectors may be fashioned for such a class of video. The outcome of this detection is a shot-subshot hierarchy reflecting the game, half, series, and play hierarchy of football. Once the footage has been segmented, our interface permits the refinement of the segmentation using the multiply-linked interface representation. Each play may be annotated, labeled, and modified interactively. Furthermore, since the underlying synchronization of all the interface components is time, the system may handle multiple video viewpoints (e.g. endzone view, blimp downward view, press box view) . Each of these views may be displayed in different keyframe representation windows.
  • video viewpoints e.g. endzone view, blimp downward view, press box view
  • the multiresolution representation is particularly useful because it optimizes the use of screen real-estate, and so permits a user to browse shots from different viewpoints simultaneously.
  • the animation of the keyframes in each MAR are synchronized so that when one selects a keyframe from one window to be the centralized current shot, all MAR representations will centralize the concomitant keyframes.
  • the same set of interfaces may be used to view and study the resulting organized video. Another application may be in a court of law.
  • Video may serve as legal courtroom archives either in conjunction with or in lieu of stenographically generated records.
  • the domain events to be detected in the video are the transitions between witnesses, the identity of the speaker (judge, lawyer, or witness) .
  • the witness-box is vacant.
  • a wi tness-box camera may be set up to capture the vacant witness-box before the proceedings and provide a background template from which occupants may be detected by a simple image subtraction change detection.
  • 'Witness sessions' may be defined as time segments between witness-box vacancies.
  • Witness changes must occur in one of these transitions.
  • witness identity by an algorithm that locates the face and compares facial features. Since we are interested only in witness change, almost any face recognizer will be adequate (all we need is to determine if the current 'witness session' face is the same as the one in the previous session) .
  • a standard first order approach that compares the face width, face length, mouth width, nose width, and the distance between the eyes and the nostrils as a ratio to the eye separation comes immediately to mind.
  • the lawyer may be identified by tracking her in the video. Speaker identification may by achieved detecting the frequency of lip movements and correlating them to the location of sound RMS power amplitude clusters in the audio track.
  • the multiple video streams may be represented in the interface as different keyframe windows. This allows us to organize, annotate and access the multiple video streams in the semantic structure of the courtroom proceedings . This may be hierarchy of the (possibly cross-session) testimonies of particular witnesses, direct and cross examination, witness sessions, question and witness response alternations, and individual utterances by courtroom participants . Hence the inherent structure, hierarchy, and logic of the proceedings may be accessible from the video content.
  • Home video is another application.
  • One of the greatest impediments to the wider use of home video is the to access potentially many hours of video in a convenient way.
  • a particularly useful domain event in home video is the 'new actor detector' .
  • the most significant moving objects in most typical home videos are people.
  • the same head-detector described earlier for witness detection in our courtroom example may be used to determine if a new vector cluster entering the scene represent a new individual .
  • Home videos can then be browsed to find people by viewing the new-actor keyframes the way one browses a photograph album.
  • the techniques described herein may also be applied to special event videos. Certain special event videos are common enough to warrant the development of dedicated domain detectors for them. An example might be that of formal weddings. Wedding videos are either taken by professional or amateur photographers, and one can imagine the application of our technology to produce annotated keepsakes of such video.
  • the techniques described above also have application in the business environment. A variety of business applications could benefit from our technology. We describe two kinds of video as being representative of these applications. First, business meeting could benefit from video processing. Business meetings (when executed properly) exhibit an organizational structure that may be preserved and accessed by our technology. Depending on the kinds of meeting being considered, video segments may be classified as moderator-speaking, member- speaking, voting and presentation.
  • a camera may be trained on her to locate all moderator utterances. These may be correlated with the RMS power peaks in the audio stream. _ The same process will detect presentations to the membership. Members who speak or rise to speak may be detected by analyzing the image flow fields or by image change detection and correlated with the audio RMS peaks as discussed earlier. If the moderator sits with the participants at a table, she will be detected as a speaking member. Since the location of each member is a by-product of speaker detection it is trivial to label the moderator once the video has been segmented.
  • This process will provide a speaker-wise decomposition of the meeting that may be presented in our multiply-linked interface technology.
  • a user may enhance the structure in our hierarchical editor and annotator to group sub-shots under agenda items, proposals and discussion, and amendments and their discussion. If multiple cameras are used, these may be accessed and as synchronized multi-view databases as in our previous examples.
  • the product will be a set of video meeting minutes that can be reviewed using our interaction technology. Private copies of these minutes may be further organized and annotated by individuals for their own reference. Certain areas of marketing may also benefit under the embodiments described above. Mirroring the success of desktop publishing in the 1980 's, we anticipate the immense potential in the production of marketing and business videos.
  • a marketer may use our interaction technology to further organize, and annotate the video. These video segments, may further be resequenced to produce a marketing video. Home buyers may view a home of interest using our multiply-linked interaction technology to see different aspects of the home. This random-access capability will make home comparison faster and more effective.
  • a specific embodiment of a method and apparatus for providing content based video access according to the present invention has been described for the purpose of illustrating the manner in which the invention is made and used. It should be understood that the implementation of other variations and modifications of the invention and its various aspects will be apparent to one skilled in the art, and that the invention is not limited by the specific embodiments described. Therefore, it is contemplated to cover the present invention any and all modifications, variations, or equivalents that fall within the true spirit and scope of the basic underlying principles disclosed and claimed herein .

Abstract

A method, and apparatus (see figure 1), for accessing a video segment of a plurality of video frames. The method includes the steps of segmenting the plurality of video frames into a plurality of video segments based upon semantic content and designating a frame of each segment of the plurality of segments as a keyframe and as an index to the segment (see figure 7). The method further includes the steps of ordering the keyframes and placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location (see figure 7). A keyframe may be designated as a selected keyframe. The ordered keyframes may be processed through the ordered display until the selected keyframe occupies the selected location (see figure 7).

Description

CONTENT-BASED VIDEO ACCESS
Field of the Invention
The field of the invention relates to imaging and more particularly to video imaging.
Background of the Invention It has been said that a picture is worth a thousand words. Video is at least 30 pictures per second. Video has the potential of being a rich and compelling source of information. One can imagine, for example, the use of video transcripts as a replacement for the archaic stenograph for recording the proceedings in a court of law. Many challenges have to be met before this vision can be realized. Much research has gone into technologies and standards for video compression, synchronization between video and sound, transport of video over networks and other communication channels and operating system issues of the storage and retrieval of video media on demand. In our research, we address the challenge of how video may be accessed and manipulated in an interactive fashion, and how video may be segmented by content to provide conceptual-level access to the data. We predict that one of the key impediments to making video a "first class" information source will be the availability of data in a useable form. Video has to be packaged in a way that makes it amenable for search and rapid browsing. Two characteristics inherent to video data stand in the way of its effective use as information conduit: its sheer volume and its sequential nature. Video is capable of generating vast amounts of raw footage—and such footage is by far the most voluminous and ever growing data trove available. In order to convert such raw data into usable information, one has to package the data into semantically significant accessible units. Given the sheer volume of video data, it is desirable to automate the majority of the task of video segmentation. Given the variability in the semantic content of video, it is necessary to facilitate user interaction with the segmentation process. The sequential nature of video also makes its access cumbersome. Anyone who owns a video camera is acquainted with the frustration of tapes gathering dust on the shelf because it is too time consuming to sift through the hours of video.
Key to the widespread use of video as information is a stable architecture and set of tools that could make video access viable. This is illustrated by the universal appeal of table-based relational databases. If a custom database has to be developed for each installation, the cost would be prohibitive for all but governmental agencies and select mega-corporation. The availability of a technology base in relational databases makes it possible to purchase an off-the-shelf database that provides a large part of the solution. Market realities show that the need to customize such databases and to organize one's data to meet one's requirements is acceptable once such a stable technology substrate exists . In this document, we disclose the results of our research in content-based video access and describe the technology and interface components we have developed. We shall present our work in terms of the two major technologies necessary to realize our vision: Video Access, Interaction and Manipulation, and Content-Based Video Segmentation . The former addresses the higher level system needs for handling video not as a stream of video frames but as conceptual objects based on content. The latter address the challenge of automating the production of these content-based video objects.
Summary
A method and apparatus are provided for accessing a video segment of a plurality of video frames . The method includes the steps of segmenting the plurality of video frames into a plurality of video segments based upon semantic content and designating a frame of each video segment of the plurality of segments as a keyframe and as an index to the segment. The method further includes the steps of ordering the keyframes and placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location. A keyframe may be designated as a selected keyframe. The ordered keyframes may be precessed through the ordered display until the selected keyframe occupies the selected location.
Brief Description of the Drawings FIG. 1 is a block diagram of a content based video access system in accordance with an embodiment of the invention;
FIG. 2 depicts the shot hierarchical organization of the system of FIG. 1; FIG. 3 depicts a shot hierarchy data model of the system of FIG. 1;
FIG. 4 depicts a multiply-linked representation for video access of the system of FIG. 1; FIG. 5 depicts a summary representation of a standard keyframe and filmstrip representation of the system of FIG. 1;
FIG. 6 depicts a keyframe-annotation box representation of the system of FIG. 1 ; FIG. 7 depicts a multiresolution animated representation showing shot ordering in boxed arrows of the system of FIG. 1;
FIG. 8 depicts an animation sequence in ordered mode of the system of FIG. 1 ; FIG. 9 depicts an animation sequence in straight line mode of the system of FIG. 1;
FIG. 10 depicts a multiresolution animated browser with a magnified view of the system of FIG. 1 ;
FIG. 11 depicts a screen setup for scanning video frames of the system of FIG. 1;
FIG. 12 depicts a total search time and animation time versus animation speed plot of the system of FIG. 1;
FIG. 13 depicts histograms showing time clustering for different animation speeds of the system of FIG. 1 ;
FIG. 14 depicts a hierarchical shot representation control panel of the system of FIG. 1;
FIG. 15 depicts a VCR-like control panel of the system of FIG. 1; FIG. 16 depicts a timeline representation control panel of the system of FIG. 1; FIG. 17 depicts a hierarchy of events inherent in the processing of video frames by the system of FIG. 1;
FIG. 18 depicts VCM output of a pan-right sequence showing the local vector field as fine lines and a dominant vector of a video frame processed by the system of FIG. 1;
FIG. 19 depicts a four-quadrant approach to zoom detection of the system of FIG. 1;
FIG. 20 depicts VCM output of a zoom-in sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
FIG. 21 depicts a VCM output of a pan-zoom sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
FIG. 22 depicts a method of creating an NCM for a given interest point of simulated video frame processed by the system of FIG. 1; FIG. 23 depicts a vector field obtained using ADC
(ncm's) from a video frame processed by the system of FIG. 1;
FIG. 24 depicts a vector field obtained using VCM from a video frame processed by the system of FIG. 1 ; FIG. 25 depicts a global VCM computed for vertical pan sequence of a video frame processed by the system of FIG. 1;
FIG. 26 depicts a vertical camera pan of a video frame analyzed using VCM by the system of FIG. 1 ; FIG. 27 depicts a vector field obtained without using temporal prediction by the system of FIG. 1; FIG. 28 depicts a vector field obtained using temporal prediction by the system of FIG. 1 ;
FIG. 29 depicts an example of vector clustering combining a moving object and camera pan by the system of FIG. 1;
FIG. 30 depicts hand movements detected using VCM and clustering by the system of FIG. 1 ;
FIG. 31 depicts a noisy frame (S/N=21.7dB) analyzed using ADC (ncm's) by the system of FIG. 1; FIG. 32 depicts another example of a noisy frame
(S/N=21.7dB) analyzed using VCM by the system of FIG. 1 ; FIG. 33 depicts a pure zoom sequence analyzed with VCM by the system of FIG. 1 ;
FIG. 34 depicts a vector field obtained for a pan and zoom sequence by the system of FIG. 1; and
FIG. 35 depicts a vector field obtained for a camera rotating about its optical axis by the system of FIG. 1.
Detailed Description of a Preferred Embodiment
1. Video Access, Interaction and Manipulation.
The goal of our technologies is to transform video from a sequential stream of frame data into a rich and compelling information source. For this to be realized, we have to address the challenges posed by both the immense data volume and impediments posed by the inherent sequential nature of video. The video access environment that we are developing is designed to help the user to make sense of this data. Our approach is to provide a rapid browsing and video manipulation interface to facilitate organization and access of the video data as semantic data units. FIG. 1 is a block diagram of a video processing system 10, generally, that may be used to segment a video file based upon a semantic content of the frames. Segmentation based upon semantics allows the system 10 to function as a powerful tool in organizing and accessing the video data.
We shall discuss the design philosophy behind our interface, and the particular interface elements in the interface provided by the system 10 with a video database. Of particular importance is the
Mul tiresolution Animated Representation (MAR) which we present along with the results of our user study. For purposes of preliminary explanation purposes, the MAR may be thought of as a man-machine interface (MMI) which facilitates a user's temporal grasp of the semantic content .
1.1 Shot Architecture.
Before we proceed, we need to define several terms to facilitate our discussion. A video sequence can be thought of as a series of video frames . These frames can be organized into shots . We define a shot as any sequential series of video frames delineated by a first and a last frame. The shot is the basic video representational unit in our system. Our system permits shots to be organized into a hierarchical shot model. FIG. 2 shows the shot architecture of our system that may be displayed on a display 16 of the system 10. FIG. 2 shows a portion of the MAR (to be described in more detail later) at level 0 along with subgroups of the MAR at levels 1 and 2. In the figure, each shot at the highest level (we call this level 0) is numbered starting from 1. Shot 2 is shown to have four subshots which are numbered 2.1 to 2.4. These shots are said to be in level 2 (as are shots 6.1 to 6.3) . Shot 2.3 in turn contains three subshots, 2.3.1 to 2.3.3. In our organization, shot 2 spans its subshots. In other words, all the video frames in shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to 2.3.3 are also in shot 2.3. Hence the same frame may be contained in a shot in each level of the hierarchy. One could, for example, select shot 2 and begin playing the video from its first frame. The frames would play through shot 2, and eventually enter shot 3. If one begins playing from the first frame of shot 2.2, the video would play through shot 2.4 and proceed to shot 3.
Next, we define a series of concepts which define the temporal-situatedness of the system. The user may select any shot to be the current shot. The video system will play the frames in that shot, and the frame being displayed at any instant is known as the current frame. These are dynamic concepts since the current frame and current shot changes as the video is being played. Suppose we are level 2 of the hierarchy and select shot 2.3.3 as the current shot (shown as a shaded box in FIG. 2) . Shot 2.3 a level 1 and shot 2 at level 0 would conceptually become the current shot at those levels. This could lead to confusion in the user, and hence we introduce the concept of the current level. At any moment in the interface, the system is situated at one level, and only the shots at that level appear in the visual summary representation (to be described later) . In our current example, the system would be in level 2 although the current frame is also in shot 2 . 3 and shot 2 in levels 1 and 0 respectively.
This organization may, for example, be used to break a video archive of a courtroom video into the opening statements, a series of witnesses, closing statements, jury instructions, and verdict at the highest level. A witness testimony unit may be broken into direct and cross-examination sessions at the next level, and may be further broken into a series of attorney-witness exchanges, etc.
FIG. 3 shows the data model which makes up our shot database. Each shot defines its first and final frame within the video (represented by F and L respectively) and its resources for visual summary (the keyframe and filmstrip representations represented by the K) . Each shot unit is linked to its predecessor and successor in the shot sequence. Each shot may also be comprised of a list of subshots. This allows the shots to be organized in a recursive hierarchy.
1.2 Multiply-Linked Representations in Video Access.
Interface consistency and metaphor realism are critical to helping the user form a good mental model of both the system of the MAR and the data (e.g., the shots and subshots of the MAR) . Our system features a mul tiply-linked representation set that provides this consistency and realism. By maintaining this realism and consistency, we provide a degree of tangibleness to the system and video data that allows the user to think of our representational units as 'real' objects. The consistency among the representational components also permits the user to maintain a degree of 'temporal situatedness ' through the video elements displayed. The system is made up four major interlinked representational components :
• A multi-mode visual summary representation that allows the user to get a glimpse of the content of the video
• A hierarchical video shot representation that permits access and manipulation of video objects into a conceptually cogent information source
A VCR-like representation with a video display screen
• A timeline representation that facilitates the user's access to video in terms of time displacement into a shot sequence.
The system interface showing these components is shown in FIG. 4. While all these components involve detailed design and implementation, the key to making them consistent and real is that they are mul tiply- linked. When one selects a shot representation in the visual summary, the hierarchical video shot representation, the VCR-like video screen display, and the timeline representation are updated to show the video segment in each of their representations . The timeline moves to the temporal displacement of the selected shot, the VCR-like video screen displays the first frame in the shot, and the hierarchical representation indicates where the shot is in the shot hierarchy. 1.3 Multi-Mode Visual Summary.
The fundamental summary element in our system is the keyframe . It has been shown that this simple representation is a surprisingly effective way to provide the user with a sense of the content of a video sequence. Our key contribution is in the way such keyframes are organized and presented within the context of a video access and browsing environment. We provide three summary organizations: • The standard keyframe presentation
• The annotation presentation
• The multiresolution animated keyframe presentation
These summary representations are interchangeable keyframe presentation modes . Apart from the obvious differences in organization, they interact with the other representational components in exactly the same way. The keyframes have the same 'behavior' which we shall present as we describe each representation mode.
1.3.1 Standard Keyframe and Filmstrip Representation. FIG. 5 shows the standard keyframe representation. Each shot is summarized by its keyframe, and the keyframes are displayed in a scrollable table. The keyframe representing the "current shot" is highlighted with a boundary. A shot can be selected as the current shot from this interface by selecting its keyframe with the computer pointing device (e.g., a mouse) . In accordance with the mul tiply- linked representation principle, the position of the current shot in the shot hierarchy appears in the shot hierarchy representation, the first frame of the shot appears as the "current frame" in the display of the VCR-like interface, and the timeline representation is updated to show the offsets of the first frame of the shot in the video data. The video can be played in the VCR display by activating the "Play" button on the VCR-like interface. When the video is being played, the current keyframe highlight boundary blinks. At any time, the frame displayed in the VCR display is the current frame. When the current frame advances beyond the selected shot, the next shot becomes the current shot, and the current shot highlight boundary is updated.
Beneath the keyframe window is a subwindow showing the annotation information of the current shot. This current frame annota tion panel contains the classification of the shot and a textual description that includes a set of keywords. The classification and keywords are dependent on the domain to which the system is applied. For example, for a video database for court room archival, the classification may be "prosecution questioning", "testimony", etc. If editing is enabled, the user may modify the annotation content of the current shot .
A filmstrip presentation of any shot may be activated by clicking on its keyframe with the keyboard shift key depressed (other key combinations or alternate mouse buttons may be chosen for this function) . The filmstrip representation may be similarly activated from all the visual summary modes. Filmstrip presentation of the current shot may be as shown in FIG. 5. This filmstrip shows every nth frame at a lower resolution (we use the 10th frame in our implementation) in a horizontally scrollable window. The user can select any frame in the filmstrip and make it the current frame . All other representation components are updated accordingly (i.e., the VCR-like display will jump to the frame and enter pause mode and the timeline representation will be updated) . As will be seen later, the filmstrip representation provides higher temporal resolution access into the video and is particularly useful for editing the shot hierarchy.
1.3.2 Keyframe-Annotation Representation.
FIG. 6 shows the visual summary mode that is particularly useful for browsing and updating annotation content. The keyframes of the shots are displayed beside their annotation panels subwindow in a scrollable window. The annotation panel is identical to the current frame annotation panel in the standard keyframe representation. The keyframes behave exactly the same way as in the standard keyframe representation (i.e., selection of the current shot, activation of the filmstrip representation of any shot, and the interlinking with the other representation components). While this representation does not permit the user to see as many keyframes, its purpose is to let the user browse and edit the shot annotations . Annotation information from neighboring shots is available, and one does not have to make a particular shot the current shot to see its annotation or to edit the annotation content.
The trade-off here is between annotation context and the number of shots that is accessible on one screen. 1.3.3 Multiresolution Animated Representation.
We function in a dynamic and spatial world, and humans have immense capacity for spatial and temporal reasoning, memory and tracking. One of the key attractions of the familiar computer desktop interface if the Apple Macintosh is that it exploits the user's ability to remember the location of a document in a spatial environment. The prospect that this interface will attract a larger market has compelled others in the industry to imitate this interface. The internet, for example, has been in existence long before the arrival of web technology. It was not until the adoption of spatial navigation, graphical interaction, hyperlinking and multimedia that the web caught the imagination of the market. Our Mul tiresolution Anima ted Representation (MAR) is designed to exploit these capabilities for video browsing. The goal is to produce an interaction methodology that is economical for on-screen real-estate and that helps the user form an accurate mental model of the video shot architecture, interaction methodology and video conten .
FIG. 7 shows the layout of our MAR. The purpose of using the MAR is to permit more keyframes to fit within the same screen real-estate. Our current implementation shown in FIG. 7 may present 75 keyframes in a 640x480- pixel window. It may also contain 72 low resolution keyframes displayed at 128x96 pixels, two intermediate resolution keyframes displayed at 128x96 pixels, and one high resolution keyframe (representing the current shot) displayed at 256x192 pixels. This is in contrast with the standard keyframe representation which displays 40 keyframes at 128x96 pixels and takes more than twice the screen real-estate (a 750x610 pixel window) . Even at the lowest resolution, the thumbnail images provide sufficient information for the user to recognize its contents . The animation control panel above the MAR allows the user to select the animation mode and speed. We shall discuss the different path modes in the next sections. The user is also able to select either the "real image" or "box animation" modes from this panel. The former mode animates the thumbnail images while the latter replaces them with empty boxes during animation.
While the "real image" mode is vastly superior, the
"box animation" mode may be necessary for machines with less computational capacity. We shall also discuss the need for being able to set the animation speed in a later section.
We shall first describe the operation of the MAR and how it integrates with the other three representation components. We shall also present the results of our user experiments which show the efficacy of the representation and highlight the tradeoffs inherent in the interface.
Turning now to MAR operation, FIG. 7 shows the ordering of the shots in the MAR in the boxed arrows . One may think of the shot sequence of the video as scrolling (e.g., processing) "through" the MAR interface. The highest resolution keyframe in the center is always the current shot keyframe . As the cursor moves over each keyframe, a highlighting boundary appears around the keyframe (see FIG. 10) . This indicates that the corresponding shot will be selected if the mouse selection button is depressed. If a shot is selected, the interface animates so that the selected frame moves and expands to the current frame location. The other keyframes move their respective new locations according to their shot order. Our system implements two animation modes that determine the paths the frames take in the animation: ordered mode and straight line mode . We shall first describe the interface before we discuss the rationale for our design and the results of our access time studies. As will be evident in our discussion, apart from the obvious difference between the MAR and the other two visual summary modes, care has been taken to keep the interaction methodology consistent throughout all visual summary modes .
FIG. 8 shows the ordered mode animation. The keyframes scroll along the path determined by the shot order shown in FIG . 7 , expanding as they reach the center keyframe location and contracting as it passes the center. This animation proceeds until the selected keyframe arrives at the current keyframe location at the center of the interface. Ordered mode animation is useful for understanding the shot order in the browser and for scanning the keyframes by scrolling through the keyframe database. Ordered mode animation is impractical for random shot access because of the excessive time taken in the animation.
In accordance with our overall system design, one can access the filmstrip representation in exactly the same way as in the other visual summary modes. The filmstrip, of course, functions the same way no matter which visual summary mode is active when it is launched.
The current frame annotation panel is also accessible in the MAR as with the standard keyframe representation. The second animation mode is the straight line animation illustrated in FIG. 9. When the keyframe of a shot is selected, the system determines the final location of every keyframe within the shot database and animates the motion of each keyframe to its destination in a straight line. If the resolution of the keyframe changes between its origin and destination, the keyframe grows or shrinks in a linear fashion along its animated path. As is evident in the animation sequence shown in FIG. 9, new keyframes move into and some keyframes move off the browser panel to maintain the shot order. The advantage of this animation mode is its speed. As will be seen later, the trade-off achieved between animation time and visual search time determines the total time it takes to access a particular shot through the interface.
1.3.3.1 Magnifying Glass Feature
Although the lowest resolution thumbnail images generally provide sufficient information to select the required shot, we have found that it is sometimes convenient to be able to see a higher resolution display of a keyframe before selecting the shot. This may be characterized as a magnifying glass feature. Being able to see a higher resolution display of a keyframe is especially important if the shot keyframes are very similar or where the keyframes are complex. If the mouse pointer stays over a particular keyframe for a specified period (2 seconds in one implementation) , a magnified view of the keyframe appears. This magnified view disappears once the cursor leaves the keyframe.
FIG. 10 shows a highlighted keyframe under the cursor and a magnified display of the same keyframe. 1.3.4 Experimental Results for MAR
Two assumptions drive the design of the MAR browser 10. The first is that the resolution-for-screen real- estate tradeoff gives the user an advantage in viewing greater scope without seriously impairing her ability to recognize the content of the keyframe. The second assumption is that the animation assists the user to select a sequence of shots.
The scope advantage of the first assumption is self-evident since one can indeed see more shot keyframes for any given window size using the multiresolution format. Whether the reduction in resolution proves to be a serious impediment is harder to determine. Thumbnail representations are used by such programs as Adobe's Photoshop which uses a much lower resolution image as image file icons on the Macintosh computer. The subjects who have tested our system demonstrated that they can easily recognize the images in the low resolution thumbnails. We tested the second assumption by conducting a response time experiment which will be detailed in this section. The goal of the experiment was to determine if animation improves user performance, and how different animation speeds may impact selection time. The scenario we are testing is that in which the user 'knows' which keyframe she would like to see next (e.g., when one spots multiple targets of interest and would like to view them in turn) . In order to avoid the effects of scrolling through yet unseen portions of the shot database, we ensure that the next keyframe of interest (beyond the one currently to be selected) is always in the current representation. Our experiment is also designed to test the performance of practiced users and to elicit subjective responses from subjects.
FIG. 11 shows the screen layout of our experiment. Each subject was presented with the interface and a window showing five target keyframes . A window showing the list of targets is displayed above the MAR selection window. Since our experiment tests the situation in which the user knows which images she was interested in, the same target sequence of five keyframes was used throughout the experiment. The subjects were required to select the five target images displayed in order from left to right. The keyframe to be selected is highlighted in the target list window. If the subject makes a correct selection, the interface functions as designed (i.e., the animation occurs). Erroneous selections were recorded. The interface does not animate when an error is made. We shall call each such sequence of five selections a selection set.
The experiment was performed by five subjects. Each subject was required to perform 8 sets of selections at each of five animation speeds. Hence each subject made 40 selections per animation speed yielding a total of 32 between selection intervals which were recorded. This constitutes a total of 200 selections and 128 time interval measurements per subject. The subjects rested after each selection set. The fastest animation speed was the 'jump' animation in which the keyframes moved instantaneously to their destinations . We call this animation speed zero (or no animation) . In animation speed 1, the keyframes moved to their destinations in 30 steps. Animation speeds 2 to 4 correspond to 45, 60, and 75 animation steps respectively. The actual times for each animation are shown in animation time histograms in FIG. 13. We recorded the animation time, and the post-animation search time for each selection. The sum of the animation and search times constitutes the total selection time for each target.
Since we used only five subjects, we could not randomize the animation-speed order which was presented to them. We applied a counterbalanced experiment design in which each subject was tested at a different animation-speed order. No ordering effects were seen. To prevent the effect of subjects remembering the shot ordering, the shot (or keyframe) ordering in the interface was randomized for each selection set. Before the experiment, each subject was briefed about the interface and the experiment procedure. They were allowed to play with the system at various animation speeds until they felt comfortable with the system. The same five-keyframe target sequence that was used in the experiment was used in the familiarization phase .
FIG. 12 plots the average animation, search and total selection times against animation speed for all subjects. The plots show that the total time for selection decreases at animation speed 1 (from 2.87 sec to 2.36 sec) and increases steadily thereafter (2.78 sec, 3.35 sec and 3.67 for animation speeds 2, 3 and 4 respectively) . For our subjects, the break-even point for animation appears to be around animation speed 2. Our plots also show that the search time decreases from animation speed 0 to 2. Thereafter, increased animation speed seems to have little effect on search time. The animation time, however, increases steadily from speed 0 to 4. An ANOVA single factor analysis on the results yields shows that these averages are reliable (see Table
1) •
Table 1. ANOVA analysis of experiment results
Figure imgf000023_0001
The histograms for each of the measurements for the different animation speeds in FIG. 13 give us an insight to what is happening. One may think of selection time as the sum of an animation time, visual search time, and motor action time (time for the user to move the cursor to the target and make the selection) . In reality, with animation, we observe that subjects move the cursor during tracking before animation terminates. This actually permits faster selection with animation. We also notice subjects move the cursor as a pointer in tandem with visual search (even at animation time 0 or jump mode) so that it is not easy to separate pure visual search and motion time. The histogram of the total selection time for animation speed 0 (this is the same as search time since animation time is zero) in FIG. 13 is more broadly distributed than for the other animation speeds. This reflects the greater variance in the visual search and motion times without animation. What this suggests is that a significantly greater amount of visual and motor search takes place with no animation. It is observed that even if the subject locates the next target during the current selection, he/she is not able to anticipate its new location after the jump. Each search appears to begin sequentially from the top or bottom of the window. With animation, visual search time is more greatly reduced because the subjects begin moving the cursor even before animation ends . The result is more tightly clustered selection and search times.
In post experiment interviews, all subjects express a preference for animation. The common sentiment by the subjects and others who have tried the interface informally is that animation makes it easier to make select required keyframes .
Our results show that animation speed 1 (average of 1.5 sec) was optimal for the subjects who participated in the experiments. We caution, however, that this may not be generalizable to the general population. Our subjects were all between 24 and 35 years of age. It is likely that people from other age groups will have a different average optimal animation speed. Furthermore, for our subjects, animation speeds above animation speed 2 (average of 2.24 sec) did not appreciably decrease search time while it increased total selection time. We observe that at higher animation times, the subjects appear to be waiting for the animation to cease before making their selections . This delay may lead to frustration or a loss of concentration in users of the system if the animation time is too long. For this reason, we have made the animation speed selectable by the user.
1.4 Hierarchical Video Shot Representation.
The second component of our multiply-linked representation system, the hierarchical shot representation, is shown in FIG. 14. The Hierarchical Video Shot Representa tion panel is designed to allow the user to navigate, to view, and to organize video in our hierarchical shot model shown in FIG. 2. It comprises two panels for shot manipulation, a panel which shows the ancestry of the current shot and permits navigation through the hierarchy, a panel for switching among visual summary modes, and a panel that controls the shot display in the visual summary.
In accordance with our MLR philosophy, the hierarchical shot representation is tied intimately to the other representational components. It shows the hierarchical context of the current shot in the visual summary which is in turn ganged to the VCR-like display and the timeline representation. The shot editing panel permits manipulation of the shot sequence at the hierarchical level of the current shot. It allows the user to split the current shot at the current frame position. Since the user can make any frame filmstrip representation of the visual summary the current frame, the filmstrip representation along with the VCR-like control panel are useful tools in the shot manipulation process. As illustrated by the graphic icon, the new shot created by shot splitting becomes the current shot. In the same panel shots may be merged into the current shot and groups of shots (marked in the annotation visual summary interface) may be merged into a single shot .
This panel also allows the user to create new shots by setting the first and last frames in the shot and capturing its keyframe. When the "Set Start" and "Set End" buttons are pressed, the current frame (displayed in the VCR display) becomes the start and end frames respectively of the new shot. the default keyframe is the first frame in the new shot, but the user can select any frame within the shot by pausing the VCR display at that shot and activating the "Grab Keyframe" button.
The second shot manipulation panel is designed for subshot editing. As is obvious from the bottom icons in this panel, it permits the deletion of a subshot sequence (the subshot data remains within the first and last frames of the supershot, which is the current shot) . The "promote subshot" button permits the current shot to be replaced by its subshots, and the "group to subshot" permits the user to create a new shot as the current shot and make all shots marked in the annotation visual summary interface its subshots.
The subshot navigation panel in FIG. 14 permits the user to view the ancestry of the current shot and to navigate the hierarchy. The "Down" button in this panel indicates that the current shot has subshots, and the user can descend the hierarchy by clicking on the button. If the current shot has no subshots, the button becomes blank (i.e., the icon disappears) . Since the current shot in the figure (labeled as "Ingrid Bergman") is a top level shot, the "Up" button (above the "down" button) is left blank. The user can also navigate the hierarchy from the visual summary window. In our implementation, if a shot contains subshots, the user can click on the shot keyframe with the right mouse button to descend one level into the hierarchy. The user can ascend the hierarchy by clicking on any keyframe with the middle mouse button. These button assignments are, however, arbitrary and can be replaced by any mouse and key-chord combinations .
The hierarchical shot representation panel also permits the user to hide the current shot or a series of marked shots in the visual summary display. This makes is easier for the user to view a greater video over a greater temporal span in the video summary window at the expense of access to same shots in the sequence. The hide feature can, of course, be switched off to reveal all shots.
1.5 VCR-Like Representation with Video Display.
FIG. 15 shows the VCR-like control panel that provides a handle to the data using the familiar video metaphor. Through this panel, the user can operate the video just as through it were a video tape or video disk while maintaining situated in the shot hierarchy. When the video is played, one might think of the image in the VCR display as the current frame. As the current frame advances (or reverses, jumps, fast forwards/reverses , etc.) across shot boundaries, all the other representation components respond accordingly to maintain situatedness . The user can also loop the video through the current shot and jump to the current shot boundaries . Whenever the current frame is changed through any other representational component (e.g., by selecting a shot in video summary, selecting a frame in a filmstrip representation, navigating the shot hierarchy in the hierarchical shot representation, and changing the temporal offset in the timeline representation) , the VCR display will jump to the new current frame and the VCR panel will be ready to play the video from that point. We have found that the frame-level control in the VCR control panel to be invaluable in the creation and organization of shots.
1.6 Timeline Representation.
FIG. 16 shows the timeline representation of the current video state. The lower timeline shows the offset of the current frame in the entire video. The vertical bars represent the shot boundaries in the video, and the numeric value shows the percent offset of the current frame into the video . Consistent with the rest of the interface, the slider is strongly coupled to the other representational components. As the video plays in the VCR-like representation, the location of the current frame is tracked on the timeline. Any change in the current frame by interaction from either the visual summary representation and the hierarchical shot representation is reflected by the timeline as well. The timeline also serves as a slider control. As the slider is moved, the current frame is updated in the VCR-like representation, and the current shot is changed in the visual summary and hierarchical shot representations . The timeline on the top shows the offset of the current frame in the video sequence of the currently active hierarchical level. It functions in exactly the same way as the global timeline. In the example shown in FIG. 16, the current frame is within the first shot in the highest level of the hierarchy, offset 5% into the global video sequence. The same frame is in the eleventh (last) subshot of shot 1 in the highest level, and is offset 96% into the subshots of shot 1.
2. Content-Based Video Segmentation
We have described our approach and system for managing, editing, manipulating and accessing video data. As it stands, this system is adequate for generating semantic shot databases manually. While the interface makes it easier to do this, it would be preferable if such a database could be generated in an automatic or semi-automatic fashion. We have already observed that the challenge of video is its volume and sequential nature. Without automated video segmentation, the task of producing a video database would be very labor intensive, slow and costly. In this section, we detail our approach to semantic content- based video segmentation. As used herein, semantic content of a video sequence may include any symbolic or real information element in a frame (or frames) of a video sequence. Semantic content, in fact, may be defined on any of a number of levels. For example, segmentation based upon semantic content (on the simplest level) may be triggered based upon detection of a scene change. Segmentation may also be triggered based upon camera pans or zooms, or upon pictorial or audio changes among frames of a video sequence.
We note that segmentation need not be perfect for it to be useful. Consider the situation where a tourist takes six hours of video on a trip. If a system is able to detect every time the scene changed (i.e., the camera went off and on) and divided the video into shots based at scene change boundaries, we would have fewer than 200 scene changes if the average time the camera stayed on was about 2 minutes. If we extracted the first frames of each shot as a keyframe, the tourist would be able to browse this initial shot segmentation using our interface. She would be able to quickly find the shot she took at the Great Wall of China by just scanning the keyframe database. Given that she has a sense of the temporal progression of her trip, this database would prove useful with scene change-based segmentation. Furthermore, the video would now be accessible as higher level objects rather than as a sequence of image frames that must be played. She could further organize the archive of her trip using the hierarchical editing capability of our system.
2.1 Segmentation Strategy. The above example illustrates our approach to semantic content-based video segmentation. First, the purpose of such segmentation should provide an initial shot database which may then be further organized by interactive editing. Our fundamental strategy is first to apply computer vision and image processing technology to produce a good initial content-based video segmentation. Next, we use a rapid browsing and video manipulation interface to further organize and access the video as semantic data units.
Second, we may exploit the natural organization intrinsic to the way video is taken to provide this initial segmentation. This natural organization is illustrated by our scene change example in that one expects that each time the camera goes on, something interesting is happening. Another such organizational cue may be the change of camera perspectives between scene changes. Assuming that the cameraman is rational, one may assume that she has a reason for panning the camera. A slow pan may indicate that she is recording a panorama of the scene. The extrema of such pans may suggest that the video has settled on semantically interesting information. Other events of interest in video may be detectable depending on the domain in which the video is extracted. Video from a football game may, for example, be segmented into plays by detecting the scrimmages. We shall discuss other organizational cues that may be detected in the video in a later section.
Third, video may be segmented by detecting video events which serve as cues that the semantic content has changed. For our scene change detection example, one may use some discontinuity in the video content as such a cue.
2.2 Segmentation Hierarchy.
FIG. 17 shows the event hierarchy that one might detect in video data. At the highest level of the hierarchy are domain events . Such events are strongly dependent on the domain in which the video is taken. In addition to the football video example given above, one might conceive of other domains in which events may be detected in the video. In a courtroom domain, the task may be to detect each time a different person is speaking, and when different witnesses enter and leave the witness box. In a general tourist and home video domain, one might apply a 'drama model' and detect every time a person is added to the scene. One may then locate a specific person by having the system display the keyframes of all 'new actor' shots. The desired person will invariably appear.
The next two levels involve events that are domain independent. Camera/photography events are essentially events of perspective change in the video. The two that we detect are camera pans and zooms . We have already discussed scene change events.
The detection of such events from video content involves image and vision-based computation. The question then becomes one of finding appropriate computational algorithms to detect visual cues that signal these video events. In the next sections, we shall discuss our algorithmic approaches and show some results of our work.
2.3 Scene Change Events . A scene change is a discontinuity in the video stream where the video preceding and succeeding the discontinuity differ abruptly in location and time. Examples of scene change events in video include cuts, fades, dissolves and wipes. Cuts, by far the most common in raw video footage, involve abrupt scene changes in the video. The other modes may be classified as forms of more gradual scene change. By our definition, some scene changes are inherently undetectable . For example, one may cut a video stream and restart it viewing exactly the same video content . It is not possible to detect such changes from the video alone. Hence, we modify our definition to include the constraint that scene change events must include significant alteration in the video content before and after the scene change event.
Boreczky and Rowe provide a good review and evaluation of various algorithms to detect scene changes. They compare five different scene boundary detection algorithms using four different types of video: television programs, news programs, movies, and television commercials. The five scene boundary detection algorithms evaluated were global histograms, region histograms, motion compensated pixel differences, and discrete cosine transform coefficient differences. Their test data had 2507 cuts and 506 gradual transitions. They concluded that global histograms, region histograms, running histograms performed well overall. The more complex algorithms actually performed more poorly. In their evaluation, running histograms produced the best results using the criterion of 'least number of transition missed' . It detected above 93% of the cuts and 88% of the gradual transitions.
In our implementation, we adopted both the global and running histogram approaches. In our own preliminary tests, the running histogram detected 17 out of 17 cuts and 9 out of 10 gradual scene transitions. The global histogram approach is efficient and simple, and we believe that a combination of both approaches may produce better results. 2.4 Camera/Photography Events.
Camera or photography events are related to the change of camera perspective. In our work, we address camera pans and zooms. Such perspective changes can be detected by examining the motion field in the video sequence. When the camera pans, we expect the video to exhibit a translation field in the video data in the opposite direction form the pan. When the camera zooms in or out, we expect a divergent field from the camera focal axis respectively. We note that if the scene is far from the focal point of the camera, pan-zoom combinations are dominated in the vector field by the pan effect.
We have developed a novel Vector Coherence Mapping (VCM) algorithm to compute image flow vector fields in video data. VCM is a parallelizable algorithm that applies a fuzzy voting scheme to compute both global and local vectors in an image sequence. Using our fuzzy approach, we can impose various spatial and temporal constraints on the vector computation. Hence, the vectors we compute are coherent across space and time. Our results show that VCM is a robust flow field algorithm that is able to handle moving cameras, moving objects, and combinations of moving cameras and scenes. VCM produces a set of local vectors and a dominant global translation vector between successive pairs of adjacent frames. These vectors can be used to determine whether the flow field satisfies the flow conditions of either camera pans or zooms . 2.4.1 Camera Pan Detection.
We have already noted that camera pans result in a global translation field across the image. VCM produces a set of vectors: vj, i = 1, 2... n between frames at time t and t + δt. The dominant vector computed is denoted Vp . The dominant vector may indicate either a global translation field or a dominant object motion in the frame. One may distinguish between the two by noting that the global translation field is distributed across the entire image while the dominant object motion is likely to be clustered. Further, a true translation field is likely to persist across several frames . We detect global translation fields by taking the average of the dot products between all scene vectors and the dominant vector:
Figure imgf000035_0001
where Lpan is the likelihood that the field belongs to a camera pan at time t.
If Lpan remains high (above 80%) over a specified time interval (we do not expect real camera pans of less than 1 second or 30 frames) , then a determination may be made that the camera is panning. FIG. 18 shows the output of VCM for a pan of a computer workbench. The local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
2.4.2 Camera Zoom Detection.
FIG. 19 illustrates our approach for zoom detection. Since we assume pure zooms to converge or diverge form the optical axis of the camera, we detect a zoom by dividing all vectors computed into four quadrants Ql to Q4. In each quadrant, we take a 45° unit vector in that quadrant VI to V4 respectively and compute the dot product between all the vectors in each quadrant. Our likelihood measure for zoom is the average of the absolute values of the dot products :
Figure imgf000036_0001
If Azoom remains high (above 80%) over a specified time interval, then a determination may be made that the camera is zooming. As with pans, we do not expect real camera zooms of less than 1 second or 30 frames. FIG. 20 shows the output of VCM for a zoom of the back of a computer workbench. the local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
Table 2 and Table 3 show the first four pan and zoom likelihoods for the pan and zoom sequences of FIG. 18 and FIG. 20 respectively. As can be seen, the pan likelihood is consistently high for the pan sequence and the zoom likelihood is high for the zoom sequence.
Table 2. Global Vector, Pan and Zoom Likelihoods for pan sequence of FIG. 17
Figure imgf000037_0001
Table 3. Global Vector, Pan and Zoom Likelihoods for zoom sequence of FIG. 19
Figure imgf000037_0002
2.4.3 Pan-Zoom Combination.
FIG. 21 shows the VCM output for a pan-zoom sequence. The pan and zoom likelihood results are tabulated in table 3. As we predicted, the pan effect dominates that of the zoom vector field. The pan likelihood values are consistently above 80% throughout the sequence. Since we are interests only in video segmentation, it is not important that the system distinguish between pan-zooms and pure pans.
We have discussed the technology we developed for Content-Based Video Access. We have developed a demonstrable interaction technology based on sound psychological principles in Mul tiply- linked Representations and human perception principles. The multi-dimensional access to video data for organization, access and manipulation is a novel and compelling way to work with such data. Of particular importance is the Multiresolution Animated Representation. The approach is completely new and our user experiments have shown that it is effective for managing the complexities of video . Turning now to semantic segmentation, we will present a new parallel approach for the computation of an optical flow field from a video image sequence. This approach incorporates the various local smoothness, spatial and temporal coherence constraints transparently by the application of fuzzy image processing techniques. Our Vector Coherence Mapping (VCM) approach accomplishes this by a weighted voting process in "local vector space, " where the weights provide high level guidance to the local voting process. Our results show that VCM is capable of extracting flow fields for video streams with global dominant fields (e.g., owing to camera pan or translation), moving camera and moving object (s), and multiple moving objects. Our results also show that VCM is able to operate under strong image noise and motion blur, and is not susceptible to boundary oversmoothing . What is the original contribution of this work? Vector-Coherence Mapping VCM approach makes several contributions to feature-based optical flow computation. First, it combines the correlation and constraint-based processes into a set of fuzzy image processing operations . It obviates the typical iterative post process of constraint application (e.g., by relaxation labeling, greedy vector exchange, calculus of variations, or Kalman filters) . We show the application of both spatial and temporal coherence constraints . Second, VCM is a completely parallel voting algorithm in "vector parameter space." The voting is distributed with each vector being influenced by elements in its neighborhood. Since the voting takes place in vector space, it is relatively immune to noise in image space. Our results show that VCM functions under both synthetic and natural (e.g., motion blur) conditions, third, the fuzzy combination process provides a handle by which high level constrain information is used to guide the correlation process. Hence, no thresholds need to be applied early on in the algorithm, and there are no non-linear decisions (and consequently errors) to propagate to later processes .
Why should this contribution be considered important? VCM extends the state of the art in feature- based optical flow computation. The algorithm is straight-forward and easily implementable in either serial or parallel computation. VCM is able to compute dominant translation fields in video sequence and multiple vector fields representing various combinations of moving cameras and moving objects. The algorithm can compute good vector fields from noisy image sequences. Our results on real image sequences evidence this robustness. These qualities combine to make VCM a well- founded and practical algorithm for real world applications . What is the most closely related work by others and how does this work differ? Because it is a voting-based algorithm, VCM is similar to the Hough-based approaches. The difference is that in VCM, the voting is distributed and the constraints enforced on each vector are local to the region of the vector. Furthermore, in VCM the correlation and constrain enforcement functions are integrated in such a way that the constraints "guide" the correlation process by the likelihood distribution. Our results show that VCM has good noise immunity. Unlike other approaches which use such techniques as M- estimators to enforce robustness, the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors .
How can other researchers make use of the results of this work? First, VCM provides a framework for the implementation of various constraints using fuzzy image processing. A number of other constraints may be added within this paradigm (e.g., color, texture, and other model-based constraints) . Second, VCM is an effective way for generating flow fields from video data. Researchers can use the algorithm to produce flow fields that serve as input into dynamic flow recognition problems like video segmentation and gesture analysis .
Barron et al . provide a good review of optical flow techniques and their performance. We shall adopt their philology of the field to put our work in context. They classify optical flow computation techniques into four classes. The first of these, pioneered by Horn and Schunck computes optical fields using the spatial- temporal gradients in an image sequence by the application of an image flow equation. The second class performs "region-based matching" by explicit correlation of feature points and computing coherent fields by maximizing some similarity measure. The third class are "energy-based" methods which extract image flow fields in the frequency domain by the application of "velocity- tuned" filters. The fourth class are "phase-based" methods which extract optical flow using the phase behavior of band-pass filters. Barron et al . include zero-crossing approaches such as that due to Hildreth under this category. Under this classification, our approach falls under the second (region-based matching) category.
Most region or feature-based correlation approaches involve three computational stages: pre-processing to obtain a set of trackable regions, correlation to obtain an initial set of correspondences, and the application of some constraints (usually using calculus of variations to minimize deviations from vector field smoothness, or using relaxation labeling to find optimal set of disparities) to obtain the final flow field. One may, therefore, further classify such approaches according to the strategy taken at each of the three stages .
The kinds of features selected in the first stage are often related to the domain in which the tracking occurs. Essentially good features to be tracked should have good localization properties and it must be reliably detected. Shi and Tomasi provide an evaluation of the effectiveness of various features. They contend that the texture property that makes features unique are also the ones that enhance tracking. Tracking systems have been presented that use corners, local texture measures., mixed edge, region and textural features, local spatial frequency along 2 orthogonal directions, and a composite of color, pixel position and spatiotemporal intensity gradient.
Simple correlation using any feature type typically results in a very noisy vector field. Such correlation is usually performed using such techniques as template matching, absolute difference correlation (ADC) , and the sum of squared differences (SSD) . A key trade-off in correlation is the size of the correlation region or template. Larger regions yield higher confidence matches while smaller ones are better for localization. Zheng and Chellappa apply a weighted correlation that assigns greater weights to the center of the correlation area to overcome this problem. A further reference also claims that by applying subpixel matching estimation and using affine predictions of image motion given previous ego-motion estimates, they can compute good ego-motion fields without requiring post processing to smooth the field.
The final step in most approaches is to apply certain constraints to smooth the flow field. Such constraints include rigid-body assumptions, spatial field coherence, and temporal path coherence. These constraints may be enforced using such techniques as relaxation labeling, greedy vector exchange and competitive learning for clustering. These algorithms are typically iterative and converge on a consistent coherent solution.
The contribution of the VCM approach presented here is that it combines the correlation and constraint-based smoothing processes into a set of fuzzy image processing operations. The algorithm is completely parallel and obviates the iterative post process. In essence, VCM performs a voting process in vector parameter space and biases this voting by likelihood distributions that enforce the spatial and temporal constraints. Hence, VCM is similar to the Hough-based approaches. The difference is that in VCM, the voting is distributed and the constraints enforced on each vector is local to the region of the vector. Furthermore, in VCM the correlation and constraint enforcement functions are integrated in such a way that the constraints "guide" the correlation process by the likelihood distribution. The Hough methods, on the other hand, apply a global voting space. One reference, for example, first computes the set of vectors and estimates the parameters of dominant motion in an image using such a global Hough space. To track multiple objects, one reference divides the image into patches and computes parameters in each patch, and applies M-estimators to exclude outliers from the Hough voting.
Our results show that VCM has good noise immunity. Unlike other approaches which use such techniques as M- estimators to enforce robustness, the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors . Turning now to spatial coherence, let Pfc =
Figure imgf000043_0001
be the set of interest points detected in image Is at time t. These may be computed by any suitable interest operator. Since VCM is feature-agnostic, we apply a simple image gradient operator and select high gradient points.
For a particular interest point pj in image Is, we estimate its new position in image ιt+δt by computing the correlation of the neighborhood of p^ in ιt+δt . This estimate, however, is very susceptible to image noise and chance correlations . Our approach uses the weighted aggregates of neighboring correlations to obtain a stable vector field.
We define a Normal Correlation Map ncm for some point p^ to image j_t+δt to be the correlation response of the region around p^ in Ifc with a rectangular region in
It+δt centered at the coordinates . We use absolute difference correlation ADC to perform the correlation. Hence the Normal Correlation Map is given by:
Ntø j , n] = X N " |lfc [j, k] - It+δt [m + j, n + k] , ( 1 )
where -Dx < m < Dx; -Dy < n < Dy and p[ ≡ (x^.y^j, 2N + 1 is the x and y dimension of the correlation template, and Dx and Dy define the maximal expected x and y displacement of px at t + δt respectively.
We define the Vector Coherence Map vcm at p^ to be:
Figure imgf000044_0001
where 0 < W^ ?^) ≤ 1 is some weighting function of the contribution of the ncmN(pj) of point p^ on the vector at p^ , and P is the set of interest points in It.
By manipulating W^y?^), we can enforce a variety of spatial coherence constraints on the vector field. To enforce a spatial proximity coherence constraint, for example, we can employ either the Euclidean or Checkerboard distance, respectively, by applying:
W1 t(p: t)
Figure imgf000045_0001
or
W1 t(p: t)= s(k., k2, max - x^, y' - y^)) (3) where pj ≡
Figure imgf000045_0002
≡ (xj,y! and ki < k2 are weighting constants for the sigmoidal function: s(k-, k2, d) =1 for d < k.
_ 1 - ε - F(kι; k2, d) 1 - 2ε
Figure imgf000045_0003
=0 for d > k2 (4) where 0 < ε << 1 (we use value of 0.01 in our implementation) , and,
-1
F(k1,k2,d) = -
- -aa| cd-^i
1 + e
- 2 f where a = In k2 - k- 1 ~ ε. Hence the vcm implements a voting scheme by which neighborhood point correlations affect the vector j at point Pj . We can convert this into a 'likelihood-map' for v^; by normalizing it, subject to a noise threshold
T
(5)
Figure imgf000045_0004
where peak' (ycm(p^)j = peak value of vcm(p^j if it is above threshold TVCm = ∞ otherwise vcm(p therefore maps the likelihood of terminal points for vectors originating from p^ due to neighborhood point correlations . the computation of the dominant translation field across the entire image and through a video sequence is important for the segmentation of video streams . VCM can compute such a field by setting p^ to be some imaginary point, and using a uniform weighting function:
W1 t(p3 t)=l Vj (6) Hence, a global vcm is computed corresponding to the dominant translation occurring in the frame (e.g., due to camera pan) . Furthermore, a vcm can be computed for ANY point in image Ifc whether or not it is an interest point. A vcm built in this way can be used to estimate optical flow at any point, so a dense optic flow field can be computed.
For vector tracking, however, it is undesirable to select a vector for which there is no local evidence
(i.e., no appropriate correlation for p' in image It+δt ) . This evidence may be found in the ncm N(pj). TO achieve this, we first normalize N(pj) using the sigmoidal function from equation 4:
H?') = 1 - S(τω + δ, Tw + δ, N(p^)) (7) where Tw is a threshold and δ controls the steepness of the sigmoidal function. We can then apply a fuzzy-AND operation of |vcm(p and Np to obtain a "likelihood- map" for v^ with both local and neighborhood support. We can realize this as: Lspat al fe )
Figure imgf000047_0001
] ( 8 ) where ® denotes pixel-wise multiplication. L sPatιi(p_) ^s the likelihood-map for v^ owing to spatial coherence constraints . 5 Temporal coherence may be considered next. To enforce a temporal coherence constraint, we employ a piecewise linear dynamic model. We introduce the expected velocity vector for point p^ . We can compute v^ given the previous vector v^δt and the 0 previous acceleration vector a^ . Assuming constant acceleration, ai* =■ a ~ t , we can estimate:
Figure imgf000047_0002
We have v^+δt from the tracking history and we can
1 T5r- es_t-i•majt_e at.-δt = vt.-δt - „v.t-2δt
Given an expected vector v^ , we introduce the idea of the scatter template . We make the observation that the larger the expected vector, the larger would be the region of possible destination of the real velocity 20 vector.
Hence, we apply a scatter template T^ centered at p^+δt . The template is applied to every point x(k, 1) of the ncm N(PJ). We can implement the scatter template T^ using sigmoid function from equation 4. 25 For every point x(k, 1) belonging to N(P^), the scatter template is calculated as : τ (k, l) = s(klt, k2t, |x - prδt|) ( 10 ) where klt = f- flvHJ and k2t = f2 (fvH) control the steepness of the sigmoid function with response to the expected vector length (function becomes steeper as v decreases) .
This scatter template is fuzzy-ANDed with ypj) to obtain a new temporal ncm Nτ(p^j:
Nτ(pι t)=N(pι t)® T^ (11) where ® denotes pixel-wise multiplication. This applies the highest weight to the area of
N( I) close to p^+ fc and suppresses the more distant values. We can compute a temporal vector map vcmτ(p') for every point p^ :
Figure imgf000048_0001
vcmτ(pj) can then be normalized and fuzzy-ANDed with
Ny? (in the same way as equations 5 and 8) to obtain the likelihood map Ls_t(p^j with both local and neighborhood spatial and temporal support:
Figure imgf000048_0002
We may now address the question of boundary conditions and the related question of when to begin and end a vector trace. There are 3 such boundary conditions: (1) Initialization at the first frame when no vectors exist; (2) the motion detector provides strong evidence for a moving object in a region that does not have an existing vector; (3) Equation 13 for vector tracking yields no suitable match for a point being tracked.
Under a first frame condition the ncm computation cannot be constrained by vector history. In this case, we may compute the velocities v[ and vj for the first and second frames using the spatial coherence constraint alone (using equation 8) . We can then proceed with the temporal constraint equation for the third frame.
Alternatively, we can compute v^ and proceed using the estimate v^ = v^ .
We may now consider the case of where evidence exists for the presence of a vector in a new region. In this case, the motion detector (motion-sensitive edge detector) provides strong evidence for a moving object in a region that does not have an existing vector. this is similar to case 1, except that it applies only to the region of interest, and not to the whole image.
In the case when equation 13 yields no suitable match for a vector being tracked, three situations may the cause: (1) rapid acceleration/deceleration pushed the point beyond the search region; (2) the point has been occluded; or (3) an error occurred in previous tracking.
Situation (1) is the most common cause for a loss of tracking in this case. We typically work with 30 frame-per-second data, and this sampling rate is insufficient when either the motion is too fast to compute the acceleration (or path curvature) accurately, or if there is an abrupt change in motion. To resolve the problem, we relax the temporal constraint and use
L spatiai( i / • ^f this yields a new vector, we have to decide if we continue tracking the point. To do this, we posit that there is a maximal allowable acceleration Ta . If the new vector does not violate this constraint
(i.e., v'-v'-1 < Ta ) we proceed with the tracking. If the maximal acceleration condition is violated, we flag the point in the tracking sequence as a point of motion change and proceed as though this were a new motion. In the current implementation, we compute both
L satιai(Pι) an<3 Ls-t
Figure imgf000050_0001
(using equations 8 and 13, respectively) . Currently, we do not try to recover a trace through temporary occlusion. Once the vector field has been computed, we may cluster the vectors using an interactive clustering algorithm. Vectors are clustered by vector location, direction and magnitude. The importance of each feature used during clustering can be adjusted.
Since W in equation 2 and T of equation 10 may both be precomputed, the ncm is obtained by a regular convolution correlation process, and all other operations are pixelwise image multiplications. This algorithm is easily parallelizable .
The Vector Coherence Mapping algorithm implementation consists of three main parts :
1. Finding feature (interest) points (pixels) on the frames.
2. Computing the movement vectors (optic flow) between the frames .
3. Updating the feature point array maintained for the whole analyzed sequence of frames. We may compute feature points using one of two extraction approaches. In the first method we use a Sobel operator to estimate local gradients and apply nonmaximal suppression to find locally maximal gradient points . In the second approach we use a detector that emphasizes moving edge points. This operator takes the fuzzy-AND of the normalized spatial and temporal gradients in the video to locate edge points that move. This allows us to focus our computation on image regions where a flow may be detected.
To ensure an even distribution of the flow field across the frame, we subdivide it into 16x16 subwindows and pick two points with the highest gradient from each subwindow (their gradients have to be above a certain threshold) , which gives 600 interest points in our implementation. If a given 16x16 subwindow does not contain any pixel with high enough gradient magnitude, no pixels are chosen from that subwindow. Instead, more pixels are chosen from other subwindows, so the number of interest points remains constant.
We make an assumption that if there is a suitable match on the next frame for a feature point from the current frame, it should also be a feature point. This assumption allows us to speed up our sequential implementation of the VCM algorithm, and ensures that only regions with strong features are tracked. To make matching more robust, we detect up to 10 times more feature points on the next frame (by lowering the minimal allowable gradient threshold during feature extraction stage) .
The initial set { feature point array) of 600 feature points is computed ONLY for the first frame of analyzed seguence of frames , and then the set is updated during vector tracking. For purposes of maintaining field continuity, a least one permanent feature points array may be maintained over the entire sequence of frames . The array is filled with initial set of feature points found for the first frame of the sequence. After calculating vcm's for all feature points from the permanent array, the array may be updated as follows:
1. If for a given point in permanent array the vcm produced a valid vector, the point coordinates are updated in permanent array and the point is marked as
CONTINUED. If motion change was detected (as described above) , the point is flagged accordingly.
2. If there is no valid vector for a given point p[ in permanent array, the point is substituted with a new interest point taken—if possible—from the 16x16 subwindow of the NEXT frame corresponding to the current frame subwindow to which point p^ belongs. If there are no interest points in the corresponding subwindow of the next frame, a new interest point from a random subwindow is chosen. The new interest point is chosen so that two or more points with the same coordinates DO NOT appear in the permanent array (all interest points are unique) . This allows the algorithm to maintain uniform distribution of the interest points over the frame during the whole sequence. New interest points are marked as NEW.
3. If two points pj and pj from permanent array have the same computed new position (both vectors end at the same point on the next frame) , a point with higher maximal vcm value (total correlation value or hot spot value) is chosen (these maximal values are saved with interest points before vcm normalization) to be updated and remain in the permanent array as CONTINUED point . If the maximal values are identical, the random point is chosen to remain CONTINUED. the second point is substituted by a new feature point taken from the next frame (as described above) .
Our vector computation is based on absolute difference correlation, calculated for small 5x5 neighborhoods around each interest point. the size of considered neighborhoods was established empirically.
3x3 neighborhoods proved to be too susceptible to noise, and 7x7 neighborhoods did not visibly improve vector field quality, but slowed the computation.
The 5x5 region around each point p^ in the current frame is correlated against the 65x49 area of the NEXT frame, centered at the coordinates of pj as shown on FIG 22. The resulting 65x49 array serves as N(PJJ. The hot spot found on N(P ) could be a basis for vector computation, but the vector field obtained is usually noisy (see FIG. 23) . This is precisely the result of the ADC process alone.
The vcm for a given feature point is created according to equations 2 and 4. For efficiency, our implementation considers only N(PJ)'S of the points pfc within a 65x65 neighborhood of pj when computing vcm(p^ . Each vcm is then normalized. The vector v^ is computed as the vector starting at the center of the vcm and ending at the coordinates of the peak value of the vcm. If the maximal value in vcm is smaller than a certain threshold, the hot spot is considered to be too weak, the whole vcm is reset to 0, the vector related to p^ is labeled UNDEFINED, and a new interest point is selected as detailed above.
We also address the problem of "feature drift" . Feature drift arises because features in real space do not always appear at integer positions of an image pixel grid. While some attempts to solve this problem by subpixel estimation have been described, we explicitly allow features locations to vary by integral pixel displacements. We want to avoid assigning a vector v^ to p^ if it does not correspond to a high correlation in
Figure imgf000054_0001
Hence, we inspect
Figure imgf000054_0002
(the ADC response) to see if the value corresponding to the vcm (p ) hot spot is above threshold Tw (see equation 7) . Secondly, to improve the tracking accuracy for subsequent frames, we want to ensure that p^+δt is also a feature point.
Hence, if p^+δt is not a feature point in the next frame, we select the feature point (if one exists) from the 3x3 neighborhood of p'+δt with the highest corresponding value (which has to be larger than Tw) as the new terminal point of v^ . If the above steps do not produce a terminal point that is a feature point with a high enough corresponding N(P J value, the entry in the feature point array is labeled as UNDEFINED, and a substitute point is found. FIG. 24 shows VCM algorithm performance. It shows the same frame as FIG. 23. One can easily see that the vector field is much smoother when computed using vcm's. Dominant translation may be computed according to equation 6. Since we do not want any individual ncm to dominate the computation, they are normalized before summing. Hence, the dominant motion is computed based on the number of vectors pointing in a certain direction, and not on the quality of match which produced these vectors.
In our implementation only ncm's of feature points with valid movement vectors are added to create global vcm.
FIG. 25 shows the example image of a global vcm computed for the pan sequence shown in FIG. 26. The hot spot is visible in the lower center of FIG. 25. This corresponds to the global translation vector shown as a stronger line in the middle of the image on FIG. 26. the intensity distribution in FIG. 25 is an evidence of an aperture problem existing along the arm and dress boundaries (long edges with relatively low texture). These appear as high intensity ridges in the vcm. In the VCM voting scheme, such edge points still contribute to the correct vector value, the hot spot is still well defined.
Temporal coherence may now be considered. As discussed above, we compute two likelihood maps for each feature point using equations 8 and 13 to compute the spatial and spatial-temporal likelihood maps, respectively. The fact that the scatter template is applied to ncm's and not only vcm's allows the neighboring points' temporal prediction to affect each other. As a result, a given point's movement history affects predicted positions of its neighbors. This way, when there is a new feature point in some area and this point does not have ANY movement history, its neighbors (through temporally weighted ncm's) can affect the predicted position of that point.
An example of how the temporal prediction mechanism improves the correctness of the vector field computation is illustrated in the vector fields computed for a synthetic frame sequence involving two identical objects. The objects move vertically at constant velocities of 15 pixels/frame in opposite directions. FIG. 27 shows the vector field obtained without temporal prediction, and FIG. 28 shows vector field obtained for the same data with temporal prediction. Without temporal prediction, we can see a lot of false vectors between objects as they pass each other. Temporal prediction solves this problem.
The vectors can be clustered according to three features: origin location, direction and magnitude. The importance of each feature used during clustering can be adjusted. It is also possible to cluster vectors with respect to two or one feature only. We use a one pass clustering method. Example of vector clustering is shown in FIG. 29.
In this section we present the results of video sequence analyses with our implementation of the VCM algorithm. All examples, except FIGs . 27 and 28, are real video sequences . FIG. 23 shows a noisy vector field obtained from the ADC response (ncm's) on a hand motion video sequence. The vector field computed for the same sequence produced a vector field characterized by the smooth field shown in FIG. 24.
FIG. 26 presents the performance of VCM on a video sequence with an up-panning camera, and where the aperture problem is evident. FIG. 25 shows the global vcm computed on a frame in the sequence. The bold line in the center of FIG. 26 shows the correct image vector corresponding to an upward pan.
FIG. 28 shows the results of the VCM algorithm for a synthetic image sequence in which two identical objects move in opposite directions at 15 pixels per frame. The correct field computed in FIG. 28 shows the efficacy of the temporal cohesion. Without this constraint, most of the vectors were produced by false correspondences between the two objects (FIG. 27) . This experiment also shows that VCM is not prone to boundary oversmoothing .
FIG. 29 shows the vector fields computed for a video sequence with a down-panning camera and a moving hand. The sequence contains significant motion blur.
The VCM and vector clustering algorithms extracted two distinct vector fields with no visible boundary oversmoothing . In the Jimmy Johnson sequence shown in FIG. 30, the subject is gesturing with both hands and nodding his head. Three distinct motion fields were correctly extracted by the VCM algorithm.
FIGs. 31 and 32 show the efficacy of the VCM algorithm on noisy data. The video sequence was corrupted with uniform random additive noise to give a S/N ratio of 21.7 dB . FIG. 31 shows the result of ADC correlation (i.e., using the ncm's alone). FIG. 32 shows the vector field computed by the VCM algorithm. The difference in vector field quality is easily visible .
FIGs. 33, 34 and 35 show analysis of video sequences with various camera motions . The zoom-out sequence resulted in the anticipated convergent field (FIG. 33) . Fig. 34 shows the vector field for a combined up-panning and zooming sequence. FIG. 35 shows the rotating field obtained from a camera rotating about its optical axis . We presented a parallelizable algorithm that computes coherent vector fields by the application of various coherence constraints. The algorithm features a voting scheme in vector parameter space, making it robust to image noise. The spatial and temporal coherence constraints are applied using fuzzy-image- processing technique by which the constraints are applied to the bias the correlation process. Hence, the algorithm does not require the typical iterative second stage process of constraint application. The experiment results presented substantiate the promise of the algorithm. VCM is capable of extracting vector fields out of image sequences with significant synthetic and real noise (e.g., motion blur) . It produced good results on videos with multiple independent or composite (e.g., moving camera with moving object) motion fields. Our method performs well for aperture problems and permits the extraction of vector fields containing sharp discontinuities with no discernible over-smoothing effects. The technology described in this patent application facilitates a family of applications that involve the organization and access of video data. The commonality of these applications is that they require segmentation of the video into semantically significant units, the ability to access, annotate, refine, and organize these video segments, and the ability to access the video data segments in a integrated fashion. Following are a number of examples of such applications.
One application of the techniques described above may be to video databases of sporting events . Professionals and serious amateurs of organized sports often need to study video of sporting events.
Basketball teams need to see the tendencies of opposing players in particular game situations. Coaches need to analyze the plays of their own teams. Tennis players want to study the likelihood of an opponent to hit certain strokes from different angles.
Apart from the segmentation of the underlying video using scene change and photography events, one needs to design a set of custom domain event detectors for each sport. We shall consider American football as representative of such applications.
In football, a game may be organized into halves, series and plays. Each series (or drive) may be characterized by the roles of the teams (offense or defense) , the distance covered, the time consumed, number of plays, and the result of the drive. Each play may be described by the kind of play (passing, rushing, or kicking) , the outcome, the down, and the distance covered. To obtain such a segmentation, one needs to detect the scrimmages and the direction of play. In this case, the segmentation may be obtained by analysis of the image flow fields obtained by an algorithm like our VCM to detect the most atomic of these units (the play) . The other units may be built up from these units.
VCM facilitates the application of various constraints in the computation of the vector fields. Apart from the usual spatial and temporal coherence constraints, one may use the likelihood that moving pixels belong to the uniform color of each team. Hence vector fields may be computed for the movement of players on each team. In a scrimmage, the team on offense (apart from the man-in-motion) must be set for a whole second before the snap of the ball. In regular video this is at least 30 frames. The defensive team is permitted to move. Hence the predominance of motion by one color for approximately one second or more will constitute a scrimmage event. This detection may also be used to obtain the direction, duration, and distance of the play. The fact that the ground is green with yard markers will also be useful in this segmentation. Specialized event detectors may be used for kickoffs, and to track the path of the ball in pass plays. What is important is that a set of domain event detectors may be fashioned for such a class of video. The outcome of this detection is a shot-subshot hierarchy reflecting the game, half, series, and play hierarchy of football. Once the footage has been segmented, our interface permits the refinement of the segmentation using the multiply-linked interface representation. Each play may be annotated, labeled, and modified interactively. Furthermore, since the underlying synchronization of all the interface components is time, the system may handle multiple video viewpoints (e.g. endzone view, blimp downward view, press box view) . Each of these views may be displayed in different keyframe representation windows. Here, the multiresolution representation is particularly useful because it optimizes the use of screen real-estate, and so permits a user to browse shots from different viewpoints simultaneously. In this case the animation of the keyframes in each MAR are synchronized so that when one selects a keyframe from one window to be the centralized current shot, all MAR representations will centralize the concomitant keyframes. Given sufficient computational resources and screen real-estate, one may even play the synchronized video of all viewpoints simultaneously. The same set of interfaces may be used to view and study the resulting organized video. Another application may be in a court of law.
Video may serve as legal courtroom archives either in conjunction with or in lieu of stenographically generated records. In this case, the domain events to be detected in the video are the transitions between witnesses, the identity of the speaker (judge, lawyer, or witness) . For such an application, one needs only to detect the change of witnesses, and not to recognize the particular witness. This may be implemented simply by detecting episodes in which the witness-box is vacant. A wi tness-box camera may be set up to capture the vacant witness-box before the proceedings and provide a background template from which occupants may be detected by a simple image subtraction change detection. 'Witness sessions' may be defined as time segments between witness-box vacancies. Witness changes must occur in one of these transitions. Once the witness sessions have been delineated, we have a focus area in which we may establish witness identity by an algorithm that locates the face and compares facial features. Since we are interested only in witness change, almost any face recognizer will be adequate (all we need is to determine if the current 'witness session' face is the same as the one in the previous session) . A standard first order approach that compares the face width, face length, mouth width, nose width, and the distance between the eyes and the nostrils as a ratio to the eye separation comes immediately to mind. Also, the lawyer may be identified by tracking her in the video. Speaker identification may by achieved detecting the frequency of lip movements and correlating them to the location of sound RMS power amplitude clusters in the audio track. In such an application, it is reasonable to expect multiple synchronized video streams. Again, since the underlying thread in our technology is the time synchrony of the video components, we can utilize all the same interaction components as in the previous example. The multiple video streams may be represented in the interface as different keyframe windows. This allows us to organize, annotate and access the multiple video streams in the semantic structure of the courtroom proceedings . This may be hierarchy of the (possibly cross-session) testimonies of particular witnesses, direct and cross examination, witness sessions, question and witness response alternations, and individual utterances by courtroom participants . Hence the inherent structure, hierarchy, and logic of the proceedings may be accessible from the video content.
The techniques presented above may also be used for organizing personal video. We anticipate the day when digital video will be a common as current 35mm film photography. We envision the day when individuals will be able, either to process their video at home, or to take their media into processing outlets at which the may select various optional domain detectors .
Home video is another application. One of the greatest impediments to the wider use of home video is the to access potentially many hours of video in a convenient way. Apart from the standard scene change and photography events, a particularly useful domain event in home video is the 'new actor detector' . After accounting for camera motion (for example by using image vectors extracted by VCM or the vector fields contained in the standard MPEG encoding) , the most significant moving objects in most typical home videos are people. The same head-detector described earlier for witness detection in our courtroom example may be used to determine if a new vector cluster entering the scene represent a new individual . Home videos can then be browsed to find people by viewing the new-actor keyframes the way one browses a photograph album. One may, in fact, think of a standard package to perform such domain processing as a 'drama' domain where new scenes and the entrance of actors are significant events.
Once the initial segmentation is obtained, one may use our multiply-linked interaction technology to browse and annotate the video, and even to modify the structure of the segmentation. Travel logs may also benefit from the invention.
Personal travel logs share many of the same characteristics of home video. The basic drama model works well for this domain. In addition, domain detectors that characterize scenes as 'indoor' or 'outdoor' will be useful. One may do this by detecting the skyline of outdoor scenes and the predominance of vertical edges in indoor scenes . The latter is based on the observation that indoor scenes are typically shot in the second perspective in which vertical lines stay parallel and vertical, and other parallel edges typically project to a vanishing point. As before our multiply-linked interaction technology permits the use of the segmented video.
The techniques described herein may also be applied to special event videos. Certain special event videos are common enough to warrant the development of dedicated domain detectors for them. An example might be that of formal weddings. Wedding videos are either taken by professional or amateur photographers, and one can imagine the application of our technology to produce annotated keepsakes of such video. The techniques described above also have application in the business environment. A variety of business applications could benefit from our technology. We describe two kinds of video as being representative of these applications. First, business meeting could benefit from video processing. Business meetings (when executed properly) exhibit an organizational structure that may be preserved and accessed by our technology. Depending on the kinds of meeting being considered, video segments may be classified as moderator-speaking, member- speaking, voting and presentation. If the moderator speaks from a podium, a camera may be trained on her to locate all moderator utterances. These may be correlated with the RMS power peaks in the audio stream. _ The same process will detect presentations to the membership. Members who speak or rise to speak may be detected by analyzing the image flow fields or by image change detection and correlated with the audio RMS peaks as discussed earlier. If the moderator sits with the participants at a table, she will be detected as a speaking member. Since the location of each member is a by-product of speaker detection it is trivial to label the moderator once the video has been segmented.
This process will provide a speaker-wise decomposition of the meeting that may be presented in our multiply-linked interface technology. A user may enhance the structure in our hierarchical editor and annotator to group sub-shots under agenda items, proposals and discussion, and amendments and their discussion. If multiple cameras are used, these may be accessed and as synchronized multi-view databases as in our previous examples. The product will be a set of video meeting minutes that can be reviewed using our interaction technology. Private copies of these minutes may be further organized and annotated by individuals for their own reference. Certain areas of marketing may also benefit under the embodiments described above. Mirroring the success of desktop publishing in the 1980 's, we anticipate the immense potential in the production of marketing and business videos. As with every application domain, a specialized set of domain detectors will enhance the power of our technology in a particular application. Consider the example of real-estate marketing videos. One might imagine the use of such video to present homes for sale. While this is currently feasible for multi- million dollar homes using custom hand editing, the market potential is in widespread use of such video- based marketing. Scene change events are important for such applications as photographers highlight various features of the home. Photography events such as slow pans are typical to the sweep of the camera to obtain video panoramas . Using the indoor-outdoor detector described earlier, one may label shots as being inside or outside shots .
Once the raw video has been segmented into shots, a marketer may use our interaction technology to further organize, and annotate the video. These video segments, may further be resequenced to produce a marketing video. Home buyers may view a home of interest using our multiply-linked interaction technology to see different aspects of the home. This random-access capability will make home comparison faster and more effective. A specific embodiment of a method and apparatus for providing content based video access according to the present invention has been described for the purpose of illustrating the manner in which the invention is made and used. It should be understood that the implementation of other variations and modifications of the invention and its various aspects will be apparent to one skilled in the art, and that the invention is not limited by the specific embodiments described. Therefore, it is contemplated to cover the present invention any and all modifications, variations, or equivalents that fall within the true spirit and scope of the basic underlying principles disclosed and claimed herein .

Claims

1. A method of accessing a video segment of a plurality of video frames, such method comprising the steps of: segmenting the plurality of video frames into a plurality of video segments based upon semantic content; designating a frame of each video segment of the plurality of segments as a keyframe and as an index to the segment; ordering the keyframes ; placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location; designating a keyframe as a selected keyframe; and precessing the ordered keyframes through the ordered display until the selected keyframe occupies the selected location.
2. The method of accessing a video segment as in claim
1 further comprising storing the plurality of video frames in a memory.
3. The method of accessing a video segment as in claim
2 further comprising storing an identifier of each keyframe of the plurality of keyframes, where each location of the shift register corresponds to a location of the ordered display.
4. The method of accessing a video segment as in claim 1 further comprising storing a pointer along with the identifier of the keyframe in a location of the shift register, the pointer identifying a location in the memory of the corresponding video segment of the keyframe .
5. The method of accessing a video segment as in claim 1 further comprising increasing a resolution and size of the designated keyframe in the designated location of the ordered display.
6. The method of accessing a video segment as in claim 1 further comprising designating at least one keyframe as an apex of a hierarchical subgroup.
7. The method of accessing a video segment as in claim 6 further comprising designating at least one frame of the video segment of the at least one keyframe as a second order keyframe in a second order subgroup below the at least one key frame.
8. The method of accessing a video segment as in claim 7 further comprising displaying the hierarchical subgroup of at least one keyframe in an overlapping window of the recirculating display.
9. The method of accessing a video segment as in claim 1 further comprising playing the video segment of the designated keyframe in place of the designated keyframe in the designated location of the recirculating display.
10. The method of accessing a video segment as in claim 9 further comprising advancing the keyframe in the designated location to a next keyframe in the ordered group of keyframes when the last frame of the video segment playing in the designated location is reached.
11. The method of accessing a video segment as in claim 10 further comprising playing the video segment of the next keyframe in the designated location of the ordered display.
12. The method of accessing a video segment as in claim 9 further comprising designating another keyframe for playing in the designated location.
13. The method of accessing a video segment as in claim 12 further comprising precessing the other keyframe through the ordered display into the designated location and playing the segment associated with the other keyframe in the designated location.
14. The method of accessing a video segment as in claim 1 wherein the step of segmenting the video frames based upon semantic content further comprises detecting a vector cluster in a predetermined location of a video frame of the plurality of video frames as a basis of segmentation .
15. The method of accessing a video segment as in claim 14 wherein the step of detecting a vector cluster further comprising determining that the vector cluster is substantially a single distinct vector field.
16. The method of accessing a video segment as in claim 14 wherein the step of detecting a vector cluster further comprising comparing the vector cluster with a threshold value.
17. The method of accessing a video segment as in claim 16 wherein the step of detecting the video cluster as a basis of segmentation further comprises segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups by comparing a vector cluster with a subgroup threshold value.
18. The method of accessing a video segment as in claim 16 wherein the step of detecting the video cluster as a basis of segmentation further comprises segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups based upon a field direction of the video cluster.
19. The method of accessing a video segment as in claim 16 wherein the step of detecting the video cluster as a basis of segmentation further comprises segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups based upon detecting a field direction of the video cluster in a predetermined direction.
20. The method of accessing a video segment as in claim 1 wherein the step of segmenting the video frames based upon semantic content further comprises determining a vector coherence map.
21. The method of accessing a video segment as in claim 20 wherein the step of determining a vector coherence map further comprises calculating a dominant flow field.
22. The method of accessing a video segment as in claim 20 wherein the step of determining a vector coherence map further comprises estimating an optical flow at any point of the vector coherence map.
23. The method of accessing a video segment as in claim 20 wherein the step of determining a vector coherence map further comprises normalizing the vector coherence map.
24. The method of accessing a video segment as in claim
23 wherein the step of normalizing the vector coherence map further comprises determining a velocity likelihood map by fuzzy ANDing elements of the vector coherence map with elements of the normalized vector coherence map.
25. The method of accessing a video segment as in claim
24 wherein the step of determining a velocity likelihood map further comprises comparing the velocity likelihood maps of temporally adjacent video frames to identify movement vectors.
26. The method of accessing a video segment as in claim
25 wherein the step of identifying movement vectors further comprises clustering the movement vectors by location.
27. The method of accessing a video segment as in claim 26 wherein the step of clustering the movement vectors by location further comprises finding feature points on the frames .
28. The method of accessing a video segment as in claim 26 wherein the step of clustering the movement vectors by location further comprises computing movement vectors between frames .
29. The method of accessing a video segment as in claim 26 wherein the step of clustering the movement vectors by location further comprises updating a feature point array .
30. The method of accessing a video segment as in claim 1 wherein the step of segmenting the plurality of video frames further comprises detecting motion among elements of the sequence of video frames.
31. A method of detecting motion in a sequence of video frames, such method comprising the steps of: defining a normal correlation map for each video frame of the sequence of video frames using a convolution correlation process; and determining a vector coherence map among the frames of the sequence of frames of the sequence of video frames based upon the normal correlation map of the frame .
32. The method of detecting motion as in claim 31 wherein the step of determining the vector coherence map further comprises finding feature points on each video frame .
33. The method of detecting motion as in claim 32 wherein the step of determining the vector coherence map further comprises computing vector movements between successive video frames.
34. The method of detecting motion as in claim 33 wherein the step of determining the vector coherence map further comprises updating a feature point array of an analyzed sequence of frames .
35. The method of detecting motion as in claim 31 wherein the step of defining a normal correlation map further comprises for each point p of each frame of the sequence of frames determining a correlation response of image elements I for a region around the point p.
36. The method of detecting motion as in claim 35 wherein the step of determining the correlation response further comprises using absolute difference correlation.
37. The method of detecting motion as in claim 31 wherein the step of determining the vector coherence map further comprising imposing spatial coherence constraints to each element of the normal correlation map by applying weighting constants of a sigmoidal function to the element.
38. The method of detecting motion as in claim 37 wherein the step of determining the vector coherence map further comprising normalizing each element by dividing the element by a noise threshold to provide a likelihood vector coherence map.
39. The method of detecting motion as in claim 32 wherein the step of defining feature points further comprises using a Sobel operator to estimate local gradients .
40. The method of detecting motion as in claim 39 wherein the step of using a Sobel operator to estimate local gradients further comprises applying nonmaximal suppression to the local gradients to find locally maximal gradient points as feature points.
41. The method of detecting motion as in claim 32 wherein the step of defining feature points further comprises taking a fuzzy-AND of normalized spatial and temporal gradients to locate edge point that move as feature points .
42. The method of detecting motion as in claim 32 wherein the step of computing vector movements between successive video frames further comprises subdividing each frame into subwindows and excluding any subwindows with pixel values which exceed a threshold.
43. Apparatus for method accessing a video segment of a plurality of video frames, such apparatus comprising: means for segmenting the plurality of video frames into a plurality of video segments based upon semantic content; means for designating a frame of each video segment of the plurality of segments as a keyframe and as an index to the segment; means for ordering the keyframes ; means for placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location; means for designating a keyframe as a selected keyframe; and means for precessing the ordered keyframes through the ordered display until the selected keyframe occupies the selected location.
44. The apparatus for accessing a video segment as in claim 43 further comprising means for storing the plurality of video frames in a memory.
45. The apparatus for accessing a video segment as in claim 44 further comprising means for storing an identifier of each keyframe of the plurality of keyframes, where each location of the shift register corresponds to a location of the ordered display.
46. The apparatus for accessing a video segment as in claim 43 further comprising means for storing a pointer along with the identifier of the keyframe in a location of the shift register, the pointer identifying a location in the memory of the corresponding video segment of the keyframe.
47. The apparatus for accessing a video segment as in claim 43 further comprising means for increasing a resolution and size of the designated keyframe in the designated location of the ordered display.
48. The apparatus for accessing a video segment as in claim 43 further comprising means for designating at least one keyframe as an apex of a hierarchical subgroup .
49. The apparatus for accessing a video segment as in claim 48 further comprising means for designating at least one frame of the video segment of the at least one keyframe as a second order keyframe in a second order subgroup below the at least one key frame.
50. The apparatus for accessing a video segment as in claim 49 further comprising means for displaying the hierarchical subgroup of at least one keyframe in an overlapping window of the recirculating display.
51. The method of accessing a video segment as in claim 43 further comprising means for playing the video segment of the designated keyframe in place of the designated keyframe in the designated location of the recirculating display.
52. The apparatus for accessing a video segment as in claim 51 further comprising means for advancing the keyframe in the designated location to a next keyframe in the ordered group of keyframes when the last frame of the video segment playing in the designated location is reached.
53. The apparatus for accessing a video segment as in claim 52 further comprising means for playing the video segment of the next keyframe in the designated location of the ordered display.
54. The apparatus for accessing a video segment as in claim 51 further comprising means for designating another keyframe for playing in the designated location.
55. The apparatus for accessing a video segment as in claim 54 further comprising means for precessing the other keyframe through the ordered display into the designated location and playing the segment associated with the other keyframe in the designated location.
56. The apparatus for accessing a video segment as in claim 43 wherein the means for segmenting the video frames based upon semantic content further comprises means for detecting a vector cluster in a predetermined location of a video frame of the plurality of video frames as a basis of segmentation.
57. The apparatus for accessing a video segment as in claim 56 wherein the means for detecting a vector cluster further comprises means for determining that the vector cluster is substantially a single distinct vector field.
58. The apparatus for accessing a video segment as in claim 56 wherein the means for detecting a vector cluster further comprises means for comparing the vector cluster with a threshold value.
59. The apparatus for accessing a video segment as in claim 58 wherein the means for detecting the video cluster as a basis of segmentation further comprises means for segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups by comparing a vector cluster with a subgroup threshold value .
60. The apparatus for accessing a video segment as in claim 58 wherein the means for detecting the video cluster as a basis of segmentation further comprises means for segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups based upon a field direction of the video cluster.
61. The apparatus for accessing a video segment as in claim 58 wherein the means for detecting the video cluster as a basis of segmentation further comprises means for segmenting a video segment of the plurality of segments into a plurality of hierarchical subgroups based upon detecting a field direction of the video cluster in a predetermined direction.
62. The method of accessing a video segment as in claim 43 wherein the step of segmenting the video frames based upon semantic content further comprises determining a vector coherence map.
63. The apparatus for accessing a video segment as in claim 62 wherein the means for determining a vector coherence map further comprises means for calculating a dominant flow field.
6 . The apparatus for accessing a video segment as in claim 62 wherein the means for determining a vector coherence map further comprises means for estimating an optical flow at any point of the vector coherence map.
65. The apparatus for accessing a video segment as in claim 62 wherein the means for determining a vector coherence map further comprises means for normalizing the vector coherence map.
66. The apparatus for accessing a video segment as in claim 65 wherein the means for normalizing the vector coherence map further comprises means for determining a velocity likelihood map by fuzzy ANDing elements of the vector coherence map with elements of the normalized vector coherence map.
67. The apparatus for accessing a video segment as in claim 66 wherein the means for determining a velocity likelihood map further comprises means for comparing the velocity likelihood maps of temporally adjacent video frames to identify movement vectors.
68. The apparatus for accessing a video segment as in claim 67 wherein the means for identifying movement vectors further comprises means for clustering the movement vectors by location. j
69. The apparatus for accessing a video segment as in claim 68 wherein the means for clustering the movement vectors by location further comprises means for finding feature points on the frames .
70. The apparatus for accessing a video segment as in claim 68 wherein the means for clustering the movement vectors by location further comprises means for computing movement vectors between frames .
71. The apparatus for accessing a video segment as in claim 68 wherein the means for clustering the movement vectors by location further comprises means for updating a feature point array. 0
72. The apparatus for accessing a video segment as in claim 43 wherein the means for segmenting the plurality of video frames further comprises means for detecting motion among elements of the sequence of video frames . 5
73. Apparatus for detecting motion in a sequence of video frames, such apparatus comprising of: means for defining a normal correlation map for each video frame of the sequence of video frames using a 0 convolution correlation process; and means for determining a vector coherence map among the frames of the sequence of frames of the sequence of video frames based upon the normal correlation map of the frame .
74. The apparatus for detecting motion as in claim 73 wherein the means for determining the vector coherence map further comprises means for finding feature points on each video frame .
75. The apparatus for detecting motion as in claim 74 wherein the means for determining the vector coherence map further comprises means for computing vector movements between successive video frames.
76. The apparatus for detecting motion as in claim 75 wherein the means for determining the vector coherence map further comprises means for updating a feature point array of an analyzed sequence of frames .
77. The apparatus for detecting motion as in claim 73 wherein the means for defining a normal correlation map further comprises for each point p of each frame of the sequence of frames means for determining a correlation response of image elements I for a region around the point p.
78. The method of detecting motion as in claim 77 wherein the step of determining the correlation response further comprises using absolute difference correlation.
7 . The apparatus for detecting motion as in claim 73 wherein the means for determining the vector coherence
50 map further comprises means for imposing spatial coherence constraints to each element of the normal correlation map by applying weighting constants of a sigmoidal function to the element.
80. The apparatus for detecting motion as in claim 79 wherein the means for determining the vector coherence map further comprising means for normalizing each element by dividing the element by a noise threshold to provide a likelihood vector coherence map.
81. The apparatus for detecting motion as in claim 73 wherein the means for defining feature points further comprises means for using a Sobel operator to estimate local gradients.
82. The apparatus for detecting motion as in claim 81 wherein the means for using a Sobel operator to estimate local gradients further comprises means for applying nonmaximal suppression to the local gradients to find locally maximal gradient points as feature points .
83. The apparatus for detecting motion as in claim 74 wherein the means for defining feature points further comprises means for taking a fuzzy-AND of normalized spatial and temporal gradients to locate edge point that move as feature points .
84. The apparatus for detecting motion as in claim 74 wherein the means for computing vector movements between successive video frames further comprises subdividing each frame into subwindows and excluding any subwindows with pixel values which exceed a threshold.
PCT/US1998/015063 1997-07-22 1998-07-22 Content-based video access WO1999005865A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US5335397P 1997-07-22 1997-07-22
US60/053,353 1997-07-22

Publications (1)

Publication Number Publication Date
WO1999005865A1 true WO1999005865A1 (en) 1999-02-04

Family

ID=21983629

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/015063 WO1999005865A1 (en) 1997-07-22 1998-07-22 Content-based video access

Country Status (1)

Country Link
WO (1) WO1999005865A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0955599A2 (en) * 1998-05-07 1999-11-10 Canon Kabushiki Kaisha Automated video interpretation system
US6340971B1 (en) * 1997-02-03 2002-01-22 U.S. Philips Corporation Method and device for keyframe-based video displaying using a video cursor frame in a multikeyframe screen
EP1251515A1 (en) * 2001-04-19 2002-10-23 Koninklijke Philips Electronics N.V. Method and system for selecting a position in an image sequence
US9779774B1 (en) 2016-07-22 2017-10-03 Microsoft Technology Licensing, Llc Generating semantically meaningful video loops in a cinemagraph
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation
US11966429B2 (en) 2021-10-13 2024-04-23 On Time Staffing Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4612569A (en) * 1983-01-24 1986-09-16 Asaka Co., Ltd. Video editing viewer
US4698664A (en) * 1985-03-04 1987-10-06 Apert-Herzog Corporation Audio-visual monitoring system
US5179449A (en) * 1989-01-11 1993-01-12 Kabushiki Kaisha Toshiba Scene boundary detecting apparatus
JPH08163479A (en) * 1994-11-30 1996-06-21 Canon Inc Method and device for video image retrieval
US5537530A (en) * 1992-08-12 1996-07-16 International Business Machines Corporation Video editing by locating segment boundaries and reordering segment sequences
US5778108A (en) * 1996-06-07 1998-07-07 Electronic Data Systems Corporation Method and system for detecting transitional markers such as uniform fields in a video signal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4612569A (en) * 1983-01-24 1986-09-16 Asaka Co., Ltd. Video editing viewer
US4698664A (en) * 1985-03-04 1987-10-06 Apert-Herzog Corporation Audio-visual monitoring system
US5179449A (en) * 1989-01-11 1993-01-12 Kabushiki Kaisha Toshiba Scene boundary detecting apparatus
US5537530A (en) * 1992-08-12 1996-07-16 International Business Machines Corporation Video editing by locating segment boundaries and reordering segment sequences
JPH08163479A (en) * 1994-11-30 1996-06-21 Canon Inc Method and device for video image retrieval
US5778108A (en) * 1996-06-07 1998-07-07 Electronic Data Systems Corporation Method and system for detecting transitional markers such as uniform fields in a video signal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UEDA H., MIYATAKE T., YOSHIZAWA S.: "IMPACT: AN INTERACTIVE NATURAL-MOTION-PICTURE DEDICATED MULTIMEDIA AUTHORING SYSTEM.", HUMAN FACTORS IN COMPUTING SYSTEMS. REACHING THROUGH TECHNOLOGYCHI. CONFERENCE PROCEEDINGS, XX, XX, 27 April 1991 (1991-04-27), XX, pages 343 - 350., XP002914568 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6340971B1 (en) * 1997-02-03 2002-01-22 U.S. Philips Corporation Method and device for keyframe-based video displaying using a video cursor frame in a multikeyframe screen
US6516090B1 (en) 1998-05-07 2003-02-04 Canon Kabushiki Kaisha Automated video interpretation system
EP0955599A3 (en) * 1998-05-07 2000-02-23 Canon Kabushiki Kaisha Automated video interpretation system
EP0955599A2 (en) * 1998-05-07 1999-11-10 Canon Kabushiki Kaisha Automated video interpretation system
WO2002086897A1 (en) * 2001-04-19 2002-10-31 Koninklijke Philips Electronics N.V. Keyframe-based playback position selection method and system
EP1251515A1 (en) * 2001-04-19 2002-10-23 Koninklijke Philips Electronics N.V. Method and system for selecting a position in an image sequence
CN100346420C (en) * 2001-04-19 2007-10-31 皇家菲利浦电子有限公司 Keyframe-based playback position selection method and system
US9779774B1 (en) 2016-07-22 2017-10-03 Microsoft Technology Licensing, Llc Generating semantically meaningful video loops in a cinemagraph
US11961044B2 (en) 2019-03-27 2024-04-16 On Time Staffing, Inc. Behavioral data analysis and scoring system
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
US11457140B2 (en) 2019-03-27 2022-09-27 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US11863858B2 (en) 2019-03-27 2024-01-02 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11783645B2 (en) 2019-11-26 2023-10-10 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11184578B2 (en) 2020-04-02 2021-11-23 On Time Staffing, Inc. Audio and video recording and streaming in a three-computer booth
US11636678B2 (en) 2020-04-02 2023-04-25 On Time Staffing Inc. Audio and video recording and streaming in a three-computer booth
US11861904B2 (en) 2020-04-02 2024-01-02 On Time Staffing, Inc. Automatic versioning of video presentations
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11720859B2 (en) 2020-09-18 2023-08-08 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11966429B2 (en) 2021-10-13 2024-04-23 On Time Staffing Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation

Similar Documents

Publication Publication Date Title
Rav-Acha et al. Making a long video short: Dynamic video synopsis
Luo et al. Towards extracting semantically meaningful key frames from personal video clips: from humans to computers
Ju et al. Summarization of videotaped presentations: automatic analysis of motion and gesture
Pritch et al. Nonchronological video synopsis and indexing
Borgo et al. State of the art report on video‐based graphics and video visualization
EP1955205B1 (en) Method and system for producing a video synopsis
US8818038B2 (en) Method and system for video indexing and video synopsis
US7594177B2 (en) System and method for video browsing using a cluster index
Pritch et al. Webcam synopsis: Peeking around the world
Chen et al. Personalized production of basketball videos from multi-sensored data under limited display resolution
CA2761187C (en) Systems and methods for the autonomous production of videos from multi-sensored data
US8301669B2 (en) Concurrent presentation of video segments enabling rapid video file comprehension
Chen et al. Visual storylines: Semantic visualization of movie sequence
US7904815B2 (en) Content-based dynamic photo-to-video methods and apparatuses
JP2009539273A (en) Extract keyframe candidates from video clips
Li et al. Structuring lecture videos by automatic projection screen localization and analysis
Borgo et al. A survey on video-based graphics and video visualization.
WO1999005865A1 (en) Content-based video access
Zhang Content-based video browsing and retrieval
Wang et al. Taxonomy of directing semantics for film shot classification
Rui et al. A unified framework for video browsing and retrieval
Aner-Wolf et al. Video summaries and cross-referencing through mosaic-based representation
Niu et al. Real-time generation of personalized home video summaries on mobile devices
Zhang et al. AI video editing: A survey
Zhang Video content analysis and retrieval

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP SG US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA