WO2006092752A2

WO2006092752A2 - Creating a summarized overview of a video sequence

Info

Publication number: WO2006092752A2
Application number: PCT/IB2006/050583
Authority: WO
Inventors: Mauro Barbieri; Gerhardus E. Mekenkamp
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-03-03
Filing date: 2006-02-23
Publication date: 2006-09-08
Also published as: WO2006092752A3

Abstract

The disclosure relates in general to a method for automatically assembling a summarized overview (6) of a plurality of images (1-5) into one single image with performing a content analysis (20) of the plurality of images providing at least object and content information (22). To provide a quick overview over a scene within one image, the disclosure provides selecting from the object and content information at least one object (8-14), and rendering (24) the at least one selected object with at least one background image into one single image obtaining the summarized overview of the plurality of images within the one single image .

Description

Creating a summarized overview of a video sequence

The invention relates in general to a method for automatically assembling a summarized overview of a plurality of images into one single image with performing a content analysis of the plurality of images providing at least object and content information.

The invention further relates to a device implementing the method and to a program with instruction for executing the method.

In today's consumer electronic (CE) products, memory size becomes a less relevant factor. This fact leads to new applications and devices, which utilize the memory. For instance, hard disk and optical disk video recorders become more and more common in consumer households. These devices enable users to store hundreds of hours of video program, e.g. television programs, movies etc. In addition, professional users can store videos of conferences and meetings. This allows recapturing the results of a conference even days or months later.

The new applications cause the amount of digital content stored in a household or a company increasing rapidly. With the storage capacity increasing, and with the rise of new and more efficient compression algorithms, the number of stored video sequences increases. This requires advanced functionality for browsing and navigating through the video content. Users should be enabled to find specific information and certain video sequences easier and faster.

For instance, from G. Mekenkamp, M. Barbieri, et al, "Generating TV Summaries for CE-devices", Multimedia '02, December 1-6, 2002, Juan-les-Pins, France, a method for generating a summarized version of stored video sequences is known. Described are methods for providing trailers generated by analyzing the content and selecting clips focusing on entertainment rather than accuracy and for generating summaries by selecting representative frames and displaying a number of downscaled versions of these frames simultaneously.

Both methods require analyzing the content of the video sequence, for instance by extracting low- level content descriptors from the content. Frames can be selected using a frame-to-frame difference computation and selecting candidate frames based on this difference. Another possible way of providing a summarized overview of a video scene is described within J.Boreczky, et al. "An Interactive Comic Book Presentation for Exploring Video", CHI 2000 Conference Proceedings, ACM Press, pp 185-192, 2000. The video sequence is summarized according to this method such that images are laid out in a compact, visually pleasing display reminiscent of a comic book. This is achieved by segmenting the video sequence into shots demarcated by camera changes. After this segmentation, each segment is assigned a segment importance. This can be done by thresholding the importance score, where less important shots are pruned leaving a concise and visually varied summary. From the selected shots, frames are selected and displayed. The frames are arranged in a logical order. The order of the frames and their size is calculated on the importance of the frames. The layout of the frames is in rows, with different frame sizes. Within a row, frames go from left to right with occasional excursion up and down. Lines can be provided between frames to indicate unambiguously their order. The layout appears like a manga or a comic. A possible implementation is illustrated in Figure Ib. Both described methods provide pictorial overviews within a set of different images extracted from the original content. Using the pictorial overview, users can quickly see what a program is about, remember whether he has seen it before or not and decide whether, for instance, to watch the sequence or to download it to a portable device, or to download it from a network onto his computer. Moreover, pictorial overviews can be useful to check rapidly the content of a time shift buffer of a video recorder or a television with storage.

However, pictorial overviews have the drawback that by representing entire picture frames from a video scene they include a redundant and excessive amount of information. Comprehending at a glance the content of a set of frames out of their context however, is very hard for the user. The sequence of frames summarizing the video sequence also requires some amount of storage or bandwidth for transmission. Additionally, the user needs to view the whole sequence of frames to be able to comprehend the content of the summarized video scene, which might be considered by users as annoying or time wasting. Therefore, one object of the invention is to provide a video sequence summarization, which provides all information within one single image. Another object of the invention is providing a video sequence summary using the least amount of memory. Another object of the invention is to provide a more efficient compression for video sequence summaries. A further object of the invention is to provide video sequence summaries comprehensive for users at first glance and without time waste. A further object of the invention is to provide summarizing the content of a plurality of images.

These and other objects of the invention are solved by a method for automatically assembling a summarized overview of a plurality of images into one single image with performing a content analysis of the plurality of images providing at least object and content information, selecting from the object and content information at least one object, and rendering the at least one selected object with at least one background image into one single image obtaining the summarized overview of the plurality of images within the one single image. The inventive method enables creating a "video-poster", i.e. one still image comprising essential information about the plurality of images to be summarized. The plurality of images can be any form of more than one single image, for instance a video sequence. At least one object within the plurality of images can be identified as essential. This at least one object can be embedded into any layout using the rendering. Instead of using the entire frame, as proposed by known methods, the invention proposes to select at least one object. The pictorial overview can be done by identifying parts of frames, objects and backgrounds as important, resizing these if necessary and fusing the selected components together with optional graphical effects.

The inventive method proposes the steps of content analysis, object selection and video poster generation.

According to embodiments, selecting from the plurality of images at least one background object to be rendered with the at least one object into the single image is provided. The plurality of images can be segmented into foreground and background objects. The single image can be rendered using objects from foreground and background. In particular, one object from the background, which is considered highly relevant, can be selected for rendering the single image. It can also be possible to merge more than one background object into the single image.

Embodiments also provide object segmentation, object tracking, background- foreground segmentation, keyframe selection, face detection and/or face recognition during the content analysis of the plurality of images. When, for instance, a video scene is recorded, for instance, from a television program, content analysis algorithms are used to analyze the content of the sequence. The content analysis can comprise object segmentation and tracking. Semantic video objects can be determined from the video sequence. The determined objects can also be analyzed in terms of importance. One possible method for object segmentation and tracking is described in A. Mahindro, et al. "Enhanced Video Representation using Object", India Institute of Engineering, Chapter 1-2, Dec. 2002.

Obtaining object-based descriptions and/or content descriptions from the video sequence during content analysis of the plurality of images is provided according to embodiments. Content descriptors can be extracted from the plurality of images and associated to the original program for further processing. For instance, in case the video sequence is encoded in MPEG-4 or has associated an MPEG-7 stream, an object-based description of the content is directly available within the sequence as metadata.

This additional data can be used to determine information for selecting the at least one object. This information can be the importance of an object for the sequence.

An automatic summarization algorithm for selecting the at least one object can be provided with the object-based descriptions and/or the content descriptions. Automatic summarization algorithms are known, for instance, from the above referenced documents.

Embodiments also provide determining from statistical information from the object and/or content information data to be used for selecting the at least one object. The statistical information can be used to determine the importance of a particular object. For instance, the more often an object is recognized within the plurality of images, the more important this object can be. For example, a threshold value can be set to allow discriminating irrelevant objects from important objects. For example, statistical salience of certain objects can be used, for example, the face of the main actor appears very frequently. Also special events, for instance, with lot of movement, e.g. car crashes or car chasing scenes, can determine the importance of an object.

Embodiments provide choosing an image layout for the single image depending on the number of selected objects, randomly from a set of available layouts or according to user preferences, once the objects to be used are selected. The layout can depend on the number of objects selected. The layout can also be selected randomly from a set of pre-defined layouts. The layout can additionally be user selected or according to user preferences.

The rendering can include various image manipulation steps. The manipulation of the objects can be done according to known image manipulation methods. Embodiments provide manipulating the at least one selected object during rendering the single image. The manipulation can be, for example, object rotation, object re-scaling, blending, shadowing, overlapping or other graphical effects. The decision which graphical effect to be used can be content dependent, at random, layout driven or user defined, or any other source.

The relative position of the selected objects within the image needs also to be defined. Embodiments insofar provide positioning the object within the single image based on a relative size of the objects, an object importance and/or color similarities between selected objects. Because more than one background object can be selected from the video scene, the extracted background objects can be combined and merged into the background of the single image.

Associated textual metadata within the plurality of images can also be utilized for rendering the single image. Embodiments provide obtaining additional textual information from the plurality of images and including the textual information into the single image. This textual information can be the title of the scene, actor names, director names, dates and others. The textual information can be included into the image using various appropriate fonts, e.g. defined by the selected or determined layout. An animated image for summarizing the plurality of images is provided by embodiments providing obtaining a sequence of animation from the at least one selected object and including the sequence of animation into the single image. The selected object can be monitored within the plurality of images and a particular motion can be extracted and used to implement an animated single image, where the object moves according to the extracted movement.

Providing an object with a link to the particular scene within the summarized plurality of images is also provided according to embodiments. Clicking such a link enables the user to jump directly to the appropriate position within the plurality of images.

The inventive method can as well be applied to a set of different video sequences and one single image can be created summarizing all video sequences. The single image can be used to represent recursively hierarchies of folders containing different video scenes. During browsing, users can select an object from the single image. This can cause the image to reveal more details of the related sequence or to switch to a new image. The objects not related to the selected object can be hidden. Pictorial overviews and video posters can become important features of advanced and intelligent recording devices capable of recording huge amounts of video sequences. Personal Video Recorders and Home Entertainment Systems can utilize the inventive method. Another aspect of the invention is a device arranged for automatically assembling a summarized overview of a plurality of images into one single image comprising a content analyzer arranged for performing a content analysis of the plurality of images providing at least object and content information, an object selector arranged for selecting from the object and content information at least one object, and a rendering device arranged for rendering the at least one selected object with at least one background image into one single image obtaining the summarized overview of the video sequence within the one single image.

A further aspect of the invention is a computer program and a computer program product for automatically assembling a summarized overview of a plurality of images into one single image, the program comprising instructions operable to cause a processor to perform a content analysis of the plurality of images to provide at least object and content information, select from the object and content information at least one object, and render the at least one selected object with at least one background image into one single image to obtain the summarized overview of the video sequence within the one single image.

Yet, another aspect of the invention is the use of a method for automatically assembling a summarized overview of a plurality of images into one single image within consumer electronics with performing a content analysis of the plurality of images providing at least object and content information, selecting from the object and content information at least one object, and rendering the at least one selected object with at least one background image into one single image obtaining the summarized overview of the video sequence within the one single image.

These and other aspects of the invention will become apparent from and elucidated with reference to the following figures. In the figures show:

Fig. Ia, b a presentation of images within a video manga;

Fig. 2 a single image illustrating the contents of a video sequence;

Fig. 3 a flowchart of an inventive method; Fig. 4 a system arranged for implementing the inventive method.

Throughout the figures, like reference signs relate to similar elements, where appropriate. As video is used more and more as the official record of meetings, teleconferences, and other events, the ability to locate relevant passages or even entire meetings becomes important. Users need visual summaries to locate specific video passages quickly. Such systems are useful in settings that require a quick overview of video to identify potentially useful or relevant segments. Examples include recordings of meetings and presentations, home movies, and domain-specific video such as recordings used in surgery or in insurance.

These seemingly different forms of video are related because they consist of multiple shots, e.g. shots at different times, perhaps by different cameras or by a hand-held cameras, but the segments are often not clearly separable from the user's perspective and thus are not readily accessible through an index or a table of contents.

Video is an information- intensive medium. To get an overview of a video document quickly, users must either view a large portion of the video or consult some sort of summary. Techniques for automatically creating pictorial summaries of videos using automatic content analysis are known. While any collection of video keyframes can be considered a summary, a meaningful and concise representation of the video can be acquired automatically by choosing only the most salient images and efficiently packing them into a pictorial summary.

Many existing summarization techniques rely chiefly on collecting one or more keyframes from each shot. To allow data reduction, a measure of importance is carried out used for summarization. Using the importance measure, keyframes are selected and resized to reflect their importance scores, such that the most important are largest. The differently-sized keyframes are then efficiently packed into a compact summary reminiscent of a comic book or Japanese manga. Figure Ia illustrates different frames 1-5, each of which having a different content. The size of the frames 1-5 can be determined from the importance of a shot shown in the frame. The shots selected to be shown within the frames can be determined using known methods, for instance as described in S. Uchihashi, et al. "Video Manga: Generating Semantically Meaningful Video Summaries " FXPaIo Alto Laboratory. Figure Ib shows a video manga 6 comprising the frames 1-5. This video manga 6 is made up from the frames 1-5, whereby the frames are ordered such that the impression of a comic is created. The size and the content of the frames is selected according to the importance of the represented scenes. The video manga, however, still relies on a selection of different frames, providing redundant information to the user. The user needs to scan all frames to get an overview of the summarized video sequence.

Figure 2 illustrates a possible result of an inventive method. A video sequence is analyzed and various objects are determined to be shown within the image. From Figure 2 the objects 8, 10, and 12 represent faces of actors. These faces have been determined from the content analysis and found to be important. Therefore, the objects 8-12 are placed within the image. An additional object 14, which represents an airplane, is also selected and displayed. The ordering of the objects 8-14 is calculated based on the relative size of the objects, the importance of the objects and the chosen layout. Object 12 is found more important than object 10 and even more important than object 8. Therefore, object 12 is in the foreground, while the other objects 8, 10 are smaller and can be partially masked by object 12.

In addition, a textual information 16 is selected and displayed within the image. The position of textual information 16 can be determined from the chosen layout.

The single image illustrated in Figure 2 represents the most important objects 8-14 of the video sequence to be summarized. Additionally, textual information 16 is used. The ordering and arrangement of the objects is calculated using the object information, the layout information and additional information. The single image allows the user to see the most important things of a video sequence at first glance.

Figure 3 illustrates a flowchart of an inventive method. The video sequence is analyzed in a content analysis 20. The results of the content analysis 20 are forwarded to an object selection 22. Within the object selection 22, object information is evaluated and based on object occurrence rates, object importance, statistical data and further object related information, at least one object is selected. The at least one selected object is forwarded to rendering 24. Within the rendering 24, the selected objects are graphically manipulated so as to allow arranging them within one image. In addition, a layout is selected and the objects are adopted to fit into this layout. Textual information is reformatted to fit into the layout. All objects are placed within the image, an appropriate background is selected and the textual information is arranged. The result is one single image representing a summary of the video sequence received.

The inventive method allows providing a summary of a video sequence using only one single image and allowing the user to obtain all relevant information at first glance. Figure 4 illustrates a system arranged for implementing the inventive method. The system comprises a device 30, for example a consumer electronic device, having an input terminal 32 and an output terminal 34. Information, in particular graphical information, is communicated within the device using a communication bus 36, for example a graphics data bus. The device 30 further comprises a main memory 38, a central processor 40, a graphic interface and processor 42, a content analyzer 44, an object selector 46 and a rendering device 48. All components are interconnected via the communication bus 36.

During operation, the central processor 40 is controlled by a computer program 52, tangibly stored on a computer program product 50. The computer program 52 controls the central processor 40 to implement the inventive method and to further process images and video scenes.

Images and video scenes are received at the input terminal 32 and stored in main memory 38. The main memory 38 can also be supported by hard disks or optical storage devices. For displaying the video scenes and the images, these are processed by the graphic interface 42 to output terminal 34.

When providing a summarization of a video sequence, the video sequence to be summarized is processed to content analyzer 44. The content analyzer 44 is arranged for performing a content analysis of the video sequence providing at least object and content information. After content analysis, the results are processed to object selector 46, which is arranged for selecting from the object and content information at least one object. The selected objects are processed to rendering device 48, which is arranged for rendering the at least one selected object with at least one background image into one single image obtaining the summarized overview of the video sequence within the one single image. The rendering device 48 can be supported by the graphics processor within the graphics interlace 42. The output of the rendering device 48 can be stored in main memory 38 together with the summarized video sequence. The summary can be processed to output terminal 34 to be displayed on a display device.

Claims

CLAIMS:

1. A method for automatically assembling a summarized overview (6) of a plurality of images (1-5) into one single image with

- performing a content analysis (20) of the plurality of images providing at least object and content information, - selecting from the object and content information (22) at least one object (8-

14), and

- rendering (24) the at least one selected object (8-14) with at least one background image into one single image obtaining the summarized overview of the video sequence within the one single image.

2. The method of claim 1 , with selecting from the plurality of images (1 -5) at least one background object to be rendered with the at least one object (8-14) into the single image.

3. The method of claim 1 , with providing object segmentation, object tracking, background- foreground segmentation, key- frame selection, face detection and/or face recognition during the content analysis (22) of the plurality of images or the video sequence.

4. The method of claim 1 , with obtaining object-based descriptions and/or content descriptions from the plurality of images during content analysis (22) of the plurality of images.

5. The method of claim 4, with determining from the object-based descriptions and/or the content descriptions information for selecting (22) the at least one object (8-14).

6. The method of claim 4, with providing the object-based descriptions and/or the content descriptions to an automatic summarization algorithm selecting the at least one object (8-14).

7. The method of claim 4, with determining from statistical information from the object and/or content related information data to be used for selecting the at least one object (8-14).

8. The method of claim 4, with determining from the object-based descriptions and/or the content descriptions and/or the statistical information an object importance for selecting the at least one object (8-14).

9. The method of claim 1, with choosing an image layout for the single image depending on the number of selected objects (8-14), randomly from a set of available layouts or according to user preferences.

10. The method of claim 1, with manipulating the at least one selected object (8- 14) during rendering (24) the single image.

11. The method of claim 1 , with positioning the object (8-14) within the single image based on a relative size of the objects, an object importance and/or color similarities between selected objects.

12. The method of claim 1, with obtaining additional textual information (16) from the plurality of images and including the textual information (16) into the single image.

13. The method of claim 1 , with obtaining a sequence of animation from the at least one selected object (8-14) and including the sequence of animation into the single image.

14. The method of claim 1, with providing an object (8-14) with a link to the particular scene within the plurality of images.

15. A device (30) arranged for automatically assembling a summarized overview of a plurality of images into one single image comprising

- a content analyzer (44) arranged for performing a content analysis of the plurality of images providing at least object and content information,

- an object selector (46) arranged for selecting from the object and content information at least one object, and

- a rendering device (48) arranged for rendering the at least one selected object with at least one background image into one single image obtaining the summarized overview of the video sequence within the one single image.

16. A computer program (52) for automatically assembling a summarized overview (6) of a plurality of images (1-5) into one single image, the program comprising instructions operable to cause a processor to

- perform a content analysis (20) of the plurality of images to provide at least object and content information,

- select from the object and content information (22) at least one object (8-14), and

- render (24) the at least one selected object (8-14) with at least one background image into one single image to obtain the summarized overview of the video sequence within the one single image.

17. A computer program product (50) for automatically assembling a summarized overview (6) of a plurality of images (1-5) into one single image, the product storing a program (52) comprising instructions operable to cause a processor to - perform a content analysis (20) of the plurality of images to provide at least object and content information,

18. Use of a method for automatically assembling a summarized overview (6) of a plurality of images (1-5) into one single image within consumer electronics (30) with - performing a content analysis (20) of the plurality of images providing at least object and content information,

- selecting from the object and content information (22) at least one object (8- 14), and