WO2012032537A2 - A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device - Google Patents

A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device Download PDF

Info

Publication number
WO2012032537A2
WO2012032537A2 PCT/IN2011/000597 IN2011000597W WO2012032537A2 WO 2012032537 A2 WO2012032537 A2 WO 2012032537A2 IN 2011000597 W IN2011000597 W IN 2011000597W WO 2012032537 A2 WO2012032537 A2 WO 2012032537A2
Authority
WO
WIPO (PCT)
Prior art keywords
textual
frames
newly added
frame
added data
Prior art date
Application number
PCT/IN2011/000597
Other languages
French (fr)
Other versions
WO2012032537A3 (en
Inventor
Subhasis Chaudhuri
A. Ranjith Ram
Original Assignee
Indian Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Indian Institute Of Technology filed Critical Indian Institute Of Technology
Publication of WO2012032537A2 publication Critical patent/WO2012032537A2/en
Publication of WO2012032537A3 publication Critical patent/WO2012032537A3/en

Links

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied

Definitions

  • This invention relates to a method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device. More specifically, the invention relates to providing a legible display of a lecture video on the miniature video device.
  • miniature video devices include, but are not limited to mobile terminals, Personal Digital Assistants (PDA), hand-held computers, cell phones, tablet PCs, etc.
  • PDA Personal Digital Assistants
  • the video content intended for large screens like television and computer monitors is resized to fit on smaller screens of such miniature devices.
  • some information might be inherently lost either by spatial sub-sampling or by marginal cropping while attempting to fit it on the smaller screen.
  • the original video when played on miniature video devices suffers from overhead in memory usage and poor resolution.
  • An educational media such as a lecture video mainly comprises a lecturer writing content on a board/writing pad or explaining some slide shows.
  • lecture videos are largely useful in distance education, community outreach programs and video-on demand applications.
  • the textual content becomes too small to be legible. Therefore, there is a need to efficiently overcome the degradation of the legibility of the displayed textual content of a lecture video, due to the limitations on the screen size.
  • WO2009042340 discloses processing video data for automatically selecting at least one key-frame based on a set of criteria and then encoding the video data with an identifier.
  • it does not provide legibility retention of the video frames which is crucial in the case of lecture videos.
  • the method and system should facilitate legible display of textual content of a lecture video on a portable device without any loss in instructional value. Further, the legible display of the lecture video should be highly compressed, streamable, and synchronized with respect to the audio channel of the original video.
  • An object of the invention is to provide a content adaptive and legibility retentive display of a lecture video on a miniature video device.
  • Another object of the invention is to provide legible display of textual content of a lecture video on a miniature video device. Yet another object of the invention is to provide a legible display of the lecture video which is highly compressed, streamable, and synchronized with respect to the audio channel of the original video.
  • a method for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device comprising a sequence of textual and non-textual frames along with associated audio.
  • the method comprising the steps of: creating a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof; and sequentially displaying key-frames extracted from the textual and non-textual frames in accordance with the metadata by panning textual key-frames with a selection window having an aspect ratio and size in accordance with a display screen of the miniature video device and a center point as x-y position of newly added data point in the respective textual frame.
  • detecting the y-position of newly added data point in a current textual frame comprises comparing Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value.
  • detecting the x-position of the newly added data point in the current textual frame comprises comparing Vertical Projection Profiles (VPPs) of ink pixels contained in regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum.
  • VPPs Vertical Projection Profiles
  • the location of newly added data point in an intermediate frame occurring in time period between two textual frames temporally spaced apart by the predefined time interval is detected by temporally interpolating x-y positions of newly added data points of said two textual frames.
  • the size of selection window is equal to the size of display screen of the miniature video device and the size of selection window is varied based on the difference between locations of newly added data points of temporally adjacent textual frames.
  • the lecture video has a frame rate of 25 frames per second and a sub-sampling rate as 50
  • the HPP and VPP of textual frames spaced apart by 2 seconds are computed, leading to detection of newly added data points at regular interval of 50 frames.
  • a system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device comprising a sequence of textual and non-textual frames along with associated audio.
  • the system comprises a metadata creation module configured to create a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof.
  • the system further comprises a media re-creation module configured to receive the metadata, associated audio, and key-frames extracted from textual and non-textual frames; and sequentially displaying the key-frames in accordance with the metadata by panning the textual key-frames with a selection window having an aspect ratio and size in accordance with the miniature video device and center point as x-y position of newly added data point of the respective textual frame.
  • a media re-creation module configured to receive the metadata, associated audio, and key-frames extracted from textual and non-textual frames; and sequentially displaying the key-frames in accordance with the metadata by panning the textual key-frames with a selection window having an aspect ratio and size in accordance with the miniature video device and center point as x-y position of newly added data point of the respective textual frame.
  • the metadata creation module is configured to detect the y-position of a newly added data point in a current textual frame by comparing the Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y- position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value; and detect the x position of the newly added data point in the current textual frame by comparing Vertical Projection Profiles (VPPs) of ink pixels contained in the regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum.
  • the media recreation module is configured to vary the size of selection window based on the difference between locations of newly added data points of temporally adjacent textual frames.
  • the media recreation module is configured to provide a manual over-riding control to the viewer for selecting their region of interest in the lecture video and displaying said region with full or appropriate resolution.
  • the media recreation module is also configured to receive the metadata, associated audio and key-frames from a Server device as streaming media.
  • Fig 1 is a block diagram illustrating a system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device;
  • Fig. 2 is a functional block diagram representing the steps involved in a method for creating a meta data indicating location of newly added data point in two textual frames temporally spaced apart by a pre-defined time interval;
  • Fig. 3 is a plot of x-y positions of newly added data points in temporally adjacent textual frames
  • Fig. 4 is a block diagram illustrating a lecture video, textual and non-textual key-frames, associated audio, and the recreated lecture video using the metadata;
  • Fig. 5 is a graph illustrating horizontal projection profiles of ink pixels of two temporally adjacent textual frames
  • Fig. 6 is a graph illustrating local sum of differential ink pixels of horizontal projection profiles of two temporally adjacent textual frames
  • Figs.7a and 7b are graphs illustrating vertical projection profiles of ink pixels in cropped regions of two temporally adjacent textual frames
  • Fig. 8 is a graph illustrating local sum of differential ink pixels of vertical projection profiles of cropped regions of two temporally adjacent textual frames.
  • the system 100 includes a media splitter 101, a gray scale conversion module 102, a segmentation and shot recognition module 103, a non-textual key-frame extraction module 104, a textual key-frame extraction module 105, a metadata creation module 106, a metadata interpolation module 107, and a media re-creation module 108.
  • Each module performs a specific function, each function being a contributory step in providing a content adaptive and legibility retentive display of a lecture video on the miniature video devices such as mobile terminals, Personal Digital Assistants (PDA), hand-held computers, tablet PCs.
  • PDA Personal Digital Assistants
  • a typical lecture video comprises a lecturer explaining a topic by writing on a board or using slide shows.
  • Such lecture video may include scenes of various instructional activities such as a talking head activity scene, a class room activity scene, a writing hand activity scene or a slide show activity scene.
  • a typical video data is made up of unique consecutive images referred as frames. The frames are displayed to the user at a rate referred as frames per second.
  • the media splitter 101 is configured to receive an original lecture video and split it into video and audio data.
  • the gray-scale conversion module 102 and segmentation and shot recognition module 103 executes temporal segmentation of the video data to detect scene changes/breaks therein and then detect the activities in it.
  • a histogram difference is measured between two consecutive frames of the video data. If the sum of absolute difference of the histograms between two consecutive frames of the video data crosses a threshold, the frames are declared as shot boundary frames thereby determining a scene break.
  • the activity detection of scenes is HMM (Hidden Markov Model) based and is carried out in two phases i.e. a training and a testing phase.
  • the HMM parameters are learned based on which a scene would be classified into one of the above mentioned activities. For example, to classify a scene into a talking head activity scene, writing hand activity scene or a slide show activity scene, motion within the scene is taken into account for classification. Motion in a talking head activity scene is more than that of writing hand activity scene and the least in the slide show activity scene. Therefore, the energy of the temporal derivative in intensity space is used as a relevant feature.
  • the gray-level histogram gives the distribution of the image pixels over different intensity values. It is very sparse for the slide show activity scene, moderately sparse for the writing hand activity scene and dense for talking head activity scene.
  • Histogram entropy is a direct measure of the variation of pixel intensity in an image. If there is a high variation of intensity, the entropy will be high and vice versa.
  • the content of the video data may be classified into textual and non-textual content.
  • the talking head activity and class room activity scenes are non-textual content.
  • the writing hand activity and slide show activity scenes are textual content, as in some way or other, these scenes display textual content to the user.
  • the frames of lecture video which include textual content are hereinafter referred to as textual frames, whereas the frames which include non-textual content are hereinafter referred to non-textual frames.
  • a typical lecture video comprises a sequence of textual and non-textual frames along with associated audio.
  • the textual key-frame extraction module 105 and non-textual key- frame extraction module 106 are configured to extract representative key-frames from the textual and non-textual frames respectively, such that a set of key-frames represent a summarized semantic content of an entire scene for a particular duration.
  • the textual key-frame extraction module 105 performs ink-pixel based extraction of key-frames from textual frames
  • the non-textual key-frame extraction module 104 performs a visual quality based extraction of key-frames from non-textual frames.
  • the meta-data creation module 106 is configured to create a metadata which indicates location in the textual frames where the lecturer is currently writing or the text is appearing in a slide show.
  • the location in a textual frame which is currently being scribbled is represented by an x-y position which varies in accordance with the writing advancement from frame to frame.
  • the region in the textual frame where the lecturer is currently writing or the text is appearing is the region of interest for a viewer as during the display of video data, the viewer will usually focus on that portion of the video where text is appearing or the lecturer is writing.
  • the metadata creation module 106 creates metadata that indicates location of newly added data points in textual frames temporally spaced apart by a pre-defined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof.
  • a set of textual frames temporally spaced apart by the pre-defined time interval is obtained by sub-sampling textual frames at a predefined rate, the pre-defined rate being equal to product of the pre-defined time interval and the frame rate of the lecture video.
  • the frame rate of the lecture video may be represented by f and the temporal sub-sampling rate by k, where f may take values such as 25fps, 30fps, etc and k may take values such as 10, 20, 40, 50, 60 or even higher.
  • the y-position of newly added data point in a current textual frame is detected by comparing the Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value.
  • HPP Horizontal Projection Profile
  • the corresponding x-position is detected by comparing Vertical Projection Profiles (VPPs) of ink pixels contained in the regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum.
  • VPPs Vertical Projection Profiles
  • the x-y position of newly added data point in a textual frame may also be referred as a track point of the textual frame.
  • the track points of textual frames spaced apart by the predefined time interval constitutes the metadata, which may be an array of size 2 * L, where L is the total number of textual frames derived from the video data. L may be higher than the count of non-textual frames since writing operation is a much slower activity compared to the standard video frame rate.
  • the metadata interpolation module 107 is configured to interpolate the location of a newly added data point of an intermediate frame occurring in time period between two textual frames temporally spaced apart by a predefined time interval using locations of newly added data points in said two textual frames.
  • the media re-creation module 108 is configured to recreate the lecture video on the miniature video device. Essentially, the media re-creation module 108 is configured to receive the metadata, associated audio and key-frames extracted from textual and non-textual frames, and sequentially display the key-frames in accordance with the metadata by panning the textual keyframes with a selection window having an aspect ratio and size in accordance with the display screen of the miniature video device and center point as x-y position of the respective textual frame. In the recreated lecture video, the non textual key-frames are resized to fit onto the screen of the miniature video device, whereas the textual key-frames are cropped to display the region of interest with maximum resolution.
  • the metadata drives the selection window to scan the entire content in the textual key-frame for the time for which that textual key-frame is intended for display.
  • the media re-creation module 108 displays a cropped region of the textual key-frame to the viewer, the cropped region being the region of interest and having size equal to the display screen of the miniature video device.
  • the cropped region of a textual key-frame may also be referred to as key-hole image of respective key-frame.
  • the cropped region may further be referred as a child frame of the parent key-frame.
  • the system 100 may be deployed using a client-server configuration.
  • the Server device will have the lecture video, the media splitter 101, gray scale conversion module 102, segmentation and shot recognition module 103, textual and non-textual key- frame extraction modules 104 and 105, metadata creation module 106.
  • the client device will include the media re-creation module 108.
  • the media re-creation module 108 is an instructional media player of the client device.
  • the metadata interpolation module 107 may either be present at the client device or the Server device.
  • the client device may be a miniature video device and may request for a lecture video from the Server device.
  • the Server device may send the metadata, key-frames and associated audio of the lecture video to the client device.
  • the client device may either download the metadata, key-frames and associated audio from the Server device or receive them as streaming data.
  • the data received at the media re-creation module 108 in response to a request for a lecture video will include (a) few key-frames in any image file format, (b) an audio file in suitable file format and (c) an XML or equivalent text file containing the temporal marking for the placement of key-frames and the metadata.
  • the media re-creation module 108 receives the key- frames, metadata and audio.
  • the temporally long-lasting video shots are replaced by the corresponding static key-frames, therefore, the data received at the client device is highly compressed with respect to the original lecture video.
  • the original lecture video is received in a highly reduced form at the client device but with better means to suit the target display and storage with minimal information loss.
  • the Horizontal Projection Profile (HPP) of ink pixels of a current frame i.e. ⁇ ⁇ frame is computed by projecting its ink pixels on the y-axis.
  • the HPP of ink pixels of a previous frame i.e. (p-k)* frame is computed.
  • the HPP of (p-k)* frame is subtracted from the HPP of p* frame to detect the point where the local amount of differential ink pixels in the HPPs reaches a threshold value. Said point is then set as the y-position of newly added data point of the p th frame.
  • the Vertical Projection Profile (VPP) of ink pixels of a cropped p th frame is computed.
  • the cropped p th frame is a horizontal strip image of the p th frame including ink pixels contained in the regions around the detected y-position.
  • the VPP of ink pixels of a cropped (p-k) th frame is computed.
  • the cropped (p-k) th frame is a horizontal strip image of the (p-k) th frame including ink pixels contained in the regions around the detected y-position.
  • the VPP of (p- k) th frame is subtracted from the VPP of p th frame to detect the point where the local amount of differential ink pixels in the VPPs is maximum. Said point is then set as the x-position of newly added data point of the p 1 * 1 frame.
  • the detected x-y position of the ⁇ ⁇ frame is then set as the track point of the p* textual frame.
  • B t (m, n), 0 ⁇ m ⁇ M, 0 ⁇ n ⁇ N be the textual frame at time t, where M and N are number of rows and columns of said frame respectively.
  • the HPP of an image is a vector in which each element is obtained by summing up the pixels values along a row. Therefore, the HPP of the textual frame at time t is an array P t of length M in which each element stands for the total number of ink pixels in a row of the processed frame. Since pixels in each frames of the content scene are converted into ink and paper pixels corresponding to 0 and 255 values of gray levels respectively, the array P t is represented by
  • the local absolute summation SD(j) of the difference HPP array D(m) is performed to detect any considerable variation in ink pixels.
  • Wd is the localization parameter for summation and ⁇ is the threshold for differential ink detection. If SD(j) > ⁇ , it may be assumed that there is a scribbling activity or a new data point is introduced in the jth line since the local sum of the absolute HPP difference yield a high value which is a result of the introduction of extra ink pixels in that line. Therefore, with the help of HPPs of two temporally adjacent frames, the y-position of the newly added data point in a textual frame is detected.
  • the x-position of the newly added data point of the textual frame B t (m, n) is detected using the detected y-position.
  • the horizontal strip region around the detected y-position is cropped and VPPs of the cropped images is computed.
  • the point at which the local sum of the absolute differences of the VPPs yields maximum value is taken as the x-position of the newly added data point of the textual frame B t (m, n).
  • HPP and VPP of textual frames temporally spaced apart by 2 seconds are computed, leading to detection of x-y positions of newly added data points in the lecture video at regular interval of k, i.e. 50 frames.
  • the time interval of 2 seconds for computation of track points in textual frames is taken with the assumption that there would not be too much variation of textual content in 2 second duration in a lecture video.
  • the writing pace of the lecturer is usually very slow compared to the video frame rate, therefore increment in textual content (extra ink pixels introduced) from frame to frame is too less such that no noticeable difference appears in HPPs of consecutive textual frames.
  • the computation of x-y positions of newly added data points at interval of 50 frames not only increases the efficiency but also reduces unnecessary computations.
  • the media-recreation module 108 displays the textual key-frames by panning them with a selection window in accordance with the metadata.
  • the metadata derived in regular intervals of k frames for driving the selection window may result in a jittered panning of the textual key-frames.
  • the track points of intermediate frames occurring in time period between p* and (p+k) 111 frame positions are detected.
  • (x(ti), y(t ) and (x(t 2 ), y(t 2 )) be track points of frames occurring at time tj and t 2 respectively.
  • the track point of an intermediate frame occurring at a time t between and t 2 is interpolated using track points of frames at ti and t 2 .
  • the x-position of the track point of such intermediate frame at time t can be calculated as
  • the y-position of the track point of the intermediate frame at time t is computed using y-position of track points of frames occurring at and t 2 .
  • a plot of track points of temporally spaced textual frames is illustrated.
  • the track points are plotted for textual frames occurring in time period from t[ to t 8 .
  • the plot of track points essentially provides an idea of the movement of pen of the lecturer on the writing board. When there is a considerable deviation between two consecutive track points, say track points at t4 and t 5 , it implies that during the writing activity, the lecturer suddenly toggles from one point to another.
  • the size of the selection window for panning a key-frame is equal to the size of the display screen of the miniature video device so as to display the textual content with maximum resolution.
  • a considerable deviation between two track points has to be taken into account as it may be beyond both the spatial span and movement of the preferable selection window. Such deviations are taken into account by computing distance between consecutive track points at ti and t 5 as
  • (x(U), y(U)) and (x(t 5 ), y(t 5 )) are x-y positions of consecutive track points. If d is greater than a predefined value, say 100, the size of the selection window is increased between U and t 5 to make it large enough to include track points at U and t 5 , i.e. (x(t 4 ), y(t 4 )) and (x(t 5 ), y(t 5 )) totally inside the selection window.
  • the size of selection window is increased for wider visibility such that the viewer does not miss any region of interest and to provide a sense of spatial context. Under these circumstances, the text does not appear in its full resolution to the viewer since scaling is required to fit the selection window on the screen.
  • the size of the selection window is then decreased to the preferable size in time interval from t 5 to since, there is no considerable deviation between track points at t 5 and
  • the feature of varying the size of the selection window based on the difference between the track points of temporally adjacent textual frames generates a feel of automatic zoom in/out to the viewer and may be referred to as a multi resolution local visual delivery.
  • the display is a content adaptive one.
  • a viewer may be provided with an optional manual over-riding control over the region of interest during display of the recreated lecture video on the miniature video device.
  • the viewer can use the touch screen technology or the selection key in their miniature devices for selecting their region of interest and displaying it with full or appropriate resolution.
  • a user is allowed to opt their own way of viewing the content.
  • the creation of metadata may be slightly different from the manner in which the metadata is created for video containing hand written slides.
  • the position of the current line is detected by using the HPP of the ink pixels of consecutive textual frames.
  • the position of the current line serves as the metadata required on media re-creation module 108 for the vertical positioning of the selection window after which it is linearly swept horizontally with a timeline slow enough to watch the content. This process is repeated until a new line is introduced below that.
  • the vertical step size can be calculated from the ratio of the vertical dimensions of the display size of the server and client devices.
  • the time interval for sweeping along a particular line can be calculated from the total time duration required for displaying that key-frame and the vertical step size.
  • the total time duration of the original lecture video 401 is 16 minutes, which includes a talking head activity scene for 4 minutes, a writing hand activity scene for 8 minutes, again a talking head activity scene for 4 minutes and associated audio 407.
  • the first talking head activity scene includes total 6000 non-textual frames
  • the writing hand activity scene comprises total 12000 textual frames
  • the second talking head activity scene include total 6000 non-textual frames.
  • a metadata indicating writing advancement in the textual frames is created using HPPs and VPPs of textual frames temporally spaced apart by a predefined time interval, i.e.
  • the non-textual key-frames 402, 405 and textual key-frames 403, 404 are extracted from non-textual and textual frames respectively using known methods. Based on the metadata, audio 407 and key-frames 402-405, the lecture video is recreated on the miniature video device. In the recreated lecture video 406, the non-textual key-frames 402 and 405 are resized to fit onto the miniature device screen, where as the textual key-frames 403 and 404 are panned with selection window to display child frames 408 and 409 so to provide the region of interest to the viewer with maximum resolution.
  • a lecture video of time duration 55-60 minutes having frame rate as 25 frames per second is taken, where the content portion of the lecture video comprise hand written slides.
  • the extraction of metadata is performed on the textual frames of the original video by computing HPPs of every two textual frames temporally spaced apart by two seconds.
  • HPPs of two such textual frames are plotted on the same graph as illustrated in Fig. 5.
  • the plot of local sum of the absolute difference of said HPPs is shown in Fig. 6.
  • the VPP based x-position detection method is illustrated in Figs. 7 and 8.
  • Figs. 7(a) and (b) respectively illustrate the VPPs of cropped horizontal strip regions around the detected y-position of the two frames.
  • the plot of local sum of the absolute difference of these VPPs is shown in Fig. 8.
  • the region where a high overshoot occurs essentially contains the newly written text.
  • the tracked point is (222, 400). This procedure is repeated for all content textual frames in the intervals of 50 frames.

Abstract

A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device where the lecture video comprises a sequence of textual and non-textual frames along with associated audio. The method comprises creating a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof; and sequentially displaying key-frames extracted from the textual and non-textual frames in accordance with the metadata by panning textual key-frames with a selection window having an aspect ratio and size in accordance with a display screen of the miniature video device and a center point as x-y position of newly added data point in the respective textual frame.

Description

TITLE OF THE INVENTION
A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device
FIELD OF THE INVENTION
This invention relates to a method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device. More specifically, the invention relates to providing a legible display of a lecture video on the miniature video device.
BACKGROUND OF THE INVENTION
With the increased use of handheld devices and the development of mobile multimedia technologies, a large numbers of users prefer to view video content on their miniature video devices. Examples of miniature video devices include, but are not limited to mobile terminals, Personal Digital Assistants (PDA), hand-held computers, cell phones, tablet PCs, etc. The video content intended for large screens like television and computer monitors is resized to fit on smaller screens of such miniature devices. During resizing, some information might be inherently lost either by spatial sub-sampling or by marginal cropping while attempting to fit it on the smaller screen. As a result, the original video when played on miniature video devices suffers from overhead in memory usage and poor resolution.
An educational media such as a lecture video mainly comprises a lecturer writing content on a board/writing pad or explaining some slide shows. Such lecture videos are largely useful in distance education, community outreach programs and video-on demand applications. When such a video is displayed on a small screen, the textual content becomes too small to be legible. Therefore, there is a need to efficiently overcome the degradation of the legibility of the displayed textual content of a lecture video, due to the limitations on the screen size.
Reference is made to a paper titled as "Video retargeting: Automatic Pan and Scan, Proceedings of the 14th annual ACM international conference on Multimedia, pp. 241-250, 2006", which discloses a retargeting method for conventional video with importance of information so as to fit it on the target display. The video frames are cropped according to the local importance of information so as to fit it on the target display. However, it does not provide a legible display of textual contents of a lecture video on a miniature screen.
Reference is made to WO2009042340, which discloses processing video data for automatically selecting at least one key-frame based on a set of criteria and then encoding the video data with an identifier. However, it does not provide legibility retention of the video frames which is crucial in the case of lecture videos.
Reference is made to US 20090251594, which discloses finding salient regions in video frames using a scale-space, spatio-temporal information and displaying a cropped region which is isotropically resized to match the target display. However, this method never preserves the resolution of the video frames and is not suitable for video containing document images.
Reference is made to a paper titled as "Visual input for pen-based computers, Proc. of International Conf. on Pattern Recognition, Vol. 3, 1996, pp. 33-37", which discloses using a combination of a correlation detector, Canny edge detector and Kalman filter for tracking the trajectory of the tip of the pen in a lecture video while a lecturer writes content. The trajectory of the tip of the pen may be useful to select a region of interest in which the pen currently strikes and display the region of interest with maximum resolution on the screen of a portable device. However, such method fails to track the pen when the instructor moves away his/her pen from the paper often for prolonged duration and then starts writing again. The pen and the corresponding hand occlude part of the writing and are not suitable for displaying the content. Further, the method is not amenable to any data reduction technique, which is a must for a lecture video.
Reference may be made to a paper titled as "Video based on-line handwriting recognition, Proc. of the sixth International Conf. on document analysis and recognition (ICADR Ό1)", which discloses handwriting recognition. However, the legible display of textual content of a lecture video does not require any character recognition, instead recognizing region of interest which is currently being written and displaying that region of interest with maximum resolution is of prime concern.
Reference is made to Indian Patent Application No. 386/MUM/2009, titled "A device and method for automatically recreating a content preserving and compression efficient lecture video". Said invention primarily relates to extraction of key-frames from the lecture video and then recreating the key-frames at the display device without degrading the audio video quality. However, said application does not disclose legible display of key-frames of the lecture video on a miniature video device.
Therefore, there is a need for method and system which improve the legibility of a lecture video on a miniature video device. The method and system should facilitate legible display of textual content of a lecture video on a portable device without any loss in instructional value. Further, the legible display of the lecture video should be highly compressed, streamable, and synchronized with respect to the audio channel of the original video. OBJECTS OF THE INVENTION
An object of the invention is to provide a content adaptive and legibility retentive display of a lecture video on a miniature video device.
Another object of the invention is to provide legible display of textual content of a lecture video on a miniature video device. Yet another object of the invention is to provide a legible display of the lecture video which is highly compressed, streamable, and synchronized with respect to the audio channel of the original video. DETAILED DESCRIPTION OF THE EMBODIMENTS OF THE INVENTION
According to the invention, there is provided a method for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device, the lecture video comprising a sequence of textual and non-textual frames along with associated audio. The method comprising the steps of: creating a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof; and sequentially displaying key-frames extracted from the textual and non-textual frames in accordance with the metadata by panning textual key-frames with a selection window having an aspect ratio and size in accordance with a display screen of the miniature video device and a center point as x-y position of newly added data point in the respective textual frame.
Preferably, detecting the y-position of newly added data point in a current textual frame comprises comparing Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value. Further, detecting the x-position of the newly added data point in the current textual frame comprises comparing Vertical Projection Profiles (VPPs) of ink pixels contained in regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum.
Preferably, the location of newly added data point in an intermediate frame occurring in time period between two textual frames temporally spaced apart by the predefined time interval is detected by temporally interpolating x-y positions of newly added data points of said two textual frames.
Preferably, the size of selection window is equal to the size of display screen of the miniature video device and the size of selection window is varied based on the difference between locations of newly added data points of temporally adjacent textual frames. Preferably, when the lecture video has a frame rate of 25 frames per second and a sub-sampling rate as 50, the HPP and VPP of textual frames spaced apart by 2 seconds are computed, leading to detection of newly added data points at regular interval of 50 frames.
According to the invention, there is provided a system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device, the lecture video comprising a sequence of textual and non-textual frames along with associated audio. The system comprises a metadata creation module configured to create a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof. The system further comprises a media re-creation module configured to receive the metadata, associated audio, and key-frames extracted from textual and non-textual frames; and sequentially displaying the key-frames in accordance with the metadata by panning the textual key-frames with a selection window having an aspect ratio and size in accordance with the miniature video device and center point as x-y position of newly added data point of the respective textual frame.
Preferably, the metadata creation module is configured to detect the y-position of a newly added data point in a current textual frame by comparing the Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y- position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value; and detect the x position of the newly added data point in the current textual frame by comparing Vertical Projection Profiles (VPPs) of ink pixels contained in the regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum. Preferably, the media recreation module is configured to vary the size of selection window based on the difference between locations of newly added data points of temporally adjacent textual frames.
Preferably, the media recreation module is configured to provide a manual over-riding control to the viewer for selecting their region of interest in the lecture video and displaying said region with full or appropriate resolution. The media recreation module is also configured to receive the metadata, associated audio and key-frames from a Server device as streaming media.
These and other aspects, features and advantages of the invention will be better understood with reference to the following detailed description, accompanying drawings and appended claims, in which,
Fig 1 is a block diagram illustrating a system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device;
Fig. 2 is a functional block diagram representing the steps involved in a method for creating a meta data indicating location of newly added data point in two textual frames temporally spaced apart by a pre-defined time interval;
Fig. 3 is a plot of x-y positions of newly added data points in temporally adjacent textual frames;
Fig. 4 is a block diagram illustrating a lecture video, textual and non-textual key-frames, associated audio, and the recreated lecture video using the metadata;
Fig. 5 is a graph illustrating horizontal projection profiles of ink pixels of two temporally adjacent textual frames;
Fig. 6 is a graph illustrating local sum of differential ink pixels of horizontal projection profiles of two temporally adjacent textual frames; Figs.7a and 7b are graphs illustrating vertical projection profiles of ink pixels in cropped regions of two temporally adjacent textual frames; and
Fig. 8 is a graph illustrating local sum of differential ink pixels of vertical projection profiles of cropped regions of two temporally adjacent textual frames.
Referring to Fig. 1, a system 100 for providing a content adaptive and legibility retentive display of a lecture video is described. The system 100 includes a media splitter 101, a gray scale conversion module 102, a segmentation and shot recognition module 103, a non-textual key-frame extraction module 104, a textual key-frame extraction module 105, a metadata creation module 106, a metadata interpolation module 107, and a media re-creation module 108. Each module performs a specific function, each function being a contributory step in providing a content adaptive and legibility retentive display of a lecture video on the miniature video devices such as mobile terminals, Personal Digital Assistants (PDA), hand-held computers, tablet PCs.
A typical lecture video comprises a lecturer explaining a topic by writing on a board or using slide shows. Such lecture video may include scenes of various instructional activities such as a talking head activity scene, a class room activity scene, a writing hand activity scene or a slide show activity scene. A typical video data is made up of unique consecutive images referred as frames. The frames are displayed to the user at a rate referred as frames per second.
The media splitter 101 is configured to receive an original lecture video and split it into video and audio data. The gray-scale conversion module 102 and segmentation and shot recognition module 103 executes temporal segmentation of the video data to detect scene changes/breaks therein and then detect the activities in it. To determine a scene break, a histogram difference is measured between two consecutive frames of the video data. If the sum of absolute difference of the histograms between two consecutive frames of the video data crosses a threshold, the frames are declared as shot boundary frames thereby determining a scene break. The activity detection of scenes is HMM (Hidden Markov Model) based and is carried out in two phases i.e. a training and a testing phase. During the training phase, the HMM parameters are learned based on which a scene would be classified into one of the above mentioned activities. For example, to classify a scene into a talking head activity scene, writing hand activity scene or a slide show activity scene, motion within the scene is taken into account for classification. Motion in a talking head activity scene is more than that of writing hand activity scene and the least in the slide show activity scene. Therefore, the energy of the temporal derivative in intensity space is used as a relevant feature. The gray-level histogram gives the distribution of the image pixels over different intensity values. It is very sparse for the slide show activity scene, moderately sparse for the writing hand activity scene and dense for talking head activity scene. Hence the entropy of the histogram can be treated as another good feature for the detection of these activities. Histogram entropy is a direct measure of the variation of pixel intensity in an image. If there is a high variation of intensity, the entropy will be high and vice versa.
According to the type of instructional activity, the content of the video data may be classified into textual and non-textual content. For example, the talking head activity and class room activity scenes are non-textual content. Whereas, the writing hand activity and slide show activity scenes are textual content, as in some way or other, these scenes display textual content to the user. The frames of lecture video which include textual content are hereinafter referred to as textual frames, whereas the frames which include non-textual content are hereinafter referred to non-textual frames. Hence, a typical lecture video comprises a sequence of textual and non-textual frames along with associated audio.
The textual key-frame extraction module 105 and non-textual key- frame extraction module 106 are configured to extract representative key-frames from the textual and non-textual frames respectively, such that a set of key-frames represent a summarized semantic content of an entire scene for a particular duration. The textual key-frame extraction module 105 performs ink-pixel based extraction of key-frames from textual frames, whereas, the non-textual key-frame extraction module 104 performs a visual quality based extraction of key-frames from non-textual frames.
The meta-data creation module 106 is configured to create a metadata which indicates location in the textual frames where the lecturer is currently writing or the text is appearing in a slide show. The location in a textual frame which is currently being scribbled is represented by an x-y position which varies in accordance with the writing advancement from frame to frame. The region in the textual frame where the lecturer is currently writing or the text is appearing is the region of interest for a viewer as during the display of video data, the viewer will usually focus on that portion of the video where text is appearing or the lecturer is writing.
In accordance with various embodiments of the present invention, the metadata creation module 106 creates metadata that indicates location of newly added data points in textual frames temporally spaced apart by a pre-defined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof. In an embodiment of the present invention, a set of textual frames temporally spaced apart by the pre-defined time interval is obtained by sub-sampling textual frames at a predefined rate, the pre-defined rate being equal to product of the pre-defined time interval and the frame rate of the lecture video. The frame rate of the lecture video may be represented by f and the temporal sub-sampling rate by k, where f may take values such as 25fps, 30fps, etc and k may take values such as 10, 20, 40, 50, 60 or even higher.
The y-position of newly added data point in a current textual frame is detected by comparing the Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value.
When the y-position of newly added data point of a current frame is detected, the corresponding x-position is detected by comparing Vertical Projection Profiles (VPPs) of ink pixels contained in the regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum. The x-y position of newly added data point in a textual frame may also be referred as a track point of the textual frame.
The track points of textual frames spaced apart by the predefined time interval constitutes the metadata, which may be an array of size 2 * L, where L is the total number of textual frames derived from the video data. L may be higher than the count of non-textual frames since writing operation is a much slower activity compared to the standard video frame rate.
The metadata interpolation module 107 is configured to interpolate the location of a newly added data point of an intermediate frame occurring in time period between two textual frames temporally spaced apart by a predefined time interval using locations of newly added data points in said two textual frames.
The media re-creation module 108 is configured to recreate the lecture video on the miniature video device. Essentially, the media re-creation module 108 is configured to receive the metadata, associated audio and key-frames extracted from textual and non-textual frames, and sequentially display the key-frames in accordance with the metadata by panning the textual keyframes with a selection window having an aspect ratio and size in accordance with the display screen of the miniature video device and center point as x-y position of the respective textual frame. In the recreated lecture video, the non textual key-frames are resized to fit onto the screen of the miniature video device, whereas the textual key-frames are cropped to display the region of interest with maximum resolution.
Essentially, during the recreation of the lecture video on the media re-creation module 108, the metadata drives the selection window to scan the entire content in the textual key-frame for the time for which that textual key-frame is intended for display. In accordance with an embodiment of the present invention, the media re-creation module 108 displays a cropped region of the textual key-frame to the viewer, the cropped region being the region of interest and having size equal to the display screen of the miniature video device. The cropped region of a textual key-frame may also be referred to as key-hole image of respective key-frame. The cropped region may further be referred as a child frame of the parent key-frame.
In accordance with various embodiments of the present invention, the system 100 may be deployed using a client-server configuration. In a typical client-server configuration, the Server device will have the lecture video, the media splitter 101, gray scale conversion module 102, segmentation and shot recognition module 103, textual and non-textual key- frame extraction modules 104 and 105, metadata creation module 106. Whereas, the client device will include the media re-creation module 108. In a preferred embodiment, the media re-creation module 108 is an instructional media player of the client device. Further, the metadata interpolation module 107 may either be present at the client device or the Server device.
The client device may be a miniature video device and may request for a lecture video from the Server device. In response to the request, the Server device may send the metadata, key-frames and associated audio of the lecture video to the client device. In accordance with an embodiment of the present invention, the client device may either download the metadata, key-frames and associated audio from the Server device or receive them as streaming data. In an embodiment of the present invention, the data received at the media re-creation module 108 in response to a request for a lecture video will include (a) few key-frames in any image file format, (b) an audio file in suitable file format and (c) an XML or equivalent text file containing the temporal marking for the placement of key-frames and the metadata. It may be noted that instead of receiving a lecture video, the media re-creation module 108 receives the key- frames, metadata and audio. The temporally long-lasting video shots are replaced by the corresponding static key-frames, therefore, the data received at the client device is highly compressed with respect to the original lecture video. In this way, the original lecture video is received in a highly reduced form at the client device but with better means to suit the target display and storage with minimal information loss.
The detection of x-y position of newly added data point in two textual frames spaced apart by the predefined time interval/sub-sampled at rate k, is explained with reference to Fig. 2.
At 201, the Horizontal Projection Profile (HPP) of ink pixels of a current frame, i.e. ρΛ frame is computed by projecting its ink pixels on the y-axis. In a similar fashion, at 202, the HPP of ink pixels of a previous frame, i.e. (p-k)* frame is computed. At 203, the HPP of (p-k)* frame is subtracted from the HPP of p* frame to detect the point where the local amount of differential ink pixels in the HPPs reaches a threshold value. Said point is then set as the y-position of newly added data point of the pth frame.
At 204, the Vertical Projection Profile (VPP) of ink pixels of a cropped pth frame is computed. The cropped pth frame is a horizontal strip image of the pth frame including ink pixels contained in the regions around the detected y-position. At 205, the VPP of ink pixels of a cropped (p-k)th frame is computed. The cropped (p-k)th frame is a horizontal strip image of the (p-k)th frame including ink pixels contained in the regions around the detected y-position. At 206, the VPP of (p- k)th frame is subtracted from the VPP of pth frame to detect the point where the local amount of differential ink pixels in the VPPs is maximum. Said point is then set as the x-position of newly added data point of the p1*1 frame. The detected x-y position of the ρΛ frame is then set as the track point of the p* textual frame.
In an exemplary embodiment of the present invention, Let Bt (m, n), 0 < m < M, 0 < n < N be the textual frame at time t, where M and N are number of rows and columns of said frame respectively. The HPP of an image is a vector in which each element is obtained by summing up the pixels values along a row. Therefore, the HPP of the textual frame at time t is an array Pt of length M in which each element stands for the total number of ink pixels in a row of the processed frame. Since pixels in each frames of the content scene are converted into ink and paper pixels corresponding to 0 and 255 values of gray levels respectively, the array Pt is represented by
1 N
Pt(m) = ∑ I Bt(m, n) - 255 |; m = 1, 2, .M . •(1)
To detect the y-position of the newly added data point in textual frame Bt (m, n), the local amount of differential ink pixels in the HPPs of temporally adjacent frames Bt (m, n) and Bt-k(m, n) is counted and locate the point where it grows to considerable value. For this, the HPP of the previous frame Bt-k (m, n) is subtracted from that of the current frame Bt (m, n), to get a difference HPP array D(m)
D(m) = P,(m) - Pt.k(m) (2)
The local absolute summation SD(j) of the difference HPP array D(m) is performed to detect any considerable variation in ink pixels.
SD(j) = 23 \ D(m) \; wd < j < M - wd (3)
15
Where Wd is the localization parameter for summation and Θ is the threshold for differential ink detection. If SD(j) > Θ, it may be assumed that there is a scribbling activity or a new data point is introduced in the jth line since the local sum of the absolute HPP difference yield a high value which is a result of the introduction of extra ink pixels in that line. Therefore, with the help of HPPs of two temporally adjacent frames, the y-position of the newly added data point in a textual frame is detected.
The x-position of the newly added data point of the textual frame Bt (m, n) is detected using the detected y-position. In both the current and previous frames Bt (m, n) and Bt-k (m, n), the horizontal strip region around the detected y-position is cropped and VPPs of the cropped images is computed. The point at which the local sum of the absolute differences of the VPPs yields maximum value is taken as the x-position of the newly added data point of the textual frame Bt (m, n).
If SD(j) < Θ for the entire range of j, it may be assumed that no extra ink pixels are introduced in the frame Bt (m, n), which means that no writing activity has occurred or no additional text is displayed in the time period between the frames Bt-k (m, n) and Bt (m, n). Such a situation may occur when a lecturer stops introducing new text on the display board of the classroom and only explains the textual content already present on the display board. Such situation may also occur, when the lecturer explains some other topic not present on the display board. In such cases, when no writing activity occurs or no new text is introduced in a textual frame, then during the recreation of the lecture video, corresponding textual key-frame is not panned with a selection window to display a child frame, instead, parent textual key-frame is resized and displayed.
When the lecture video has a frame rate of 25 frames per second and sub-sampling rate k as 50, then HPP and VPP of textual frames temporally spaced apart by 2 seconds are computed, leading to detection of x-y positions of newly added data points in the lecture video at regular interval of k, i.e. 50 frames. It may be noted, that the time interval of 2 seconds for computation of track points in textual frames is taken with the assumption that there would not be too much variation of textual content in 2 second duration in a lecture video. Further, the writing pace of the lecturer is usually very slow compared to the video frame rate, therefore increment in textual content (extra ink pixels introduced) from frame to frame is too less such that no noticeable difference appears in HPPs of consecutive textual frames. The computation of x-y positions of newly added data points at interval of 50 frames not only increases the efficiency but also reduces unnecessary computations.
As explained earlier, during re-creation of the lecture video, the media-recreation module 108 displays the textual key-frames by panning them with a selection window in accordance with the metadata. However, it may be noted that the metadata derived in regular intervals of k frames for driving the selection window may result in a jittered panning of the textual key-frames. Hence to make the panning smooth, the track points of intermediate frames occurring in time period between p* and (p+k)111 frame positions are detected.
In an exemplary embodiment, let (x(ti), y(t ) and (x(t2), y(t2)) be track points of frames occurring at time tj and t2 respectively. In accordance with an embodiment of the present invention, the frames occurring at time ti and t2 may be temporally adjacent frames sub-sampled at a rate k, where k=50. The track point of an intermediate frame occurring at a time t between and t2 is interpolated using track points of frames at ti and t2. In an embodiment of the present invention, the x-position of the track point of such intermediate frame at time t can be calculated as
(4)
Figure imgf000016_0001
Similarly, the y-position of the track point of the intermediate frame at time t is computed using y-position of track points of frames occurring at and t2.
With reference to Fig.3, a plot of track points of temporally spaced textual frames is illustrated. The track points are plotted for textual frames occurring in time period from t[ to t8. The plot of track points essentially provides an idea of the movement of pen of the lecturer on the writing board. When there is a considerable deviation between two consecutive track points, say track points at t4 and t5, it implies that during the writing activity, the lecturer suddenly toggles from one point to another.
Preferably, the size of the selection window for panning a key-frame is equal to the size of the display screen of the miniature video device so as to display the textual content with maximum resolution. However, a considerable deviation between two track points has to be taken into account as it may be beyond both the spatial span and movement of the preferable selection window. Such deviations are taken into account by computing distance between consecutive track points at ti and t5 as
d = <*(*. ) - x( )y + (y(ts) - y(t<W (5)
where (x(U), y(U)) and (x(t5), y(t5)) are x-y positions of consecutive track points. If d is greater than a predefined value, say 100, the size of the selection window is increased between U and t5 to make it large enough to include track points at U and t5, i.e. (x(t4), y(t4)) and (x(t5), y(t5)) totally inside the selection window. The size of selection window is increased for wider visibility such that the viewer does not miss any region of interest and to provide a sense of spatial context. Under these circumstances, the text does not appear in its full resolution to the viewer since scaling is required to fit the selection window on the screen.
The size of the selection window is then decreased to the preferable size in time interval from t5 to since, there is no considerable deviation between track points at t5 and The feature of varying the size of the selection window based on the difference between the track points of temporally adjacent textual frames generates a feel of automatic zoom in/out to the viewer and may be referred to as a multi resolution local visual delivery. Thus, based on the content developed by the instructor, the parameters for pan and zoom are learned from the data itself and changed accordingly. Therefore, the display is a content adaptive one.
In accordance with various embodiments of the present invention, a viewer may be provided with an optional manual over-riding control over the region of interest during display of the recreated lecture video on the miniature video device. In the manual control, the viewer can use the touch screen technology or the selection key in their miniature devices for selecting their region of interest and displaying it with full or appropriate resolution. Thus a user is allowed to opt their own way of viewing the content.
In the case of slide shows, where the textual content is displayed in form of incremental lines or with special effects, the creation of metadata may be slightly different from the manner in which the metadata is created for video containing hand written slides. In the slide shows, the position of the current line is detected by using the HPP of the ink pixels of consecutive textual frames. The position of the current line serves as the metadata required on media re-creation module 108 for the vertical positioning of the selection window after which it is linearly swept horizontally with a timeline slow enough to watch the content. This process is repeated until a new line is introduced below that. Hence the overall mechanism is, slowly pan the selection window a few times horizontally and then step vertically to repeat the same through another line. The vertical step size can be calculated from the ratio of the vertical dimensions of the display size of the server and client devices. Then the time interval for sweeping along a particular line can be calculated from the total time duration required for displaying that key-frame and the vertical step size.
With reference to Fig. 4, the overall methodology for providing a content adaptive and legibility retentive display of lecture video 401 on a miniature video device is illustrated. The total time duration of the original lecture video 401 is 16 minutes, which includes a talking head activity scene for 4 minutes, a writing hand activity scene for 8 minutes, again a talking head activity scene for 4 minutes and associated audio 407. At a frame rate of 25 frames per second, the first talking head activity scene includes total 6000 non-textual frames, the writing hand activity scene comprises total 12000 textual frames and the second talking head activity scene include total 6000 non-textual frames. A metadata indicating writing advancement in the textual frames is created using HPPs and VPPs of textual frames temporally spaced apart by a predefined time interval, i.e. two seconds. The non-textual key-frames 402, 405 and textual key-frames 403, 404 are extracted from non-textual and textual frames respectively using known methods. Based on the metadata, audio 407 and key-frames 402-405, the lecture video is recreated on the miniature video device. In the recreated lecture video 406, the non-textual key-frames 402 and 405 are resized to fit onto the miniature device screen, where as the textual key-frames 403 and 404 are panned with selection window to display child frames 408 and 409 so to provide the region of interest to the viewer with maximum resolution.
EXAMPLE:
A lecture video of time duration 55-60 minutes having frame rate as 25 frames per second is taken, where the content portion of the lecture video comprise hand written slides. The extraction of metadata is performed on the textual frames of the original video by computing HPPs of every two textual frames temporally spaced apart by two seconds. HPPs of two such textual frames are plotted on the same graph as illustrated in Fig. 5. On examining said HPP plots, it can be seen that they coincide locally, through some initial rows (nearly upto m = 200) as they originate from those lines of text which are common in both the frames. The plot of local sum of the absolute difference of said HPPs is shown in Fig. 6. The point where considerable deviation starts is noted as y-position for tracking. From the plot, said point is at m=222. The VPP based x-position detection method is illustrated in Figs. 7 and 8.
Figs. 7(a) and (b) respectively illustrate the VPPs of cropped horizontal strip regions around the detected y-position of the two frames. The plot of local sum of the absolute difference of these VPPs is shown in Fig. 8. The region where a high overshoot occurs essentially contains the newly written text. Hence the point (n = 400) corresponding to the maximum of the curve in Fig. 8 is selected as x-position of the track point. Hence for this pair of processed frames, the tracked point is (222, 400). This procedure is repeated for all content textual frames in the intervals of 50 frames.
Although the invention has been described with reference to a specific embodiment, this description is not meant to be construed in a limiting sense. Various modifications of the disclosed embodiment, as well as alternate embodiments of the invention, will become apparent to persons skilled in the art upon reference to the description of the invention. It is therefore contemplated that such modifications can be made without departing from the scope of the invention as defined in the appended claims.

Claims

Claims:
1. A method for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device, the lecture video comprising a sequence of textual and nontextual frames along with associated audio, the method comprising the steps of:
creating a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof; and
sequentially displaying key-frames extracted from the textual and non-textual frames in accordance with the metadata by panning textual key-frames with a selection window having an aspect ratio and size in accordance with a display screen of the miniature video device and a center point as x-y position of newly added data point in the respective textual frame.
2. The method as claimed in claim 1, wherein detecting the y-position of newly added data point in a current textual frame comprises comparing Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value.
3. The method as claimed in claim 2, wherein detecting the x-position of the newly added data point in the current textual frame comprises comparing Vertical Projection Profiles (VPPs) of ink pixels contained in regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum. The method as claimed in claim 1 , which comprises detecting the location of newly added data point in an intermediate frame occurring in time period between two textual frames temporally spaced apart by the predefined time interval by interpolating x-y positions of newly added data points of said two textual frames.
The method as claimed in claim 1 , wherein the size of selection window is equal to the size of display screen of the miniature video device.
The method as claimed in claim 5, wherein the size of selection window is varied based on the difference between locations of newly added data points of temporally adjacent textual frames.
The method as claimed in claim 1, wherein, when the lecture video has a frame rate of 25 frames per second and a sub-sampling rate as 50, the HPP and VPP of textual frames spaced apart by 2 seconds are computed, leading to detection of newly added data points at regular interval of 50 frames. A system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device, the lecture video comprising a sequence of textual and nontextual frames along with associated audio, the system comprising:
a metadata creation module configured to
create a metadata that indicates location of newly added data points in textual frames temporally spaced part by a predefined time interval by computing horizontal and vertical projection profiles of ink pixels in said textual frames and detecting x-y positions of newly added data points thereof; and
a media re-creation module configured to
receive the metadata, associated audio, and key-frames extracted from textual and non-textual frames; and
sequentially displaying the key-frames in accordance with the metadata by panning the textual key-frames with a selection window having an aspect ratio and size in accordance with the miniature video device and center point as x-y position of newly added data point of the respective textual frame.
9. The system as claimed in claim 8, wherein the metadata creation module is configured to detect the y-position of a newly added data point in a current textual frame by comparing the Horizontal Projection Profile (HPP) of ink pixels of the current textual frame with that of a previous textual frame, and setting the y-position as the point where the amount of differential ink pixels in the HPPs reaches a threshold value; and
detect the x position of the newly added data point in the current textual frame by comparing Vertical Projection Profiles (VPPs) of ink pixels contained in the regions around the detected y-position of the current and previous textual frames and setting the x-position as the point where the amount of differential ink pixels in the VPPs is maximum.
10. The system as claimed in claim 8, wherein the media re-creation module is configured to vary the size of selection window based on the difference between locations of newly added data points of temporally adjacent textual frames.
11. The system as claimed in claim 8, which comprises a metadata interpolation module configured to interpolate the location of a newly added data point of an intermediate frame occur occurring in time period between two textual frames temporally spaced apart by a predefined time using locations of newly added data points in said two textual frames.
12. The system as claimed in claim 8, wherein the media recreation module is configured to provide a manual over-riding control to the viewer for selecting their region of interest in the lecture video and displaying said region with full or appropriate resolution.
13. The system as claimed in claim 8, wherein the media recreation module is configured to receive the metadata, associated audio and key-frames from a Server device as streaming media.
PCT/IN2011/000597 2010-09-06 2011-09-02 A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device WO2012032537A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2474/MUM/2010 2010-09-06
IN2474MU2010 2010-09-06

Publications (2)

Publication Number Publication Date
WO2012032537A2 true WO2012032537A2 (en) 2012-03-15
WO2012032537A3 WO2012032537A3 (en) 2012-06-21

Family

ID=45811027

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2011/000597 WO2012032537A2 (en) 2010-09-06 2011-09-02 A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device

Country Status (1)

Country Link
WO (1) WO2012032537A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113316012A (en) * 2021-05-26 2021-08-27 深圳市沃特沃德信息有限公司 Audio and video frame synchronization method and device based on ink screen equipment and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6336124B1 (en) * 1998-10-01 2002-01-01 Bcl Computers, Inc. Conversion data representing a document to other formats for manipulation and display
US20040125877A1 (en) * 2000-07-17 2004-07-01 Shin-Fu Chang Method and system for indexing and content-based adaptive streaming of digital video content
US20040205513A1 (en) * 2002-06-21 2004-10-14 Jinlin Chen Web information presentation structure for web page authoring

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6336124B1 (en) * 1998-10-01 2002-01-01 Bcl Computers, Inc. Conversion data representing a document to other formats for manipulation and display
US20040125877A1 (en) * 2000-07-17 2004-07-01 Shin-Fu Chang Method and system for indexing and content-based adaptive streaming of digital video content
US20040205513A1 (en) * 2002-06-21 2004-10-14 Jinlin Chen Web information presentation structure for web page authoring

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113316012A (en) * 2021-05-26 2021-08-27 深圳市沃特沃德信息有限公司 Audio and video frame synchronization method and device based on ink screen equipment and computer equipment
CN113316012B (en) * 2021-05-26 2022-03-11 深圳市沃特沃德信息有限公司 Audio and video frame synchronization method and device based on ink screen equipment and computer equipment
WO2022247014A1 (en) * 2021-05-26 2022-12-01 深圳市沃特沃德信息有限公司 Audio and video frame synchronization method and apparatus based on ink screen device, and computer device

Also Published As

Publication number Publication date
WO2012032537A3 (en) 2012-06-21

Similar Documents

Publication Publication Date Title
US11849196B2 (en) Automatic data extraction and conversion of video/images/sound information from a slide presentation into an editable notetaking resource with optional overlay of the presenter
CN107633241B (en) Method and device for automatically marking and tracking object in panoramic video
CA2761187C (en) Systems and methods for the autonomous production of videos from multi-sensored data
US8594488B1 (en) Methods and systems for video retargeting using motion saliency
US8457469B2 (en) Display control device, display control method, and program
US10645344B2 (en) Video system with intelligent visual display
US8085302B2 (en) Combined digital and mechanical tracking of a person or object using a single video camera
US8676030B2 (en) Methods and systems for interacting with viewers of video content
US20120057775A1 (en) Information processing device, information processing method, and program
US20060044446A1 (en) Media handling system
US8515258B2 (en) Device and method for automatically recreating a content preserving and compression efficient lecture video
Carlier et al. Crowdsourced automatic zoom and scroll for video retargeting
Choudary et al. Summarization of visual content in instructional videos
KR20080078186A (en) Method extraction of a interest region for multimedia mobile users
Hoshen et al. Wisdom of the crowd in egocentric video curation
US20110235859A1 (en) Signal processor
Xiong et al. Snap angle prediction for 360 panoramas
Tang et al. Exploring video streams using slit-tear visualizations
US20190005133A1 (en) Method, apparatus and arrangement for summarizing and browsing video content
Miniakhmetova et al. An approach to personalized video summarization based on user preferences analysis
US20050198067A1 (en) Multi-resolution feature extraction for video abstraction
WO2012032537A2 (en) A method and system for providing a content adaptive and legibility retentive display of a lecture video on a miniature video device
Shi et al. Consumer video retargeting: context assisted spatial-temporal grid optimization
Liao et al. An automatic lecture recording system using pan-tilt-zoom camera to track lecturer and handwritten data
Ram et al. Video Analysis and Repackaging for Distance Education

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11823161

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11823161

Country of ref document: EP

Kind code of ref document: A2