WO2005117431A1

WO2005117431A1 - Method for synchronising video and audio data

Info

Publication number: WO2005117431A1
Application number: PCT/AU2005/000747
Authority: WO
Inventors: Martin Samuel Lipka
Original assignee: Vividas Technologies Pty Ltd
Priority date: 2004-05-26
Filing date: 2005-05-26
Publication date: 2005-12-08

Abstract

The present invention involves a method and computer software product for playing a multimedia digital data stream comprising audio data and video data, the latter displayed to a user in a sequence of frames, in order to provide synchronisation between the streams. The method comprising the steps of calculating the audio time in accordance with the time elapsed since the start of the audio data stream, determining at a certain point in time the offset of the video stream from the audio stream, adjusting, if an offset is detected, the frame delivery rate by a prescribed amount, and repeating the above steps at successive points in time, to constantly adjust the frame delivery rate by no more than a maximum amount for each successive frame, so as to constantly trim the video stream display to enhance synchronisation with the audio stream.

Description

Method for synchronising video and audio data

Field of the invention

The present invention concerns a method for synchronising audio and video media streams, in particular to provide to a user the experience of seamless audio playback and smoothly synchronised video playback.

In broad terms, the invention relates to the field of data processing, for processing a stream of data including audio and video data comprised in a sequence of frames. A new architecture and method of operation is described below.

Background of the invention In order to preserve synchronisation between audio and video data, it is necessary to make adjustment to the transfer rate of the stream of data, so that a specified video presentation time is synchronised with a reference time, such as the correct moment in time of the associated audio stream. The data stream is organised in frames of data fed through a processing device, and a processing unit within the processing device is provided with means for determining the synchronisation.

The MPEG standard (from the Motion Pictures Expert Group (MPEG)) is a well established standard for audio and video compression and decompression algorithms, for use in the digital transmission and receipt of audio and video broadcasts. This provides for the efficient compression of data according to an established psychoacoustic model to enable real time transmission, decompression and broadcast of high quality sound and video images. Other audio standards have also been established for the encoding and decoding of audio and video data transmitted in digital format, such as data for digital television systems.

Compression standards are based on psycho-acoustics of human perception. Generally, video and audio need to match to an accuracy of not much worse than 1/20 of a second in order to be acceptable for the viewer. Accuracy worse than 1/10 of a second is usually noticeable by the viewer, and accuracy of worse than 1/5 of a second is almost always noticeable.

Maintaining synchronisation between video and audio data is a straightforward matter if the streams are integrated and played using a single video/audio source. This is not the case for digital video, as the audio data and the video data are separated and independently decoded, processed, and played. Furthermore, computer users may require to view digital video while performing some other task or function within the computer, such as sending or receiving information from a computer network. This is quite possible in a multitasking computing environment, but can introduce significant multimedia synchronisation problems between the audio and the video data.

The use of compression techniques such as MPEG requires the multimedia data to be decoded before it can be played, which is often a very computer-intensive task, particularly with respect to the video data. In addition, competing processes may steal away processing cycles of the central processor, which dynamically affects apparent processing power of the machine. This the result that the ability to read, decode, process, and play the multimedia data will vary during the processing, which can effect the ability to synchronously present the multimedia data to the user. The prior art has developed a number of ways to tackle this problem. One simple solution is to alter the speed of the audio data to match that of the video data. However, audio hardware does not generally support simple alterations in the audio rate, and in any case varying the audio rate produces a result generally unpleasant to the viewer, such as wavering alterations in pitch, deterioration in speech, etc. For this reason, the audio data is generally taken as providing the standard of player time, and the video is made to keep pace with it.

A further approach is simply to increase the performance level of the hardware, to ensure that the intensive computing requirements are met, and synchronisation of the audio and video can therefore be maintained. However, in applications of multimedia streaming to client browsers, the system has no control over the processing power (or over the simultaneous competing needs) of individual machines. It is therefore important that the synchronisation processes are as performance-tolerant as possible.

Other solutions of the prior art have included use of inferior decoding methods, and the dropping of frames of video data to maintain synchronisation with the audio data. However, in terms of viewer experience, these techniques are very much compromises. Using an inferior decoding method typically results in a blurred or blocky image, whilst merely dropping frames produces a result that is typically jerky in appearance.

United States Patent No. 6,310,652 to Li et al. discusses a synchronisation method in which a 'presentation time' of data frames is continuously compared with a 'reference time' calculated by the playing device. Subsequent frames or portions thereof, are then either dropped or repeated depending on whether the presentation time is earlier or later than the calculated reference time. This solution is less than ideal, in that it not only requires specialised hardware to calculate the reference time, but also involves dropping or repeating both audio and video frames, resulting in an unsatisfactory user experience. United States Patent No. 6,272,776 to Griffiths discusses playing the video data ahead of the corresponding audio data in order to maintain synchronisation. The 'initial due time' of the video data is first determined, which is typically the time-stamped initial start time for the video and audio data indicating when the video and audio data should be played. An 'offset time' is then applied to the video due time, which adjusts when the video data should be played relative to the corresponding audio data and produces an adjusted video due time earlier than the initial video due time. The particular value of the offset - and hence the amount of time by which the video data is played ahead of the audio data - may be varied depending on how early or late a frame of video data is relative to the corresponding audio data. Variations in the offset may also be made to account for an increase in available processing power, which allows a smaller offset to be applied. The method is said to be advantageous in that it allows video to be played ahead of the audio in order to 'build in a margin' for any future late frames while degrading the video as little as possible.

However, irrespective of whether an offset between the video and audio data is applied, the method attempts to jump to the exact point of synchronisation between the audio and video data upon each detection of an early or late video frame. Typically, this results in a blurred or jerky image, in a similar manner to when video frames are dropped or paused, in order to achieve synchronisation.

It is also important that sufficient processor time is devoted to the audio decode and play process to avoid intrusive and undesirable breaks (pops and silences) in the sound stream.

There therefore remains a need for a system for maintaining or improving the synchronisation between audio and video data which degrades the presented video as little as possible, which avoids breaks in the audio, which minimises the need for dropped video frames, and which is adaptive to the apparent processing power of the system while avoiding jerky video appearances when adapting to the apparent processing power of the system or other effects.

Summary the invention

In accordance with the invention, there is provided a method for playing a multimedia digital data stream comprising audio data and video data, the latter displayed to a user in a sequence of frames, in order to provide synchronisation between the streams, comprising the steps of: calculating the audio time in accordance with the time elapsed since the start of the audio data stream; determining at a certain point in time the offset of the video stream; adjusting, if an offset is detected, the frame delivery rate by a prescribed amount; repeating the above steps at successive points in time, to constantly adjust the frame delivery rate by no more than a prescribed amount for each successive frame, so to constantly trim the video stream display to enhance synchronisation with the audio stream. In summary, then, the audio is synchronised to the system clock, and that synchronisation produces an offset. This offset is relative to a point in time when the media started playing. It should be noted that there is plenty of this jitter in the audio synchronisation offset, and this is dependent on the hardware, the load on the computer at the time, and many other factors. Video frames are displayed at a certain and adjustable rate, that rate being trimmed (adjusted by a small amount) in accordance with the apparent time difference between the audio and video. This might be termed 'loose synchronisation', as opposed to so-called 'dead beat' control, as instead of being brought abruptly into accurate synchronisation, the video frame delivery control is slowly slewed in accordance with the detected time difference between the audio and video times. The result is smooth video delivery, synchronisation being achieved in a manner largely undetectable to the viewer.

According to a second aspect of the present invention there is provided a computer software product for playing a multimedia digital data stream comprising audio data and video data, the latter displayed to a user in a sequence of frames, in order to provide synchronisation between the streams, comprising computer program code, which when executed: j calculates the audio time in accordance with the time elapsed since the start of the audio data stream; determines at a certain point in time the offset of the video stream from the audio stream; adjusts, if an offset is detected, the frame delivery rate by a prescribed amount; and repeats the above steps at successive points in time, to constantly adjust the frame delivery rate by no more than a maximum amount for each successive frame, so as to constantly trim the video stream display to enhance synchronisation with the audio stream.

Brief Description of the Drawing

A preferred embodiment of the present invention will now be described by reference to the accompanying drawing (Figure 1), a diagrammatic illustration of the method of the present invention.

Detailed Description of the Drawing The present invention may be practised on any suitable computing device, with the necessary hardware and software resources for decoding and playing digital audio and video data streams. Such devices include personal computers (PCs), hand-held devices, multiprocessor systems, mobile telephone handsets, dvd players and terrestrial, satellite or cable digital television set top boxes. The data to be played may be provided as streamed data, or may be stored for playback in any suitable form.

The audio playback is synchronised to the system clock of the particular device, and this is the only variable that is considered an absolute reference for the purposes of the technique of the invention. The system clock measures time in milliseconds. When the audio stream is started, the system clock time is recorded. A calculation is then performed to determine how much audio time has elapsed.

Some media playback devices, such as those implemented on the Mac OS, provide this information directly in the form of an actual time value. Other devices however, such as those utilising the DirectX Application Programming Interfaces, only provide the position of , a playback pointer in an audio buffer, rather than an actual time elapsed value. Moreover, because 'ring buffers' are often used, it is necessary to keep track of the number of buffers of data that have been used, along with the sample rate of the media, in order to calculate how much audio time has elapsed.

The audio time is considered as the 'lead', and the video attempts to loosely synchronise to this time. A timer event exists which prompts the video to display a frame. This prompting is over sampled, and is often ignored. In the current embodiment of the device, it is set to 100 prompts per second, and therefore 3 in 4 prompts are ignored when displaying 25 frames per second media. This setting is arbitrary, and the trade-off is amount of CPU overhead used against smoothness of playback. The video time is calculated from the same base offset as the audio time. The actual video time is calculated by the time from when the last displayed frame plus that frame number, multiplied by the interval between frames (the reciprocal of the frame rate).

When the period between frames has elapsed (this is the reciprocal of the frames-per-second rate (eg l/25s), and the next video prompt occurs, then the next frame is displayed. As will be recognised, this can introduce up to 10 milliseconds of jitter in this process. However, video refresh is already synchronised to the vertical blank (ie the refresh rate of the monitor) which is usually between around 43 and 120 Hz (nominally 72 Hz) which already gives a jitter of nominally 14 milliseconds, imperceptible to the human eye. The period between frames is adjusted, or 'trimmed' in accordance with the audio time, by a prescribed maximum amount. In the current embodiment and under normal conditions, this trimming is adjusted by a maximum of 2 millisecond per frame. If the audio time appears to be 'in front' of the video time, then the period between frames is reduced, and vice versa. Oversampling the timer event that prompts the display of a video frame allows the system to account for the trimming, an in particular the reduction of the time period between display of successive video frames.

In one embodiment, the determination of whether the audio time is 'in front' or 'behind' the video time, occurs at a frequency of around once per second. This is sufficiently frequent to afford a smooth and constant effective synchronisation between the audio and video stream.

If the audio and the video become excessively out of synchronism (in accordance with prescribed criteria; currently 200ms is considered excessively out of synchronism), the following considerations come into play. If the audio is excessively ahead of the video, one or more entire frames are omitted ('dropped'), to enable the video to catch up with the audio. As many are discarded as is required to catch up. If the video time is well ahead of the audio time, then the video is stalled until the audio catches up.

The accompanying Figure 1 diagrammatically illustrates the method of the invention. The horizontal time axis represents time t elapsed from commencement of the multimedia data playback, as measured by the system clock. The upper trace shows the audio data stream, the audio played time APT representing the synchronisation point we are aiming for. The lower trace shows the video data stream, and the latest frame to be played LFP is shown in the figure as trailing the synchronisation point objective. The next frame is therefore scheduled for display at the 'apparent' time of 1/fps later, but with a +2ms deviation, to trim it towards synchronisation with the audio data stream. If the video data stream is ahead of the audio data stream, then the next frame is scheduled for display 1/fps later, but with a - 2ms deviation. If the latest frame to be played occurred before a prescribed time interval ta before the synchronisation point (LFP<-tø), then one or more frames are omitted. If the latest frame to be played is timed to display after a prescribed time interval ta from the synchronisation point (LFP > ta), then the video is held for the audio to catch up. Features of the invention:

1. Smooth video playback, without jerky adjustment to audio.

2. Smoothly handling jitter in the audio play information.

3. Synchronise the audio and video. 4. Handle adverse situations with massive differences in stream position.

The word 'comprising' and forms of the word 'comprising' as used in this description and in the claims does not limit the invention claimed to exclude any variants or additions. Modifications and improvements to the invention will be readily apparent to those skilled in the art. Such modifications and improvements are intended to be within the scope of this invention.

Claims

1. A method for playing a multimedia digital data stream comprising audio data and video data, the latter displayed to a user in a sequence of frames, in order to provide synchronisation between the streams, comprising the steps of: calculating the audio time in accordance with the time elapsed since the start of the audio data stream; determining at a certain point in time the offset of the video stream from the audio stream; adjusting, if an offset is detected, the frame delivery rate by a prescribed amount; and repeating the above steps at successive points in time, to constantly adjust the frame delivery rate by no more than a. maximum amount for each successive frame, so as to constantly trim the video stream display to enhance synchronisation with the audio stream.

2. A method according to claim 1 wherein the prescribed amount by which the frame delivery rate is adjusted is related to the magnitude of the offset of the video stream.

3- A method according to claim 1 wherein the step of adjusting the frame delivery rate comprises adjusting the interval between display of successive frames.

4. A method according to claim 1 including the further step of periodically prompting for the display of video frames at a higher rate than the intended frame display rate of the multimedia digital stream, and ignoring prompts for frame display occurring in the interval between display of successive frames, so as to account for the trimming of the frame delivery rate.

5. A method according to claim 4 wherein the video frame display prompt rate is around 100 times per second for a multimedia digital stream with an intended video frame display rate of approximately 25 frames per second.

6. A method according to any preceding claim, wherein the step of determining the offset of the video stream from the audio stream is carried out around once per second.

7- A method according to any preceding claim, including the further step of either pausing or dropping frames from, the video stream in the event that the offset between the video stream and audio steam exceeds a predetermined maximum value, so as to restore synchronisation between the video and audio stream.

8. A method according to any preceding claim, wherein the predetermined maximum amount is approximately 2 milliseconds per frame.

9. A method according to any preceding claim, wherein the audio stream is synchronised with the system clock of the device upon which the multimedia digital data stream is played.

10. A computer software product for playing a multimedia digital data stream comprising audio data and video data, the latter displayed to a user in a sequence of frames, in order to provide synchronisation between the streams, comprising computer program code, which when executed: calculates the audio time in accordance with the time elapsed since the start of the audio data stream; determines at a certain point in time the offset of the video stream from the audio stream; adjusts, if an offset is detected, the frame delivery rate by a prescribed amount; and repeats the above steps at successive points in time, to constantly adjust the frame delivery rate by no more than a maximum amount for each successive frame, so as to constantly trim the video stream display to enhance synchronisation with the audio stream.

11. A computer software product according to claim 10 wherein the prescribed amount by which the frame delivery rate is adjusted is related to the magnitude of the offset of the video stream.

12. A computer software product according to claim 10 wherein adjusting the frame delivery rate comprises adjusting the interval between display of successive frames.

13. A computer software product according to claim 10, further including computer code, which when executed periodically prompts for the display of video frames at a higher rate than the intended frame display rate of the multimedia digital stream, and ignores prompts for frame display occurring in the interval between display of successive frames, so as to account for the trimming of the frame delivery rate.

14. A computer software product according to claim 13 wherein the video frame display prompt rate is around 100 times per second for a multimedia digital stream with an intended video frame display rate of approximately 25 frames per second.

15. A computer software product according to claim 10, wherein the step of determining the offset of the video stream from the audio stream is carried out around once per second.

16. A software product according to claim 10, further including computer program code, which when executed, either pauses or drops frames from, the video stream in the event that the offset between the video stream and audio steam exceeds a predetermined maximum value, so as to restore synchronisation between the video and audio stream.

17. A computer software product according to claim 10 wherein the predetermined maximum amount is approximately 2 milliseconds per frame.

18. A computer software product according to claim 10 wherein the audio stream is synchronised with the system clock of the device upon which the multimedia digital data stream is played.