WO2001063935A1

WO2001063935A1 - Video and audio coding

Info

Publication number: WO2001063935A1
Application number: PCT/GB2001/000127
Authority: WO
Inventors: Donald Martin Monro
Original assignee: M-Wave Limited
Priority date: 2000-02-24
Filing date: 2001-01-12
Publication date: 2001-08-30
Also published as: CA2407536A1; AU2535101A; GB0004423D0

Abstract

A method of encoding and compressing a video sequence, for transmission across a video channel, includes converting the original sequence of frames (O) to a compressed sequence (C) which includes intra-frames, predicted frames and interpolated frames. A sequence of residual images (R) is then determined by subtracting the original frames from the compressed frames. Finally, a spatial/temporal transform is applied to the residual image sequence, the output of which is transmitted across the video channel from the video encoder to a video decoder. A similar approach, using a temporal transform on the residuals, may be applied to audio compression.

Description

VIDEO AND AUDIO CODING

The present invention relates to a method of digitally encoding video sequences and audio signals.

The representation of a single image, and particularly of a video sequence, in digital format produces extremely large quantities of data. To reduce storage requirements and to allow the images to be transmitted through limited-bandwidth channels, the data must first be compressed by eliminating inherent redundancies within it.

Video compression is divided into two basic categories. When individual frames are compressed without reference to any other frames, the compression is described as "intra-coded^". One of the advantages of intra- coded video is that there is no restriction on the editing which can be carried out on the image sequence. As a result, most digital video in the broadcasting industry is stored in this way- The intra-coding approach can be used in association with any of a large number of still image compression techniques such as. for example, the industry standard JPEG compression scheme. This approach is taken by the moving JPEG standard for video compression: JPEG compression is used for each of the individual frames, with each of the frames being handled independently and without reference to any other frame.

Video sequences are not. however, typically composed of a collection of entirely unrelated images, and greater compression can normally be obtained by taking account of the temporal redundancy in the video sequence. This involves a process known as inter-coded compression. With this approach, individual images in the output sequence may be defined with reference to changes that have occurred between that image and a previous image. Since the compressed data stream (sent across the video channel for reconstruction by the decoder) typically represents information taken from several frames at once, editing on the compressed data stream is not normally carried out because the quality is severely compromised. Inter-coded compression is one of the compression techniques that is incorporated into the MPEG video compression standard.

A typical inter-coded compression scheme is shown schematically in Figure 1 . In that Figure, the upper row O represents the original digitised video frames that are to be compressed, the second row C represents the compressed images, and the bottom row R the residuals.

In the scheme shown, selected original frames S are treated as still images, and are compressed by any convenient method to produce intra-frames 1. These frames are then used as reference frames to create predicted frames P. The contents of these frames are projected from one or more I frames - either forwards or backwards in the sequence. This is normally achieved by the use of motion vectors, associated with moving blocks within the image. Alternatively, the movement of specific physical objects within the image may be determined and predicted. Finally, the C sequence is completed by generating interpolated frames B between the P and I frames. The original video sequence can then be approximated by the sequential frames of the sequence, namely the I. B and P frames.

In practice, further corrections normally have to be made if the end result is to appear reasonable. These further corrections are achieved b\ determining a residual frame R corresponding, in each case, to the difference between the original frame and the corresponding compressed frame. Residual frames a} . but need not. be calculated for the intra frames. Accordingly, the residual frames marked X mav sometimes be omitted. In a practical embodiment, an encoder calculates the I frames from the these original frames labelled S in the diagram, and. from that, calculates the motion parameters (vectors) that are needed to define the P frames. The datastrcam transmitted from the encoder to the decoder thus includes the encoded I frames and the appropriate motion vectors enabling the decoder to construct the P frames. Information on the B frames is not sent, since those can be reconstructed by the decoder alone purely on the basis of the information within the I and P frames.

In order to improve the final result, the datastream also includes the residual images, sent on a frame by frame basis. Since the residual image represents the difference between the original image and the compressed image. the encoder needs to have access to the sequence of compressed images. That is achieved by incorporating an additional decoder within the encoder.

The final datastream. as sent, therefore includes the full I frames, the motion vectors for the P frames and all of the residual frames possibly excluding those that are labelled X in Figure 1. Each residual image is typically compressed before transmission.

One difficulty with this approach is that it can be difficult to control the bit rate within the datastream. It is a first object of the present invention to provide a method of encoding which allows greater bit rate control. It is a further object to provide a method which allows further improvements in compression ratios without noticeable degradation of the output.

According to the present invention there is provided a method of encoding and compressing a time-varying sequence of digital samples comprising:

(a) deriving from the said samples a corresponding sequence of compressed samples:

(b) deriving from corresponding digital samples and compressed samples a sequence of residual samples; and

(c) taking a group of residual samples and applying a temporal transform to the samples of the group to produce compressed data representative of the group.

The method preferably includes creating an output datastream which includes, separately, a compressed data representative of the group. Also preferably, the output datastream does not include data which is separately representative of the individual residual samples: all that is required is data representative of the group. The datastream may. and indeed frequently will, include other information such as for example data representative of at least some reference compressed samples from within the sequence of compressed samples, and prediction information permitting the reconstruction of predicted samples between the reference samples. In embodiments relating to video coding, these may respectively relate to the intra-frames and the predicted frames.

There may. but need not. be a residual sample which corresponds to each reference sample. In the video coding case, there may. but need not. be a residual image which corresponds to an intra-frame.

The residual sample may be determined by differencing or otherwise comparing (eg by division) the corresponding digital samples and the compressed samples. The bit rate in the output stream may be adjusted by altering one or more parameters of the temporal transform which is applied to residual samples within the group. Greater fiexibility for adjusting the bit rate may be provided by using more aggressive compression at the first stage in the procedure. namely the derivation of the compressed samples. High levels of compression at this stage pushes more information down to the residual samples. As these are dealt with on a grouped basis, this provides improved control over bit rate/distortion in the final output. The group of residual samples to which the temporal transform is applied may comprise a block, for example a contiguous block. In one embodiment, the temporal transform is repeatedly applied to adjacent blocks. In that embodiment, the block length may either be fixed or variable and preferably is coterminous with the reference compressed samples (the intra-frames. in the video case).

Alternatively, the group may be defined by a sliding transform, of limited extent, which is repeatedly applied as it moves along the sequence of residual samples. The sliding transform may operate within a window of fixed or varying size, or the transform basis may be some more complex selection of residual samples (eg weighted averages). The extent of the transform will normally be limited, but could in principle be of indefinite extent.

The temporal and the spatial transforms may be any convenient transform such as a discrete cosine transform, or a wavelet transform. Other possibilities include the Lapped Orthogonal Transform (LOT). Generalised Lapped Orthogonal Transform (GENLOT). Lapped Biorthogonal Transform and the Matching Pursuits Transform. In its application to video compression, the temporal transform will normally be a spatial/temporal (spatio-temporal) transform applied to the residual images of the group. The same transform might be used both for the spatial and temporal parts. Alternatively, the transforms may be different.

The invention also extends to a method of encoding and compressing a time-varying sequence of audio samples. The invention further extends to a digital coder, for example to a video coder and/or an audio coder.

The present invention, in its various embodiments, provides several advantages. When used with motion-prediction algorithms, the residual frames are likely to have some degree of correlation, so compression ratios can be improved.

More generally, taking a range of residues together, both in the video and in the audio case, will give the opportunity of improved control over the transmitted/saved bit rate/distortion. Further improvements in rate/distortion control could be achieved with the use of an embedded quantiser.

The invention may be carried into practice in a number o wa\ s and one specific embodiment will now be described, by way of example, with reference to the accompanying Figures, in which:

Figure 1 illustrates a method of prior art video coding: and Figure 2 illustrates the preferred way of dealing with the residual images in accordance with the present invention.

The preferred embodiment of the present invention will be described with reference both to Figure 1 and to Figure 2. The previous description of

Figure 1. so far as it relates to the prior art. will be taken as read and emphasis will be given to those features which expand upon or w hich are different from the features of the prior art.

In the preferred embodiment, the original sequence of frames are compressed and the I. B and P frames computed by any convenient method. MPEG 2 may be used and/or JPEG compression used for computing the I frames. Alternatively, any other static compression scheme could be used for the I frames - for example wavelet transform. The original frames need not necessarily be "straiεht out of the camera^": thev could themselves have been compressed.

The residual frames are determined by subtracting or otherwise comparing (eg by division) the original frames from the compressed frames, or vice versa. Rather than dealing with the residual images one by one. however. as in the prior art. they are now dealt with as a group, as shown in Figure 2. A specific group of residual images 10 is selected, and a spatial/temporal transform is carried out in three dimensions (x, y, t) on that data. The result of that transform is a compressed version of the information within all of the frames that have been selected. That compressed data is then passed in the bit stream from the encoder to the decoder. There is now no need to send the compressed residual images, one by one. and the information in the bit stream is therefore the intra-frame data, the motion vectors need to define the P frames, and the spatial/temporally-coded residuals.

The residuals may be taken block by block, in other words one group of frames may be selected and encoded, then an adjacent group and so on.

Preferably, the block start and end frames arc determined by the intra frames, although that is not essential. Alternatively, and preferably, the transform may be applied within a sliding "window ^"" which moves along the residual image sequence as shown by the arrow 12. The window may. but need not, be of a fixed length. The transform could be any convenient three-dimensional compression transform, for example a wavelet transform or a Discrete Cosine Transform (DCT). Different transforms may be used for the spatial and temporal parts.

In the most preferred embodiment, a wavelet transform is used to obtain the intra-frames from the original sequence, with block based or object based motion prediction being used to calculate the P frames. The B frames are then determined using any standard interpolation technique. Finally, a sliding window spatial/temporal transform is applied to the residual data, with both the spatial and the temporal parts of the transform being a wavelet transform.

One significant advantage of the invention is the improved user control it provides over the bit rate which is being transmitted across the video channel. By deliberately selecting a high level spatial compression when constructing the intra-frames. more video information will be pushed down into the residuals. The bit rate which relates to the residuals can be controlled according to the level of compression which is applied by the spatial/temporal transform. It should be noted that, in contrast to the prior art. this allow s substantial control of the bit rate while keeping the frame rate constant.

It will be appreciated that the same general principles can be applied to audio as well as to video compression. In place of a sequence of video images, one simply has an audio signal which is digitised at regular intervals into a sequence of audio samples or "frames^". These samples are compressed using any suitable audio compression algorithm such as a wavelet transform or any of the algorithms which are incorporated into the audio coding portions of the MPEG standard. The residuals are calculated, as before, by subtracting the original sequence from the compressed sequence, or vice versa, and a final compression transform is applied either to sequential blocks of residual or. preferably, to adjacent residuals which fall within the compass of a sliding window. As described above, any suitable compression algorithm could be used, including wavelet transforms and DCTs.

Claims

CLAIMS:

1. A method of encoding and compressing a time-varying sequence of digital samples comprising: (a) deriving from the said samples a corresponding sequence of compressed samples;

2. A method as claimed in claim 1 including creating an output datastream. the datastream including the compressed data representative of the group.

3. A method as claimed in claim 2 in which the datasteam further includes data representative of at least some reference compressed samples from within the sequence of compressed samples.

4. A method as claimed in claim 3 in which the datastream further includes prediction information permitting the reconstruction of predicted samples between the reference samples.

5. A method as claimed in claims 2 to 4 in which the datastream does not include data representative of individual residual samples.

6. A method as claimed in any one of the preceding claims in which the residual samples are determined by differencing the corresponding digital samples and compressed samples.

7. A method as claimed in any one of claims 2 to 6 including adjusting the bit rate of the output datastream by adjusting a parameter of a quantizer which operates on the compressed data.

8. A method as claimed in any of claims 2 to 6 in which the group of residual samples comprises a block of samples.

9. A method as claimed in claim 8 in which the temporal transform is repeatedly applied to adjacent blocks.

10. A method as claimed in any one of claims 1 to 7 in which the group is defined by a sliding transform, of limited extent, which is repeatedly applied as it moves along the sequence of residual samples.

1 1. A method as claimed in any one of claims 1 to 7 in which the group is defined by a sliding window which moves across the sequence of residual samples, the window defining the temporal extent of the transform.

12. A method as claimed in any one of claims 8 to 1 1 in which the temporal transform is a discrete cosine transform.

13. A method as claimed in any one of claims 8 to 1 1 in which the temporal transform is a wavelet transform.

14. A method as claimed in any one of claims 1 to 1 1 in which the sequence of digital samples comprises a sequence of video frames.

15. A method as claimed in claim 14 in which the compressed samples comprise compressed images and the residual samples comprise residual images.

16. A method as claimed in claimed in claims 14 or 15 in which the compressed images are derived by applying a wavelet transform to the digital images.

17. A method as claimed in claims 14 or 15 in w hich the compressed images are derived by applying a discrete cosine transform to the digital images.

18. A method as claimed in any one of claims 14 to 17 in which the temporal transform comprises a spatial/temporal transform applied to the residual images of the group.

19. A method as claimed in claim 1 8 in which the spatial/temporal transform includes a wavelet transform.

20. A method as claimed in claim 1 8 in which the spatial/temporal transform includes a discrete cosine transform.

21 . A method as claimed in claim 1 8 in which the spatial/temporal transform has a spatial part which is a wavelet transform and a temporal part w hich is also a wavelet transform.

22. A method as claimed in claim 18 in which the spatial/temporal transform has a spatial part which is a DCT and a temporal part which is also a DCT.

23. A method as claimed in any one of claims 1 to 13 in which the sequence of digital samples comprises a sequence of audio samples.

24. A digital coder for encoding and compressing a time-varying sequence of digital samples, the coder comprising: (a) means for deriving from the said samples a corresponding sequence of compressed samples;

(b) means for deriving from corresponding digital samples and compressed samples a sequence of residual samples; and

(c) means for taking a group of residual samples and applying a temporal transform to the samples of the group to produce compressed data representative of the group.

25. A digital coder as claimed in claim 23 including output means for outputting a datastream. the datastream including the compressed data representative of the group.

26. A digital coder as claimed in claim 23 comprising a video coder.

27. A digital coder as claimed in claim 23 comprising an audio coder.