VIRTUAL BROADBAND TECHNOLOGY
This invention relates to virtual broadband technology (VBT) and in particular, to virtual broadband transmission of images, and specifically, of images carrying both vision and sound.
The desideration of being able to readily transmit broadband material into individual homes or businesses will be achieved when fibre-optic cable is readily available to all users, but on the present estimates, this will take until into the 21st Century.
It has long been considered desirable to be able to transmit images and particularly those carrying video signals and sound, by way of Public Switched Telephone Network (PSTN) telephone lines, but there have been great difficulties in doing this because it has not been possible to carry the required bandwidth on such lines, it being possible only through traditional broadband services (ISDN and the like) and as mentioned above, fibre-optic cable.
For example, at the present time the highest maintainable rate of transmission along PSTN telephone lines as in analogue modems is perceived to be 28.8 Kilobits per second although on some occasions, particularly through digital exchanges and using digital information, one can transmit along these lines
at up to 57.6 Kilobits per second. The difficulty about speeds of transmission which are higher than the supplier of the lines is prepared to warrant is that if there is any difficulty in transmission, then the transmission rate will automatically drop to the rate of which the network is capable, in the case of the Australian National AUSTEL System, 28.8 Kilobits per second.
Thus, any transmission system for which it is essential that a higher transmission rate be used, will fail if the transmission rate cannot be achieved by the network.
The principal object of this invention is to provide means whereby video and/or audio and full motion video signals can be transmitted satisfactorily along PSTN telephone network lines, within the present operating capabilities of these lines.
The invention includes a method of conversion of video and/or video and audible signals to a form which can be readily transmitted along PSTN telephone lines including the steps of using a combining process to effect a first reduction in size of the signal and subsequently compressing the reduced signal.
In a first preferred aspect of the invention the reduction and compression provides a signal suitable for transmission on the PSTN network.
If the signal is basically data information, we find that we can use an eight bit transmission and by a compression algorithm, which reduces the information in the ratio of 1:200 then the required output can be achieved.
If the signal to be transmitted is a signal such as a television picture or the like, then it is preferred that 16 bit technology be used, so as to provide a greater colour information than would otherwise be the case, and, in such circumstances, we first pass the signal through a stage which does a 1:50 compression and then effect a second recompression which permits a change in the dynamics of the picture and obtain a further reduction by 30% to 40% of the amount of material which needs to be transmitted.
In order that the invention may be more readily understood, we shall describe two specific embodiments of the invention.
For the invention to be applied, it is first necessary to capture and provide an initial conversion of video and or video and/or audible signals to a digital form.
Normally the material is captured in an analogue form by the use of a video camera, camcorder, satellite receiver or can be obtained from all ready prepared material such as from a laserdisc player.
It is known that this material can be digitised and saved and
or compressed as files onto magnetic media such as hard disks. At the present time, the most common digital video and audio file types are Microsoft's Audio/Video Interleaved (AVI) or Motion Picture Experts Group (MPEG) conventions.
Once Analog Video and Audio is digitised with the use of well balanced hardware/Software capture techniques, CODECS and algorithms, it can be combined, stored, copied, edited, re-compressed, transmitted and decompressed to suit the bandwidth requirements of the respective medium and application. Moreover, the storage and archival medium can be any Digital Storage medium - from a mere floppy disk to a rewritable, Magneto-Optical disk or the RAID hard disk subsystem of a Wide Area Network (WAN) server.
Once the video is stored in digital form, the flexibility of manipulation and possibilities are endless. Digital Video and Audio Sequences can effortlessly be edited either at the server end or at the node Workstation PC, in frame by frame, using editing software like Microsoft's Video for Windows or Adope Premiere. Digital Video and Audio files such as AVI or MPEG files can be integrated into fully interactive presentations for greater impact of Multimedia business presentations, interactive inquiry and purchasing information or simply be edited and transmitted as live or repeatable movies and broadcasts on 'demand' over a client/server type of arrangement.
The technology incorporates a high performance Real-time Full-Motion Video and Hi-Fi 3D Stereo Audio option as well as Still Video capture facility that digitises Video and Audio from any four different sources and of any international video standard, directly onto the hard or a removable disk in a single easy step. Because it used high performance video processors such as Intel's i750 (a world standard) or AT&T's MPEG chipset, this facility is optimised for high quality Digital Video and Audio capture of up to 30fps and at resolutions up to 320×240 pixels at 16 Million colours.
In order that the combination, compression and re-compression may be readily understood we shall describe two specific embodiments of the technology.
In the first instance, where, say a single 16-bit publication quality still picture or document is to be transmitted, we pass this combined signal through a compression algorithm which compresses the signal down to one two-hundredth of its original size, and thus provide a DTE output signal which, with the further Digital Modem compression rate of 4:1, can be transmitted within seconds at a rate far less than what can be conventionally carried in the PSTN network system, even at the speed of 28.8 kilobits per second.
In the second instance, where, say, the picture to be transmitted is information relating to a full motion 16-bit
picture (64 thousand colours - High-colour) together with synchronised sound and where it is necessary to provide at least 24 frames per second, we use a slightly different operation.
We initially use a combining process to reduce the size of the initial "picture". In this specification, the word picture will be used generically to indicate all the video information which is to be transmitted including sound and sync information. This reduction is effected by combining groups of pixels, say four pixels, to provide a signal which can be deemed to have the characteristics of the combined groups of pixels. By doing this, we can immediately reduce the total amount of Video information which needs to be transmitted down to 25%.
During the two-stage compression process which follows, so as to maintain a greater colour (16-bit High-colour) information than would otherwise be the case, we first pass the signal through a stage which does a 1:50 compression and then effect a Selective Re-compression which involves variable manipulation of the dynamics of the picture in order to obtain a further reduction by at least 30-40% of the amount of material which actually needs to be transmitted which would have been incorporated in previous frames and therefore does not need to be transmitted.
Finally, if the signal to be transmitted is a signal such as a
television picture or the like, then it is preferred that a Digital Modem incorporating 16-bit multiplexing and transmission technology is used. In this instance far greater DTE speeds can be realised as well as an additional 4:1 compression and error correction is achieved.
In addition, for the sake of compatibility of VBT based equipment with existing competitive hardware/software, international standards are religiously observed and utilised as 'shell' standards of all Hardware and Software associated with VBT
2.2 The client-End Workstation
In each case the signal so produced lies within the bandwidth of the network and can readily be transmitted to a client receiver Workstation at which there is an effective total reversal of the process, including first the decompression and the reconstruction of the picture to its original size.
The main factor upon which VBT firstly totally relies an secondly takes full advantage of, is the processing power and intelligence of the client-end workstation. In order to perform the time estimation algorithm involved in the Selective R-compression and Decompression in real-time, enormous distributed processing power and intelligence is required. VBT harnesses this power through dedicated hardware/software components employed in its video
reconstruction equipment.
Effectively, the mean characteristics of four pixels combined into one and, say, along junctions, the adjacent pixels could be slightly different. Of course, in an area of the same or similar colours, the combined pixel , when reconstituted into four pixels, will vary only very little from the characteristics of the pixels originally compressed. Furthermore, with the continuous movement of the objects and the use of the VBT smoothing algorithm, the effect of pixelisation is far less noticeable especially when the video CODEC is using the optional secondary video acceleration and smoothing continuously-scalable playback to a Full-screen. A TV interface module is also offered and the frame rate is digital video and audio playback is automatically adjusted to produce the best possible result.
The following describe the element of the VBT encoding process which is found in the ITU (International Telecommunications Union) H.26x Standards as well as the use of the MPEG-1 and MPEG-2 standards of video coding.
The source coder operates on non-interlaced pictures occurring 30,000/1001 (approximately 29.97) times per second. The tolerance on picture frequency is +50ppm. Pictures are coded as luminance and two colour difference (chrominance) component (Y, CB and CR).
These components and the codes representing their sampled values are as defined in CCIR Recommendation 601.
Black = 16
White = 235
Zero colour difference = 128
Peak colour difference - 16 and 240
These values are nominal ones and the coding algorithm functions with input values of 1 through to 254.
There are five standardised picture common intermediate formats: sub-QCIF, QCIF, CIF, 4CIF and 16CIF. For each of these picture formates, the luminance sampling structure is dx pixels per line, dy lines per picture in an orthogonal arrangement. Sampling of each of the two colour difference components is at dx/2 pixels per line, dy/2 lines per picture, orthogonal. The values of dx, dy, dx/2 and dy/2 are given in TABLE 1/H.263 of ITU for each of the picture formats.
To permit a common system interchange to cover use in and between regions using 625-line and 525-line television standards, the source coder operates on CIF and QCIF. The standards of the input and output television signals which may for example, composite or component, analogue or digital and the methods of performing any necessary conversion to and from the source coding format are therefore kept independent of the encoding operation.
The video coder provides a self-contained digital bit stream which may be combined with other multi-facility signals. The video decoder performs the reverse process. Pictures are sampled at an integer multiple of the video line rate. This sampling clock and the digital network clock are asynchronous.
In order to maintain compatibility at all levels, VBT uses 16x16 macroblocks, 8×8 sub-blocks, selective re-compression/motion estimation and compensation, DCT (Discrete Cosine Transform) of prediction errors, run-length coding and variable length code-words. This section defines the coding method for the core elements in VBT including an 8x8 motion vector search.
CIF and QCIF are the core picture interchange formates used. Conversion between CIF and QCIF division formats are made s follows:
/ Integer division with truncation towards zero.
// Integer division with rounding to nearest integer (ie.
3//2=2, -3//2=-2)
A 7 tap filter is used with the taps being (-1,0,9,16,9,0,- l)/32. The taps are related to a CIF raster. During the conversion from CIF to QCIF a filter is used on both horizontal and vertical directions for luminance and chrominance. Re-conversion from QCIF to CIF is performed using the same filter. During re-construction and since CIF
is generated from a QCIF raster, zeros are entered at filter taps where pixels are missing. The up conversion is first performed in one direction (e.g. horizontal) to produce the CIF pixels. The same procedure is performed in the other direction. The same procedure is performed for the luminance and the two chrominance components.
In order to complete this dynamic smoothing algorithm, mirroring around the end pixels in the QCIF window is performed.
A hybrid of inter-picture prediction to utilise temporal redundancy and transform coding of the remained signal to reduce spatial redundancy is adopted. The decoder has motion compensation capability, allowing optional incorporation of this technique in the coder. Half pixel precision is used for the motion compensation for PSTN use, as opposed to ISDN use where full pixel precision and a loop filter are used. Variable length coding is used for the symbols to be transmitted.
In addition to the core H.26x coding algorithm, four negotiable coding options are included. All these options can be used together or separately, except the advanced prediction mode which requires the unrestricted motion vector mode to be used at the same time.
In this optional mode motion vectors are allowed to point
outside the picture. The edge pixels are used as prediction for the "not existing" pixels. With this mode a significant gain is achieved if there is movement across the edges of the picture, especially for the smaller picture formats. Syntax-based Arithmetic Coding mode is used instead of variable length coding. The SNR and reconstructed frames will be the same, but significantly fewer bits will be produced. Furthermore an optional advanced prediction mode (overlapped block) motion compensation (OBMC) is used for the luminance part of P-pictures. Four 8×8 vectors instead of one 16×16 vector are used for some of the macroblocks in the picture.
The encoder has to decide which type of vectors to use. Four vectors use more bits, but give better prediction. The use of this mode generally gives a considerable improvement. Especially a subjective gain is achieved because OBMC results in less blocking artifacts.
PB-frames are also used consisting of two picture being coded as one unit. The name PB comes form the name of picture types in MPEG where there are P-pictures and B-pictures. Thus, a PB-frame consists of one P-picture which is predicted from the last decoded P-picture and one B-picture which is predicted from both the last decoded P-picture and the P-picture currently being decoded. This last picture is called a B-picture, because parts of it may be biodirectionally predicted from the past and future P-pictures. With this coding option, the picture rate can be increased considerably without
increasing the bit rate as much.
The transmission clock is provided externally. The video bitrate may be variable. No constraints on the video bitrate are given; constraints will be set by the terminal or the network used.
The encoder controls its output bistream to comply with the requirements of the decoder. Video data is provided on every valid clock cycle. This can be ensured by MCPBC stuffing or, when forward error correction is used, also by forward error correction stuffing frames.
The coder is used for bi-directional or unidirectional visual communication.
Error handling is provided be external means. The decoder can send a command to encode one or more GOBs of its next picture in INTRA mode with coding parameters such as to avoid buffer overflow. Alternatively the decoder sends a command to transmit only non-empty GOB headers. The transmission method for these signals is by external means.
The prediction is inter-picture and is augmented by motion compensation (see below). The coding mode in which prediction is applied is called INTER; the coding mode is called INTRA of no prediction is applied. The INTRA coding mode is signalled at the picture level (I-picture for INTRA of P-picture for
INTER) or at the macroblock level in INTER mode. The B-pictures are partly predicted bi-directionally.
Motion compensation is optionally but preferably used in the encoder. The decoder accepts one vector per macroblock or of the advanced prediction mode is used, one or four vectors per macroblock. If the PB-frames mode is used an additional delta vector can be transmitted per macroblock for adaptation of the forward motion vector for prediction of the B-macroblock.
Both horizontal and vertical components of the motion vectors have integer or half integer values restricted to the range [-16,15.5] (this is also valued for the forward and backward motion vectors for B-pictures)
Normally, a positive value of the horizontal or vertical component of the motion vector signifies that the prediction is formed from pixels in the previous picture which are spatial to the right or below the pixels being predicted. The only exception is for the backward motion vectors for the B-pictures, where a positive value of the horizontal or vertical components of the motion vector signifies that the prediction is formed from pixels in the next picture which are spatially to the left or above the pixels being predicted.
Motion vectors are restricted such that all pixels referenced by them are within the coded picture area except, when the Unrestricted Motion Vector mode is used.
Several parameters are varied to control the rate of generation of coded video data. These include processing prior to the source coder, the quantizer, block significance criterion and temporal sub-sampling. When invoked, temporal sub-sampling is performed by discarding complete pictures.
A decoder can signal its preference for a certain trade off between spatial and temporal resolution of the video signal. The encoder signals its default trade off at the beginning of the call, and indicates whether it is capable to respond to decoder request to change this trade off. The transmission method for these signals is by external means.
Forced updating is achieved by forcing the use of the INTRA mode of the coding algorithm. The update pattern is not defined. To control the accumulation of inverse transform mismatch error, each macroblock shall be coded in INTRA mode at least once every 132 times when coefficient are transmitted for this macroblock.
Byte alignment of start codes is achieved by inserting less than 8 zero-bits before the start code such that the first bit of the start code is the first (most significant) bit of a byte. A start code is therefore byte aligned if the position of its most significant bit is a multiple of 8-bits from the first bit in the video bitstream. All picture start codes shall and GOB and EOS start codes may, be byte aligned. The number of bits spent for a certain picture is always a
multiple of 8 bits.
The video multiplex is arranged in a hierarchical structure with four layers. From top to bottom the layers are; - Picture
- Group of Blocks (optional)
- Macroblock
- Block
The most significant bit is transmitted first. All unused or spare bits are set to "1".
Data for each picture consist of a picture header followed by data for Groups of Blocks, eventually followed by an end-of-sequence code and stuffing bits. PLCI is only present if indicated by CPM. TRB and DBQUANT are only present if PTYPE indicates "PB-frame". Combinations of PSPARE and PEI may not be present. EOS may not be present, while STUF may be present only if EOS is present. Picture headers for dropped pictures are not transmitted.
Picture Start Code (PSC) (22 bits + 0-7 stuffing bits). PSC is a word of 22 bits. Its value is 0000 0000 0000 1 00000. All picture start codes are byte aligned. This is achieved by inserting less than 8 zero-bits before the start code such that the first bit of the start code is the first (most significant) bit of a byte.
A PICTURE frame consists of 99 MACROBLOCKS
A MACROBLOCK consists of six BLOCKS, four luminance (Y) blocks and two chrominance blocks (U and V)
The four luminance and two chrominance BLOCKS, each contain 62 pixels (8×8).
Motion estimation is performed on the luminance macroblock. SAD (Sum of Absolute Difference) is used as an error measure as detailed below.
In the basic VBT mode the block size for vectors is 16×16. As an option, 8×8 vectors may be used (Advanced prediction mode). This section applies to both 16x16 and 8x8 block sizes.
If 8x8 vectors are used, both 8×8 and 16×16 vectors may be obtained from the search algorithm. Only a small amount of additional computation is needed to obtain the 8x8 integer vectors in addition to the 16×16 vectors.
The search is made with integer pixel displacement and for the Y component. The comparisons are made between the incoming block and the displaced block in the previous original picture. A full search is used, and the search area is up to +/-15 pixels in horizontal and vertical direction around the original macroblock position.
15", N=16 or 8.
For the zero vector SAD 16(0,0) is reduced by 100 - to favour the zero vector when there is no significant difference.
SAD 16 (0, 0) = SAD16 (0, 0) - 100
The (x,y) pair resulting in the lowest SAD 16 is chose as the 16x16 integer pixel motion vector, V0. The corresponding SAD is SAD 16(x,y).
Likewise, the (x,y) pair resulting in the lowest SAD8(x,y) are chosen to give the 4 8x8 vectors V1, V2, V3 and V4.
If only 16x16 prediction is used: SADinter = SAD16 (x,y)
If advanced prediction is used: SADinter = min (SAD16 (x,y), SAD4x8)
After the integer pixel motion estimation the coder makes a decision on whether to use INTRA or INTER prediction in the coding. The following parameters are calculated to make the INTRA/INTER decision.
INTRA mode is chosen if:
Notice that if SAD16(0,0) is used, this is the value that is already reduced by 100 above.
If INTRA mode is chosen, no further operations are necessary for the motion search. if INTER mode is chose, the motion search continues with half-pixel search around V0 position.
Half-pixel search is performed for 16x16 vectors as well as for 8x8 vectors if the option is chosen. The half-pixel search is done using the previous reconstructed frame. The search is performed on the luminance component of the macroblock, and the search area is +/-1 half-pixel around the
target matrix pointed to by V0, V1, V2, V3 or V4. For the 16x16 search the zero vector sad, SAD(0,0) is reduced by 100 as the integer search.
The half-pixel values are found using the interpolation described in Figure 1 and which corresponds to bilinear interpolation.
Figure 1. Interpolation scheme for half-pixel search
The vector resulting in the best match during the half-pixel
search is named MV. MV consists of horizontal and vertical components (MVx, MVy), both measured in half-pixel units.
This section applies only if advanced prediction mode is selected, then we provide means for searching a decision on 16x16 or 8x8 prediction mode.
SAD for the best half-pixel 16x16 vector (including subtraction of 100 if vector is (0,0)):
SAD16(x,y)
SAD for the whole macroblock for the best half-pixel 8x8 vectors:
The following rule applies:
If: SAD4×8<SAD16 - 100, choose 8×8 prediction
otherwise choose 16×16 prediction.
When using INTER mode coding, the motion vector must be transmitted. The motion vector components (horizontal and vertical) are coded differentially by using a spatial neighbourhood of three motion vectors already transmitted
(Figure 2). These three motion vectors are candidate predictors for the differential coding.
In the special cases at the borders of the current GOB or picture the following decision rules are applied in increasing order:
1. The candidate predictors MV1 is set to zero if the corresponding macroblock is outside the picture (at the left side);
2. Then, the candidate predictors MV2 and MV3 are set to MV1 if the corresponding macroblocks are outside the current GOB or picture (at the top);
3. Then, the candidate predictor MV3 is set to zero of the corresponding macroblock is outside the picture (at the right side);
4. When the corresponding macroblock was coded in INTRA mode or was not coded (COD=l), the candidate predictor is set to zero.
The motion vector coding is performed separately on the horizontal and vertical components.
For each component, the median value of the three candidates for the same component is computed:
Px = Median (MV1x, MV2x, MV3x)
Py = Median (MV1y, MV2y, MV3y)
For instance, if MV1 = (-2,3) and MV3 = (-1,7), then Px = -1 and Py = 5.
The variable length codes for the vector differences MVDx and MVDy as listed are:
3.3 Macroblock Prediction
In the previous step the prediction mode was decided and the motion vector found (if INTER). For the Y component the same filters as described for motion search are used.
Table 1 shows the different modes. Notice that MODE contains information about prediction as well as update of QP.
TRANSFORM
A separable 2-dimensional Discrete Cosine Transform (DCT) is used.
Quantization
The numbers of quantizers is 1 for the first coefficient of INTRA blocks and 31 for all other coefficients. Within a macroblock the same quantizer is used for all coefficients except the first one of INTRA blocks. The decision levels are not defined. The first coefficient of INTRA blocks is nominally the transform dc value uniformly quantised with a stepsize of 8.
Each of the other 31 quantizers use equally spaced reconstruction levels with a central dead-zone around zero and with a step size of an even value in the range 2 to 62.
The quantization parameter QP may take integer values from 1 to 31. The quantization stepsize is 2×QP.
COF A transfer coefficient to be quantised.
LEVEL Absolute value to be quantised version of the transform coefficient.
COF' Reconstructed transform coefficient.
Quantization:
For INTRA: LEVEL = /COF/ / (2×QP)
For INTRA: LEVEL = ( /COF/-QP/2)/(2×QP)
The DC coefficient of an INTRA block is quantised below. 8 bits are used for the quantised DC coefficient.
Quantization: LEVEL = COF // 8
Dequantization: COF ' = LEVEL × 8
VLC Encoding of quantised transform coefficients (VLEC)
The 8 × 8 blocks of transform coefficients are scanned with "zigzag" scanning as listed below.
A three dimensional variable length code is used to code transform coefficients. An EVENT is a combination of three parameters:
LAST 0: There are more nonzero coefficients in the block.
1: This is the last nonzero coefficient in the block.
RUN Number of zero coefficients preceding the current nonzero coefficient.
LEVEL Magnitude of the coefficient.
The most commonly occurring combinations of (LAST, RUN, LEVEL) are coded with variable length codes given. The remaining combinations of (LAST, RUN, LEVEL) are coded with a 23 bit word consisting of:
ESCAPE 7 bit
LAST 1 bit (0: Not last coefficient, 1: Last nonzero coefficient)
RUN 7 bit
LEVEL 8 bit
VBT Buffer regulation can be effected in different ways.
Fixed stepsize and frame rate:
One mode of performing simulations is fixed stepsize and frame rate. In this mode simulations shall be performed with constant stepsize throughout the sequence. The quantizer is "manually" adjusted so that the average bitrate for all pictures in the sequence - minus picture number 1 - is as close as possible to one of the target bit rates (8,16 or 32 kb/s)
Regulation of stepsize and frame rate:
For realistic simulations with limited buffer and coding delay, a buffer regulation is needed. Mechanisms for regulating the output bitrate are:
Possibility of changing the stepsize on macroblock level.
The following buffer regulation will be used as a beginning.
The first intra picture is coded with QP=16. After the first picture the buffer content is set to:
For the following pictures the quantizer parameter is updated at the beginning of each new macroblock line. The formula for calculating the new quantizer parameter is:
The first two terms of this formula are constant for all macroblock within a picture. The third term adjusts the quantizer parameter during coding of the picture.
The calculated QPnew must be adjusted so that the different fits in with the definition of DQUANT. The buffer content is updated after each complete frame in the following way:
After the first frame the buffer content is updated as
follows :
The variable frame_incr indicates how many times the last coded picture must be displayed. it also indicates which picture from the source is coded next.
To regulate frame rate, ftarget and a new B are calculated at the start of each frame:
At the start of the second frame:
Bt-1 = B
At the start of each of the remaining frames:
For this buffer regulation, it is assumed that the process of encoding is temporarily stopped when the physical transmission buffer is nearly full. This means that buffer overflow and forced to fixed blocks will not occur.
However, this also means that no minimum frame rate and delay can be guaranteed.