WO1998007278A2 - Video encoder/decoder - Google Patents

Video encoder/decoder Download PDF

Info

Publication number
WO1998007278A2
WO1998007278A2 PCT/US1997/013039 US9713039W WO9807278A2 WO 1998007278 A2 WO1998007278 A2 WO 1998007278A2 US 9713039 W US9713039 W US 9713039W WO 9807278 A2 WO9807278 A2 WO 9807278A2
Authority
WO
WIPO (PCT)
Prior art keywords
image
video
data
processor
encoder
Prior art date
Application number
PCT/US1997/013039
Other languages
French (fr)
Other versions
WO1998007278A3 (en
Inventor
John J. Proctor
Mark M. Baldassari
Craig H. Richardson
Chris J. Hodges
Kwan K. Truong
David L. Smith
Philip A. Desjardins
Thomas H. Knight
Original Assignee
3Com Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/689,444 external-priority patent/US5805228A/en
Priority claimed from US08/689,519 external-priority patent/US6072830A/en
Priority claimed from US08/694,949 external-priority patent/US5926226A/en
Priority claimed from US08/689,456 external-priority patent/US5864681A/en
Priority claimed from US08/689,401 external-priority patent/US5831678A/en
Application filed by 3Com Corporation filed Critical 3Com Corporation
Priority to AU40455/97A priority Critical patent/AU4045597A/en
Publication of WO1998007278A2 publication Critical patent/WO1998007278A2/en
Publication of WO1998007278A3 publication Critical patent/WO1998007278A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/94Vector quantisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/008Vector quantisation

Definitions

  • the present invention generally relates to digital data compression and enoding and decoding of signals. More particularly, the present invention relates to methods and apparatus for encoding and decoding digitized video signals.
  • digital channel capacity is the most important parameter in a digital transmission system because it limits the amount of data to be transmitted in a given time.
  • the transmission process requires a very effective source encoding technique to overcome this limitation.
  • the major issue in video source encoding is usually the tradeoff between encoder cost and the amount of compression that is required for a given channel capacity.
  • the encoder cost usually relates directly to the computational complexity of the encoder. Another significant issue is whether the degradation of the reconstructed signal can be tolerated for a particular application
  • a common objective of all source encoding techniques is to reduce the bit rate of some underlying source signal for more efficient transmission and/or storage.
  • the source signals of interest are usually in digital form. Examples of these are digitized speech samples, image pixels, and a sequence of images.
  • Source encoding techniques can be classified as either lossless or lossy.
  • lossless encoding techniques the reconstructed signal is an exact replica of the o ⁇ gmal signal
  • lossy encoding techniques some distortion is introduced into the reconstructed signal, which distortion can be tolerated in many applications.
  • Almost at the video source encoding techniques achieve compression by exploiting both the spatial and temporal redundancies (correlation) inherent in the visual source signals.
  • Numerous source encoding techniques have been developed over the last few decades for encoding both speech waveforms and image sequences.
  • PCM Pulse code modulation
  • DPCM differential PCM
  • delta modulation predictive encoding
  • various hybrid as well as adaptive versions of these techniques are very cost-effective encoding schemes at bit rates above one bit per sample, which is considered to be a medium-to-high quality data rate.
  • PCM Pulse code modulation
  • DPCM differential PCM
  • delta modulation predictive encoding
  • various hybrid as well as adaptive versions of these techniques are very cost-effective encoding schemes at bit rates above one bit per sample, which is considered to be a medium-to-high quality data rate.
  • a deficiency of all the foregoing techniques is that the encoding process is performed on only individual samples of source signal.
  • Scalar quantization involves basically two operations. First, the range of possible input values in partitioned into a finite collection of subranges. Second, for each subrange, a representative value is selected to be output when an input value is within the subrange.
  • Vector quantization allows the same two operations to take place in multidimensional vector space. Vector space is partitioned into subranges each having a corresponding representative value or code vector. Vector quantization was introduced in the late 1970s as a source encoding technique to encode source vectors instead of scalars. VQ is described in A. Gersho, "Asymptotically optimal block quantization," IEEE Trans. Information Theory, vol. 25, pp. 373-380, July 1979; Y. Linde, A. Buzo, and R.
  • VQ Vector quantization design
  • IEEE Trans. Commun. vol. 28, pp. 84-95, January, 1980
  • R.M. Gray J.C. Kieffer
  • Y. Linde "Locally optimal quantizer design”
  • Information and Control vol. 45, pp. 178-198, 1980.
  • An advantage of the VQ approach is that it can be combined with many hybrid and adaptive schemes to improve theoverall encoding performance. Further, VQ-oriented encoding schemes are simple to implement and generally achieve higher compression than scalar quantization techniques.
  • the receiver structure of VQ consists of a statistically generated codebook containing code vectors.
  • VQ-oriented encoding techniques operate at a fixed rate/distortion tradeoff and thus provide very limited flexibility for practical implementation.
  • Another practical limitation of VQ is that VQ performance depends on the particular image being encoded, especially at low-rate encoding. This quantization mismatch can degrade the performance substantially if the statistics of the image being encoded are not similar to those ofthe VQ.
  • Two other conventional block encoding techniques are transform encoding (e.g, discrete cosine transform (DCT) encoding) and subband encoding.
  • DCT discrete cosine transform
  • transform encoding the image is decomposed into a set of nonoverlapping contiguous blocks and linear transformation is evaluated for each block. Transform encoding is described in the following publications: W.K.
  • transform encoding transform coefficients are generated for each block, and these coefficients can be encoded by a number of conventional encoding techniques, including vector quantization. See N.M. Nasrabadi and R.A King, “Image coding using vector quantization a review,” IEEE Trans Commun , vol.
  • Subband coding of images are then separately encoded at different bit rates. This approach resembles the human visual system.
  • Subbank encoding is a very effective technique for high quality encoding of images and video sequences, such as high definition TV.
  • Subband encoding is also effective for progressive transmission in which different bands are used to decode signals at different rate/distortion operating points.
  • subband encoding is usually not very efficient in allocating bit rates to encode the subband images at low rates.
  • an apparatus for encoding an image signal includes an acquisition module disposed to receive the image signal.
  • a first processor is coupled to the acquisition module.
  • At least one encoder processor is coupled to the first processor, and produces an encoded image signal under control of the first processor.
  • a method for generating a compressed video signal is provided. The method includes the steps of converting an input image signal into a predetermined digital format and transferring the digital format image signal to at least one encoder processor.
  • the method further includes the step of applying, at the encoder processor, a hierarchical vector quantization compression algorithm to the digitized image signal. At the next step, a resultant encoded bit stream generated by the application of the algorithm is collected.
  • a method of compressing an image signal representing a series of image frames includes the step of analying a first image frame by computing a mean value for each of a plurality of image blocks within the first image frame, storing the computed mean values in a scalar cache, providing a mean value quantizer comprising a predetermined number of quantization levels arranged between a minimum mean value and a maximum mean value stored in the scalar cache, the mean value quantizer producing a quantized mean value, and identifying each image block from the plurality of image blocks that is a low activity image block.
  • each of the low activity image blocks is encoded with its corresponding quantized mean value.
  • a method periodically enhancing a video image with at least one processor in which the video image includes a series of frames, each of the frames including at least one block.
  • the method includes the steps of determining space availability of the processor, determine time availability of the processor, and determining the number of bits used to encode a frame. Then, a value is assigned to a first variable corresponding to the number of bits used to encode the frame. The first variable is compared to a predetermined refresh threshold.
  • the method further includes the steps of scalar refreshing the video image if the first variable exceeds the predetermined refresh threshold, and block refreshing the video image if the first variable is less than or equal to the predetermined refresh threshold.
  • a method for transmitting data from an encoder to a decoder includes, in combination encoding the data into a plurality of macro rows, transmitting the macro rows from the encoder to the decoder, and decoding the macro rows at the decoder.
  • an apparatus for electronically panning an image scene in a video conferencing system having a camera, an image processor and an image display, an apparatus for electronically panning an image scene.
  • the apparatus includes an image acquisition module disposed to receive an image signal from the camera.
  • the image acquisition module produces a digital representation of the image signal.
  • Means for transmitting a portion of the digital representation of the image signal to the image processor are provided.
  • the transmitting means is coupled between the image acquisition module and the image processor.
  • the apparatus also includes an image pan selection device coupled to the transmitting means.
  • the image pan selection device is operable to change the portion of the digital representation of the image signal that is transmitted to the image processor.
  • An object of the present invention is a reduction of the number of bits required to encode an image. Specifically, an object of the invention is a reduction of the number of bits required to encode an image by using a scalar cache to encode the mean of the blocks in the different levels of the hierarchy. Yet another object of the present invention is a reduction of decoding errors caused by faulty motion cache updates in the presence of transmission errors.
  • a further object of the present invention is a reduction of the overall computational power required to implement the HVQC and possible reduction of the number of bits used to encode an image.
  • Mother object of the present invention is an improvement in the perceived image quality. It is further an object of the present invention to control how bits are allocated to encoding various portions of the image to improve the image quality.
  • Another object of the present invention is to reduce the probability of buffer overflow and control how bits are allocated to encoding various portions of the image to improve the image quality.
  • a further object of the present invention is to improve image quality by encoding the chrominance components independently of the luminance component.
  • Still another object of the present invention is an improvement in overall image quality, particularly of the chrominance component.
  • an object of the present invention to keep the computation load down by providing the processing devices with a history of the portion of the image they have just acquired and thus increase the precision of the motion vectors along the partition boundary. Yet another object of the invention is a reduction of the bit rate and the computational complexity. An additional object of the invention is to allow computational needs to be balanced with image quality and bit rate constraints
  • Another object of the invention is bit rate management determined by a single processing device that in turn controls the va ⁇ ous encoder devices
  • An additional object of the invention is to allow for maximum throughput of data to the transmission channel, immediate bit rate feedback control to all encoding processors, and, by the use of small packets of data, reduce image inaccuracies at the decoder due to channel errors It is also an object of the invention to allow for efficient retransmission, if necessary.
  • a further object of the invention is to allow for a data channel to transmit va ⁇ ous types of data in a multiplexed fashion.
  • Another object of the invention is low cost and low complexity implementation of YUV-to-RGB conversion
  • Yet another object of the invention is a low cost, low complexity mechanism for controlling the data to undergo YUV-to-RGB conversion.
  • An additional object of the invention is improved error detection and concealment
  • Another object of the invention is to reduce the amount of time between the viewer seeing a scene of high motion begin and
  • Another object of the invention is to transmit high quality still image frames. Another object of the invention is to improve the image quality by pre-determimng which areas require the most bits for encoding.
  • - li lt is also an object of the invention to provide a fast, efficient, and low cost method of calculating the distortion measure used in the hierarchical algorithm.
  • Another object of the invention is a low cost means of panning an image scene and a low cost means of zooming in and out of an image scene.
  • An additional object of the invention is an elimination of the need for specialized
  • Another object of the invention is to simplify video conferencing for the user. Moreover, it is an object of the invention to allow a user to have a visual representation of who the call is and what they may want.
  • Figure 1A is a block diagram of a hierarchical source encoding system inaccordance with the present invention.
  • Figure IB is a block diagram of a hierarchical source encoding system in accordance with a preferred embodiment of the present invention.
  • Figure 2 is a block diagram of a background detector of Figure 1 A;
  • Figure 3 is a graphical illustration showing conventional quadtree decomposition of image blocks performed by the hierarchical source encoding system of Figure 1 A;
  • Figure 4 is a block diagram of a motion encoder of Figure 1 A;
  • Figure 5 is a graphical illustration showing the functionality of the compute-and-select mechanism of Figure 4.
  • Figure 6 is a graphical illustration showing the encoding of motion vectors within a cache memory of Figure 4.
  • Figure 7 is a graphical illustration showing updating processes of various cache stack replacement algorithms for the cache memory of Figure 4.
  • Figure 8 is a block diagram for a block matching encoder of Figure 1 A;
  • Figure 9 is a block diagram of a cache vector quantizer for encoding a new frame of a video sequence in the frame buffer of the hierarchical source encoding system of Figure 1 A;
  • Figure 10 is a graphical illustration showing a working set model for the cache memory of Figure 9;
  • Figure 11 is a graphical illustration showing a novel adaptive working set model for the cache memory of Figure 9;
  • Figure 12 is a graphical illustration showing a raster scan technique for scanning the image blocks in one image frame;
  • Figure 13 is a graphical illustration showing a localized scanning technique for scanning the image blocks in one image frame
  • Figure 14A is a schematic diagram of a preferred parallel processing architecture
  • Figure 14B is a high level diagram of the architecture of a video conferencing system
  • Figure 15 is a bit map of the format of date for the YUV-to-RGB converter shown in Figure 14A;
  • Figure 16 shows the time for each phase of pre filtering assuming the encoders shown in Figures 14A and 14B will be operating on byte packed date;
  • Figure 17 depicts processor utilization (minus overhead) for post processing
  • Figure 18 shows the format improving the efficiency of the YUV-to-RGB convenor shown in Figure 14A;
  • Figure 19 is a table of eight gamma elements
  • Figure 20 is a graph of threshold v. buffer fullness
  • Figure 21 depicts the Y data format and U&V data format in the master video imput data buffer
  • Figure 22 is a high level example of a video bit stream
  • Figure 23 is a table relating to various values of the picture start code (PSC)
  • Figure 24 is a table of the fields in a macro row header
  • Figure 25 shows the format of extended macro zero information
  • Figure 26 is a table showing various decompositions of luminance of 16x16 blocks and their resulting bit streams
  • Figure 27 is a table of the flags used to describe the video bit stream
  • Figure 28 is a table of the various decompositions of chrominance 8x8 blocks and their resulting bit streams
  • Figure 29 depicts an example of encoding still image 4x4 blocks;
  • Figure 30 shows refresh header information;
  • Figure 31 is a table of the overhead associated with a bit stream structure
  • Figure 32 is a table of the fields of a picture header
  • Figure 33 depicts a packet header
  • Figure 34 depicts encoded data in C31 memory in big endian format
  • Figure 35 depicts encoded data in PC memory for the original bit stream
  • Figure 36 depicts encoded data in PC memory for a bit stream of the system
  • Figure 37 is a block diagram of an interrupt routine
  • Figures 38 - 44 relate to packet definitions for packets sent to the VSA
  • Figure 38 depicts a control packet
  • Figure 39 depicts a status request packet
  • Figure 40 depicts an encoded bit stream request
  • Figure 41 depicts a decoder bits end packet
  • Figure 42 depicts a decoder bits start/continue packet
  • Figure 43 depicts a YUV data for encoder packet
  • Figure 44 depicts a drop current RGB frame packet
  • Figures 45 - 53 relate to packet definitions for packets sent to the host (PC);
  • Figure 45 depicts a status packet
  • Figure 46 depicts a decoder error packet
  • Figure 47 depicts a decoder acknowledgment packet
  • Figure 48 depicts an encoded bits from encoder end packet
  • Figure 49 depicts an encoded bits from encoder start/continue packet
  • Figure 50 depicts an encoded bits frame stamp
  • Figure 51 depicts a top of RGB frame packet
  • Figure 52 depicts a FIFO ready packet
  • Figure 53 depicts a YUV acknowledgment packet
  • Figures 54 - 78 relate to control types and parameters for the control packet (host to VSA); Figure 54 depicts a control encoding packet;
  • Figure 55 depicts a frame rate divisor packet
  • Figure 56 depicts an encoded bit rate packet
  • Figure 57 depicts a post spatial filter packet
  • Figure 58 depicts a pre spatial filter packet
  • Figure 59 depicts a temporal filter packet
  • Figure 60 depicts a sill image quality packet
  • Figure 61 depicts a video mode packet
  • Figure 62 depicts a video pan absolute packet
  • Figure 63 depicts a brightness packet
  • Figure 64 depicts a contrast packet
  • Figure 65 depicts a saturation packet
  • Figure 66 depicts a hue packet
  • Figure 67 depicts a super control packet
  • Figure 68 depicts a control decoding packet
  • Figure 69 depicts a motion tracking packet
  • Figure 70 depicts a request control setting packet
  • Figure 71 depicts an example of a request control setting packet
  • Figure 72 depicts a frame rate divisor packet
  • Figure 73 depicts a request special status information packet
  • Figure 74 depicts a buffer fullness information packet
  • Figure 75 depicts a YUV top of frame packet
  • Figure 76 depicts a Y data frame packet
  • Figure 77 depicts a UV data packet
  • Figure 78 depicts a YUV end of frame packet
  • Figure 79 depicts data flow between image planes, a digital signal processor and an accelerator
  • Figure 80 is table that represents formulas for calculating speed improvements with an accelerator
  • Figure 81 is table that shows speed improvement for various pipeline delay lengths used with an accelerator
  • Figure 82 depicts a mean absolute error (MAE) accelerator implementation in accordance with the equation shown in the Figure
  • Figure 83 depicts a mean absolute error (MSE) accelerator implementation in accordance with the equation shown in the Figure;
  • Figure 84 illustrates a mean square error (MSE) implementation as in Figure 82 with interpolation to provide Vi pixel resolution
  • Figure 85 illustrates a mean square error (MSE) implementation as in Figure 83 with interpolation to provide Vi pixel resolution;
  • Figure 86 is a Karnaugh map for carry save adders in the YUV to RGB matrix
  • Figure 87 is a block diagram of a circuit that converts a YUV signal to RGB format
  • Figure 88 depicts a logic adder and a related bit table for converting the YUV signal to
  • Figure 89 depicts a logic adder and a related bit table for converting the YUV signal to RGB format
  • Figure 90 depicts a logic adder and a related bit table for converting the YUV signal to RGB format
  • Figure 91 is a block diagram timing model for the YUV to RGB conversion
  • Figure 92 is a state diagram relating to the conversion of the YUV signal to RGB format
  • Figure 93 shows a 16x16 image block
  • Figure 94 shows an interpolated macro block
  • Figure 95 depicts memory space at the end of step 1 of the first iteration of a first motion estimation memory saving technique
  • Figure 96 depicts the memory space at the end of step 4 of the first iteration of the first motion estimation memory saving technique; and Figure 97 depicts memory space addressing for three successive iterations of a second motion estimation memory saving technique.
  • Figure 1A shows a hierarchical source encoding system 10 having multiple successive stages 11 - 19 for encoding different parts of an image block 21.
  • higher bit rates are allocated to those regions of the image block 21 where more motion occurs, while lower bit rates are allocated to those regions where less motion occurs.
  • the hierarchical souce encoding system 10 has a low computational complexity, is simple and inexpensive to implement, and is suited for implementation in hardware, software, or combinations thereof.
  • the hierarchical encoding system 10 utilizes cache memories with stack replacement algorithms as a noiseless method to encode motion vectors.
  • This approach has an advantage over entropy encoding because the statistics of the motion vectors are allowed to vary with time.
  • Another major advantage of this approach is that the cache memories can substantially reduce the computation required for matching data blocks.
  • a background detector 1 1 initially receives an image block 21 of size pxp, for instance, 16 x 16 or 32 x 32 pixels. Generally, the background detector 11 determines whether the image block 21 is either a background block or a moving block.
  • a block diagram of the background detector 11 is shown in Figure 2. With reference to Figure 2, the background detector 11 comprises a compute mechanism 24 in series with a T bg flag mechanism 26, and a frame buffer 28 for receiving data from the flag mechanism 26 and for writing data to the compute mechanism 24.
  • the frame buffer 28 can be any memory which can be randomly accessed.
  • the image block 21 is compared with a previously encoded p x p image block (not shown) which has been stored in the frame buffer 28 in order to generate a difference value, or a "distortion" value.
  • the distortion value is generated by comparing the blocks on a pixel-by-pixel basis.
  • the compute mechanism 24 may utilize any suitable distortion measurement algorithm for determining the distortion value, but preferably, it employs one of the following well known distortion measurement algorithms:
  • MSE eu Square Error
  • the image block 21 is classified as a background block if a distortion value is less than a predetermined threshold T bg .
  • the comparison to the threshold T bg is performed in the T bg flag mechanism 26. If the comparison results in a difference which is less than the threshold T bj , than a flag bit 32 is set to a logic high ("1") to thereby indicate that the present image block 21 is substantially identical to the previously encoded p x p image block and then the system 10 will retrieve another image block 21 for encoding.
  • the image block 21 is encoded with merely a single flat bit 32.
  • the flag bit 32 remains at a logic low ("0") and then the system 10 will pass the p x p image block 21 on to the next hierarchical stage, that is, to the p/2 x p/2 (16 x 16 pixels in this example) background detector 12.
  • the p/2 x p/2 background detector 12 has essentially the same architecture and equivalent functionality as the p x p background detector 11 , as shown in Figure 2 or a similar equivalent thereof, but the p/2 x p/2 background detector 12 decomposes the p x p image block 21 into preferably four p/2 x p/2 image blocks 21 ' via a conventional quadtree technique, prior to image analysis.
  • a conventional quadtree decomposition is illustrated graphically in Figure 3.
  • a p x p block is divided into four p/2 x p/2 blocks, which are individually analyzed, and then each of the p/2 x p/2 blocks are broken down further into p/4 x p/4 blocks, which are individually analyzed, and so on.
  • the p/2 x p/2 background detector 12 retrieves a p/2 x p/2 image block 21 ' from the four possible image blocks 21 ' within the decomposed p x p image block 21 residing in the frame buffer 28 and subsequently analyzes it. Eventually, all of the p/2 x p/2 image blocks 21 ' are individually processed by the background detector 12.
  • the flag bit 32' is set at a logic high to thereby encode the p/2 x p/2 image block 21 ', and then the background detector 12 will retrieve another p/2 x p/2 image block 21' for analysis, until the four of the p x p image block are exhausted.
  • the p/2 x p/2 image block 21 ' is forwarded to the next subsequent state of the hierarchical source encoding system 10 for analysis, that is, to the motion encoder 13.
  • the p/2 x p/2 image block 21 ' is analyzed for motion.
  • the p/2 x p/2 motion encoder 13 comprises a compute- and-select mechanism 34 for initially receiving p/2 x p/2 image block 21 ', a cache memory 36 having a modifiable set of motion vectors which are ultimately matched with the incoming image block 21 ', a T M threshold mechanism 38 for comparing the output of the compute-and- select mechanism 34 with a threshold T M) and cache update mechanism 42 for updating the motion code vectors contained within the cache memory 36 based upon motion information received from the next subsequent hierarchical stage, or from a block matching encoder 14, as indicated by a reference arrow 43'.
  • the compute-and-select mechanism 34 attempts to match the incoming p/2 x p/2 image block 21 ' with a previously stored image block which is displaced to an extent in the frame buffer 28.
  • the displacement is determined by a motion vector corresponding to the matching previously stored image block.
  • motion bectors are two-dimensional integer indices, having a horizontal displacement dx and a vertical displacement dy, and are expressed herein as coordinate pairs dx, dy.
  • Figure 5 graphically illustrates movement of the p/2 x p/2 image block 33 within the frame buffer 28 from a previous position, indicated by dotted block 46, to a present position, denoted by dotted block 48.
  • the displacement between positions 46 and 48 can be specified by a two-dimensional displacement vector dx,, dy,.
  • the compute-and-select mechanism 34 compares the current image block 21 ', which is displaced by dx,, dyford wim the set of previously stored image blocks having code vectors in the modifiable set (dx 0 , dy 0 , dxicide dy,; . . . dx,, dy choir ⁇ within the cache memory 36.
  • the code vectors have cache indices 0 through n corresponding with dx 0 , dy 0 , dx,, dy,; . . . dx,, dygun . From the comparison between the current image block 21 ' and the previously stored image blocks a minimum distortion code vector d,,,,,,, (dx,, dyj) is generated. The foregoing is accomplished by minimizing the following equation:
  • the minimum distortion code vector d m ⁇ n (dx dock dy,) is forwarded to the threshold mechanism 38, where the distortion of the minimum distortion motion vector d m ⁇ n (dx dock dy,) is compared with the threshold. If the distortion of the minimum distortion motion vector is less than the threshold, then the flag bit 52' is set at a logic high and the cache index 53', as seen in Figure 6, associated with the minimum distortion motion vector d m ⁇ r (dxicide dy,) is output from the compute-and-select mechanism 34. Hence, the image block 21 ' is encoded by the flag bit 52' and the cache index 53'.
  • the flag bit 52' is maintained at a logic low and the p/2 x p/2 image block 21 ' is forwarded to the next stage, that is, to the p/2 x p/2 block matching encoder 14 for further analysis.
  • the cache update mechanism 42 updates the cache memory 36 based on cache hit and miss information, i.e., whether the flag bit 52' is set to a logic high or low.
  • the cache update mechanism 42 may use any of a number of conventionally available cache stack replacement algorithms. In essence, when a hit occurs, the cache memory 36 is reordered, and when a miss occurs, a motion vector which is ultimately determined by the block matching encoder 14 in the next hierarchical stage is added to the cache memory 36. When the motion vector is added to the cache memory 36, one of the existing entries is deleted.
  • LRU least-recently-used
  • LFU least frequently used
  • FIFO first- in-first-out
  • the stack replacement algorithm re-orders the index stack in the event of a cache hit, and in the event of a cache miss, an index is deleted and another index is inserted in its place.
  • the block matching encoder 14 which subsequently receives the p/2 x p/2 image block 21 ' in the event of cache miss, employs any conventional block matching encoding technique.
  • suitable block matching encoding techniques are a full search and a more efficient log (logarithm) search, which are both well known in the art.
  • the block matching encoder 14 comprises a block matching estimation mechanism 56 for comparing the incoming p/2 x p/2 image block 21 ' with image blocks in the previously stored image frame of the frame buffer 28. If a log search is employed, a three level approach is recommended. In other words, a predetermined set of blocks is first searched and analyzed. A best fit block is selected.
  • a select mechanism 8 selects the motion vector with minimum distortion, or d m ⁇ n (dx,dy), from the distortion vectors generated by equation (4) and forwards the minimum distortion vector ⁇ j ⁇ dx.dy) to the cache memory 36 ( Figure 5) of the motion encoder 13 (previous stage) for updating the cache memory 36.
  • the minimum distortion vector dize as n (dx,dy) is then compared with a predetermined threshold T M in a T M threshold mechanism 64. If the minimum distortion vector dize as n (dx,dy) is greater than the predetermined threshold T M , then the flag bit 65' is maintained at a logic low, and the system 10 proceeds to the next hierarchical stage, that is, the p/4 x p/4 (8 x 8 pixels in this example) motion encoder 15 for further analysis.
  • the flag bit 65' is set at a logic high and is output along with the minimum distortion vector d m ⁇ n (dx,dy), as indicated by reference arrow 66'.
  • the p/2 x p/2 image block 21 ' is encoded by a flag bit 65' and the minimum distortion vector d combat radical n (dx,dy) 66' and the system 10 proceeds back to the background detector 12, where another p/2 x p/2 image block 21 ' is retrieved, if available, and processed.
  • the p/4 x p/4 motion encoder 15 decomposes the p/2 x p/2 image block 21 ' into preferably four p/4 x p/4 image blocks 21" through selective scanning via the conventional quadtree technique, as illustrated and previously described relative to Figure 3.
  • the p/4 x p/4 motion encoder 15 could encode the p/4 x p/4 image block 21", as indicated by reference arrow 54", with a flag bit 52 set to a logic high and a cache index 53".
  • the p/4 x p/4 block matching encoder 16 could encode the p/4 x p/4 image block 21", as indicated by reference arrow 67", with a flag bit 65" set to a logic high and a minimum distortion motion vector 66".
  • the p/8 x p/8 motion encode 17 decomposes the p/4 x p/4 image block 21 " into preferably four p/8 x p/8 image blocks 21 " through selective scanning via the conventional quadtree technique, as illustrated and previously described relative to Figure 3.
  • the p/8 x p/8 motion encoder 17 could encode the p/8 x p/8 image block 21", as indicated by a reference arrow 54", with a flag bit 52" set to a logic high and a cache index 53".
  • the p/8 x p/8 block matchine encoder 18 could encode the p/8 x p.8 image block 21", as indicated by reference arrow 67" with a flag bit 65" set to a logic high and a minimum distortion motion vector 66".
  • the block encoder may be a vector quantizer, a transform encoder, a subband encoder, or any other suitable block encoder. Transform encoding and subband encoding are described in the background section hereinbefore.
  • entropy encoding may be employed in a vector quantizer 19 to further enhance data compression.
  • indices within the cache 36 may be entropy encoded in order to further enhance data compression.
  • the statistics of an occurrence of each cache index are considered, and the number of bits for encoding each index may vary depending upon the probability ofan occurrence o f each index .
  • the p x p image block is analyzed for motion by a motion encoder as described above for a p/2 x p/2 image block. Similarly, as described above, the p x p image block may be forwarded to a block matching encoder for further analysis.
  • a new image frame must be created in the frame buffer 28 of the present invention. This situation can occur, for example, when the hierarchical source encoding system 10 is first activated or when there is a scene change in the video sequence.
  • a novel vector quantizer 68 shown in Figure 9 has been developed using the novel cache memory principles.
  • the cache vector quantizer 68 comprises a large main VQ codebook 69 which is designed off-line and a small codebook kept in a cache memory 72 whose entries are selected on-line based on the local statistics of the image being encoded. Similar of the cache memory 36 of the motion encoders 13, 15, 17, the cache memory 72 is replenished with preferably the stock algorithms LRU, LFU, FIFO, or a novel adaptive working set model algorithm, which will be discussed in further detail later in this
  • the cache vector quantizer 68 includes the large VQ codebook 69, the cache memory 72, a compute-and-select mechanism 74 for receiving the incoming p x p image block 76 and for comparing theimage block 76 to code vectors in the cache memory 72, a T c threshold mechanism 78 for determining whether the minimum distortion vector is below a predetermined threshold T c , and a cache update mechanism 82 which utilizes a cache stack replacement algorithm for update the cache memory 72. More specifically, the compute-and- select mechanism 74 performs the following equations:
  • the T c threshold mechanism 78 determines whether the minimum distortion motion vector d f c. is below the threshold T c . If so, then a flag bit 84 is set at a logic high indicating a cache hit, and the flag bit 84 is output along with the VQ address 86, as indicated by a reference arrow 88. Alternatively, if the minimum distortion d ⁇ . is greater than or equal to the threshold T c , indicating a cache miss, then the main VQ code codebook 69 is consulted.
  • the main VQ codebook 69 is set up similar to the small codebook within the cache memory 72, and compares wntries to determine a minimum distortion as with the small codebook. It should be noted, however, that other well known VQ methods can be implemented in the main VQ codebook 69. As examples, the following architectures could be implemented: (1) mean removed VQ, (2) residual VQ, and (3) gain/shape VQ.
  • the cache update mechanism 82 implements a stack replacement algorithm for the cache memory 72.
  • the cache update mechanism 82 can perform a novel adaptive working set model algorithm described hereafter.
  • entropy, encoding, transform encoding, for example, discrete cosine transform (DCT) encoding, and subband encoding may be employed in the cache vector quantizer 68 to further enhance data compression.
  • DCT discrete cosine transform
  • the cache memory is simply a list of the unique code vectors that occur during the near past [t - T + l,t], where parameter T is known as the window size.
  • the parameter T can be a function of time.
  • the ultimate cache size corresponds to the number of unique code vectors within the time interval.
  • a two-dimensional causal search window 91 can be defined, as illustrated in Figure 10, which conceptually corresponds to the time interval of the working set model.
  • the memory space in the cache memory 72 ( Figure 9) is defined as all the blocks within the causal search window 91.
  • the causal search window 91 is shown have a size W equal to three rows of blocks 92, each with a block size p x p of 4 x 4.
  • the resulting code vector for encoding each image block x(t) is the previous block in the causal search window 91 that yields the minimum distortion, such as the minimum mean squared error ("MSE") distortion.
  • MSE minimum mean squared error
  • One of the major advantages of the working set model for computer memory design over other replacement algorithms is the ability to adapt the size of the causal search window 91 based on the average cache miss frequency, which is defined as the rate of misses over a short time interval.
  • Different window sizes are used to execute different programs, and the working set model is able to allocate memory usage effectively in a multiprogramming environment.
  • the adaptive working set model of the present invention uses the foregoing ideas to implement variable-rate image encoding. Different regions of an image require different window sizes. For example, an edge region may require a much larger search window 91 than that of shaded regions. Accordingly, a cache-miss distance is defined based on the spatial coordinates of each previously-encoded miss-block in the causal window 91 relative to the present block being encoded.
  • NfW ⁇ r.c the set of indices of the miss-blocks where W f is the window size used to estimte the cache-miss distance.
  • the spatial coordinates of the set N((W- > r,c) are illustrated in Figure 11 as the shaded blocks 93.
  • the average cache-miss frequency is defined as the summation of the reciprocal of each cache-miss distance within the search window:
  • the window size W at a given time t is updated according to the following equation:
  • a and B are two pre-defined constants, and W ⁇ , is the pre-defined maximum allowed window size.
  • the size of the causal search window 91 is manipulated depending upon the miss frequency. As misses increase or decrease, the window 91 respectively increases or decreases.
  • redundant blocks within the causal search window 91 are preferably minimized or eliminated. More specifically, if a present block matches one or more previous blocks in the causal search window 91, then only a code vector representative of the offset to the most recent block is encoded for designating the present block. This aspect further enhances data compression. Localized Scanning of Cache Indices
  • the method used to index the image blocks in a single image frame can also affect the average cache bit rate because the cache memory 72 updates its contents based entirely on the source vectors that have been encoded in the near past.
  • the indexing of image blocks may be based on a raster scan, as illustrated in Figure 12.
  • Figure 12 shows an example of a raster scan for scanning 256 image blocks.
  • the major problem with this method is that each time a new row begins, the inter-block locality changes rapidly.
  • the cache size is very small relative to the number of vectors in a row, many cache misses with occur whenever the cache starts to encode the source vectors in a new row.
  • the image blocks are indexed for higher performance by providing more interblock locality.
  • a localized scanning technique for example, the conventional Hubert scanning technique, is utilized for indexing the image blocks of the image frame.
  • Figure 13 illustrates the Hubert scanning technique as applied to the indexing of theimage blocks of the image frame of the present invention for scanning 256 image blocks.
  • the number of encoders used during the encoding process may be adaptively changed to improve coding efficiency.
  • the adaptation is based upon the statistical nature of the image sequence or the scene-change mean squared error ("MSE"). Coding efficiency is improved by assuring that every level of the hierarchy has a coding gain of greater than unity. Since there are overhead bits to address the hierarchical structure (the more levels the more bits required), it is possible to have a level that has coding gain less than unity. For example, in many experiments it was seen that during rapid moving scenes, the 4x4 motion coder has a coding gain less than unity and is thus replaced by spatial coders.
  • a scalar cache may be used to encode the mean of the blocks in the different levels of the hierarchy.
  • codec After the video encoder/decoder, sometimes referred to as a "codec,” has performed video compression through the reduction of temporal redundancy, further gains in compression can be made by reducing the spatial redundancy.
  • Two such methods include: a.) A "working set model" which looks for similar areas (matching blocks) of video elsewhere in the same frame. The working set model looks in a small portion of the frame near the current area being encoded for a match since areas spatially close to one another tend to be similar. This relatively small search window can then be easily coded to provide a high compression ratio.
  • the window size can also be varied according to the image statistics to provide additional compression.
  • a scalar cache containing the mean of previously code areas can be used to encode the current area's mean value. Again, this takes advantage of areas spatially close to one another tending to be similar. If an area can be encoded with low enough distortion (below a threshold) by using it's mean value, then a scalar cache could save encoding bits.
  • an overall scene change MSE is performed. This is to determine how much motion may be present in an image.
  • the coder reaches a level of 4x4 blocks, then if the MSE is below a threshold, the image is encoded with a structure that finds the 4x4 motion vectors from a search of the motion cache or through a three step motion search. On the other hand, if the MSE is above a threshold, the 4x4 motion vectors are found through one of the following:
  • Each 4x4 block is coded by its mean and a mean-removed VQ.
  • the structure is adapted depending upon image content.
  • Appendix 2 contains high level pseudo code that describes the encoding and decoding algorithm.
  • Appendix 3 contains pseudo code related to the encoding process and buffer management shown in Appendix 5.
  • Appendix 5 contain pseudo code of the encoding and decoding algorithm in a C programming language style.
  • the HVQC algorithm had the cache-miss flag located in one particular position in the cache.
  • the presently preferred embodiment allows the cache-miss flag position to vary within the cache so that it may be more efficiently encoded through techniques such as entropy encoding.
  • U.S. Patent No. 5,444,489 describes a hierarchical vector quantization compression (“HVQC") algorithm.
  • the HVQC algorithm may be applied to an image signal using a parallel processing technique to take advantage of high bit rate transmission channels.
  • HVQC algorithm One of the basic strengths of the HVQC algorithm is its adaptability to higher bit rate transmission channels. Users of these channels typically require high performance in terms of image size, frame rate, and image quality.
  • the HVQC algorithm may be adapted to these channels because the computational power required to implement the algorithm may be spread out among multiple processing devices.
  • a preferred parallel processing architecture is shown schematically in Figure 14A.
  • a video front end 94 receives a video input signal 95.
  • the video front end 94 may, for example, be a device that converts an analog signal, such as a video signal in NTSC, PAL, or S-VHS format, into a digital signal, such as a digital YUV signal.
  • the video front end 94 may receive from a video source a digital signal, which the video front end may transform into another digital format.
  • the video front end 94 may be implemented using a Philips SAA7110A and a Texas Instruments TPC 1020A.
  • the video front end 94 provides digitized video data to a first processor 96, where the digitized video data is preprocessed (i.e.
  • the first processor 96 transmits the data corresponding to an image frame to a second processor 97 and a third processor 98.
  • the first processor 96 communicates with the second processor 97 and the third processor 98 through a direct memory access (“DMA") interface 99, shown in Figure 14A as a memory and a bus transceiver.
  • DMA direct memory access
  • the first processor 96 uses most of its available MIPs to perform, for example, pre and post filtering, data acquisition and host communication.
  • the first processor 96 uses less than 30% of its available MIPs for decoding functions.
  • the first processor 96, the second processor 97 and the third processor 98 may be off-the-shelf digital signal processors.
  • a commercially available digital signal processor that is suitable for this application is the texas Instruments TMS320C31 ("C31”) floating point digital signal processor.
  • a fixed point digital signal processor may alternatively be used.
  • general purpose processors may be used, but at a higher dollar cost.
  • the first processor 96 operates as a "master” and the second and third processors 97 and 98 operate as “slaves.”
  • the second and third processors 97 and 98 are dedicated to the function of encoding the image data under the control of the first processor 97.
  • the HVQC algorithm may be implemented with multiple processing devices, such as the processor architecture shown in Figure 14A, by spatially dividing an image into several sub-images for coding.
  • unequal partitioning of the image among the various processing devices may be used to share the computational load of encoding an image.
  • the partitioning is preferably accomplished by dividing the image into several spatial segments. This process is known as boundary adaptation and allows the various processing devices to share the computational load instead of each having to be able to handle a seldom seen worst case computational burden.
  • the system may allocate more bits for encoding the image to the processor encoding the portion of the video with more activity. Therefore, this encoder may keep trying to encode at a high quality level since it would have more bits to encode with. This technique would not, however, reduce the computational requirements as does the preferred embodiment.
  • the DMA interface 99 is utilized to write image data to the second and third processors 97 and 98, exchange previously coded image data, and receive encoded bit streams from both the second and third processors 97 and 98. Data passing among the processors 96, 97 and 98 may be coordinated with a communication system residing on the master processor 96.
  • the first processor 96 is also coupled to a host processor (not shown), such as the CPU of a personal computer, via a PC bus 100, an exchange register and a converter 101.
  • the converter 101 preferably includes a field programmable gate array ("FPGA"), which produce, in response to decoded data provided by the first processor 96, 16 bit video data in RGB format for the personal computer to display.
  • the converter 101 may include an FPGA that produces an 8 bit video data signal with an EPROM for color look-up table.
  • FPGA field programmable gate array
  • a commercially available FPGA that is suitable for this application is available from Xilinx, part no. XC3090A. FPGAs from other manufacturers, such as Texas Instruments, part no. TPC1020A, may alternatively be used.
  • FIG 14B the operation of the parallel processing architecture, as shown in Figure 14A, will be described for use in an environment in which the incoming signal is a NTSC signal and the output is in RGB format.
  • the architecture of a video encoder/decoder 103 preferably includes three digital signal processors, two for video encoding and one for video decoding, data acquisition, and control.
  • a high level view of this parallel processing architecture is shown in Figure 14B.
  • a master/decoder processor 102 acquires digitized video data from a video interface 104, performs the pre- and post-processing of the video data, provides a host interface 106 (for bit-stream, control, and image information), decodes the incoming video bit-stream, and controls two slave encoder processors 108 and 1 10.
  • the interaction between the master processor 102 and the slave encoders, 108 and 110, is through a direct memory access interface, such as the DMA interface 99 shown in Figure 14A.
  • the master/decoder processor 102 corresponds to the first processor 96 shown in Figure 14A.
  • the two slave encoder processors 108 and 100 correspond to the second and third processors 97 and 98 shown in Figure 14A.
  • the master processor 102 puts the slave encoder processors 108 and 110 into hold and then reads/writes directly into their memory spaces. This interface is used for writing new video data directly to the slave encoder processors 108 and 110, exchanging previously coded image ("PCI") data, and accepting encoded bit streams from both slave encoder processors 108 and 110.
  • PCI previously coded image
  • the master processor 102 uses the YUV-to-RGB converter FPGA, as shown in Figure 14A, to produce an 8-bit (with separate eprom for color look-up table), a 15 bit RGB, or 16-bit RGB video data for the PC to display.
  • High Level Encoder/Decoder Description This section prioritizes and describes the timing relationships of the various tasks that make up the video encoder/decoder 103 and also describes the communication requirements for mapping the video encoder/decoder 103 into three processors.
  • the tasks will be grouped according to the various software systems of the video encoder/decoder 103. These software systems are:
  • Each system is responsible for one or more distinct tasks and may reside on a single processor or, alternatively, may be spread over multiple processors. Data passing between processors is coordinated with a communication system residing on the master processor 102.
  • the video input, video output, encoder, and decoder systems operate cyclically since they perform their task (or tasks) repeatedly.
  • the video input system has two aspects. First, a video image is collected. The video image is then processed (filtered) to spatially smooth the image and passed to the slave encoder processors 108 and 110. To maximize efficiency, the image may be transferred to the slave encoder processors 108 and 1 10 as it is filtered.
  • the master processor 102 receives data via the video interface 104.
  • the video interface or video front end 104 includes an NTSC to digital YUV converter, for example.
  • the chrominance components (UV) of the YUV data are preferably sub- sampled by a factor of two, horizontal and vertical, relative to the luminance component (Y), which is a 4: 1 : 1 sampling scheme.
  • NTSC provides video data at a rate of 60 fields per second. Two fields (and odd and even field) make up a single video frame. Therefore, the frame rate for NTSC is actually 30 frames per second. Two sampling rates seem to be prevalent for digitizing the pixels on a video scan line: 13.5 Mhz and 12.27 Mhz.
  • the former gives 720 active pixels per line while the latter gives 640 active (and square) pixels per line.
  • the target display uses a 640x480 format, therefore the 12.27 Mhz sampling rate is more appropriate and provides more time to collect data. The following assumes a sampling rate of 12.27 MHz and that one must collect 176 pixels of 144 or 146 lines of the image (QCIF format).
  • the hardware and software described herein may be adapted to other input video formats and target display formats in a straightforward manner.
  • inter-field jitter may be avoided by collecting 176 samples on a line, skipping every other sample and every other line (i.e., taking every line in the given field and ignoring the lines of the other field). No noticeable aliasing effects have been detected with such a system, but anti-aliasing filters could be used to reduce aliasing.
  • the front end video interface 104 interrupts the master process 102 only at the beginning of every line to be collected.
  • the video front end 104 will buffer 8 Ys and 4 Us and 4 Vs of video data at a time. Therefore, the master processor 102 preferably dedicates itself to video collection during the collection of the 176 samples that make up a line of video data. This may be accomplished using a high priority interrupt routine.
  • the line collection interrupt routine is non- interruptable.
  • the time to collect the 176 samples of video data is 28.6 microseconds (1/(6.14 Mhz) x 176 pixels).
  • the video input software system copies a portion of the collected data out of internal memory after the line has been collected. This will add about 8.8 microseconds to the collection of the odd lines of video data.
  • the average total time to collect a line of video data is therefore 33 microseconds plus interrupt overhead. Assuming an average of 3 microseconds for interrupt overhead, the average total time to collect a line of video data is 36 microseconds. Therefore, processor utilization during the image acquisition is 57% (36 microseconds/63.6 microseconds x 100). This constitutes about 8% (@ 15 frames/second) of the processor. Using this approach, the maximum interrupt latency for all other interrupts will be about 40 microseconds.
  • the image is preferably spatially lowpass filtered to reduce image noise.
  • only the Y component is filtered.
  • the input image filtering and transfer task will consume about 10.5 msec or 15.7% of a frame time.
  • U and V transfer will consume 340 microseconds or about 0.4% of the frame time.
  • the slave encoder processors 108 and 110 are in hold for a total of about 4msec or about 6% of a frame time.
  • the present invention provides methods to enhance the motion estimation process by means of pre-filtering where the pre-filtering is a variable filter whose strength is set at the outset of a set of coded images to reduce the number of bits spent on motion estimation.
  • a temporal and/or spatial filter may be used before encoding an image to reduce the high frequency components from being interpreted as motion which would require more bits to encode.
  • two, one-dimensional filters are used - one in the horizontal direction and one in the vertical direction.
  • other filters may be used, such as a two- dimensional filter, a median filter, or a temporal filter (a temporal filter would filter between two image planes - the one to be encoded and the previous one).
  • a combination of these filters could also be used for pre-filtering.
  • the filter strength is set at the outset of each image macro block depending upon how much motion is present in a given macroblock. High frequency components in the video image may arise from random camera noise and sharp image transitions such as edges.
  • the master processor 102 may post- filter the encoded image signal.
  • a temporal and/or spatial filter may be used to reduce high frequency artifacts introduced during the encoding process.
  • a variable strength filter based upon the hit-ratio of the predictors in the working set model is used.
  • the filter strength is variable across the whole image.
  • Post-filtering may be implemented using two, one-dimensional filters - one in the horizontal direction and one in the vertical direction. A median, temporal, or combination of filter types could also be used.
  • the image is in a packed format in the slave encoder processors' memory and is unpacked by the slave encoder processor once it receives the image.
  • the encoding system is distributed over the two slave encoder processors 108, 110 with the resulting encoded bit streams being merged on the master processor 102.
  • the encoding system described herein allows interprocessor communication to synchronize and coordinate the encoding of two portions of the same input image. Once both the slave encoder processors 108 and 110 are ready to encode a new image, they will receive the image from the master processor 102. The slave encoder processors 108 and 1 10 then determine whether a scene change has occurred in the new image or whether the boundary needs to be changed. This information is then communicated between the two slave encoder processors.
  • boundary change If the boundary is to be changed, previously coded image (pci or PCI) data must be transferred from one slave encoder processor to the other. Boundary change is limited to one macro row per frame time where a macro row contains 16 rows of the image (16 rows of Y and 8 rows of U and V). In addition to the boundary change, extra PCI data needs to be exchanged to eliminate edge effects at the boundary line. This involves sending at least four additional rows from each slave encoder processor to the other.
  • the exchange of PCI data for a boundary change will take a maximum of 0.3% ((4 cycles/4 pixels) x (16 rows Y x 176 pixels + 8 rows U x 88 pixels + 8 rows V x 88 pixels) x .05 microseconds/cycle)/(66666 microseconds/frame x 100%) for packed data and 1.3% of a frame time for byte-wide data.
  • the slave encoder processors 108, 1 10 use the unpacked format.
  • the slave encoder processors After the exchange of PCI data, the slave encoder processors encode their respective pieces of the video image. This process is accomplished locally on each slave encoder processor with the master processor 102 removing the encoded bits from the slave encoder processors 108, 110 once per macro row.
  • a ping-pong buffer arrangement allows the slave encoder processors 108, 1 10 to transfer the previously encoded macro row to the master while they are encoding the next macro row.
  • the master processor 102 receives the bits, stores them in an encoder buffer and ultimately transfers the encoded bit stream data to the PC host processor.
  • the control structure of the encoder is demonstrated in the pseudo code at Appendix 2, Appendix 3 and Appendix 5.
  • a distortion measure for each image to be encoded may be used to determine whether or not an image is to be encoded as a still image (when the video encoder can no longer produce acceptable results) or with respect to the previously coded image (video encoder can still produce an acceptable image).
  • the slave encoder processors 108, 1 10 calculate a MSE for each 8X8 luminance (i.e. Y component, no chroma calculation) block. The sum of these MSEs is compared to a threshold to determine if a scene change has occurred.
  • the background detector takes the sum of four - 8X8 MSE calculations from the scene change test and uses that to determine if a macroblock is a background macroblock or not. 8X8 background detection is done by looking up the appropriate blocks MSE determined during the scene change test. Since the lowest level of the hierarchy and the still image level of the algorithm produce identical results, the use of a common VQ code and codebook for the still image encoder and the lowest hierarchical level of the encoder can save system size, complexity, and cost.
  • each slave encoder processor calculates a scene change distortion measure, in the form of a MSE calculation, for the portion of the image that it is responsible for encoding.
  • the slave encoder processor 108 then sums up all of the scene change MSEs from each of the slave encoder processors 108 and 1 10 to generate a total scene change MSE.
  • the slave encoder processor 1 10 passes its MSE calculation to the slave encoder processor 108 where it is summed with the slave encoder processor 108 MSE calculation to generate the overall scene change MSE.
  • the three-step or the motion cache search processes are allowed to break out of the search loop (although the search process is not yet completed) because a "just good enough" match has been found.
  • the distortion of the best-match is compared with the threshold in each level (there are a total of three to four levels depending on the search window size).
  • the search process may be terminated if any level there is a motion vector whose resulting distortion is less than the threshold.
  • the resulting distortion of the motion vector in each cache entry is compared with the threshold.
  • the search process may be terminated if there is any entry whose resulting distortion is less than the threshold.
  • the computational saving is significant in the motion cache search since the first entry is a most probable motion vector candidate and the method requires one MSE computation instead of L (L is the cache size) during a cache- hit. During a cache-miss, the computation increases slightly because more compares are required to implement the search process.
  • the bit-rate for a given frame may be reduced because the early cached motion vectors require fewer bits to encode, but this is sometimes offset by the bit-rate required by later frames because of the poorer previously coded image quality cause by the use of less precise motion vectors. Accordingly, the computational load of the encoder processors 108, 110 is reduced by thresholding each level of the three-step motion search and by thresholding each entry of the motion cache. This method may be applied to any block size at any level of the hierarchy.
  • the threshold to encode a given image block is limited to a pre-defined range, [min_t, max_t].
  • Min_t and max t are therefore further adapted on either a macro row-by-macro row basis or a frame-by-frame basis to have an even finer control over the image quality as a function of the bit-rate.
  • the adaptation can be based on the scene-change MSE or on the image statistics of the previously coded frame.
  • the threshold at which a given image block is encoded may be controlled based upon the buffer fullness.
  • the video encoder/decoder 103 uses piece-wise linear functions whose threshold value increase linearly with the buffer fullness.
  • a distortion measure obtained by encoding an image ara at a given level of the hierarchy may be compared against an adaptable threshold to enable a tradeoff between image quality, bit rate, and computational complexity. Increasing the thresholds will, in general, reduce the amount of computation, the bit rate, and the image quality. Decreasing the thresholds will, in general, increase the amount of computation, the bit rate, and the image quality.
  • Different threshold adaptation methods include: updating after each macro row, updating after each of the largest blocks in the hierarchy (such as 16x16 blocks), or updating after every block of any size.
  • Adaptation strategy can be based on a buffer fullness measure, elapsed time during the encoding process, or a combination of these.
  • the buffer fullness measure tells the encoder how many bits it is using to encode the image. The encoder tries to limit itself to a certain number of bits per frame (channel bit rate/frame rate).
  • the buffer fullness measure is used to adapt the thresholds to hold the number of bits used to encode an image to the desired set point. There is a time on the DSP and the system checks the time at the beginning and end of the encoding process. As an image is encoded, the number of bits used to encode it must be regulated. This can be done by adapting the hierarchical decision thresholds.
  • an image area may be encoded at a lower level of the hierarchy, but the gain in image quality is not significant enough to justify the additional bits required to encode it.
  • the determination as to which level of the hierarchy to encode an image can be made by comparing the distortion resulting from encoding an image area at different levels of the hierarchy to the number of bits required to encode the image area at each of those levels. The hierarchical level with the better bit rate-to-distortion measure is then the level used to encode that image area.
  • encoding of the chrominance (U and V) components of an image may be improved by encoding them independently of the luminance component.
  • the U and V components may also be encoded independently of one another. This may be achieved, for example, by encoding the chrominance components by using separate VQ caches and by using two-dimensional mean VQ.
  • the system can either share motion information between the luminance and chrominance components to save bits, but possibly at lower image quality, or use separate/independent components at the expense of more bits to achieve higher image quality.
  • the system can even do a partial sharing of motion information by using the luminance motion information as a starting point for performing a chrominance motion search.
  • the decision of whether or not to share is made at the outset by how much bandwidth (i.e., bit rate) is available. In addition, it may be less expensive to make a system that shares motion information since less memory and computation would be required than if the chrominance components had to be independently determined. Since spatially close image areas tend to be similar, it is preferable in a bit rate sense to encode these areas with similar mean values. To do this, the encoder may use various methods to compute the current image minimum and maximum mean quantizer values and then use these values to select a set of quantization levels for encoding the mean values of the variable sized image blocks. Basically this sets how many bits would be used to quantize a block with its mean value instead of a VQ value.
  • the block is of low activity (i.e., low frequency content), then it may be better in a bit rate sense to encode it with its mean value rather than using a VQ entry.
  • Each block's mean value is then entered into the scalar cache. Pseudo code relating to this process is found at Appendix 2, Appendix 3 and Appendix 5.
  • the master processor 102 acquires the encoded bit stream data from the PC host (or from the encoders directly when in a command mode) and decodes the bit stream into frames of YUV image data. Control of the encoder processors bit rate generation is preferably achieved by sending the encoded bit streams to a processor, such as the processor 108, which feeds back rate control information to the encoder processors 108 and 1 10.
  • a processor such as the processor 108
  • the HVQC algorithm allocates bits on an as needed basis while it encodes an image. This can present some image quality problems since initially, the encoder has many bits to use and therefore its thresholds may be artifically low. As coding progresses and the allocation of bits is used up, areas that require more bits to encode are unduly constrained.
  • the HVQC algorithm preferably examines the entire image date before the encoding of an image is begun to make an estimate of how many bits are to be used to code each are of the image. Thresholds may be determined not only by looking at how much distortion is present in the overall image as compared to the previously coded image, but also on a block by block basis and then adjusting the thresholds for those areas accordingly. Since the decoder processor must track the encoder processors, it is important that the decoder processor track even in the present of channel errors. This can be done by initialization of the VQ and motion caches to set them to a set of most probable values and then to re-initialize these caches periodically to prevent the decoder from becoming lost due to transmission errors.
  • the master processor 102 i.e. decoder
  • the master processor 102 preferably uses a packed (4 pixels per 32-bit word) format to speed the decoding process. In a preferred embodiment, this process takes about 30% of a frame time (@ 15 frames/second) per decoded image frame.
  • the Y component is filtered and sent to the YUV to RGB converter.
  • the filter and transfer of the Y and the transfer of the U and V components takes 15 msec or 22.5% of the frame.
  • Pseudo code for the decoding operations is found at Appendix 2 and Appendix 5.
  • the most computationally intensive portions of the HVQC algorithm are its block matching computations. These computations are used to determine whether a particular block of the current image frame is similar to a similarly sized image block in the previously coded frame. The speed with which these calculations can be performed dictates the quality with which these computations can be performed.
  • the preferred architecture can calculate multiple pixel distortion values simultaneously and in a pipelined fashioned so that entire areas of the image may be calculated in very few clock cycles. Accelerator algorithms are employed to calculate the distortion measure used in the hierarchical algorithm.
  • the overall image quality may be improved by alternating the scanning pattern so that no image blocks receive preferential treatment.
  • the scanning pattern may correspond to scanning from the top of the image to the bottom or from the bottom to top.
  • the encoder can generate far more bits per image frame that the channel can handle for a given image frame rate. When this occurs, there can be a large amount of delay between successive image frames at the decoder due to transmission delays.
  • This delay is distracting. Take the example of a first person listening to and viewing an image of a second person. The first person may hear the voice of the second person but, in the image of the second person, the second person's mouth does not move until later due to the delay created by a previous sequence.
  • One way to overcome this is to not encode more images frames until the transmission channel has had a chance to clear up, allowing reduction of latency by skipping frames to clear out the transmission channel after sending a frame requiring more bits than the average allocated allotment.
  • the system does two types of frame skipping. First, immediately after encoding a still image it waits for the buffer to clear before getting a new image to encode, as shown in the pseudo code below:
  • Still_encode(&mac_hr);/* call subroutine to perform a still image encode */ /* Then wait until the buffer equals buffersize before encoding again */ wa ⁇ tUntilBufferFullness(BufferSize+(capac ⁇ ty*2)); /* Then skip two frames because the one old frame is already packed in the slave memory and the MD has a second old frame in its memory. Dropping two will get us a new frame */
  • NumFrmesToSkip 2;/*set how many frames to skip before acquiring a new one */ Second, when encoding video, a different method is used that is a little more efficient in that the processor does not have to wait for communications, but instead generates an estimation of how many frames to skip (based on how many bits were used to encode the last image) and then starts encoding the next frame after that, on the assumption that the buffer will continue emptying while the image is being collected and initially analyzed for a scene change. For instance, the following pseudo code demonstrates this process: /* now test to see if the previously coded image, pci, takes up too much channel capacity to be immediately followed by another image. The capacity or bits per frame is calculated from the bits/sec and frames/sec. In the capacity calculation there is a divide by 2 because there are two processors, each of which gets half the bits. */
  • /*Skip is an integer.
  • the -1.0 is to downward bias the number of frames to be skipped. */
  • the software tells the video front end acquisition module when to capture an image for encoding by sending it a control signal telling it how may frames to skip (NumFramesToSkip)
  • the average allocated allotment of bits is determined by the channel bit rate divided by the target frame rate. It is calculated before each image is encoded so that it can be updated by the system.
  • the post decoder spatial filter is responsible for filtering the Y image plane and transferring the Y, U, and V image planes to the host.
  • the filter operation preferably uses no external memory to perform the filter operations. This requires that the filter operation be done in blocks and the filtered data be directly transferred into a YUV to RGB FIFO. For this purpose, the block size is two lines.
  • the filter uses a four line internal memory buffer to filter the data.
  • the format of data for the YUV-to-RGB converter is shown in Figure 15.
  • Task Prioritization and Timing Relationships Task prioritization and timing are preferably handled by the master processor 102 since it controls data communication. Except where otherwise noted, task prioritzation and timing will refer to the master processor 102.
  • the highest priority task in the system is the collection of input image data. Since there is little buffer memory associated with the video front end 104, the master processor 102 must be able to collect the video data as it becomes available. Forcing this function to have the highest priority causes the next lower priority task to have a maximum interrupt latency and degrades system performance of the lower priority task. Slave processor communication is the next highest priority task.
  • the slave processors 108, 1 10 have only one task to perform: encoding the video data. To accomplish this task, the slaves must communicate a small amount of data between themselves and the video image must be transferred to them by the master processor 102. This communication of data must take place in a timely fashion so as to not slow the slave processors 108, 1 10 down.
  • the PC host communication Next in the priority task list is the PC host communication.
  • the highest bandwidth data communicated between the master processor 102 and the PC host will be the RGB output data.
  • this data is double buffered through a FIFO. This allows the PC host the most flexibility to receive the data.
  • the remaining tasks are background tasks.
  • the background tasks will execute in a well-defined order. Once the input image is acquired, it will be filtered (pre-processed) and sent to the slave processors 108 and 110. The next background task will be to decode the bit stream coming from the PC host. Once this task is complete, the master processor 102 will post-process the image and transfer to the YUV-to-RGB FIFO as a foreground task.
  • the memory system has 128 kwords of 32-bit memory followed by 128 kwords of byte-wide memory that is zero filled between bits 9 to 23 bits and the remaining upper 8 bits float.
  • YUV-to-RGB Conversion A FPGA has been designed that implements that RGB conversion using two-bit coefficients. The design requires nearly all of the logic units of a Texas Instruments TPC1020. The functions of this subsystem are to take data from an input FIFO and to convert this date to 15 or 16-bit RGB output. An interface is provided to an 8-bit EPROM to support 256 color VGA modes. Further details of this process are found below in the sections titled "YUV to RGB Conversion" and "Video Teleconferencing,” and in Appendix 6.
  • the pre- and post-filters are preferably row-column filters that filter the image in each dimension separately.
  • the filter implementation may use internal memory for buffering.
  • the internal memory is totally dynamic, i.e. it may be re-used by other routines. Less overhead will be incurred if the filter code is stores in internal memory (less than 100 words, statically allocated).
  • the pre-processing filter may be a five phase process. First, the data must be input from the video front end 104. Second, the row-wise filtering is performed. This filtering assumes a specific data format that contains two pixels per 32-bit word. The third phase re- formats the output of the row-wise filter. The fourth phase performs the column-wise filter and fifth phase re-formats the data into whatever format the encoder will use.
  • Figure 16 shows the time for each phase assuming the encoder will be operating on byte packed data. This also assumes zero overhead for calling the function and function setup. It should be much less than one cycle per pixel (0.5 will be used in this discussion).
  • the last phase consumes time in both the master processor 102 and encoder processors 108, 110, which is where the actual data transfer from master to encoder takes place.
  • Total master processor 102 utilization for 7-bit data is then 29.7%.
  • the master processor 102 utilization is 32.5%.
  • the encoder processor utilization is 3.5% (data transmission).
  • the post processing may also be accomplished using a five step process.
  • the data must be reformatted from the packed pixel to a filter format.
  • Step two filters the data columnwise.
  • Step three reformats the data.
  • Step four filters the data row-wise.
  • Step five reformats and transmits the data to the YUV-to-RGB interface.
  • Figure 17 depicts the processor utilization (minus overhead) for these phases.
  • Total master processor 102 utilization for filtering 7-bit Y, U and V data is then 32.1%.
  • For 8-Bit data the master processor 102 utilization is 35.1%.
  • the output system is assumed to take data in the low byte at zero wait states. Master processor 102 utilization may be decreased by more then 7% if the YUV-to-
  • RGB interface could take two Ys at a time and both the U and V at a time in the format shown in Figure 18.
  • master processor 102 utilization may be decreased by at least 3% if the YUV interface could take the 8-bit data from either bits 8-15 or 24-31 of the data bus as specified by the address.
  • YUV-to-RGB Conversion This section summarizes the YUV-to-RGB conversion of image data for Host display conversion. This section assumes the YUV components are 8-bit linearly quantized values. It should be noted that a Philips data book defines a slightly compressed range of values for the digital YUV representation.
  • Approximate coefficients may be implemented with only hardware adder/subtractors i.e., no multiplier, if the coefficients contain only two bit that were on (set to one). This effectively breaks the co-efficient down in two one-bit coefficients with appropriate shifting. For lack of a better name, the rest of this section will call these coefficients "two-bit" coefficients.
  • the two-bit coefficient method requires no multiplier arrays.
  • the video encoder system compares the mean squared error (MSE) between the current image and the last previously coded image to a threshold value.
  • MSE mean squared error
  • the video encoder looks at the ratio of buffer fullness to buffer size to determine how to set the g parameter for adjusting the threshold value. For a single encoder system, this is modified to be approximately:
  • the following is a method for adapting thresholds in a way that will better allocate bits and computational power for single and multi-encoder
  • the system does not compare the current block's MSE to the MSE of other blocks in the image, and therefore the system may spend a lot of time and bits trying to encode a given block when that time and those bits may have been better spent (proportionally) on other blocks with higher MSEs. This effect can be compensated for by using a block's MSE value to adjust its threshold level.
  • the MSEs for 8X8 blocks are readily available (stored in memory).
  • the MSEs for 16X16 blocks are available from the 8X8 blocks by adding four localized 8X8 blocks together - this step is already done during the encoding process.
  • the problem of how to classify image blocks can be addressed by having a fixed number of default MSE ranges (possibly derived from a series of test sequences) so that any given blocks' MSE value can be categorized with a lookup table. This could possible be done at only the 16X16 block level so as to reduce the number of lookups to that required for the 99 blocks.
  • the 16X 16 block level category could then be extended down to the lower levels of the hierarchy. If calculation power permits, then the lower hierarchical levels could do their own categorization on a block-by-block basis. Ideally, this block-by-block categorization would occur at all levels of the hierarchy. Each of these categories would then be used to modify the threshold levels.
  • a fixed number of default MSE ranges could be extended to more ranges, or even turned into a function, such as a piece-wise linear curve.
  • the proposed enhancements should fit in with the buffer fullness constraint used by the preferred embodiment algorithm to adjust the thresholds by realizing that the ratio of bits calculated to bits expected can be re-calculated from the new distribution of expected bits for each block. That is, the bits expected calculation is no longer just a function of time, but instead is based upon both the elapsed time and the expected number of bits to encode each of the previously encoded blocks.
  • the threshold for each block can be scaled by the following factor:
  • the proposed video coder algorithm enhancements distribute the bits spent encoding an image to those areas that need them in a mean squared error sense. Additionally, it helps with the computational processing power spent in encoding an image by also concentrating that power in those areas that need it in a mean squared error sense. This not only better utilizes resources in a multi-encoder environment, but it also helps with the realization of a 15 frames per second, 19.2KHz, single video encoder/decoder system.
  • the quality and the bit rate of the video algorithm are controlled by the levels of the hierarchy that are used to encode the image. While encoding all the blocks with the 4x4 VQ results in the highest quality, it also imposes the largest bit-rate penalty, requiring over 3000 bits per macro row (over 27000 bits per frame).
  • the hierarchical structure of the video coder addresses this problem by providing a collection of video coders, ranging from a very high coding gain (1 bit per 256 pixels) down to more modest coding gains (14 bits per 16 pixels). Adjusting the quality is the same problem as selecting the levels of the hierarchy that are used to encode the image blocks.
  • the selection of the levels of the hierarchy that are used and the particular encoders within each level of the hierarchy are controlled by thresholds.
  • a distortion measure is calculated and this value is compared to a threshold to see if the image block could be encoded at that level. If the distortion is less than the threshold, the image block is encoded with that encoder.
  • the thresholds are preferably adjusted for every image block based upon a measure of the image block activity (defined as the standard deviation).
  • the dependence on the block activity follows a psychovisual model that states that the visual system can tolerate a higher distortion with a high activity block than a low activity block. In other words, the eye is more sensitive to distortion in uniform areas than in non uniform areas.
  • the bias term, bias[N] is used to compensate for the fact that at a given distortion value, a larger image block may appear to have lower quality than a smaller image block with the same distortion. The bias term, therefore, attempts to exploit this phenomena by biasing the smaller image block size with a larger threshold, i.e. larger bias[N].
  • the bias terms are: 1, 1.2, and 1.4 for 16x16, 8x8, and 4x4 image blocks, respectively.
  • ⁇ and ⁇ must sum to 1.
  • a large value of ⁇ biases the encoder to put more emphasis on the variance of the image block so that higher variance blocks (i.e., higher activity blocks) are encoded at earlier levels of the hierarchy and at lower bit rates because there is a higher threshold associated with those blocks.
  • the value of the scale parameter ⁇ preferably varies as the HVQC algorithm is applied. A larger value of ⁇ forces more image blocks to be encoded at the higher levels of the hierarchy because the threshold becomes much higher. On the other hand, when ⁇ is assigned a smaller value, more image blocks may be encoded at the lower levels of the hierarchy.
  • is an array of eight elements, three of which may be used. These elements are shown in Figure 19 in which ⁇ ranges from a low of 6 to a high of 80.
  • Gamma[BG16] appropriately scaled by the bias term for other image block sizes, is used for all the background and motion tests.
  • the actual value of gamma[BG16] used within the coder is determined by the buffer fullness. Smaller thresholds are used when the buffer is nearly empty and larger thresholds are used when the buffer is almost full.
  • bgthl6[0], bgthl6[l], bgthl6[2], fl, and f2 used for a preferred embodiment are listed below. These values indicate that the system is operating with a set of relatively small thresholds (between bgthl6[0] and bgthl6[l] until the buffer fills to 70 percent full. Above f2, the overflow value, bgth overflow is used.
  • the following variables correspond to Figure 20: bgthl6[0](MSE per pixel) 6.0
  • the channel capacity determines the bit rate that is available to the encoder.
  • the bit rate i.e. number of bits per second, is used in forming the encoder buffer size that is used in the ratio of the buffer fullness to buffer size.
  • the buffer size is computed as 3.75 x bits per frame. When used with the frame rate, the number of bits per frame can be determined. The bits per frame are used to determine if there are bits available for refreshing after the image has been coded.
  • Frame rate specifies what frame rate is used with the channel capacity to compute the number of bits per frame.
  • the motion tracking number is used to scale the ratio of the buffer fullness to buffer size. A small number makes the buffer appear empty and hence more hits are generated. Further details on threshold adjustment are found in Appendix 4, which contains code in the C programming language for computing thresholds.
  • the video encoder/decoder system 103 may be composed of three TMS320C31 DSP processors. Specifically, in the preferred embodiment, there is a master processor 102 and two slave processors 108, 1 10. Under this architecture, the master processor 102 is responsible for all data input and output, overall system control, coordinating interprocessor communication, decoding images and pre- and post-processing of the images, such as filtering. The slave processors 108, 110 are solely responsible for encoding the image.
  • the software system (micro-code) for the video encoder/decoder 103 is formed from three parts; the system software, encoding software and decoding software. The system software provides the real-time interface to the video source date, bit stream data, image transfer to the host and overall system control and coordination. This section is intended to give an overview perspective of the video encoder/decoder system software. The major data
  • the video encoder/decoder 103 There are two basic data paths in the video encoder/decoder 103. One is the encoder data path while the other is the decoder data path.
  • the date path starts at a video source on the master processor 102, it is then filtered and transferred by the master to the slaves 108, 110.
  • the slaves 108, 110 encode the data one macro row at a time sending the encoded bits back to the master processor 102.
  • a macro row is a horizontal strip of the image containing 16 full rows of the image.
  • the master processor 102 buffers these bits until the host requests them.
  • the video source can be either the video front end 104 or the host depending on the video mode set in the video encoder/decoder 103 status. By default the video mode is set to the host as the source.
  • the decoder data path starts with encoded data flowing into the decoder buffer from either the host when in "Normal” or the encoder buffer when in "pass through” mode.
  • the master processor 102 then decodes the encoded data into an image or frame.
  • the frame is then filtered by the master processor 102 and transferred to the host through the YUV to RGB converter system two lines at a time.
  • This converter system contains a FIFO so that up to five lines of RGB data can be buffered up in the FIFO at any given time.
  • the data is synchronously transferred two lines at a time.
  • the image can be filtered and transferred to the host in one long burst. In practice this is not completely true. For about 9 milliseconds of a frame time, interrupts from the video input line collection system can cause the video encoder/decoder 103 to not keep up with the host if the host is fast enough.
  • the decoded image that is to be filtered and passed to the host becomes the reference frame for the next decoded image.
  • This provides a double buffer mechanism for both the filter/data transfer and the decoding process. No other buffers are required to provide a double buffer scheme.
  • This data path is a result of splitting the encoding process between the two slave processors 108, 110. This data path is handled by the master processor 102.
  • the major data buffers for the video encoder/decoder 103 are the master video input data buffer, the master decoder image buffers, the master decoder encoded bit buffer, the master encoder encoded bit buffer, the slave encoded bit buffer, and the slave packed image buffers and the slave unpacked encoder image buffers.
  • the master video input data buffer consists of three buffers containing the Y, U & V image planes.
  • the Y image is formatted with two pixels per word.
  • the size of the Y buffer is 178x144/2 or 12,672 words. There are two extra lines to make the spatial filtering easier and more efficient.
  • the Y data format and U & V data format are shown in Figure 21.
  • the U & V images are packed four pixels per word in the format received from the video front end 104.
  • the least significant byte of the word contains the left most pixel of the four pixels.
  • the size of each U & V buffer is 89x72/4 or 1602 words.
  • There is an extra line because the system is collecting two extra Y lines.
  • the names of these buffers are inputY, inputU and input V.
  • the master decoder encoded bit buffer is used to store encoded bit data to be decoded by the decoder.
  • the buffer is a circular buffer.
  • the bits are stores in packets. Each packet has an integer number of macro rows associated with it. The first word in the packet contains the number of bits in the data portion of the packet. Each macro row within the packet starts on a byte boundary. Bits in between macro rows that make up the gap must be zero. These bits are referred to herein as filler bits. The last macro row may be filled out to the next byte boundary with filler bits (value zero).
  • the size of this buffer is set to 48 Kbits or 1536 words in a preferred embodiment.
  • the size of this buffer could be reduced substantially. It is preferably large so that one can use different techniques for temporal synching.
  • the decoder buffer is maintained through a structure known as MDecoder and reference by GlobalDecoder. MDecoder is of type ChannelState and GlobalDecoder is of type ChannelState.
  • the master encoder encoded bit buffer is used to combine and store encoded bit data from the slaves processors 108, 1 10.
  • the buffer is a circular buffer.
  • Each slave processor 108, 1 10 sends encoded bits to the master processor 102 one macro row at a time.
  • the master process 102 receives macro rows from the slave processors 108, 110, it places them in the master encoded bit buffer in packet format.
  • the lower 16 bits of the first word of each packet contain the number of encoded bits in the packet followed by the encoded bits. If the first word of the packet is negative it indicates that this packet contains what is known as a picture start code.
  • the picture start coded packet delimits individual frames of encoded bits. Any unused bits in the last words of the packet are zero.
  • Each slave processor 108 or 110 contains a Ping-Pong buffer that is used to build and transfer encoded bits one macro row at a time. While one macro row is being encoded into one of the Ping-Pong buffers, the other is being transmitted to the master processor 102.
  • Ping and pong may each have a size of 8192 bits. This restricts the maximum macro row size to 8192 bits.
  • the slave packed image buffer is used to receive the filtered video input image from the master processor 102.
  • the Y, U & V Image planes are packed 4 pixels per word. The sizes are preferably 176x144/4 (6336) words for y and 88x7/4 (1584) words for U & V.
  • the encoder will unpack this buffer into an unpacked encoder image buffer at the start of an encoding cycle.
  • slave unpacked encoder image buffers There are two slave unpacked encoder image buffers. One that is the current unpacked video image to be coded and one that is an unpacked reference image. At the beginning of a coding cycle the packed image buffer is unpacked into the current video image buffer. The reference image is the last encoded image. For still image encoding the reference image is not used. As the encoder encodes the current video image, the current video image is overwritten with coded image blocks. After the current image has been fully coded, it then becomes the reference image for the next encoding cycle.
  • the master processor 102 is responsible for video input collection and filtering, image decoding and post filtering, bit stream transfer both from the encoder to the host and from the host to the decoder, inter-slave communication and overall system control.
  • the master processor 102 is the only processor of the three that has interrupt routines (other than debugger interrupts).
  • the master has interrupt routines for the video frame clock, video line collection, RGB FIFO half full, exchange register in and out the and the slave/debugger.
  • the highest priority interrupt is the video line collection interrupt. This is the only interrupt that is never turned off except to enter and exit a lower priority interrupt. Interrupt latency for this interrupt is kept to a minimum since the video line collection system has very little memory in it.
  • the next highest priority interrupt is the slave interrupt. This allows the slave processors 108, 1 10 to communicate between the slave and the master and also between slaves with the help of the master. Timely response to slave requests allows the slaves more time to do the encoding.
  • the next highest priority interrupts are the exchange register interrupts. There are two interrupts for each direction (four total) of the exchange register. Two are used to process CPU interrupts while the other two are used to process DMA interrupts (transfer).
  • the CPU interrupts grants external bus access to the DMA that is a lower priority device on the bus thereby guarantying a reasonable DMA latency as seen from the host.
  • the master processor 102 turns off interrupts, the DMA can still proceed if it can gain access to the external bus. Normally, the only time the CPU interrupts are off is when the master is collecting a video input line or the master is trying to protect code from interrupts. In both cases, the code that is being executed should provide sufficient external bus access holes to allow for reasonable DMA latency times.
  • the next highest priority interrupt is the RGB FIFO half-full interrupt. This interrupt occurs any time the FIFO falls below half full. This allows the master processor 102 to continually keep ahead of the host in transferring RGB data assuming that the master is not busy collecting video line data.
  • the lowest priority interrupt is the frame clock that interrupts 30 times a second. This interrupt provides for a video frame temporal reference, frame rate enforcement and simulated limited channel capacity in "pass through" and "record' modes.
  • the software is structured to allow up to three levels of interrupt processing.
  • the first level can process any of the interrupts. While the video line interrupt is being processed, no other interrupt can be processed. If the slave, frame or exchange register interrupt is being processed, the video line collection can cause a second level interrupt to occur. While the RGB interrupt is being processed, the slave, frame or exchange register can cause a second level interrupt to occur, and while that second level interrupt is being processed, the video collection interrupt could cause a third level interrupt.
  • the host communication interface is a software system that provides a system for communicating between the video encoder/decoder 103 and the host processor. There are two hardware interfaces to support this system. The former was designed for low bandwidth general purpose communication, while the later was designed to meet the specific need of transferring high bandwidth video data to be displayed by the host. Only the 16 bit exchange register can cause a host interrupt. Interrupts are provided for both the host buffer full and from the host buffer empty conditions.
  • protocol parsing system As data is received from the host, it is processed by a protocol parsing system. This system is responsible for interpreting the header of the protocol packets and taking further action based on the command found in the header. For protocol packets that contain data, the parser will queue up a DMA request to get the data. The name of the protocol parser is HOST_in( and can be found in the file host if.c.
  • Data to be sent to the host takes one of two forms. It is either general protocol data or YUV data to be converted to RGB. In either case the data is sent to the host through a queuing system. This allows for the master processor 102 to continue processing data (decoding) while the data transfer takes place. The system is set up to allow 20 outstanding messages to be sent to the host. In most cases, there will be no more than 2 or 3 outstanding messages to be sent to the host.
  • YUV data is sent through the YUV to RGB converter system, the associated message informing the host of data availability is sent through the exchange register. This system contains a FIFO that will buffer up to 5 full lines of YUV data.
  • the master processor When YUV data is ready to be sent to the host, the master processor places the data into the YUV to RGB FIFO, either two or six lines at a time depending on the size of the image. Each time the two or six lines are placed into the FIFO, the master processor 102 also queues up a protocol message to the host indicating that it has placed data in the FIFO. The master processor 102 will place data in the FIFO any time the FIFO is less than half full (and there is data to be sent). This allows the master processor 102 to keep ahead of the host unless the master processor 102 is collecting a line of video data or moving PCI data from one slave processor to the other, (i.e., has something of a higher priority to do).
  • YUV frames are queued up to be sent to the host with the RGB queue ⁇ ) function.
  • Packets of data to be sent to the host are queued up with the HOST_queue() and HOST queueBitBufferQ functions.
  • the HOST _queue() function issued to queue up standard protocol packets.
  • the HOST _queueBitBufjfer() is used to queue up a bit stream data buffer that may contain several protocol packets containing bit stream infromation (macro rows).
  • the HOST _vsaRespond() function issued to send status information to the host. Further details of the video encoder/decoder host software interface are provided below.
  • the video front end interface is responsible for collecting and formatting video data from the video capture board. From the application level view, the video front end system collects one frame at a time. Each frame is collected by calling the VIDEO_start() function.
  • the video start function performs several tasks. First, it looks at the video encoder/decoder status and sets the pan, pre-filter coefficients, frame rate, brightness, contrast, hue and saturation. Then it either arms the video front end 104 to collect the next frame or defers that to the frame interrupt routine depending on the state of the frame sampling system.
  • the frame sampling system is responsible for enforcing a given frame rate. If the application code requests a frame faster than the specified frame rate, the sampling waits an appropriate time before arming the video collection system.
  • This system is implemented through the frame clock interrupt function FRAME_interrupt().
  • the FRAME_inerrupt() function is also responsible for coordinating the simulated limited channel capacity in "record” and "pass-through” modes. It may do this by requesting that encoded bits be moved from the encoder buffer to their detination on every frame clock. The amount of data moved is dependent on the specified channel capacity.
  • the video front end system Once the video front end system has been armed, it will collect the next frame under interrupt control.
  • the video lines are collected by two interrupt routines. One interrupt routine collects the even video lines and the other interrupt routine collects the odd lines. This is necessary to save time since it is necessary to decimate the U&V components in the vertical direction.
  • the even line collection interrupt routine collects and places the Y component for that line into an internal memory buffer.
  • the U&V components are stored directly to external memory.
  • the odd line collection interrupt routine discards the U&V components and combines the Y components from last even line with the Y components from the current line and places these in external memory.
  • the format of the Y image is compatible with the spatial filter routine that will be used to condition the Y image plane.
  • the video line collection interrupts could interrupt code that has the slave processors 108, 110 in hold (master processor 102 accessing slave memory), the interrupt code releases the hold on the slave processors 108, 110 before collecting the video line. Just before the interrupt routine returns, it restores the previous state of the slave hold.
  • the front end spatial filter is responsible for filtering and transferring the Y image plane to the slave processors 108, 110.
  • the image is transferred in packed format (four pixel per word).
  • the slave processors 108, 110 are put into hold only when the date transfer part of the process is taking place. Two lines of data are filtered and transferred at a time.
  • the function InputFilterYQ is used to filter and transfer one Y image plane.
  • the U&V image planes are transferred to the slave using the ovelmageToSlaveQ function.
  • This function moves the data in blocks from the master to the slave. The block is first copied into internal memory, then the slave is placed into hold and the data is copied from internal memory to the slave memory. The slave is then released and the next block is copied. This repeats until the entire image is moved.
  • the internal memory buffer is used because this is the fastest way to move data around in the C31 processor.
  • the post decoder spatial filter is responsible for filtering the Y image plane and transferring the Y,U&V image planes to the host.
  • the filter operation uses no external memory to perform the filter operations. This requires that the filter operation be done in blocks and the filtered data be directly transferred into the UUV to RGB FIFO.
  • the block size is two lines.
  • the filter uses a four line internal memory buffer to filter the data. To get things started (i.e., put two lines into the internal memory buffer and initialize the filter system) the function O tputFilterYinitQ is used. Each time the ststem wishes to place two more lines of data into the FIFO it calls the OutputFilterYQ function. These two functions are used exclusively to filter and transfer an image to the FIFO. They do not coordinate the transfer with the host. Coordination is performed by the RGB back end interface described next.
  • the RGB back end is responsible for coordinating the transfer of selfview and decoded images to the host.
  • the Y plane is also spatially filtered.
  • the video encoder/decoder system is capable of transferring both a selfview and a decoded image each frame time.
  • the smaller selfview image is a decimated version of the larger selfview image.
  • the application code wants to send an image to the host it calls the KG _queue() function passing it the starting address of the Y,U&V image planes and what type of image is being transferred.
  • the queuing system is then responsible for filtering if it is a decoded image and transferring the image to the host. Selfview images are not filtered.
  • the queuing system can queue up to two images. This allows both a selfview and a decoded image to be simultaneously queued for transfer.
  • the images are transferred to the host in blocks.
  • the transfer block size is two lines.
  • the transfer block size is six lines. This makes the most effective use of the FIFO buffer in the YUV to RGB converter while allowing the host to extract data while the C31 is filling the FIFO.
  • the mechanisms used to trigger the filtering/transfer or transfer of the image block is the FIFO half full interrupt. When the FIFO goes from half full to less than half full it generates as interrupt to the C31. This interrupt is processed by the RGB interrupt) function.
  • the interrupt function filters and transfers a block of data using the OutputFilterYQ function in the case of a decoded image.
  • the interrupt function copies the image block using the OutputYQ function for a large selfview image or the OutputYsmallQ function for the small selfview image.
  • the OutputFilterYQ and OutputYsmallQ functions also transfer the U&V image planes.
  • the RGB _dropCurrent() function is called whenever the video encoder/decoder 103 wants to stop sending the current image to the host and start processing the next image in the queue. This function is called by the host interface when the host sends the drop frame protocol packet.
  • the slave communication interface is used to coordinate the transfer of video images to the slaves, transfer encoded bit stream data from the slave processors 108, 1 10 to the master processor 102, transfer data between the slaves and transfer data from the slaves to the host.
  • the latter provides a way to get statistical data out of the slave processors 108, 1 10 and is used for debugging only.
  • the interrupt is processed by the SLA VE requestQ function.
  • the types of slave requests that can be processed are:
  • the request "Request New Frame” is mad right after the encoder unpacks the packed video image at the beginning of an encoding cycle.
  • SLAVE_request() simply records that the slave wants a new frame and returns from the interrupt.
  • the main processing loop is then responsible for filtering and transferring the new image to the slave. Once the main processing loop has transmitted a new image to the slave the new Frame Available flag in the communication area is set to TRUE.
  • the request "Request Scene Change Information” is used to exchange information about the complexity of the image for each slave. If the value exchanged is above some threshold, both slaves will encode the frame as a still image. The slaves transmit their local scene change information to the master as part of this request. Once both slaves have requested this information, the master exchanges the sceneValue in the respective communication areas of the slave processors 108, 1 10 with the other slave's scene change information.
  • the request "Request Refresh Bits" is similar to the scene change information request.
  • the master responds instantly with the size of the master's encoder encoded bit buffer.
  • the slave is wanting to wait until the master encoder buffer is below a specified size and for the other slave to request this information.
  • the request "Request PCI” is used to send or receive previously coded image data to or from the other slave processor. This is used when the adaptive boundary is changed. There are two forms of this request. One for sending PCI data and one for receiving PCI. After one slave sends a receive request and the other sends a send request, the master processor 102f then copies the PCI data from the sender to the receiver.
  • the request "Send Encoded Bits” is used to transfer a macro row of data to the master to be placed in the master's encoder encoded bit buffer. There are two forms of this request.
  • One to send a picture start code and one to send a macro row is sent before any macro rows from either processor are sent.
  • the request "Acknowledge Slave Encoder Buffer Cleared” is used to acknowledge the master's request for the slaves to clear their buffers. This allows the master processor 102 to synchronously clear its encoder bits buffer with the slave clearing its buffer.
  • the request "Send Data to Host” is used to transmit a data array to the host processor. This is used to send statistical data to the host about the performance of the encoder.
  • the master processor 102 can send messages to the slave through the SLA VE nformQ function.
  • the slave processors 108, 110 are not interrupted by the message.
  • the slaves occasionally look (once a macro row or one a frame) to see if there are any messages.
  • the types of messages are as follows:
  • the message "Image Ready” informs the slave that a packed image is ready.
  • the "Clear Encoder Buffer” message tells the slave to clear its Ping-Pong bit buffers and restart the encoding process with the next video frame.
  • the slave will respond after resetting its buffers by sending back an acknowledgment (Acknowledge Slave Encoder Buffer Cleared).
  • the "Still Image Request” message is used to tell the slave to encode the next frame as a still image.
  • the "Scene Change Information” message is used to respond to the slaves request to exchange scene change information.
  • the "Refresh Bits" message is used to respond to the slaves request to exchange refresh bit information.
  • the "Buffer Fullness” message is used to respond to the slaves request to wait until the size of the master's encoder encoded bit buffer has fallen below a certain level.
  • the master bit buffer interface is responsible for managing the encoder encoded bit buffer and the decoder encoded bit buffer on the master processor 102.
  • the bit buffers are managed through a structure of type ChannelState.
  • the bits are stores in the buffer in packets.
  • a packet contains a bit count word followed by the encoded bits packed 32 bits per word. Bits are stored most significant to least significant.
  • the important elements of this structure used for this purpose are:
  • wordsinBuffer This contains the current number of words in the buffer. This includes the entire packet including the storage for the number of bits in the packet (see description of bit data buffers above)
  • Bufferlndex This is the offset in the buffer where the next word will be taken out.
  • BufferlntPtrindex This is the offset in the buffer where the next word will be placed. If BufferlntPtrindex is equal to B fferindex there is no data in the buffer.
  • BufferFullness This measures the number of bits currently in the buffer.
  • bitPos This points to the current bit in the current word to be decoded. This is not used for the Encoder buffer.
  • bitsdLeftlnPacket This is used to indicate how many bits are left to be decoded in the decoder buffer. This is not used for the encoder buffer.
  • buffer This is a pointer to the beginning of the bit buffer.
  • the function putDecoderBitsQ is used to put bits from the host into the decoder buffer.
  • the encoderToDecoderQ function is used in the "pass-through" mode to move bits from the encoder buffer to the decoder buffer.
  • the getBitsQ function is used to extract or look at a specified number of bits from the current bit position in the buffer.
  • the bitsInPacketQ function is used to find how many bits are left in a packet. If there are no bits or only fill bits left in the packet the size of the next packet is returned.
  • the skipToNextPacketQ function is used when the decoder detects an error in the packet and wants to move to the next packet ignoring the current packet.
  • the moveEncoderBitsQ function is used to simulate a limited channel capacity channel when the system is in the "record” or "pass-through” mode.
  • T egetEncoderBitsQ function is used to move encoder bits from the master processor 102 encoder bit buffer to the host. This is used in the real-time mode when there is a real limited bit channel system (like a modem) requesting bits from the video encoder/decoder system.
  • the encoderToHostQ function is used to build a host protocol buffer from the encoder buffer.
  • the host protocol buffer can contain multiple protocol packets. This function is called by the moveEncoderBitsQ and getEncoderBitsQ functions.
  • the mainQ function and main processing loop, masterLoopQ is responsible for coordinating a number of activities. First it initializes the video encoder/decoder 103 subsystems and sends the initial status message to the host. It then enters a loop that is responsible for coordinating 6 things. 1. Filtering and transferring the input video image to the slaves.
  • the main processing loop does the following. If the video encoder/decoder 103 status indicates that a selfview image should be sent to the host and the video encoder/decoder 103 is not currently sending a selfview frame, it will queue up the new frame to be sent to the host. It will then filter and transfer the new frame to the slave processors 108, 110. This process does not alter the input image so it does not effect the selfview image. If the host has requested that the frame be sent in YUV format, the YUV image will then be transmitted to the host in YUV format. Finally, the video input system will be restarted to collect the next frame. If the decoder state is non-zero and the processing loop was entered with a non-zero flag, the DECODE JmageQ function is called to coordinate the decoding and transfer to the host of the next video frame.
  • the last part of the main processing loop checks to see if encoded bits can be sent to the host if any are pending to be sent. If the host had requested encoder encoded bits and the encoder buffer was emptied in the host interrupt processing function before the request was satisfied the interrupt routine must exit leaving some number of bits pending to be transmitted.
  • the DECODE JmageQ function is responsible for coordinating the decoding process on a frame by frame basis and the transmission of the decoded frames to the host.
  • a flag is used to coordinate the decoding and transferring of decoded images.
  • the flag is called OKtoDecode. It is not OK to decode if there is a decoded frame currently being transmitted to the host and the next frame has already been decoded. This is because the decoder needs the frame being transmitted to construct the next decoded frame. Once that old frame has been completely transmitted to the host, the recently decoded frame is queued to be transmitted and the decoding process can continue. If the decoder bit buffer starts to back up (become full) the decoder will go on without queuing up the previously decoded image.
  • the decoding process gets or looks at bits in the decoder buffer by calling the getBitsQ and bitsInPacketQ function. If during the decoding process, the getBitsQ or the bitsInPacketQ function finds the decoder buffer is empty, it waits for the buffer to be non-zero. While it waits it calls the masterLoopQ function with a zero parameter value. This allows other tasks to be performed while the decoder is waiting for bits.
  • the slave processors 108, 1 10 are responsible for encoding video frames of data This is their only responsibility, therefore there is not much system code associated with their function.
  • the slave processors 108, 110 do not process any interrupts (except for the debugger interrupt for debugging purposes). There is a simple communication system that is used to request service from the master All communication from the master to the slave is done on a polled basis.
  • the structure is defined in the slave h file as a typedefSLA VE Request
  • the name of this structure for the slave processor is SLA VE com.
  • the master processor 102 accesses the communication area at offset 0x400 in the slave memory access window
  • the communication structure would reside at location 0xa00400 in the master processor 102.
  • the communication structure would reside at location 0xc00400.
  • the master In order for the master to gain access to this structure, it must first place the slave(s) into hold. Every time the slave processor needs to request service from the master processor 102 it first checks to see if the master has processed the last request.
  • Sex_THBE (where "x" indicates slave encoder processor is to be checked) that indicates that the slave is currently requinng service from the master. If the master has not processed the last request the slave waits until it has. When there is no pending service, the slave places a request or command in the communication structure element name "cmd" It also places any other data that the master will need to process the command m the communication structure. If the communication requires synchronization with the other slave or a response before the slave can continue, the slave also sets the communicationDone flag to F.ALSE. Then the slave sets the hardware request flag. If the slave needs a response before going on it will poll the communicationDone flag waiting for it to go TRUE. Once the flag is TRUE, any requested data will be available in the communication structure.
  • Video Bit Stream This section describes the bit stream that is in use in the video encoder/decoder system when compiled with picture start code (PSC) equal to four. This section addresses the header information that precedes each macro row and the format of the bit stream for video, still, and refresh information.
  • PSC picture start code
  • Figure 22 shows a high level example of a video bit stream where the video sequence includes the first frame coded by the still image coder and the second frame coded by the video coder followed by the refresh data for part of the macro row.
  • the decoder can decode each macro row in a random order and independently from other macro rows. This flexibility is achieved by adding a macro row address (macro row number) and macro header information to each encoded macro row.
  • each of the motion caches must be reset before encoding and decoding each macro row. This modification results in a slightly lower PWNR (0.1 dB) for a number of file-io-sequences, but it was seen to have no perceptual degradation in both the file-io and the real-time simulations.
  • the motion vectors should be reset to some small and some large values to account for the high-motion frames (the latest source code resets to only small motion vectors).
  • the still image encoder is also macro row addressable, and therefore the still image VQ cache should be reset before encoding and decoding each macro row. There are four compiling options that affect how header information is added to the bit stream.
  • switch PSC which is set in the master and slave make files.
  • This use of fill bits and packetizing the macro rows has ramifications for detecting bit errors because it indicates (without decoding the data) the boundaries ofthe macro rows. This information can be used to skip to the next macro row packet in the event the decode becomes confused.
  • the decoder can determine whether it has left over bits, or too few bits, for decoding the given macro row. Once errors are detected, error concealment strategies of conditionally replenishing the corrupted macro rows can be performed and decoding can continue with the next macro row packet of data. This allows the detection of bit errors through the use ofthe macro row packetized bit stream.
  • each macro row of data begins with the macro header, defined as follows.
  • Video, still images, and refresh data are encoded on a macro row basis and packetized according to their macro row address.
  • the header information for each macro row includes the fields shown in Figure 24.
  • the amount of overhead is 8 bits per macro row.
  • the information is found in the bit stream in the order given in Figure 24, i.e., the data type is the first two bits, the macro row address the next four bits, etc.
  • the macro row address (MRA) (4 bits) specifies which of the nine (for QCIF) macro rows the encoded data corresponds to.
  • the value in the bit stream is the macro row address plus five.
  • the relative temporal reference (RTR) (1 bit) signifies whether the encoded macro row is part ofthe current frame or the next frame.
  • macro row zero (for video and still data types, not for refresh) contains the temporal reference information and once every 15 frames, frame rate information, and possibly other information as specified by a picture extended information (PEI) bit.
  • PKI picture extended information
  • the next 6 bits correspond to the frame rate divisor.
  • the decoder uses the frame rate information and the temporal reference information to synchronize the output display ofthe decoded images.
  • the frame rate divisor is sent once every 15 frames by the encoder.
  • the actual frame rate is calculated by dividing 30 by the frame rate divisor.
  • the value of 0 is used to tell the decoder that it should not try to time synchronize the display of decoded images.
  • Video images use a quad-tree decomposition for encoding.
  • the format ofthe output bit stream depends on which levels ofthe hierarchy are used and consequently follows a tree structure.
  • the bit stream is perhaps best described through examples of how 16x16 blocks of the image are encoded. Considering that a QCIF image is composed of 99 16x16 blocks.
  • Figure 26 shows several different decompositions for 16x16 blocks and the resulting bit stream.
  • the notation of Flag c indicates the complement of the bit stream flag as defined in Figure 27, which depicts the flags used to describe the video bit stream.
  • the 16x16 block is encoded as a background block.
  • the 16x16 block is coded as an entry from the motion 16 cache.
  • the 16x16 block is encoded as a motion block along with its motion vector.
  • the 16x16 block is decomposed into 4 8x8 blocks which are encoded as (starting with the upper left block) as a background 8x8 block, another background 8x8 block, a hit in the motion 8 cache, and as an 8x8 motion block and the associated motion vector.
  • the 16x16 block is divided into 3 8x8 blocks and 4 4x4 blocks.
  • the upper left 8x8 block is coded as a motion 8 cache hit
  • the second 8x8 block is decomposed and encoded as a VQ block
  • a hit in the motion 4 cache the mean ofthe block and finally as a VQ block.
  • the third 8x8 block is encoded as an 8x8 motion block and its motion vector and the final 8x8 block is encoded as a motion 8 cache hit.
  • the color information is coded on a block by block basis based on the luminance coding.
  • Figure 28 shows some sample decompositions of color (chrominance) 8x8 blocks and descriptions of their resulting bit streams.
  • Still images are encoded in terms of 4x4 blocks.
  • the coding is performed using a cache VQ technique where cache hits required fewer bits and cache misses require more bits to address the VQ codebook.
  • Figure 29 shows an example of coding 4 luminance blocks.
  • Mn is the mean of the first block
  • Mn 4 is the mean ofthe fourth block
  • VQ is the VQ address es ofthe first and fourth block, respectively
  • C 2 and C 3 are the variable length encoded cache locations.
  • C, and C 4 are invalid cache positions that are used as miss flags.
  • the VQ addresses have one added to them before they are put into the bit stream to prevent the all-zero VQ address.
  • the corresponding 4x4 chrominance blocks (first U and then V) are encoded using the same encoding method as the luminance component. Different caches are used for the color and luminance VQ information.
  • the coder Since the coder is lossy, the image must be periodically re-transmitted to retain a high image quality. Sometimes this is accomplished by sending a reference image called a "still.” Other times the image can be re-translated piece-by-piece as time and bit rate permit.
  • This updating of portions ofthe encoded image can be done through various techniques including scalar absolute refreshing, scalar differential refreshing, block absolute refreshing, and block differential refreshing.
  • Refresh is used to improve the perceptual quality ofthe encoded image.
  • the coded image is known to the transmitter along with the current input frame.
  • the transmitter uses the latter to correct the former provided that the following two conditions are true: first, that the channel has extra bits available, and second that there is time available. Whether or not there are bits available is measured by the buffer fullness; refreshing is preferably performed only when the output buffer is less than 1/3 full. Time is available if no new input frame is waiting to be processed.
  • the two independent processors also perform their refreshing independently and it is possible on some frames that one of the processors will perform refreshing while the other does not.
  • the refresh header information includes the fields shown in Figure 30. Each of these levels is further divided into absolute and differential refresh mode with the mode set by a bit in the header.
  • the refresh header information includes the field ofthe macro row header.
  • the refresh address field specifies the starting 4x4 block that the refresher is encoding within the specified macro row. For QCIF images, there are 176 4x4 blocks in each macro row.
  • the bit stream interface adds 31 to the refresh block address to present the generation of a picture start code.
  • the decoder bit stream interface subtracts this offset.
  • the "number of 4x4 blocks" field enumerates the number of 4x4 blocks ofthe image that are refreshed within this macro row packet. This field removes the need for a refresh end code.
  • the absolute refresher operates on 8x8 blocks by encoding each ofthe four luminance 4x4 blocks and single U and V 4x4 blocks independently. This means that the refresh information will always include an integer multiple of four luminance 4x4 blocks and one U and one V 4x4 block.
  • This section describes the VQ refreshing that occurs on each 4x4 luminance block and its corresponding 2x2 chrominance blocks through the use of 2, 16, or 24 dimensional VQ codebooks.
  • each 4x4 luminance block has it mean and standard deviation mean calculated.
  • the mean is uniformly quantized to 6-bits with a range of 0 to 255 and the quantized 6-bit mean is put into the bit stream. If the standard deviation ofthe block is less than a threshold, a flag bit is set and the next 4x4 luminance block is encoded. If the standard deviation is greater than the threshold, the mean is removed from the block and the 255 entry VQ is searched.
  • the 8-bit address ofthe VQ plus one is placed intot he bit stream (zero is not allowed as a valid VQ entry in a preferred embodiment - the decoder subtracts one before using the address).
  • the mean and standard deviation ofthe associated 4x4 U and V blocks are computed.
  • the U and V means are used to search a two-dimensional mean codebook that has 255 entries.
  • the resulting 8-bit VQ address plus one is put into the bit stream. If the maximum ofthe U and V standard deviations is greater than the threshold a 16-dimensional 255-entry VQ is searched for both the U and the V components.
  • the two resulting 8 bit VQ addresses plus one are put into the bit stream.
  • VQ refreshing may be absolute of differential.
  • each 4x4 luminance block has its mean calculated and quantized according to the current minimum and maximum mean for the macro row, and the 5-bit quantized value is put into the bit stream.
  • the mean is subtracted from the 4x4 block and the result is used as the input to the VQ search (codebook has 255 entries).
  • the resulting 8-bit VQ address is placed into the bit stream.
  • the means of the 2x2 chrominance blocks are used as a 2-dimensional input to a 63 entry VQ.
  • the resulting 6-bit address is placed into the bit stream.
  • the result is 19 bits per 4x4 block.
  • each 4x4 luminance block and the associated 2x2 chrominance blocks are subtracted from their respective values in the previously coded frame.
  • the difference is encoded as an 8-bit VQ address from a 24- dimensional codebook that has 255 entries.
  • the result is 8 bits per 4x4 block.
  • Scalar refreshing is a mode that is reserved for when the image has had VQ refreshing performed on a large number of consecutive frames (for example, about 20 frames). This mode has been designed for when the camera is pointing away from any motion. The result is that the coded image quality can be raised to near original quality levels.
  • scalar refreshing may be either absolute or differential.
  • absolute scalar refreshing (168 bits per 4x4 block)
  • the 4x4 luminance and 2x2 chrominance blocks are encoded pixel by pixel. These 24 pixels are linearly encoded with 7-bit accuracy resulting in 168 bits per 4x4 block.
  • the 4x4 luminance and 2x2 chrominance blocks are encoded pixel by pixel.
  • the resulting 4x4 block is encoded with 120 bits.
  • the system selects one ofthe available methods of refreshing.
  • the mode selected is based upon how many bits are availble and how many times an image has been refreshed in a given mode. In terms of best image quality, one would always use absolute scalar refreshing, but this uses the most bits.
  • the second choice is differential scalar refreshing, followed by absolute VQ, and finally differential VQ. In pseudo code the decision to refresh using scalar or VQ looks like this:
  • SET SCALAR tEFRESH FALSE; where cumulative_refresh_bits is a running tally of how many bits have been used to encode a given frame, TOO_MANY VQ_REFRESH_BITS is a constant, also referred to as a refresh threshold, and SET SCALAR REFRESH is a variable which is set to let other software select between various refresh sub-routines.
  • Figure 31 depicts the overhead associated with the bit stream structure.
  • Picture Header Description With respect to the picture header, in the default configuration ofthe encoder/decoder (compiled with PSC 4), the picture header is not part ofthe bit stream. This description of the picture header is only useful for historical purposes and for the cases where the compile switch PSC is set to values 1 or 3.
  • the picture header precedes the encoded data that corresponds to each frame of image data.
  • the format of the header is a simplified version ofthe H.261 standard, which can require up to 1000 bits of overhead per frame for the header information.
  • a preferred picture header format only requires 23 bits (when there is no extended picture information) for each video and still image frame and includes the fields shown in Figure 32 and described below.
  • the code works whether or not there is a start code as part of the bit stream.
  • the picture header includes a picture start code (PSC) (16 bits), a picture type (PTYPE) (1 bit) and a temporal reference (TR) (6 bits).
  • PSC picture start code
  • PTYPE picture type
  • TR temporal reference
  • a word of 16 bits with a value of 0000 0000 0000 0111 is a unique bit pattern that indicates the beginning of each encoded image frame. Assuming there have been no transmission bit errors or dropped macro rows the decoder should have finished decoding the previous frame when this bit pattern occurs.
  • the picture type bit specifies the type ofthe encoded picture data, either as still or video data as follows:
  • PTYPE '0': moving picture (encoded by the video coder)
  • a five bit word ranging from 000010 to 111110 indicates the absolute frame number plus 2 (modulo 60) ofthe current image based on the information from the video front end 104.
  • the bit stream interface adds two to this number to generate 6-bit numbers between 2 and 61 (inclusive) to avoid generating a bit field that looks like the picture start code.
  • the decoder bit stream interface subtracts two from this number. This field is used to determine if the encoders have missed real-time and how many frames were dropped.
  • Error Concealment There are three main decoding operations: decoding still image data, decoding video data, and decoding refresh data. This section discusses how undetected bit errors can be detected, and the resulting error concealment strategies.
  • bit error conditions may occur in both the video and still image decoding process.
  • the system ignores the macro row packet and goes to next macro row packet.
  • a bit error is either in the macro row address or the next frame.
  • the system uses the RTR (relative temporal reference) to determine if the duplicated macro row data is for the next frame or the current frame. If it is for the next frame, the system finishes the error concealment for this frame and sends the image to the YUV to RGB converter and begins decoding the next frame. If it is for the current frame, then the system assumes there is a bit error in the address and operates the same as if an invalid macro row address were specified.
  • the system For an error of extra bits left over after the macro row, the system overwrites the macro row with the PCI macro row data as if macro row had been dropped and never decoded. The system then continues with the next macro row of encoded bits. For an error of too few bits for decoding a macro row, the system overwrites the macro row with the PCI macro row data as if macro row had been dropped and never decoded. The system then continues with the next macro row of encoded bits. Another possible error is an invalid VQ address (8 zeros). In this case, the system copies the mean from a neighboring 4x4 block.
  • Another error is video data mixed with still data.
  • the system checks the video data RTR. If it is the same as the still RTR then there is an error condition. If the RTR indicates the next frame, then the system performs concealment on the current frame, sends image to the YUV to RGB converter and continues with decoding on the next frame.
  • Another error is refresh data mixed with still data. This is most likely an undetected bit error in the DT field because the system does not refresh after still image data. In this case, the macro row should be ignored. There are also a number of bit errors that are unique to the video image decoder.
  • One type of error is an invalid motion vector (Y,U,V). There are two types of invalid motion vectors: 1.
  • Out of range motion value magnitude (should be between 0 and 224); and 2.
  • Motion vector that requires PCI data that is out ofthe image plane. In this situation, the system assumes a default motion vector and continues decoding.
  • Another video image decoder error is refresh data mixed with video data. In this case, the system does the required error concealment and then decodes the refresh data.
  • one error is an invalid refresh address.
  • the system ignores the macro row data and skips to the next macro row packet.
  • Packetizing ofthe Encoded Data It is desirable to transmit the encoded data as soon as possible.
  • the preferred embodiment allows the encoding of images with small packets of data which can be transmitted as soon as they are generated to maximize system speed, independent of synchronization with the other devices.
  • image areas can be encoded in arbitrary order. That is, no synchronization is required among processors during the encoding of their respective regions because the decoder can interpret the macro rows and decode in arbitrary order.
  • This modular bit stream supports the merging of encoded macro-row packets from an arbitrary number of processors.
  • the macro row packets are not the same size.
  • the packets sent over the modem, which consist of 1 or more macro rows, are not the same size either.
  • variable length packets of video data which may be interleaved at the data channel with other types of data packets.
  • This section describes the software protocol between the host processor (PC) and the video encoder/decoder 103.
  • the first phase is concerned with downloading and starting the video encoder/decoder microcode. This phase is entered after the video encoder/decoder 103 is brought out of reset. This is referred to as the initialization phase.
  • the communication automatically converts to the second phase known as the run-time phase. Communication remains in the run-time phase until the video encoder/decoder 103 reset. After the video encoder/decoder 103 is taken out of reset it is back in initialization phase.
  • the video encoder/decoder 103 is place in the initialization phase by resetting the video encoder/decoder 103. This is accomplished by the following procedure:
  • the video encoder/decoder 103 After the last 16 bit word ofthe microcode is transferred to the video encoder/decoder 103, the video encoder/decoder 103 will automatically enter the run-time phase.
  • the run-time phase is signaled by the video encoder/decoder 103 when the video encoder/decoder 103 sends a status message to the host. All run-time data transmission between the Host and video encoder/decoder 103 is performed either through the 16 bit exchange register or RGB FIFO. All data sent from the host to the video encoder/decoder 103 is sent through the 16 bit exchange register.
  • RGB image data transmitted from the video encoder/decoder 103 to the host can be broken down into two categories.
  • One category is RGB image data and the other contains eveything else.
  • the RGB FIFO is used exclusively to convert YUV data to RGB data and transmit that data in RGB format to the host. Since the RGB FIFO has no interrupt capacility on the host side, coordination ofthe RGB data is accomplished by sending amessage through the 16 bit exchange register. The host does not need to acknowledge reception ofthe RGB data since the FIFO half full flag is connected to one of video encoder/decoder's master processor 102
  • All data transmission, other than RGB FIFO data, will take place in the context of a packet protocol through the 16 bit exchange register.
  • the packets will contain a 16 bit header work an optionally some amount of addition data.
  • the packet header contains two fields, a packet type (4 bits) and the size in bits (12 bits) ofthe data for the packet (excluding the header), as shown in Figure 33.
  • the last bits are transferred in the following format. If there are 8 or fewer bits, the bits are left justified in the lower byte ofthe 15 bit word. If there are more than 8 bits, the first 8 bits are contained in the lower byte ofthe 16 bit word and the remaining bits are left justified in the upper byte ofthe 16 bit word. All unused bits in the 16 word are set to zero. This provides the host processor with a byte sequential data stream.
  • Figure 34 shows encoded data in C32 memory (Big Endian Format)
  • Figure 35 shows encoded data in PC memory for an original bit stream in the system
  • Figure 36 shows encoded data in PC memory for a video encoder/decoder system.
  • packet transfers occur in one or two phases depending on the packet. If the packet contains only the header, there is only one phase for the transfer. If the packet contains a header and data, there are two phases to the transfer. In both cases, the header is transferred first under CPU interrupt control (Phase 1). If there is data associated with the packet, the data is transferred under DMA control (Phase 2). Once the DMA is finished, CPU interrupts again control the transfer ofthe next header (Phase 1). The video encoder/decoder 103 transfers the data portion of packets under DMA control in one direction at a time.
  • the rules for transferring data are as follows. There are three cases. The first case is when the host is trying to send or receive a packet with no data. In this case there is no conflict. The second case is when the host is trying to send a packet (call it packet A) containing data to the video encoder/decoder 103. In this case, the host sends the header for packet A, then waits to send the data. While the host waits to send the data portion of packet A, it must check to see if the video encoder/decoder 103 is trying to send a packet to the host. If the host finds that the video encoder/decoder 103 is trying to send a packet to the host, the host must attempt to take that packet.
  • the third case is when the host is receiving a packet (call it packet B) from the video encoder/decoder 103.
  • the host first receives the header for packet B, then waits for the data. If the host was previously caught in between sending a header and the data of a packet (call it packet C) going to the video encoder/decoder 103, it must check to see if the video encoder/decoder 103 is ready to receive the data portion of packet C. If so the video encoder/decoder 103 must transfer the data portion of packet C before proceeding with packet B.
  • State one is idle, meaning no packet is currently being transferred.
  • State two is when the header has been transferred and the host is waiting for the first transfer of the data to occur.
  • State three is when the transfer of data is occurring.
  • the interrupt routine is in state two for the receiver or the transmitter, it must be capable switching between the receiver and transmitter portion ofthe interrupt routine based on transmit or receive ready (To Host Buffer Full and From Host Buffer Empty). Once in state three, the host can exclusively transfer data with only a video encoder/decoder 103 timeout loop.
  • Figure 37 is a flow diagram of a host interrupt routine that meets the above criteria. For clarity, it does not contain a video encoder/decoder 103 died time out. Time outs shown in the flow diagram should be set to a value that is reasonable based video encoder/decoder 103 interrupt latency and on the total interrupt overhead of getting into and out ofthe interrupt routine (this includes system overhead).
  • the video encoder/decoder 103 enters the host interrupt routine at step 112 and proceeds to step 1 14, where the routine checks for a transmit ready state (To Host Buffer Full). If the system is not in the transmit ready state, the routine proceeds to step 1 16, where the system checks for a receive ready state (From Host Buffer Empty) and a packet to transfer. If both are not present, the routine proceeds to check for time out at step 118 and returns from the host interrupt routine, at step 120, if time is out. If time is not out at step 1 18, the routine returns to step 114. If the system is in the transmit ready state at step 1 14, the host interrupt routine proceeds to step 122, where routine polls the receive state.
  • a transmit ready state To Host Buffer Full
  • the routine proceeds to step 1 16 where the system checks for a receive ready state (From Host Buffer Empty) and a packet to transfer. If both are not present, the routine proceeds to check for time out at step 118 and returns from the host interrupt routine, at step
  • the routine If the receive state is "0" at step 122, then the routine reads a packet header and sets the receive state to "1 " at step 124. The routine then proceeds to look for packet data at step 126 by again checking for a transmit ready state (To Host Buffer Full). If the buffer is full, the routine reads the packet data and resets the receive state to "0" at step 130, and then the routine proceeds to step 116. On the other hand, if the buffer is not full, the routine proceeds to step 128, where it checks for time out. If there is no time out, the routine retums to step 126, whereas if there is a time out at step 128 the routine proceeds to step 1 16. If the receive state is not "0" at step 122, then the routine proceeds directly to step 130, reads data, sets the receive state to "0" and then proceeds to step 116.
  • a transmit ready state To Host Buffer Full
  • step 131 the routine polls the transmit state. If the transmit state is "0" at step 131, the routine proceeds to step 132, where the routine writes the packet header and sets the transmit state to "1.” The routine again checks whether the From Host Buffer is empty at step 133. If it is not , the routine checks for time out at step 134. If time is out at step 134, the routine returns to step 116, otherwise the routine returns to step 133.
  • step 136 the routine writes the first data and sets the transmit state to "2.”
  • step 1 7 the routine checks for more data. If more data is present, the routine proceeds to step 139 to again check whether the From Host Buffer is empty. On the other hand, if no more data is present at step 137, the routine sets the transmit state to "0" at step 138 and then returns to step 116. If the From Host Buffer is empty at step 139, the routine proceeds to step 141, where it writes the rest ofthe data and resets the transmit state to "0,” and the routine then returns to step 116.
  • the routine checks for time out at step 140 and returns to step 139 if time is not out. If time is out at step 140, the routine returns to step 116.
  • the routine proceeds to step 135, where the routine again polls the transmit state. If the transmit state is "1" at step 135, then the routine proceeds to write the first data and set the transmit state to "2," at step 136. If the transmit state is not "1" at step 135, then the routine proceeds to write the rest ofthe data and reset the transmit state to "0" at step 141. As noted above, the routine then retums to step 116 from step 141.
  • a control packet is depicted in Figure 38.
  • the control packet always contains at least 16 bits of data.
  • the data is considered to be composed of two parts, a control type (8 bits) and required control parameter (8 bits) and optional control parameters (always a multiple of 16 bits).
  • the control type and required parameter always follow the Control Packet header word. . Any optional control parameters then follow the control type and required parameter.
  • a status request packet is depicted in Figure 39.
  • the status request packet is used to request that the video encoder/encoder 103 send a status packet.
  • the status packet contains the current state ofthe video encoder/decoder 103.
  • An encoded bit stream request packet is depicted in Figure 40.
  • the encoded bits stream request packets request a specified number of bits of encoded bit stream data from the local encoder.
  • the video encoder/decoder 103 will respond by sending back N macro rows of encoded bit stream data (through the packet protocol). N is equal to the minimum number of macro rows that would contain a number of bits equal to a greater than the requested number of bits. Only one request may be outstanding at a given time.
  • the video encoder/decoder 103 will send each macro row as a separate packet. Macro rows are sent in response to a request as they are available. The requested size must be less than 4096 bits.
  • a decoder bits packages are shown in Figures 41 and 42.
  • Encoded data is sent in blocks. There are two types of packets used to send these blocks. The packets about to be described are used to transmit encoded bit stream data to the local decoder. One type of packets is used to send less than 4096 bits of encoded data or to end the transmission of more than 4095 bits of encoded data. The other is used to start or continue the transmission of more than 4095 bits of encoded data.
  • the video encoder/decoder 103 will respond by sending back RGB data of the decoded bit stream after sufficient encoded bit stream bits have been sent to and decoded by the video encoder/decoder 103.
  • a decoder bits end packet is depicted in Figure 41 and a decoder bits start continue packet is depicted in Figure 42. If less than a multiple of 16 bits is contained in the packet, the last 16 bit word contains the bits left justified in the lower byte first, then left justified in the upper byte. Left over bits in the last 16 bit word are discarded by the decoder.
  • the decoder decodes macro rows from the packet it checks to see if there are less than eight bits left in the packet for a given macro row. If there are less than eight and they are zero, the decoder throws out the bits and starts decoding a macro row from the next packet. If the "left over bits" are not zero, the decoder signals a level 1 error. If a block contains less than 4096 bits it will be transmitted with a single decoder bits end packet. If the block contains more than 4095 bits, it will be sent with multiple packets using the start/continue packet and terminated with the end packet. The start continue packet must contain a multiple of 32 bits. One way to assure this is to send the bits in a loop. If the bits left to send is less than 4095, send the remaining bits with the start continue packet and loop.
  • the YUV data for encoder packet is used to send YUV data to the encoder. It is needed for testing the encoder. It would also be useful for encoding YUV data off-line.
  • a YUV frame contains 144 lines of 176 pixel/line Y data and 72 lines of 88 pixels/line of U and V data. Because the data size field in the packet header is only 12 bits, it will take multiple packets to transfer a single YUV image.
  • Y data must be sent 2 lines at a time. U&V data are sent together one line at a time (One line of U followed by one line of V).
  • All ofthe Y plane can be sent first (or last) or the Y and UV data mey be interspersed (i.e., two lines of Y, one line of U&V, two lines of Y, one line of U&V, etc.).
  • the data type can be sent first (or last) or the Y and UV data mey be interspersed (i.e., two lines of Y, one line of U&V, two lines of Y, one line of U&V, etc.).
  • a drop current RGB frame packet is depicted in Figure 44.
  • the drop current RGB frame packet is used to tell the video encoder/encoder 103 to drop the current RGB frame being transmitted to the host.
  • the video encoder/decoder 103 may already put some ofthe frame into the FIFO and it is the responsibility of the host to extract and throw away FIFO data for any FIFO ready packets received until the next top of frame packet is received.
  • a status packet is depicted in Figure 45.
  • the status packet contains the current status ofthe video encoder/decoder system. This packet is sent after the microcode is downloaded and started. It is also sent in response to the status request packet. The packet is formatted so that it can be used directly with the super control packet defined below in the control types and parameters section.
  • the microcode revision is divided into two fields: major revision and minor revision. The major revision is located in the upper nibble and the minor revision is located in the lower nibble.
  • a decoder error packet is depicted in Figure 46. The decoder error packet tells the host that the decoder detected an error in the bit stream.
  • error level shown in Figure 46 is as follows: 1. Decoder bit stream error, decoder will continue to decode the image. 2. Decoder bit stream error, decoder has stopped decoding the input bit stream. The decoder will search the incoming bit stream for a still image picture start code.
  • a decoder acknowledge packet is depicted in Figure 47.
  • the decoder acknowledge packet is only sent when the video encoder/decoder 103 is in store/forward mode. It would be assumed that in this mode the bit stream to be decoded would be incoming from disk. Flow control would be necessary to maintain the specified frame rate and to not overrun the video encoder/decoder 103 computational capabilities.
  • the acknowledge packet would signal the host that it could send another packet of encoded bit stream data.
  • the encoded bits from encoder packets are used to send a macro row of encoded bits from the local encoder to the host. These packets are sent in response to the host sending the encoded bit stream request packet or placing the encoder into record mode.
  • packets 48 and 49 there are two types of packets used to send these macro rows.
  • One type of packet is used to send less than 4096 bits of encoded macro row data or to end the transmission of more than 4095 bits of encoded macro row data. The other is used to start or continue the transmission of more than 4095 bits of encoded macro row data.
  • Figure 48 depicts encoded bits from encoder end packet.
  • Figure 49 depicts encoded bits from encoder start/continue packet. If fewer than a multiple of 16 bits are contained in the packet, the last 16 bit word contains the bits left justified in the lower byte first, then left justified in the upper. Unused bits in the last 16 bit word are set to zero.
  • a macro row contains less than 4096 bits it will be transmitted with a single encoded bits from encoder end packet. If the macro row contains more than 4095 bits, it will be sent with multiple packets using the start/continue packet and terminated with the end packet. The start/continue packets will contain a multiple of 32 bits.
  • the encoded bits frame stamp packet is used to send the video frame reference number from the local encoder to the host. This packet is send at the beginning of a new frame in response to the host sending the encoded bit stream request packet.
  • the frame reference is sent modulo 30.
  • the frame type is one if this is a still image or zero if a video image.
  • a top of RGB image packet, depicted in Figure 51, is send to indicate that a new frame of video data will be transmitted through the RGB FIFO.
  • the large size is 176 columns x 144 rows (2 lines sent for every FIFO ready packet).
  • the small size is 44 columns x 36 rows (6 lines sent for every FIFO ready packet).
  • the frame number is the frame number retrieved from the decoded bit stream for the decoded image and the frame number from the video acquisition system for the selfview image. The number is the actually temporal reference modulo 30.
  • a FIFO ready packet is sent when there is a block of RGB data ready in the FIFO.
  • the block size is set by the type of video frame that is being transferred (see RGB top of frame packet).
  • a YUV acknowledge packet is sent to tell the host that it is ok to send another YUV frame.
  • Control types and parameters for the control packet (Host to video encoder/decoder 103). Each control type has a default setting. The default settings are in place after the microcode had been downloaded to the video encoder/decoder 103 and until they are changed by a control packets.
  • a control encoding packet is depicted in Figure 54.
  • This control is used to change the state ofthe encoding process (e.g. start, stop or restart).
  • Re-start is used to force a still image.
  • the default setting is encoding stopped and decoder in normal operation. Referring to Figure 54, for the encoder state parameter:
  • bit 3 of the encoder state parameter is set to one, the encoder buffer is cleared before the new state is entered.
  • a frame rate divisor packet is depicted in Figure 55. This control is used to set the frame rate ofthe video capture system. The default is 15 frames per second. Referring to Figure 55, for the frame rate divisor parameter:
  • a post spatial filter packet is shown in Figure 57. This control is used to set the post spatial filter.
  • the filter weight parameter specifies effect of the filter. A parameter value of zero provides no filtering effect. A value 255 places most ofthe emphasis ofthe filter on adjacent pixels. The default value for this filter is 77.
  • the spatial filter is implemented as two 3 tap, one dimensional linear filters. The coefficients for these filters are calculated as follows:
  • a pre spatial filter packet is shown in Figure 58. This control is used to set the pre spatial filter.
  • a temporal filter packet is shown in Figure 59. This control is used to set the temporal filter for the encoder.
  • the filter weight parameter specifies effect ofthe filter. A parameter value of zero provides no filtering effect. A value 255 places most ofthe emphasis ofthe filter on the previous image.
  • the temporal filter is implemented as:
  • a still image quality packet is shown in Figure 60. This control is used to set the quality level ofthe still image coder. A quality parameter of 255 indicates the highest quality still image coder and a quality parameter of zero indicates the lowest quality still image coder. There may not be 256 different quality levels. The nearest implemented quality level is selected if there is no implementation for the specified level. The default is 128 (intermediate quality still image coder).
  • a video mode packet is depicted in Figure 61. This control is used to set video input & output modes. The large and small selfview may not be specified at the same time. It is possible that the PC's or the video encoder/decoder's bandwidth will not support both a large
  • Video source is the host interface through YUV packets;
  • Video source is the video front end.
  • a video pan absolute packet is shown in Figure 62. This control is used to set electronic pan to an absolute value. Because the video encoder/decoder system is not utilizing the complete video input frame, one can pan electronically (select the part ofthe video image to capture). Planning is specified in Vi pixel increments.
  • the pan parameters are 8 bit signed numbers.
  • the Y axis resolution is 1 pixel.
  • the default pan is (0,0) which centers the captured image in the middle ofthe input video image.
  • the valid range for the x parameters is about - 100 to +127.
  • the valid range for the y parameter is -96 to +96.
  • a brightness packet is shown in Figure 63.
  • the brightness packet controls the video front end 104 brightness setting.
  • the brightness value is a signed integer with a default value as zero.
  • a contrast packet is shown in Figure 64.
  • the contrast packet controls the video front end 104 contrast setting.
  • the contrast value is an unsigned integer with a default setting of 128.
  • a saturation packet is shown in Figure 65.
  • the saturation packet controls the video front end 104 saturation setting.
  • the saturation value is an unsigned integer with a default value of 128.
  • a control decoding packet is shown in Figure 68. This control is used to change the state of the decoding process (e.g. look for still image or normal decoding).
  • the default setting for the decoder is normal decoding.
  • Decoder in bit stream pass through mode decoder gets bits from local encoder. If bit 3 ofthe decoder state parameters is set to one, the decoder buffer is cleared before the new state is entered. The set motion tracking control is used to set the motion tracking state ofthe encoder.
  • This parameter is used to trade frame rate for better quality video frames.
  • a value of zero codes high quality frames at a slower rate with longer delay.
  • a value of 255 will track the motion best with the least delay, but will suffer poorer quality images when there is a lot of motion.
  • the default setting is 255.
  • a motion tracking packet is shown in Figure 69.
  • the request control setting packet is used to request the value ofthe specified control setting.
  • the video encoder/decoder 103 will respond by send back a packet containing the requested value.
  • the packet will be formatted so that it could be used directly to set control value.
  • a request control setting packet is shown in Figure 70.
  • An example of a request control setting packet is shown in Figure 71. If the PC sent the request control setting packet of Figure 71 , the video encoder/decoder 103 would respond by sending back a frame rate divisor packet formatted, as shown in Figure 72.
  • a request special status information packet is shown in Figure 73.
  • the request special status setting packet is used to request the value ofthe specified status setting.
  • the video encoder/decoder 103 will respond by send back a packet containing the requested value. This will be used mainly for debugging.
  • Request for the following information types will result in the video encoder/decoder 103 sending back a packet as shown in Figure 74, which depicts buffer fullness (type 0x81).
  • a request YUV frame (type 0x82) causes the video encoder/decoder 103 to send to
  • video encoder/decoder 103 will send back YUV frames through the packet protocol.
  • a packet indicating the top of frame a packet containing two lines of Y data, a packet containing one line of U data followed by one line of V data and a packet indicating the end ofthe frame.
  • a YUV top of frame is depicted in
  • FIG 75 and a Y data frame is depicted in Figure 76.
  • a UV data frame is depicted in Figure 77 and a YUV end of frame is depicted in Figure 78.
  • Means Squared Error (MSE) Hardware As previously noted, the preferred architecture can calculate multiple pixel distortion values simultaneously and in a pipelined fashion so that entire areas of image may be calculated in very few clock cycles. Two accelerator algorithms may be used, the Mean Squared Error (MSE) and the Mean Absolute Error ("MAE"). In a preferred embodiment, the algorithm is optimized around the MSE calculations because the MSE results in the lowest PSNR. The amount of quantization or other noise is expressed as the ratio of peak-to-peak signal to RMSE expressed in decibels (PSNR).
  • PSNR decibels
  • DSP digital signal processor
  • the accelerator connects to the slave encoder memory bus and receives control from the slave DSP as shown in Figure 79.
  • the DSP fetches 32 bit words, 4 pixels, from memory and passes it to the accelerator.
  • the first mode, MSE/MAE calculations, is the most used mode. This mode may be with or without motion.
  • the second mode scene change MSE/MAE calculation, is done once per field. This mode computes MSE/MAE on every 8x8 block and stores result in internal DSP RAM, compares the total error of a new filed with the previous field's total error, and then software determines if still frame encoding is necessary.
  • the third mode standard deviation calculations, calculates the standard deviation for each block to be encoded to measure block activity for setting the threshold level.
  • the fourth mode is a mean calculation. This is rarely computed, only when the algorithm is in the lowest part of hierarchy.
  • a memory read takes one DSP clock, and therefore transfers four pixels in one clock.
  • MSE/MAE accelerator 142 requires four NCI (newly coded image) pixels from a NCI memory 144 and four PCI (previously coded image) pixels from a PCI memory 146 for processing by a DSP 148.
  • the DSP 148 is a Texas Instruments TMS320C31 floating point DSP. A fixed point DSP or other digital signal processor may also be used.
  • the DSP 148 reads the result from the accelerator 142 along the data bus 150, which takes one clock.
  • Figure 79 also includes a new image memory 152. If a motion search is necessary, then a second set of four PCI pixels must be read from memory. The accelerator stores the first set of PCI data and uses part ofthe second PCI data. The DSP instructs the accelerator how to process (MUX) the appropriate pixels.
  • the 8x8 blocks are calculated in the same way. Without motion takes 33 clocks plus pipeline delay, and with motion it takes 49 clocks plus pipeline delay.
  • MSE For the software to compute an MSE, first it reads a 16x16 block of NCI and PCI data into internal memory. Then it takes 2 clocks per pixel for either the MSE or MAE calculation, or a total of 128 clocks for an 8x8 block, or 32 clocks for a 4x4 block. For comparison purposes, only the actual DSP compute time of 128 or 32 clocks are compared.
  • the hardware is not burdened with the overhead associated with transferring the 16x16 blocks, therefore the actual hardware speed up improvement is greater than the value calculated.
  • Other overhead operations not included in the calculations are: instruction cycles to enter a subroutine, instructions to exit a routine, time to communicate with the hardware, etc.
  • Figure 80 is a table that represents the formulas for calculating speed improvements.
  • the expression represents a ratio between total time taken by software divided by the time for implementing the calculations in hardware.
  • Figure 81 shows the speed improvement for various pipeline delay lengths. Assuming a motion search is required 3: 1 and that the 8x8 block is required 3:1 compared to 4x4 in the overall section.
  • PLDs programmable logic devices
  • a preferred embodiment employs, for example, Altera 8000 Series, Actel ACT 2 FPGAs, or Xilinx XC4000 Series
  • Figure 82 depicts a mean absolute (MAE) accelerator implementation in accordance with the equation shown in the Figure.
  • the input is 4 newly coded image pixels (NCI), which is 32 bits, and, as shown at 164, 4 previously coded image pixels (PCI).
  • NCI newly coded image pixels
  • PCI previously coded image pixels
  • the system further requires an additional 4 PCI pixels.
  • the PCI pixels are multiplexed through an 8 to 4 bit multiplexer (MUX), which is controlled by a DSP (not shown).
  • MUX 8 to 4 bit multiplexer
  • a group of adders or summers adds the newly encoded pixels and the previously encoded pixels and generates 9 bit output words
  • the absolute value of these words generated by removing a sign bit from a 9 bit word resulting in an 8 bit word, is then input into a summer 174, the
  • Figure 84 illustrates an mean absolute error (MAE) implementation for calculating the MAE of up to four pixels at a time with pixel interpolation for sub-pixel resolution corresponding to the equation at the top of Figure 84
  • the implementation includes a 32 bit input
  • four non-interpolated pixels are transmitted via a pixel pipe to a se ⁇ es of summers or adders 204
  • four previously encoded pixels are input into a 8 to 5 multiplexer (MUX) 208
  • MUX multiplexer
  • the output ofthe MUX 208 is ho ⁇ zontally interpolated with a se ⁇ es of summers or adders 210 and the output 9 bit words are manipulated as shown at 212 by the DSP (not shown)
  • the resulting output is transmitted to a 4 pixel pipe 214 and is vertically interpolated with a senes of summers or adders, as seen at 216
  • the absolute value ofthe output ofthe adders 204 is input to adders 218, the output of which is provided to adder
  • the implementation includes a 32 bit input As seen at 222, four non-interpolated pixels are transmitted via a pixel pipe to a senes of summers or adders 224 At 226, four previously encoded pixels are input into a 8 to 5 multiplexer (MUX) 228. The output of the MUX 228 is horizontally interpolated with a series of summers or adders 230 and the output 9 bit words are manipulated as shown at 232 under the control ofthe DSP (not shown).
  • MUX multiplexer
  • the resulting output is transmitted to a 4 pixel pipe 234 and is vertically interpolated with a series of summers or adders, as seen at 236.
  • the squared value ofthe output of the adders 224 is input to adders 238, the output of which is provided to adder 240.
  • the adder 240 accumulates the appropriate number of bits depending on the mode of operation.
  • Video Teleconferencing Video teleconferencing allows people to share voice, data, and video simultaneously.
  • the video teleconferencing product is composed of three major functions; video compressions, audio compression and high speed modem.
  • the video compression function uses the color space conversion to transform the video from the native YUV color space to the host RGB display format.
  • a color space is a mathematical representation of a set of colors such as RGB and YUV.
  • the red, green and blue (RGB) color space is widely used throughout computer graphics and imaging.
  • the RGB signals are generated from cameras and are used to drive the guns of a picture tube.
  • the YUV color space is the basic color space used by the NTSC (National Television Standards Committee) composite color video standard.
  • the video intensity (“luminance”) is represented as Y information while the color information (“chrominance”) is represented by two orthogonal vectors, U and V.
  • Compression systems typically work with the YUV system because the data is compressed from three wide bandwidth (RGB) signals down to one wide bandwidth (Y) and two narrow bandwidth signals (UV).
  • RGB wide bandwidth
  • Y wide bandwidth
  • UV narrow bandwidth signals
  • the transformation from YUV to RGB is merely a linear remapping ofthe original signal in the YUV coordinate system to the RGB coordinate system.
  • the YUV to RGB matrix equations use two-bit coefficients which can be implemented with bit-shifts and adds.
  • High speed adder structures, carry save and carry select, are used to decrease propagation delay as compared to a standard ripple carry architecture. The two types of high speed adders are described below.
  • Carry save adders reduce three input addends down to a sum and carry for each bit.
  • Ci be the carry output ofthe ith bit position and Si be the sum output of the ith position. Ci and Si are defined as shown in Figure 86. A sum and carry term is generated with every bit.
  • the sum and carry terms must be added in another stage such as a ripple adder, carry select adder or any other full adder structure. Also not that the carry bus must be left shifted before completing the result. For instance, when the LSB bits are added, sum SO and carry CO are produced. When completing the result, carry CO is added to sum SI in the next stage, thus acting like a left bus shift.
  • Figure 87 is a block diagram of a color space converter that converts a YUV signal into RGB format.
  • the R component is formed by coupling a IV, a 0.5V and a 1Y signal line to the inputs of a carry save adder 242, as shown in Figure 87.
  • the carry and sum outputs from the carry save adder 242 are then coupled to the inputs of a full adder 243.
  • the sum output ofthe full adder 243 is then coupled to an overflow/underflow detector 244, which hard limits the R data range.
  • the B component of the RGB formatted signal is formed by coupling a 2U, a -0.25U and a 1 Y signal line to the inputs of a carry save adder 245.
  • the carry and sum outputs ofthe carry save adder 245 are coupled to the inputs of a full adder 246, whose sum output is coupled to an overflow-underflow detector 247.
  • the overflow-underflow detector 247 hard limits the B data range.
  • the G component of the RGB formatted signal is formed by coupling a -0.25U, a - 0.125U and a -0.5V signal line to the inputs of a carry save adder 248.
  • the carry and sum outputs ofthe carry save adder 248 are coupled, along with a -0.25 V signal line, to the inputs of a second carry save adder 249.
  • the carry and sum outputs ofthe carry save adder 249 are coupled, along with a I Y signal line, to the inputs of a third carry save adder 250.
  • the carry and sum outputs ofthe third carry save adder 250 are coupled to the inputs of a full adder 251, as shown in Figure 87.
  • the sum output ofthe full adder 251 is coupled to an overflow/underflow detector 252, which hard limits the data range ofthe G component.
  • the Y signal has eight data bits and the U and V signals have seven data bits and a trailing zero bit. U and V signals may therefore be processed by subtracting 128.
  • Each ofthe adders shown in Figure 87 is preferably, therefore, a 10 bit adder.
  • Figures 88, 89 and 90 depict a carry select adder configuration related bit tables for converting the YUV signal to RGB format.
  • Figure 88 includes a carry select adder 253 coupled to a full adder 254 that generate the R component ofthe RGB signal.
  • Figure 89 includes a carry select adder 255 coupled to a full adder 256 that generate the B component of the RGB signal.
  • Figure 90 includes carry select adders 257, 258, 259 and a full adder 260 that generate the G component ofthe RGB signal.
  • Figure 91 is a block diagram timing model for an YUV to RGB color space converter.
  • a FIFO 262 is coupled by a latch 264 to a state machine 266, such as the Xilinx XC3080- 7PC84 field programmable gate array.
  • the state machine 266 includes an RGB converter 272 that converts a YUV signal into a RGB formatted signal.
  • the converter 272 is preferably a ten bit carry select adder having three levels of logic (as shown in Figures 87-90).
  • the RGB formatted data is coupled from an output ofthe state machine 266 into a buffer 270.
  • the state machine 266 is a memoryless state machine that simultaneously computes the YUV-to-RGB conversion while controlling the flow of data to and from it.
  • the carry select adder as implemented in the Xilinx FPGA, adds two bits together and generates either a carry or a sum.
  • Carry select adders are very fact because the propagation delays and logic levels can be kept very low. For instance, a ripple carry adder of 10 bits requires 10 level of logic to compute a result. A 10 bit carry select adder can be realized in only three levels of logic.
  • Figure 92 is a state diagram relating to the conversion ofthe YUV signal to RGB format in accordance with the timing model shown in Figure 91.
  • the output color space converter shown in Figure 91 is in-circuit configurable.
  • the application software queries the platform to determine what color space it accepts and then reconfigures the color space converter appropriately.
  • YUV-to-RGB 565 color space conversion is performed.
  • the FPGA shown in Figure 91 may perform RGB color space conversions to different bit resolutions or even to other color spaces.
  • motion estimation is commonly performed between a current image and a previous image (commonly called the reference image) using an integer pixel grid for both image planes. More advanced techniques use a grid that has been unsampled by a factor of two in both the horizontal and vertical dimensions for the reference image to produce a more accurate motion estimation. This technique is referred to as half- pixel motion estimation.
  • the motion estimation is typically performed on 8x8 image blocks, or on 16x16 image blocks (macro blocks), but various image block sized may be used.
  • FIG. 93 16x16 image blocks (macro blocks), as shown in Figure 93, are used with a search range of approximately +/-16 pixels in the horizontal and vertical directions and that the image sizes are QCIF (176 pixels horizontally by 144 pixels vertically) although other image sizes could be used. Further assume that each pixel is represented as one byte of data, and that motion searches are limited to the actual picture area.
  • a 16 pixel wide vertical slice is referred to as a macro column and a 16 pixel wide horizontal slice is referred to as a macro row.
  • An interpolated macro row of reference image data is 176x2 bytes wide and 16x2 bytes high.
  • Figure 94 shows a macro block interpolated by a factor two horizontally and vertically.
  • This inte ⁇ olated reference image memory space is best thought of as being divided up into thirds with the middle third corresponding to roughly the same macro row (but inte ⁇ olated up in size) position as the macroblock being motion estimated for the current image.
  • the motion estimation using this memory arrangement can be implemented as follows:
  • inte ⁇ olated macro row 2 is copies with memory holding inte ⁇ olated macro row 1).
  • Figure 96 depicts the memory space after this step after the first iteration of the first method.
  • step 2 This first method requires that additional time and memory bandwidth be spent performing the copies, but saves memory and computational power.
  • the second method like the first method, involves using a memory space equivalent to three ofthe macro rows inte ⁇ olated up by a factor of two in both the horizontal and the vertical directions. Unlike the first method, the second method uses modulo arithmetic and pointers to memory to replace the block copies ofthe first method. The second method is better than the first method from a memory bandwidth point of view, and offers the memory and computational power savings ofthe first method.
  • Three macro rows of inte ⁇ olated reference image data corresponds to 96 horizontal rows of pixels. If, as in method 1 , the memory space of the inte ⁇ olated reference image is divided up into thirds, then the middle third ofthe memory space begins at the 33 rd row of horizontal image data. A pointer is used to locate the middle third ofthe memory space. Assuming that a number system starts at 0, then the pointer will indicate 32.
  • the motion estimation ofthe second method using this memory arrangement can be implemented as follows:
  • Figure 97 depicts memory space addressing for three successive iterations of a second
  • the above two methods can be extended to cover extended motion vector data ( ⁇ +/- 32 pixel searches instead of just the ⁇ +/- 16 pixel searches described previously) by using a memory space covering five inte ⁇ olated macro rows of reference image data.
  • This larger memory space requirement will reduce the inte ⁇ olated reference image memory requirement by approximately 45% instead of the 67% savings that using only three inte ⁇ olated macro rows of reference image data gives.
  • the second method uses modulo 160 arithmetic.
  • Panning and Zooming Traditional means of implementing a pan of a video scene is accomplished by moving the camera and lens with motors.
  • the hardware in a preferred embodiment uses an electronic means to accomplish the same effect.
  • the image scanned by the camera is larger than the image required for processing.
  • An FPGA crops the camera image to remove any unnecessary data.
  • the remaining desired picture area is sent to the processing device for encoding.
  • the position ofthe desired image can be selected and thus "moved” about the scanned camera image, and therefore has the effect of "panning the camera.”
  • the software tells the input acquisition system to change the horizontal and vertical offsets that it uses to acquire the image into memory.
  • the application software provides the users with a wire frame diagram showing the total frame with a smaller wire frame inside of it which the user positions to perform the panning operation.
  • a zoom in a video scene is accomplished by moving the camera and lens with motors.
  • electronics are used to accomplish the same effect.
  • the image scanned by the camera is larger than the image required for processing.
  • the video encoder/decoder 103 may unsample or subsample the acquired data to provide zoom in/zoom out effects of a video conferencing environment. For example, every pixel is a given image area may be used for the most zoomed in view. A zoom out effect may then be achieved by decimating or subsampling the image capture memory containing the pixel data.
  • These sampling techniques may be implemented in software or hardware.
  • a hardware implementation is to use sample rate converters (inte ⁇ olating and decimating filters) in the master digital signal processor or in an external IC or ASIC.
  • a video conferencing system employing the invention may also be used to sent high quality still images for viewing and recording to remote sites.
  • the creation of this still can be initiated through the application software by either the viewing or the sending party.
  • the application then captures the image, compresses its, and transmits it to the view over the video channel and data channel for maximum speed.
  • the still image can be compressed by one ofthe processing devices while the main video conferencing application continues to run.
  • the still is encoded in one of two methods, depending upon user preference - 1.
  • a HVQC still is sent, or ii.) a differential pulse code modulation (dpcm) technique using a Lloyd-max quantizer (statistical quantizer) with Huffman encoding.
  • dpcm differential pulse code modulation
  • Lloyd-max quantizer statistic quantizer
  • Other techniques that could be used include JPEG, wavelets, etc.
  • the MD is used to actually do the stills encoding as a background task and then sends the still data with a special header to tell the receiving codec what the data is. The receiver's decoder would then decode the still on its end.
  • Stills and video conferencing frames are presented to the user in separate application windows.
  • the stills can then be stored to files on the users system which can then be called up later after a software or hardware decoder.
  • a video sequence may be transmitted over an ordinary modem data connection, either with another party or parties, or with an on-line service.
  • the modems exchange information to decide whether there is to be a two-way (or more) video conference, a single ended video conference, or no video conference.
  • the modem may then tell the application software to automatically start up the video conference when appropriate.
  • the system described herein provides the users the capability of starting an application at any point during their conversation to launch a data and/or video conference.
  • the video encoder/decoder 103 described herein may be used with a telephone, an audio compression engine and a modem to produce a video answering machine.
  • a DRAM may be used to store several minutes of data (multiple audio/video messages).
  • the answering machine may play an audio/video message to the caller and may then record a like image.
  • the caller may choose to send either a video sequence or a still image, such as by pushing appropriate buttons on the telephone.
  • the call may be given the option of reviewing the video sequence or still image by preventing the encoding and transmission ofthe video until the caller presses an appropriate button to signal its approval.
  • the input image will be two lines longer than the coded image. This allows us to filter the image with no edge effect on the top and bottom of the image. Side edge effects are taken care of by the filter function.
  • 1656 imgs.uc ⁇ (Utong ⁇ gmat ⁇ te( 0. dim.yc-1. 0, (dim -c»2)-1. eiz ⁇ of (Utong) ); 1656 img ⁇ .vc - (Uicng**t ⁇ iMt ⁇ i ⁇ ( 0, dim.yc-1 , 0. (dh ⁇ u(c>>2)-1 , ⁇ eoffUbng) );
  • MS 10983 words (slave MD communication and packed Y. U, and V plane recVd from master)
  • This file contains pseudo code/flow charts of how the decoder may beadle fait errors.
  • the bit errors may be misting macro-raw packets of video data, or they may be undetected bit erran that the supervisor does not find.
  • i code a similar ofthe encoder This is done to make it the
  • Appendix 2 2 if (bu_e ⁇ or_flag — NO_ERROR) send the decoded data to the PC for display;
  • Appendix 2:4 if (there are macro-rows that have not been decoded)
  • Routine that is called when something unexpected appean in the bit-stream that is not recoverable.
  • the routine for refreshing the image is the routine for refreshing the image.
  • the routine for decoding a macro-row of video encoded data is the routine for decoding a macro-row of video encoded data.
  • the routine for decoding a macro-row of still encoded data is the routine for decoding a macro-row of still encoded data.
  • This routine decodes s 16x16 block using the different 16x16 coden.
  • decoderl60 hitFlag - read a bit from the bit stream; if(httFlag)
  • Appendix 2 10 put picture header information into the bit stream, video_fiSj-ne_encoderO; if (there are extra bits are left over) ⁇ refresh_encoderO;
  • the routine for esKoding a frame of video image is the routine for esKoding a frame of video image.
  • Appendix 2 11 stiU_frame_encoder
  • the routine for encoding one 4x4 block of a still image is the routine for encoding one 4x4 block of a still image.
  • the routine for refreshing a fraction of an image frame to avoid buffer underflow is the routine for refreshing a fraction of an image frame to avoid buffer underflow.
  • Appendix 2 13 int hicrarchical_eacoderO ⁇ local_hitl6_flag - encoderl6 (one 16x16 block); if (llocal hitl6 flag)
  • the routine for encoding one 16x16 block It retums the status of the encoding process (hit miss flag).
  • ⁇ eacoderl ⁇ fails to encode this 16x16 block; return (FALSE);
  • the routine for encoding one SxS block It returns the status of the encoding process (hit/miss flag). 0
  • ( f* block_match_test can be lull-search ot 3-stcp search */ found a motion in the cache that passes the threshold test; ⁇ npdatf the cache content;
  • Appendix 2 16 at this point, it is certain that the luminance component can be coded with the motion vector. But we need to test for the color componenets with the same mouoo vector the SxS bkxk with the resulting mouon vector (MV);
  • the routine for encoding one 4x4 block It always returns TRUE.
  • utt encoder40 ( if ( MOTION CACHE4 - motion cache_test(4x4) ) ⁇ found s motion vector in the cache that passes the threshold test;
  • I* blodc_mstch_test can be full-search ot 3-step search */ found a motion in the cache that passes the threshold test;
  • Appendix 2 18 found a motion in the cache that passes the threshold test; update the cache content;
  • encoderl ⁇ fails to encode this 16x16 block; i (FALSE);
  • Appendix 2 19 encoder ⁇ O
  • the routine for encoding one SxS block It returns the status of the encoding process (hit/miss flag).
  • /• block_match_test can be full-search ot 3-step search */ found a motion in the cache that pastes die threshold test; update the cache content;
  • Appendix 2 20 * COLOR components: both U and V have block size of 4x4 */ if (U and V share the MV of Y)
  • the routine for encoding one 4x4 bkxk It always returns TRUE.
  • L encoder40 ⁇ if ( MOTION CACHE4 - motion cache_test(4x4) ) ⁇ found a motion vector in the cache that passes the threshold test; i the i > dae if ( BLOCK MATCH4 - block match test(4x4) ) ⁇ * blcck_match_test can be full-search ot 3- ⁇ tep search */ found a motion in the cache that passes the threshold test;
  • Appendix 2:22 /" This is psuedo code for the HVQC encoder. It outlines the basic steps involved in the encoding process without going into the actual HVQC subroutines.
  • the opacity or bits per frame is calculated from the bits sec and frames sec.
  • there is a divide by 2 is because there are two processors, each of which gets half die bits. Therefore, each of the capacity multiplies in die following section have an addition multiplication factor of 2 (i.e., ranges are 1.5 to 2.5, 2.5 to 3.5, and greater than 3.5
  • the -1.0 is to downward bias die number of frames to be skipped.
  • Appendix 3 1 /• Compute die capacity or bits per frame from ⁇ e bits/sec and ftames sec. . .
  • Appendix 3 2 /* encode image as a still image */ cumulative_refresh_bits « 0; /• set back to use vq refresh •/
  • Reset die adaptable boundary to be 80/64 split
  • the parameter passed here should be the same as die one in initEncoderState.
  • Appendix 3:3 /" die two slave encoders, sel & se2, are set with compile switches during die make process to be a MASTER and a SLAVE. Do not confuse this with die master decoder and die slave encoder terminology */ #if MASTER /* i.e., sel */
  • die boundary can be slowly INCREASED back to center (i.e., 5 macroRows for die master and
  • the boundary can be slowly DECREASED center (i.e., 5 macroRows for die master and 4 for de slave). •/ / • Otherwise, don't adapt die boundaries */
  • Appendix 3 5 ⁇ /* if cumulative_refresh >its ⁇ /
  • the individual processor must know die cumulative bits of the other processor
  • die boundary can be slowly DECREASED back to die center (i.e., 5 macroRows for die master and 4
  • Appendix 3 8 alpha - 0.2; beta - 1.0 - alpha; lendif timvideo - get_timeO; /* start timer to see how long encode takes "/ video_encode( coder, J.mac_hdr ); /* call subroutine to encode die video •/
  • Amount we could refresh is determined by die difference of die allowed bits per frame and die number of bits we used to sencode die frame
  • Tide Compute direshold, std, and mean of an image block, float computeMseTx2d( uchar ""img, int r, int c, VEC_SIZ *wrd, float gamma, float alpha, float beta, float "mean, float “std ) img is die image in which std and mean will be extracted.
  • r and c are ie image indices
  • wrd specifies the image block size mean and std are die returned values Returned Value: die direshold
  • This routine computes die direshold to encode a given image block This direshold is s functional of die standard dev. of die image block and a number of other variables (see titr.h section later in diis file for detailed description of tiiese variables).
  • This routine is called by die still encoder and by die video encoder for all levels of die hierarchy.
  • computeMsetx2d compute die direshold for ach block being encoded "/ float computeMseTx2d( uchar ""img, int r, int c, VEC_SIZ “wrd, float gamma, float alpha, float beta, float “mean, float “std )
  • void intra_adapt( int tRate, int blockCount ) tRate is die cumulative bit-rate to encode a fractional frame. This value is reset to 0 after each frame.
  • blockCount is die total number of blocks encoded in a frame. This value is also reset after each frame.
  • This routine adapt die duesholds for encoding a video frame based on the buffer fullness, defined as ⁇ buffer_mode > in diis program.
  • Appendix 4 2 (1) Compute buffer fullness ⁇ buffer mode > :
  • the buffer fullness is based on three parameters:
  • the EFFECTIVE buffer fullness is computed based on 5 cases: case 1: based on bits only, with a focus switch (motion tracking) for trading quality and frame-rate. case 2: based on max of bits and time case 3: based on sum of bits and time case 4: based on time only
  • the default is case 1.
  • Appendix 4 3 bgthl6[0,l,2] are used to encode motion and background blocks at different levels of the coder, where as s_m4[0,l,2] are used for encoding die 4x4 spatial background blocks.
  • s_m4[0,l,2] are used for encoding die 4x4 spatial background blocks.
  • ⁇ bufferjnode > there are three cases: case 1: Small direshold bUjfer node ⁇ fullness_cutoff (this value is typically 0.75) find a ⁇ ti ⁇ reshold between bgthl6[0] and bgthl6[l], case 2: Large threshold fullness_cufoff ⁇ bufferjnode ⁇ overflow find a direshold between T>gthl6[l] and bgd ⁇ l6[2].
  • Appendix 4 4 /* Find buffer fullness • / rnodifiedBufferFullness - BufferFuUness ; mcdi ⁇ edBufferFullness — BufferFuUness + rd.yuv ate;

Abstract

A method and an apparatus for encoding an image signal. The apparatus includes an acquisition module disposed to receive the image signal. A first processor is coupled to the acquisition module. At least one encoder processor is coupled to the first processor. The at least one encoder processor produces an encoded image signal under control of the first processor. The method includes the steps of converting an input image signal to at least one encoder processor. The method further includes the step of applying, at the at least one encoder processor, a hierarchical vector quantization compression algorithm to the digitized image signal. At the next step, a resultant encoded bit stream generated by the application of the algorithm is collected. The method and apparatus of the present invention may be used in conjunction with an ordinary modem to transmit and/or receive audio, video sequences of still images.

Description

VIDEO ENCODER/DECODER
BACKGROUND OF THE INVENTION Field of the Invention
The present invention generally relates to digital data compression and enoding and decoding of signals. More particularly, the present invention relates to methods and apparatus for encoding and decoding digitized video signals.
BACKGROUND
The development of digital data compression techniques for compressing visual information is very significant due to the high demand for numerous new visual applications. These new visual applications include, for example, television transmission including high definition television transmission, facsimile transmission, teleconferencing and video conferencing, digital broadcasting, digital storage and recording, multimedia PC, and videophones.
Generally, digital channel capacity is the most important parameter in a digital transmission system because it limits the amount of data to be transmitted in a given time. In many applications, the transmission process requires a very effective source encoding technique to overcome this limitation. Moreover, the major issue in video source encoding is usually the tradeoff between encoder cost and the amount of compression that is required for a given channel capacity. The encoder cost usually relates directly to the computational complexity of the encoder. Another significant issue is whether the degradation of the reconstructed signal can be tolerated for a particular application
As described in U.S. Patent No. 5,444,489, a common objective of all source encoding techniques is to reduce the bit rate of some underlying source signal for more efficient transmission and/or storage. The source signals of interest are usually in digital form. Examples of these are digitized speech samples, image pixels, and a sequence of images.
Source encoding techniques can be classified as either lossless or lossy. In lossless encoding techniques, the reconstructed signal is an exact replica of the oπgmal signal, whereas in lossy encoding techniques, some distortion is introduced into the reconstructed signal, which distortion can be tolerated in many applications. Almost at the video source encoding techniques achieve compression by exploiting both the spatial and temporal redundancies (correlation) inherent in the visual source signals. Numerous source encoding techniques have been developed over the last few decades for encoding both speech waveforms and image sequences. Consider, for example, W.K. Pratt, Digital Image Processing, N.Y.: John Wiley & Sons, 1978; N.S. Jayant and P. Noll,: Digital Coding of Waveforms Principles and Applications to Speech and Video, Englewood Cliffs, N.J.: Prentice-Hall, 1984; A.N. Netravali and G.B. Haskell, Digital pictures: Representation and compression, N.Y.: Plenum Press, 1988. Pulse code modulation (PCM), differential PCM (DPCM), delta modulation, predictive encoding, and various hybrid as well as adaptive versions of these techniques are very cost-effective encoding schemes at bit rates above one bit per sample, which is considered to be a medium-to-high quality data rate. However, a deficiency of all the foregoing techniques is that the encoding process is performed on only individual samples of source signal. According to the well known Shannon rate-distortion theory described in T. Berger, Rate Distortion Theory, Englewood Cliffs, N.J.: Prentice Hall, 1971, a better objective performance can always be achieved in principle by encoding vectors rather than scalars.
Scalar quantization involves basically two operations. First, the range of possible input values in partitioned into a finite collection of subranges. Second, for each subrange, a representative value is selected to be output when an input value is within the subrange. Vector quantization (VQ) allows the same two operations to take place in multidimensional vector space. Vector space is partitioned into subranges each having a corresponding representative value or code vector. Vector quantization was introduced in the late 1970s as a source encoding technique to encode source vectors instead of scalars. VQ is described in A. Gersho, "Asymptotically optimal block quantization," IEEE Trans. Information Theory, vol. 25, pp. 373-380, July 1979; Y. Linde, A. Buzo, and R. Gray, "An algorithm for vector quantization design," IEEE Trans. Commun., vol. 28, pp. 84-95, January, 1980; R.M. Gray, J.C. Kieffer, and Y. Linde, "Locally optimal quantizer design," Information and Control, vol. 45, pp. 178-198, 1980. An advantage of the VQ approach is that it can be combined with many hybrid and adaptive schemes to improve theoverall encoding performance. Further, VQ-oriented encoding schemes are simple to implement and generally achieve higher compression than scalar quantization techniques. The receiver structure of VQ consists of a statistically generated codebook containing code vectors.
Most VQ-oriented encoding techniques, however, operate at a fixed rate/distortion tradeoff and thus provide very limited flexibility for practical implementation. Another practical limitation of VQ is that VQ performance depends on the particular image being encoded, especially at low-rate encoding. This quantization mismatch can degrade the performance substantially if the statistics of the image being encoded are not similar to those ofthe VQ. Two other conventional block encoding techniques are transform encoding (e.g, discrete cosine transform (DCT) encoding) and subband encoding. In transform encoding, the image is decomposed into a set of nonoverlapping contiguous blocks and linear transformation is evaluated for each block. Transform encoding is described in the following publications: W.K. Pratt, Digital Image Processing, N.Y.: John Wiley & Sons, 1978; N.S. Jaynat and P. Noll, Digital Coding of Waveforms; Principles and Applications to Speech and Video, Englewood Cliffs, N.J.: Prentice-Hall, 1984; R.C. Gonzalez and P. Wintz, Digital Image Processing, Reading, Mass.; Addison-Wesley, 2nd ed., 1987. In transform encoding, transform coefficients are generated for each block, and these coefficients can be encoded by a number of conventional encoding techniques, including vector quantization. See N.M. Nasrabadi and R.A King, "Image coding using vector quantization a review," IEEE Trans Commun , vol. 36, pp 957-971, August 1986 The transform coefficients in general are much less correlated than the original image pixels. This feature offers the possibility of modeling their statistics with well defined distribution functions Furthermore, the image is considered to be more compact in the transform domain because not all coefficients are required to reconstruct the image with very good quality Transform encoding is also considered to be a robust technique when compared to VQ because the transformation is fixed for all classes of images
Although meπtoπous to an extent, the effectiveness of transform encoding is questionable The effectiveness depends critically on how the bits are allocated in order to encode the individual transform coefficients This bit rate allocations problem is documented in A Gersho and R M Gray, Vector Quantization and Signal Compression, Mass.. Kluwer Academic, 1992 This bit rate allocation problem often results in a highly complex computational strategy, especially if it is adaptive, as suggested in N.S Jayant and P Noll, Digital Coding of Waveforms Principles and Applications to Speech and Video, Englewood Cliffs, N J Prentice-Hall, 1984 The numerous computations associated with the transformation and the bit rate allocation strategy can lead to a high-cost harware implementation. Furthermore, most encoders using transform encoding operate on block sizes of at least 8 X 8 pixels in order to achieve reasonable encoding performance These block sizes are very effective in encoding the low detail regions of the image, but can result poor quality in the high detail regions, especially at low bit-rates In this regard, see R Clarke, Transform Coding of Images, N.Y . Academic, 1985. Thus, VQ is still known to be a better technique for encoding high detail image blocks. Finally, in subband encoding the image is represented as a number of subband (bank padd) images tht have been subsampled at their Nyquist rate. In this regard, see M. Vetterli, "Multi-dimensional sub-band coding: some theory and algorithms," Signal Processing, vol. 6, pp. 97-1 12, April, 1984; J. W. Woods and S.D. O'Neil, "Subband coding of images," IEEE Trans. Acoust., Speech, Signal Processing, vol. 34, pp. 1278-1288, October, 1986. These subband images are then separately encoded at different bit rates. This approach resembles the human visual system. Subbank encoding is a very effective technique for high quality encoding of images and video sequences, such as high definition TV. Subband encoding is also effective for progressive transmission in which different bands are used to decode signals at different rate/distortion operating points.
However, a primary disadvantage of subband encoding is that the computational complexity of the bit rate allocations and the subband decomposition problem can lead to a high-cost hardware implementation. Furthermore, subband encoding is usually not very efficient in allocating bit rates to encode the subband images at low rates. Hence, there is a heretofore unaddressed need in the art for a low bit rate source encoding system and method which are much simpler and inexpensive to implement and which exhibit better computational efficiency.
SUMMARY OF THE INVENTION In accordance with a first aspect of the present invention, an apparatus for encoding an image signal is provided. The apparatus includes an acquisition module disposed to receive the image signal. A first processor is coupled to the acquisition module. At least one encoder processor is coupled to the first processor, and produces an encoded image signal under control of the first processor. In accordance with a second aspect of the present invention, a method for generating a compressed video signal is provided. The method includes the steps of converting an input image signal into a predetermined digital format and transferring the digital format image signal to at least one encoder processor. The method further includes the step of applying, at the encoder processor, a hierarchical vector quantization compression algorithm to the digitized image signal. At the next step, a resultant encoded bit stream generated by the application of the algorithm is collected.
In accordance with a third aspect of the present invention, a method of compressing an image signal representing a series of image frames is provided. The method includes the step of analying a first image frame by computing a mean value for each of a plurality of image blocks within the first image frame, storing the computed mean values in a scalar cache, providing a mean value quantizer comprising a predetermined number of quantization levels arranged between a minimum mean value and a maximum mean value stored in the scalar cache, the mean value quantizer producing a quantized mean value, and identifying each image block from the plurality of image blocks that is a low activity image block. Next, each of the low activity image blocks is encoded with its corresponding quantized mean value. These steps are then repeated for a second frame of the image signal.
In accordance with a fourth aspect of the present invention, a method periodically enhancing a video image with at least one processor, in which the video image includes a series of frames, each of the frames including at least one block, is provided. The method includes the steps of determining space availability of the processor, determine time availability of the processor, and determining the number of bits used to encode a frame. Then, a value is assigned to a first variable corresponding to the number of bits used to encode the frame. The first variable is compared to a predetermined refresh threshold. The method further includes the steps of scalar refreshing the video image if the first variable exceeds the predetermined refresh threshold, and block refreshing the video image if the first variable is less than or equal to the predetermined refresh threshold.
In accordance with a fifth aspect of the present invention, a method for transmitting data from an encoder to a decoder is provided. The method includes, in combination encoding the data into a plurality of macro rows, transmitting the macro rows from the encoder to the decoder, and decoding the macro rows at the decoder.
In accordance with a sixth aspect of the present invention, in a video conferencing system having a camera, an image processor and an image display, an apparatus for electronically panning an image scene is provided. The apparatus includes an image acquisition module disposed to receive an image signal from the camera. The image acquisition module produces a digital representation of the image signal. Means for transmitting a portion of the digital representation of the image signal to the image processor are provided. The transmitting means is coupled between the image acquisition module and the image processor. The apparatus also includes an image pan selection device coupled to the transmitting means. The image pan selection device is operable to change the portion of the digital representation of the image signal that is transmitted to the image processor.
An object of the present invention is a reduction of the number of bits required to encode an image. Specifically, an object of the invention is a reduction of the number of bits required to encode an image by using a scalar cache to encode the mean of the blocks in the different levels of the hierarchy. Yet another object of the present invention is a reduction of decoding errors caused by faulty motion cache updates in the presence of transmission errors.
A further object of the present invention is a reduction of the overall computational power required to implement the HVQC and possible reduction of the number of bits used to encode an image. Mother object of the present invention is an improvement in the perceived image quality. It is further an object of the present invention to control how bits are allocated to encoding various portions of the image to improve the image quality.
Another object of the present invention is to reduce the probability of buffer overflow and control how bits are allocated to encoding various portions of the image to improve the image quality.
Furthermore, it is an object of the present invention to improve image quality and reduce artifacts caused by transmission errors. A further object of the present invention is to improve image quality by encoding the chrominance components independently of the luminance component. Still another object of the present invention is an improvement in overall image quality, particularly of the chrominance component.
It is also an object of the present invention to improve image quality by reducing encoder artifacts. Another object of the present invention is to provides a means to trade off encoder bit rate for image quality. In addition, it is an object of the present invention to allow use of multiple processors to share in the computational load of implementing the video encoder/decoder system, sometimes referred to as a codec, which allows for greater flexibility and use of general purpose devices in the implementation of the encoder/decoder system. Another object of the present invention is implementation among several devices operating in parallel. A further object of the present invention is to keep computational requirements of the system down through the use of load balancing techniques. Moreover, it is an object of the present invention to keep the computation load down by providing the processing devices with a history of the portion of the image they have just acquired and thus increase the precision of the motion vectors along the partition boundary. Yet another object of the invention is a reduction of the bit rate and the computational complexity. An additional object of the invention is to allow computational needs to be balanced with image quality and bit rate constraints
It is further an object of the invention to provide a method of keeping the image quality acceptable and the bit rate down Additionally, it is an object of the invention to reduce the memory requirements and cost of the system. Another object of the invention is bit rate management determined by a single processing device that in turn controls the vaπous encoder devices
An additional object of the invention is to allow for maximum throughput of data to the transmission channel, immediate bit rate feedback control to all encoding processors, and, by the use of small packets of data, reduce image inaccuracies at the decoder due to channel errors It is also an object of the invention to allow for efficient retransmission, if necessary.
A further object of the invention is to allow for a data channel to transmit vaπous types of data in a multiplexed fashion. In addition, it is an object of the invention to reduce bit rate by allowing the caches to have independent initial conditions Another object of the invention is low cost and low complexity implementation of YUV-to-RGB conversion
Yet another object of the invention is a low cost, low complexity mechanism for controlling the data to undergo YUV-to-RGB conversion. An additional object of the invention is improved error detection and concealment Another object of the invention is to reduce the amount of time between the viewer seeing a scene of high motion begin and
subsequent scenes
Furthermore, it is an object of the invention to transmit high quality still image frames. Another object of the invention is to improve the image quality by pre-determimng which areas require the most bits for encoding. - li lt is also an object of the invention to provide a fast, efficient, and low cost method of calculating the distortion measure used in the hierarchical algorithm. Another object of the invention is a low cost means of panning an image scene and a low cost means of zooming in and out of an image scene. An additional object of the invention is an elimination of the need for specialized
ASICs and specialized board versions. Another object of the invention is to simplify video conferencing for the user. Moreover, it is an object of the invention to allow a user to have a visual representation of who the call is and what they may want.
These and other objects, features, and advantages of the present invention are discussed or are apparent in the following description of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
In the detailed description that follows, reference will be made to the following figures:
Figure 1A is a block diagram of a hierarchical source encoding system inaccordance with the present invention;
Figure IB is a block diagram of a hierarchical source encoding system in accordance with a preferred embodiment of the present invention;
Figure 2 is a block diagram of a background detector of Figure 1 A;
Figure 3 is a graphical illustration showing conventional quadtree decomposition of image blocks performed by the hierarchical source encoding system of Figure 1 A;
Figure 4 is a block diagram of a motion encoder of Figure 1 A;
Figure 5 is a graphical illustration showing the functionality of the compute-and-select mechanism of Figure 4;
Figure 6 is a graphical illustration showing the encoding of motion vectors within a cache memory of Figure 4;
Figure 7 is a graphical illustration showing updating processes of various cache stack replacement algorithms for the cache memory of Figure 4;
Figure 8 is a block diagram for a block matching encoder of Figure 1 A;
Figure 9 is a block diagram of a cache vector quantizer for encoding a new frame of a video sequence in the frame buffer of the hierarchical source encoding system of Figure 1 A;
Figure 10 is a graphical illustration showing a working set model for the cache memory of Figure 9;
Figure 11 is a graphical illustration showing a novel adaptive working set model for the cache memory of Figure 9; Figure 12 is a graphical illustration showing a raster scan technique for scanning the image blocks in one image frame;
Figure 13 is a graphical illustration showing a localized scanning technique for scanning the image blocks in one image frame; Figure 14A is a schematic diagram of a preferred parallel processing architecture;
Figure 14B is a high level diagram of the architecture of a video conferencing system; Figure 15 is a bit map of the format of date for the YUV-to-RGB converter shown in Figure 14A;
Figure 16 shows the time for each phase of pre filtering assuming the encoders shown in Figures 14A and 14B will be operating on byte packed date;
Figure 17 depicts processor utilization (minus overhead) for post processing; Figure 18 shows the format improving the efficiency of the YUV-to-RGB convenor shown in Figure 14A;
Figure 19 is a table of eight gamma elements; Figure 20 is a graph of threshold v. buffer fullness;
Figure 21 depicts the Y data format and U&V data format in the master video imput data buffer;
Figure 22 is a high level example of a video bit stream; Figure 23 is a table relating to various values of the picture start code (PSC); Figure 24 is a table of the fields in a macro row header;
Figure 25 shows the format of extended macro zero information; Figure 26 is a table showing various decompositions of luminance of 16x16 blocks and their resulting bit streams; Figure 27 is a table of the flags used to describe the video bit stream;
Figure 28 is a table of the various decompositions of chrominance 8x8 blocks and their resulting bit streams;
Figure 29 depicts an example of encoding still image 4x4 blocks; Figure 30 shows refresh header information;
Figure 31 is a table of the overhead associated with a bit stream structure;
Figure 32 is a table of the fields of a picture header;
Figure 33 depicts a packet header;
Figure 34 depicts encoded data in C31 memory in big endian format; Figure 35 depicts encoded data in PC memory for the original bit stream;
Figure 36 depicts encoded data in PC memory for a bit stream of the system;
Figure 37 is a block diagram of an interrupt routine;
Figures 38 - 44 relate to packet definitions for packets sent to the VSA;
Figure 38 depicts a control packet; Figure 39 depicts a status request packet;
Figure 40 depicts an encoded bit stream request;
Figure 41 depicts a decoder bits end packet;
Figure 42 depicts a decoder bits start/continue packet;
Figure 43 depicts a YUV data for encoder packet; Figure 44 depicts a drop current RGB frame packet;
Figures 45 - 53 relate to packet definitions for packets sent to the host (PC);
Figure 45 depicts a status packet;
Figure 46 depicts a decoder error packet; Figure 47 depicts a decoder acknowledgment packet;
Figure 48 depicts an encoded bits from encoder end packet;
Figure 49 depicts an encoded bits from encoder start/continue packet;
Figure 50 depicts an encoded bits frame stamp; Figure 51 depicts a top of RGB frame packet;
Figure 52 depicts a FIFO ready packet;
Figure 53 depicts a YUV acknowledgment packet;
Figures 54 - 78 relate to control types and parameters for the control packet (host to VSA); Figure 54 depicts a control encoding packet;
Figure 55 depicts a frame rate divisor packet;
Figure 56 depicts an encoded bit rate packet;
Figure 57 depicts a post spatial filter packet;
Figure 58 depicts a pre spatial filter packet; Figure 59 depicts a temporal filter packet;
Figure 60 depicts a sill image quality packet;
Figure 61 depicts a video mode packet;
Figure 62 depicts a video pan absolute packet;
Figure 63 depicts a brightness packet; Figure 64 depicts a contrast packet;
Figure 65 depicts a saturation packet;
Figure 66 depicts a hue packet;
Figure 67 depicts a super control packet; Figure 68 depicts a control decoding packet; Figure 69 depicts a motion tracking packet; Figure 70 depicts a request control setting packet; Figure 71 depicts an example of a request control setting packet; Figure 72 depicts a frame rate divisor packet;
Figure 73 depicts a request special status information packet; Figure 74 depicts a buffer fullness information packet; Figure 75 depicts a YUV top of frame packet; Figure 76 depicts a Y data frame packet; Figure 77 depicts a UV data packet;
Figure 78 depicts a YUV end of frame packet;
Figure 79 depicts data flow between image planes, a digital signal processor and an accelerator;
Figure 80 is table that represents formulas for calculating speed improvements with an accelerator;
Figure 81 is table that shows speed improvement for various pipeline delay lengths used with an accelerator;
Figure 82 depicts a mean absolute error (MAE) accelerator implementation in accordance with the equation shown in the Figure; Figure 83 depicts a mean absolute error (MSE) accelerator implementation in accordance with the equation shown in the Figure;
Figure 84 illustrates a mean square error (MSE) implementation as in Figure 82 with interpolation to provide Vi pixel resolution; Figure 85 illustrates a mean square error (MSE) implementation as in Figure 83 with interpolation to provide Vi pixel resolution;
Figure 86 is a Karnaugh map for carry save adders in the YUV to RGB matrix; Figure 87 is a block diagram of a circuit that converts a YUV signal to RGB format; Figure 88 depicts a logic adder and a related bit table for converting the YUV signal to
RGB format;
Figure 89 depicts a logic adder and a related bit table for converting the YUV signal to RGB format;
Figure 90 depicts a logic adder and a related bit table for converting the YUV signal to RGB format;
Figure 91 is a block diagram timing model for the YUV to RGB conversion; Figure 92 is a state diagram relating to the conversion of the YUV signal to RGB format;
Figure 93 shows a 16x16 image block; Figure 94 shows an interpolated macro block;
Figure 95 depicts memory space at the end of step 1 of the first iteration of a first motion estimation memory saving technique;
Figure 96 depicts the memory space at the end of step 4 of the first iteration of the first motion estimation memory saving technique; and Figure 97 depicts memory space addressing for three successive iterations of a second motion estimation memory saving technique. DET AILED DESCRIPTION OF THE PREFERRED EMBODIMENT
With reference now to the figures wherein like numerals designate corresponding parts throughout the several views, Figure 1A shows a hierarchical source encoding system 10 having multiple successive stages 11 - 19 for encoding different parts of an image block 21. In essence, higher bit rates are allocated to those regions of the image block 21 where more motion occurs, while lower bit rates are allocated to those regions where less motion occurs. The hierarchical souce encoding system 10 has a low computational complexity, is simple and inexpensive to implement, and is suited for implementation in hardware, software, or combinations thereof.
In accordance with a significant aspect of the present invention, the hierarchical encoding system 10 utilizes cache memories with stack replacement algorithms as a noiseless method to encode motion vectors. This approach has an advantage over entropy encoding because the statistics of the motion vectors are allowed to vary with time. Another major advantage of this approach is that the cache memories can substantially reduce the computation required for matching data blocks.
As shown in Figure 1 A, a background detector 1 1 initially receives an image block 21 of size pxp, for instance, 16 x 16 or 32 x 32 pixels. Generally, the background detector 11 determines whether the image block 21 is either a background block or a moving block. A block diagram of the background detector 11 is shown in Figure 2. With reference to Figure 2, the background detector 11 comprises a compute mechanism 24 in series with a Tbg flag mechanism 26, and a frame buffer 28 for receiving data from the flag mechanism 26 and for writing data to the compute mechanism 24. The frame buffer 28 can be any memory which can be randomly accessed. In the compute mechanism 24, the image block 21 is compared with a previously encoded p x p image block (not shown) which has been stored in the frame buffer 28 in order to generate a difference value, or a "distortion" value. The distortion value is generated by comparing the blocks on a pixel-by-pixel basis. The compute mechanism 24 may utilize any suitable distortion measurement algorithm for determining the distortion value, but preferably, it employs one of the following well known distortion measurement algorithms:
eu Square Error (MSE)
P P (1)
MSE - -^3- Σ .2, (χ,/ - xfl - l))2
Mem Absolute Error (MAE)
Figure imgf000021_0001
Number of Moving Pixeb (NMF)
Figure imgf000021_0002
The image block 21 is classified as a background block if a distortion value is less than a predetermined threshold Tbg. The comparison to the threshold Tbg is performed in the Tbg flag mechanism 26. If the comparison results in a difference which is less than the threshold Tbj, than a flag bit 32 is set to a logic high ("1") to thereby indicate that the present image block 21 is substantially identical to the previously encoded p x p image block and then the system 10 will retrieve another image block 21 for encoding. In the foregoing scenario, the image block 21 is encoded with merely a single flat bit 32. In the alternative, that is, if the distortion is greater than or equal to the threshold Tbg, the flag bit 32 remains at a logic low ("0") and then the system 10 will pass the p x p image block 21 on to the next hierarchical stage, that is, to the p/2 x p/2 (16 x 16 pixels in this example) background detector 12. The p/2 x p/2 background detector 12 has essentially the same architecture and equivalent functionality as the p x p background detector 11 , as shown in Figure 2 or a similar equivalent thereof, but the p/2 x p/2 background detector 12 decomposes the p x p image block 21 into preferably four p/2 x p/2 image blocks 21 ' via a conventional quadtree technique, prior to image analysis. A conventional quadtree decomposition is illustrated graphically in Figure 3. As shown in Figure 3, in conventional quadtree decomposition, a p x p block is divided into four p/2 x p/2 blocks, which are individually analyzed, and then each of the p/2 x p/2 blocks are broken down further into p/4 x p/4 blocks, which are individually analyzed, and so on. Thus, in the present invention, the p/2 x p/2 background detector 12 retrieves a p/2 x p/2 image block 21 ' from the four possible image blocks 21 ' within the decomposed p x p image block 21 residing in the frame buffer 28 and subsequently analyzes it. Eventually, all of the p/2 x p/2 image blocks 21 ' are individually processed by the background detector 12. If the retrieved p/2 x p/2 image block 21 ' matches with the corresponding previously encoded p/2 x p/2 image block (not shown) within the frame buffer 28, then the flag bit 32' is set at a logic high to thereby encode the p/2 x p/2 image block 21 ', and then the background detector 12 will retrieve another p/2 x p/2 image block 21' for analysis, until the four of the p x p image block are exhausted. Alternatively, if the particular p/2 x p/2 image block 21 ' at issue does not match the corresponding previously encoded p/2 x p/2 image block, then the p/2 x p/2 image block 21 ' is forwarded to the next subsequent state of the hierarchical source encoding system 10 for analysis, that is, to the motion encoder 13.
As shown in Figure 4, in the motion encoder 13, the p/2 x p/2 image block 21 ' is analyzed for motion. For this purpose, the p/2 x p/2 motion encoder 13 comprises a compute- and-select mechanism 34 for initially receiving p/2 x p/2 image block 21 ', a cache memory 36 having a modifiable set of motion vectors which are ultimately matched with the incoming image block 21 ', a TM threshold mechanism 38 for comparing the output of the compute-and- select mechanism 34 with a threshold TM) and cache update mechanism 42 for updating the motion code vectors contained within the cache memory 36 based upon motion information received from the next subsequent hierarchical stage, or from a block matching encoder 14, as indicated by a reference arrow 43'.
The compute-and-select mechanism 34 attempts to match the incoming p/2 x p/2 image block 21 ' with a previously stored image block which is displaced to an extent in the frame buffer 28. The displacement is determined by a motion vector corresponding to the matching previously stored image block. Generally, motion bectors are two-dimensional integer indices, having a horizontal displacement dx and a vertical displacement dy, and are expressed herein as coordinate pairs dx, dy. Figure 5 graphically illustrates movement of the p/2 x p/2 image block 33 within the frame buffer 28 from a previous position, indicated by dotted block 46, to a present position, denoted by dotted block 48. The displacement between positions 46 and 48 can be specified by a two-dimensional displacement vector dx,, dy,.
The compute-and-select mechanism 34 compares the current image block 21 ', which is displaced by dx,, dy„ wim the set of previously stored image blocks having code vectors in the modifiable set (dx0, dy0, dx„ dy,; . . . dx,, dy„} within the cache memory 36. The code vectors have cache indices 0 through n corresponding with dx0, dy0, dx,, dy,; . . . dx,, dy„ . From the comparison between the current image block 21 ' and the previously stored image blocks a minimum distortion code vector d,,,,,, (dx,, dyj) is generated. The foregoing is accomplished by minimizing the following equation:
ddx.dy) - føtø - *f*(f - Y)γ
Figure imgf000023_0001
where x(t) is the current image block 21 ' corresponding to displacement vector dx„ dy„ where x(t-l) is the previously encoded block, and where d(dx,dy) is the distortion code vector.
Next, the minimum distortion code vector dmιn(dx„ dy,) is forwarded to the threshold mechanism 38, where the distortion of the minimum distortion motion vector dmιn(dx„ dy,) is compared with the threshold. If the distortion of the minimum distortion motion vector is less than the threshold, then the flag bit 52' is set at a logic high and the cache index 53', as seen in Figure 6, associated with the minimum distortion motion vector dmιr(dx„ dy,) is output from the compute-and-select mechanism 34. Hence, the image block 21 ' is encoded by the flag bit 52' and the cache index 53'. Alternatively, if the distortion of the minimum distortion motion vector dτn,n(dx„ dy,) is greater than or equal to the threshold, then the flag bit 52' is maintained at a logic low and the p/2 x p/2 image block 21 ' is forwarded to the next stage, that is, to the p/2 x p/2 block matching encoder 14 for further analysis.
Significantly, the cache update mechanism 42 updates the cache memory 36 based on cache hit and miss information, i.e., whether the flag bit 52' is set to a logic high or low. The cache update mechanism 42 may use any of a number of conventionally available cache stack replacement algorithms. In essence, when a hit occurs, the cache memory 36 is reordered, and when a miss occurs, a motion vector which is ultimately determined by the block matching encoder 14 in the next hierarchical stage is added to the cache memory 36. When the motion vector is added to the cache memory 36, one of the existing entries is deleted.
One of the following cache stack replacement algorithms is preferably utilized: the least-recently-used (LRU) algorithm, the least frequently used (LFU) algorithm, or the first- in-first-out (FIFO) algorithm. In the LRU algorithm, the motion vector to be replaced is the one whose last reference is the oldest, or has the largest backward distance. In the LFU algorithm, the motion vector to be replaced is the one whose number of references up to that time is the smallest. Finally, the in the FIFO algorithm, the motion vector to be removed is the one which has been in the cache memory 36 the longest length of time. Figure 7 illustrates and contrasts the foregoing cache update and replacement algorithms during both a cache hit and a cache miss. In accordance with the present invention, the stack replacement algorithm re-orders the index stack in the event of a cache hit, and in the event of a cache miss, an index is deleted and another index is inserted in its place.
As seen in Figure 8, the block matching encoder 14, which subsequently receives the p/2 x p/2 image block 21 ' in the event of cache miss, employs any conventional block matching encoding technique. Examples of suitable block matching encoding techniques are a full search and a more efficient log (logarithm) search, which are both well known in the art. For this purpose, the block matching encoder 14 comprises a block matching estimation mechanism 56 for comparing the incoming p/2 x p/2 image block 21 ' with image blocks in the previously stored image frame of the frame buffer 28. If a log search is employed, a three level approach is recommended. In other words, a predetermined set of blocks is first searched and analyzed. A best fit block is selected. Then, the neighborhood of the search is reduced, and another search of predetermined blocks ensues. In a three level search, the foregoing procedure is performed three times so that the reduction in neighborhood occurs three times. Block comparison is performed preferably as indicated in equation (4) above, and the distortion vector d(dx,dy) that results in minimum distortion is selected. A select mechanism 8 selects the motion vector with minimum distortion, or dmιn(dx,dy), from the distortion vectors generated by equation (4) and forwards the minimum distortion vector α^j^dx.dy) to the cache memory 36 (Figure 5) of the motion encoder 13 (previous stage) for updating the cache memory 36.
The minimum distortion vector d„„n(dx,dy) is then compared with a predetermined threshold TM in a TM threshold mechanism 64. If the minimum distortion vector d„„n(dx,dy) is greater than the predetermined threshold TM, then the flag bit 65' is maintained at a logic low, and the system 10 proceeds to the next hierarchical stage, that is, the p/4 x p/4 (8 x 8 pixels in this example) motion encoder 15 for further analysis. If, however, the minimum distortion vector dmm(dx,dy) is less than or equal to the predetermined threshold TM, then the flag bit 65' is set at a logic high and is output along with the minimum distortion vector dmιn(dx,dy), as indicated by reference arrow 66'. Thus, in this case, the p/2 x p/2 image block 21 ' is encoded by a flag bit 65' and the minimum distortion vector d„„n (dx,dy) 66' and the system 10 proceeds back to the background detector 12, where another p/2 x p/2 image block 21 ' is retrieved, if available, and processed.
The image analyses which took place in the p/2 x p/2 motion encoder 13 and then the p/2 x p/2 motion encoder 13 and the the p/2 x p/2 block matching encoder 14, is again repeated respectively in the p/4 x p/4 motion encoder 15 and then the p/4 x p/4 block matching encoder 16, except on a smaller image block size of p/4 x p/4 pixels. The p/4 x p/4 motion encoder 15 decomposes the p/2 x p/2 image block 21 ' into preferably four p/4 x p/4 image blocks 21" through selective scanning via the conventional quadtree technique, as illustrated and previously described relative to Figure 3. To this end, the p/4 x p/4 motion encoder 15 could encode the p/4 x p/4 image block 21", as indicated by reference arrow 54", with a flag bit 52 set to a logic high and a cache index 53". Or, in the next stage, the p/4 x p/4 block matching encoder 16 could encode the p/4 x p/4 image block 21", as indicated by reference arrow 67", with a flag bit 65" set to a logic high and a minimum distortion motion vector 66". The image analyses which took place in the p/4 x p/4 motion encoder 15 and then the p/4 x p/4 block matching encoder 16, is again repeated respectively in the p/8 x p/8 (4 x 4 pixels in this example) motion encoder 17 and then the p/8 x p/8 block matching encoder 18, except on a small image block size of p/8 x p/8 pixels. The p/8 x p/8 motion encode 17 decomposes the p/4 x p/4 image block 21 " into preferably four p/8 x p/8 image blocks 21 " through selective scanning via the conventional quadtree technique, as illustrated and previously described relative to Figure 3. To this end, the p/8 x p/8 motion encoder 17 could encode the p/8 x p/8 image block 21", as indicated by a reference arrow 54", with a flag bit 52" set to a logic high and a cache index 53". Or, in the next stage, the p/8 x p/8 block matchine encoder 18 could encode the p/8 x p.8 image block 21", as indicated by reference arrow 67" with a flag bit 65" set to a logic high and a minimum distortion motion vector 66".
If the p/8 x p/8 image block 21" has not yet been encoded, then it is passed onto a block encoder 19 shown in Figure 9. The block encoder may be a vector quantizer, a transform encoder, a subband encoder, or any other suitable block encoder. Transform encoding and subband encoding are described in the background section hereinbefore.
Examples of vector quantizers which are suitable for the present invention are described in A. Gersho, "Asymptotically optimal block quantization," IEEE Trans. Information Theory, vol. 25, pp. 373-380, July, 1979; Y. Linde, A. Buzo, and R. Gray, "An algorithm for vector quantization design," IEEE Trans. Commun., vol. 28, pp. 84-95, January, 1980; R.M. Gray, J.C. Kieffer, and Y. Linde, "Locally optimal quantizer design," Information and Control, vol 45, pp. 178-198, 1980. All of the foregoing disclosures are incorporated herein by reference as if set forth in full hereinbelow. It should be further noted that entropy encoding may be employed in a vector quantizer 19 to further enhance data compression. In regard to the well known entropy encoding, also known as variable length encoding, indices within the cache 36 may be entropy encoded in order to further enhance data compression. In entropy encoding, the statistics of an occurrence of each cache index are considered, and the number of bits for encoding each index may vary depending upon the probability ofan occurrence o f each index .
Referring to Figure IB, in a preferred embodiment, prior to processing by the background detector 12, the p x p image block is analyzed for motion by a motion encoder as described above for a p/2 x p/2 image block. Similarly, as described above, the p x p image block may be forwarded to a block matching encoder for further analysis. Cache Vector Quantizer
During initialization of the hierarchical source encoding system 10, a new image frame must be created in the frame buffer 28 of the present invention. This situation can occur, for example, when the hierarchical source encoding system 10 is first activated or when there is a scene change in the video sequence. For this purpose, a novel vector quantizer 68 shown in Figure 9 has been developed using the novel cache memory principles. The cache vector quantizer 68 comprises a large main VQ codebook 69 which is designed off-line and a small codebook kept in a cache memory 72 whose entries are selected on-line based on the local statistics of the image being encoded. Similar of the cache memory 36 of the motion encoders 13, 15, 17, the cache memory 72 is replenished with preferably the stock algorithms LRU, LFU, FIFO, or a novel adaptive working set model algorithm, which will be discussed in further detail later in this
document.
In architecture, the cache vector quantizer 68 includes the large VQ codebook 69, the cache memory 72, a compute-and-select mechanism 74 for receiving the incoming p x p image block 76 and for comparing theimage block 76 to code vectors in the cache memory 72, a Tc threshold mechanism 78 for determining whether the minimum distortion vector is below a predetermined threshold Tc, and a cache update mechanism 82 which utilizes a cache stack replacement algorithm for update the cache memory 72. More specifically, the compute-and- select mechanism 74 performs the following equations:
'^ - Jiyfl ^ - (5)
Jk = min føfc(*»x)} (6)
where x is the input block 76, z1, z2, z3, . . . z1 are the code vectors in the cache memory 72, where ^ is the selected cache index, and where l<k<L.
The Tc threshold mechanism 78 determines whether the minimum distortion motion vector dfc. is below the threshold Tc. If so, then a flag bit 84 is set at a logic high indicating a cache hit, and the flag bit 84 is output along with the VQ address 86, as indicated by a reference arrow 88. Alternatively, if the minimum distortion d^. is greater than or equal to the threshold Tc, indicating a cache miss, then the main VQ code codebook 69 is consulted.
Preferably, the main VQ codebook 69 is set up similar to the small codebook within the cache memory 72, and compares wntries to determine a minimum distortion as with the small codebook. It should be noted, however, that other well known VQ methods can be implemented in the main VQ codebook 69. As examples, the following architectures could be implemented: (1) mean removed VQ, (2) residual VQ, and (3) gain/shape VQ.
The cache update mechanism 82 implements a stack replacement algorithm for the cache memory 72. In addition to the LRU, LFU, and FIFO stack replacement algorithms discussed in detail previously, the cache update mechanism 82 can perform a novel adaptive working set model algorithm described hereafter.
It should be further noted that entropy, encoding, transform encoding, for example, discrete cosine transform (DCT) encoding, and subband encoding may be employed in the cache vector quantizer 68 to further enhance data compression.
Adaptive Working Set Model Technique The cache size n discussed thus far relative to cache memory 36, 72 has been fixed throughout the process of encoding a particular image block. For most natural images, the rate of cache misses in the regions of low detail is much small than in high detail regions. Thus, more bit rate reduction could be achieved if the cache size n were allowed to vary according to the activity of the image blocks. In the adaptive working set model technique, an adaptive cache with a flexible cache size is efficiently implemented and results in a lower bitrate than is achievable using other conventional cache stack replacement algorithms, for example, LRU, LFU, and FIFO as discussed previously. To understand the adaptive working set model technique, a brief discussion of the conventional working set model technique is warranted. In the conventional working set model technique, no particular cache stack replacement technique is utilized, and the cache memory is simply a list of the unique code vectors that occur during the near past [t - T + l,t], where parameter T is known as the window size. The parameter T can be a function of time. The ultimate cache size corresponds to the number of unique code vectors within the time interval. For an image source, a two-dimensional causal search window 91 can be defined, as illustrated in Figure 10, which conceptually corresponds to the time interval of the working set model. In other words, as shown in Figure 10, the memory space in the cache memory 72 (Figure 9) is defined as all the blocks within the causal search window 91. As an example, the causal search window 91 is shown have a size W equal to three rows of blocks 92, each with a block size p x p of 4 x 4. The resulting code vector for encoding each image block x(t) is the previous block in the causal search window 91 that yields the minimum distortion, such as the minimum mean squared error ("MSE") distortion. For a given causal search window 91 having size W, the total number M of possible code vectors is given by the following equation:
M = (2W + 1)W + W (7)
One of the major advantages of the working set model for computer memory design over other replacement algorithms is the ability to adapt the size of the causal search window 91 based on the average cache miss frequency, which is defined as the rate of misses over a short time interval. Different window sizes are used to execute different programs, and the working set model is able to allocate memory usage effectively in a multiprogramming environment. The adaptive working set model of the present invention uses the foregoing ideas to implement variable-rate image encoding. Different regions of an image require different window sizes. For example, an edge region may require a much larger search window 91 than that of shaded regions. Accordingly, a cache-miss distance is defined based on the spatial coordinates of each previously-encoded miss-block in the causal window 91 relative to the present block being encoded.
More specifically, let NfW^r.c) be the set of indices of the miss-blocks where Wf is the window size used to estimte the cache-miss distance. The spatial coordinates of the set N((W->r,c) are illustrated in Figure 11 as the shaded blocks 93. The average cache-miss frequency is defined as the summation of the reciprocal of each cache-miss distance within the search window:
Figure imgf000032_0001
where r is the row index and c is the column index. This value provides a good enough estimate of the image locality at any given region without requiring a large amount of computation.
The window size W at a given time t is updated according to the following equation:
{ A2/mB i φrt) < n (9) n Wmax otherwise
where A and B are two pre-defined constants, and W^, is the pre-defined maximum allowed window size.
Hence, in the adaptive working set model technique, the size of the causal search window 91 is manipulated depending upon the miss frequency. As misses increase or decrease, the window 91 respectively increases or decreases. In accordance with another aspect of the present invention, redundant blocks within the causal search window 91 are preferably minimized or eliminated. More specifically, if a present block matches one or more previous blocks in the causal search window 91, then only a code vector representative of the offset to the most recent block is encoded for designating the present block. This aspect further enhances data compression. Localized Scanning of Cache Indices
The method used to index the image blocks in a single image frame can also affect the average cache bit rate because the cache memory 72 updates its contents based entirely on the source vectors that have been encoded in the near past. The indexing of image blocks may be based on a raster scan, as illustrated in Figure 12. Figure 12 shows an example of a raster scan for scanning 256 image blocks. The major problem with this method is that each time a new row begins, the inter-block locality changes rapidly. Moreover, if the cache size is very small relative to the number of vectors in a row, many cache misses with occur whenever the cache starts to encode the source vectors in a new row. In the preferred embodiment, the image blocks are indexed for higher performance by providing more interblock locality. In this regard, a localized scanning technique, for example, the conventional Hubert scanning technique, is utilized for indexing the image blocks of the image frame. Figure 13 illustrates the Hubert scanning technique as applied to the indexing of theimage blocks of the image frame of the present invention for scanning 256 image blocks.
It will be obvious to those skilled in the art that many variations and modifications may be made to the above-described embodiments, which were chosen for the purpose of illustrating the present invention, without substantially departing from the spirit and scope of the present invention. For example, the hierarchical source encoding system 10 was specifically described relative to decomposition of square image blocks of particular sizes. Obviously, to one of skill in the art, any block configuration and size may be utilized to practice the present invention.
The number of encoders used during the encoding process may be adaptively changed to improve coding efficiency. The adaptation is based upon the statistical nature of the image sequence or the scene-change mean squared error ("MSE"). Coding efficiency is improved by assuring that every level of the hierarchy has a coding gain of greater than unity. Since there are overhead bits to address the hierarchical structure (the more levels the more bits required), it is possible to have a level that has coding gain less than unity. For example, in many experiments it was seen that during rapid moving scenes, the 4x4 motion coder has a coding gain less than unity and is thus replaced by spatial coders.
A scalar cache may be used to encode the mean of the blocks in the different levels of the hierarchy. After the video encoder/decoder, sometimes referred to as a "codec," has performed video compression through the reduction of temporal redundancy, further gains in compression can be made by reducing the spatial redundancy. Two such methods include: a.) A "working set model" which looks for similar areas (matching blocks) of video elsewhere in the same frame. The working set model looks in a small portion of the frame near the current area being encoded for a match since areas spatially close to one another tend to be similar. This relatively small search window can then be easily coded to provide a high compression ratio. The window size can also be varied according to the image statistics to provide additional compression. The technique of varying the window size according to the image statistics is referred to as an "adaptive working set model"; and b.) A scalar cache containing the mean of previously code areas can be used to encode the current area's mean value. Again, this takes advantage of areas spatially close to one another tending to be similar. If an area can be encoded with low enough distortion (below a threshold) by using it's mean value, then a scalar cache could save encoding bits.
Before an image undergoes encoding, an overall scene change MSE is performed. This is to determine how much motion may be present in an image. In a preferred embodiment, the coder reaches a level of 4x4 blocks, then if the MSE is below a threshold, the image is encoded with a structure that finds the 4x4 motion vectors from a search of the motion cache or through a three step motion search. On the other hand, if the MSE is above a threshold, the 4x4 motion vectors are found through one of the following:
1) Working set model: the motion vector is predicted from the currently-coded image rather than the previously-coded image; 2) Each 4x4 block is coded by its mean only; or
3) Each 4x4 block is coded by its mean and a mean-removed VQ. Thus the structure is adapted depending upon image content.
Pseudo code relating to the encoding and decoding processes is found at Appendix 2, Appendix 3 and Appendix 5. Appendix 2 contains high level pseudo code that describes the encoding and decoding algorithm. Appendix 3 contains pseudo code related to the encoding process and buffer management shown in Appendix 5. Appendix 5 contain pseudo code of the encoding and decoding algorithm in a C programming language style.
As described in U. S. Patent No. 5,444,489, the HVQC algorithm had the cache-miss flag located in one particular position in the cache. The presently preferred embodiment, however, allows the cache-miss flag position to vary within the cache so that it may be more efficiently encoded through techniques such as entropy encoding.
U.S. Patent No. 5,444,489 describes a hierarchical vector quantization compression ("HVQC") algorithm. In accordance with the present invention, the HVQC algorithm may be applied to an image signal using a parallel processing technique to take advantage of high bit rate transmission channels.
One of the basic strengths of the HVQC algorithm is its adaptability to higher bit rate transmission channels. Users of these channels typically require high performance in terms of image size, frame rate, and image quality. The HVQC algorithm may be adapted to these channels because the computational power required to implement the algorithm may be spread out among multiple processing devices.
A preferred parallel processing architecture is shown schematically in Figure 14A. A video front end 94, receives a video input signal 95. The video front end 94 may, for example, be a device that converts an analog signal, such as a video signal in NTSC, PAL, or S-VHS format, into a digital signal, such as a digital YUV signal. Alternatively, the video front end 94 may receive from a video source a digital signal, which the video front end may transform into another digital format. The video front end 94 may be implemented using a Philips SAA7110A and a Texas Instruments TPC 1020A. The video front end 94 provides digitized video data to a first processor 96, where the digitized video data is preprocessed (i.e. pre-filtered). After or during pre-filtering the digitized video data, the first processor 96 transmits the data corresponding to an image frame to a second processor 97 and a third processor 98. The first processor 96 communicates with the second processor 97 and the third processor 98 through a direct memory access ("DMA") interface 99, shown in Figure 14A as a memory and a bus transceiver.
The first processor 96 uses most of its available MIPs to perform, for example, pre and post filtering, data acquisition and host communication. The first processor 96 uses less than 30% of its available MIPs for decoding functions. The first processor 96, the second processor 97 and the third processor 98 may be off-the-shelf digital signal processors. A commercially available digital signal processor that is suitable for this application is the texas Instruments TMS320C31 ("C31") floating point digital signal processor. A fixed point digital signal processor may alternatively be used. As a further alternative, general purpose processors may be used, but at a higher dollar cost.
For the system architecture shown in Figure 14A, the first processor 96 operates as a "master" and the second and third processors 97 and 98 operate as "slaves." Generally, the second and third processors 97 and 98 are dedicated to the function of encoding the image data under the control of the first processor 97.
The HVQC algorithm may be implemented with multiple processing devices, such as the processor architecture shown in Figure 14A, by spatially dividing an image into several sub-images for coding. In addition, because of the multiple processor architecture, unequal partitioning of the image among the various processing devices may be used to share the computational load of encoding an image. The partitioning is preferably accomplished by dividing the image into several spatial segments. This process is known as boundary adaptation and allows the various processing devices to share the computational load instead of each having to be able to handle a seldom seen worst case computational burden. As an alternative to dividing up the image, the system may allocate more bits for encoding the image to the processor encoding the portion of the video with more activity. Therefore, this encoder may keep trying to encode at a high quality level since it would have more bits to encode with. This technique would not, however, reduce the computational requirements as does the preferred embodiment.
The DMA interface 99 is utilized to write image data to the second and third processors 97 and 98, exchange previously coded image data, and receive encoded bit streams from both the second and third processors 97 and 98. Data passing among the processors 96, 97 and 98 may be coordinated with a communication system residing on the master processor 96.
For example, there is an exchange of image information between the processors 97 and 98 when their respective regions of the image to be encoded change due to boundary adaptation. In addition, when a boundary adaptation occurs, the processor 97 or 98 that acquires more image area to encode must receive the previously coded image plane (y, u, & v) macro row data for the image area it is not to encode and the MSE data for those macro rows. There are 144/16 = 9 macro rows per frame. This information provides the necessary data to perform accurate motion estimation. The first processor 96 is also coupled to a host processor (not shown), such as the CPU of a personal computer, via a PC bus 100, an exchange register and a converter 101. The converter 101 preferably includes a field programmable gate array ("FPGA"), which produce, in response to decoded data provided by the first processor 96, 16 bit video data in RGB format for the personal computer to display. Alternatively, the converter 101 may include an FPGA that produces an 8 bit video data signal with an EPROM for color look-up table. A commercially available FPGA that is suitable for this application is available from Xilinx, part no. XC3090A. FPGAs from other manufacturers, such as Texas Instruments, part no. TPC1020A, may alternatively be used. Referring now to Figure 14B, the operation of the parallel processing architecture, as shown in Figure 14A, will be described for use in an environment in which the incoming signal is a NTSC signal and the output is in RGB format.
Video Conference System Overview Architectural Overview The architecture of a video encoder/decoder 103 preferably includes three digital signal processors, two for video encoding and one for video decoding, data acquisition, and control. A high level view of this parallel processing architecture is shown in Figure 14B. Within this architecture, a master/decoder processor 102 acquires digitized video data from a video interface 104, performs the pre- and post-processing of the video data, provides a host interface 106 (for bit-stream, control, and image information), decodes the incoming video bit-stream, and controls two slave encoder processors 108 and 1 10. The interaction between the master processor 102 and the slave encoders, 108 and 110, is through a direct memory access interface, such as the DMA interface 99 shown in Figure 14A. The master/decoder processor 102 corresponds to the first processor 96 shown in Figure 14A. The two slave encoder processors 108 and 100 correspond to the second and third processors 97 and 98 shown in Figure 14A.
To communicate with the slave encoder processors, 108 and 110 the master processor 102 puts the slave encoder processors 108 and 110 into hold and then reads/writes directly into their memory spaces. This interface is used for writing new video data directly to the slave encoder processors 108 and 110, exchanging previously coded image ("PCI") data, and accepting encoded bit streams from both slave encoder processors 108 and 110. When outputting the decoded YUV date, the master processor 102 uses the YUV-to-RGB converter FPGA, as shown in Figure 14A, to produce an 8-bit (with separate eprom for color look-up table), a 15 bit RGB, or 16-bit RGB video data for the PC to display.
High Level Encoder/Decoder Description This section prioritizes and describes the timing relationships of the various tasks that make up the video encoder/decoder 103 and also describes the communication requirements for mapping the video encoder/decoder 103 into three processors. The tasks will be grouped according to the various software systems of the video encoder/decoder 103. These software systems are:
1. Video Input
2. Encoding
3. Decoding 4. Video Output
5. Communication and Control
Each system is responsible for one or more distinct tasks and may reside on a single processor or, alternatively, may be spread over multiple processors. Data passing between processors is coordinated with a communication system residing on the master processor 102. The video input, video output, encoder, and decoder systems operate cyclically since they perform their task (or tasks) repeatedly.
Video Input System The video input system has two aspects. First, a video image is collected. The video image is then processed (filtered) to spatially smooth the image and passed to the slave encoder processors 108 and 110. To maximize efficiency, the image may be transferred to the slave encoder processors 108 and 1 10 as it is filtered.
To collect the image, the master processor 102 receives data via the video interface 104. The video interface or video front end 104 includes an NTSC to digital YUV converter, for example. The chrominance components (UV) of the YUV data are preferably sub- sampled by a factor of two, horizontal and vertical, relative to the luminance component (Y), which is a 4: 1 : 1 sampling scheme. NTSC provides video data at a rate of 60 fields per second. Two fields (and odd and even field) make up a single video frame. Therefore, the frame rate for NTSC is actually 30 frames per second. Two sampling rates seem to be prevalent for digitizing the pixels on a video scan line: 13.5 Mhz and 12.27 Mhz. The former gives 720 active pixels per line while the latter gives 640 active (and square) pixels per line. The target display (VGA) uses a 640x480 format, therefore the 12.27 Mhz sampling rate is more appropriate and provides more time to collect data. The following assumes a sampling rate of 12.27 MHz and that one must collect 176 pixels of 144 or 146 lines of the image (QCIF format). The hardware and software described herein may be adapted to other input video formats and target display formats in a straightforward manner.
With the QCIF format, inter-field jitter may be avoided by collecting 176 samples on a line, skipping every other sample and every other line (i.e., taking every line in the given field and ignoring the lines of the other field). No noticeable aliasing effects have been detected with such a system, but anti-aliasing filters could be used to reduce aliasing.
Acquiring an image will take about 9.2 milliseconds (146 lines x 1/15734 sec/line) using about 57% of the processor during those 9.2 milliseconds. While the image is being acquired, the master processor 102 collects and stores 176 video samples every 63.6
microseconds (1 line x 1/15734 sec/line) at a rate of 6.14 MHz. Preferably, the front end video interface 104 interrupts the master process 102 only at the beginning of every line to be collected. The video front end 104 will buffer 8 Ys and 4 Us and 4 Vs of video data at a time. Therefore, the master processor 102 preferably dedicates itself to video collection during the collection of the 176 samples that make up a line of video data. This may be accomplished using a high priority interrupt routine. The line collection interrupt routine is non- interruptable.
The time to collect the 176 samples of video data is 28.6 microseconds (1/(6.14 Mhz) x 176 pixels). The video input software system copies a portion of the collected data out of internal memory after the line has been collected. This will add about 8.8 microseconds to the collection of the odd lines of video data. The average total time to collect a line of video data is therefore 33 microseconds plus interrupt overhead. Assuming an average of 3 microseconds for interrupt overhead, the average total time to collect a line of video data is 36 microseconds. Therefore, processor utilization during the image acquisition is 57% (36 microseconds/63.6 microseconds x 100). This constitutes about 8% (@ 15 frames/second) of the processor. Using this approach, the maximum interrupt latency for all other interrupts will be about 40 microseconds.
Once the image has been collected, it is preferably spatially lowpass filtered to reduce image noise. In the preferred embodiment, only the Y component is filtered. Assuming that only the Y component is filtered, the input image filtering and transfer task will consume about 10.5 msec or 15.7% of a frame time. U and V transfer will consume 340 microseconds or about 0.4% of the frame time. During the transfer of the image, the slave encoder processors 108 and 110 are in hold for a total of about 4msec or about 6% of a frame time. The present invention provides methods to enhance the motion estimation process by means of pre-filtering where the pre-filtering is a variable filter whose strength is set at the outset of a set of coded images to reduce the number of bits spent on motion estimation. For example, a temporal and/or spatial filter may be used before encoding an image to reduce the high frequency components from being interpreted as motion which would require more bits to encode. Typically two, one-dimensional filters are used - one in the horizontal direction and one in the vertical direction. Alternatively, other filters may be used, such as a two- dimensional filter, a median filter, or a temporal filter (a temporal filter would filter between two image planes - the one to be encoded and the previous one). As a further alternative, a combination of these filters could also be used for pre-filtering. The filter strength is set at the outset of each image macro block depending upon how much motion is present in a given macroblock. High frequency components in the video image may arise from random camera noise and sharp image transitions such as edges. By using filters, high frequency components related to random camera noise are reduced so that they do not appear to the motion estimators as motion. Other examples of noise might include subtle amounts of motion which are not normally noticeable. That is, the goal of pre-filtering is to reduce noise artifacts and to some extent minor motion artifacts.
In addition to pre-filtering the input image signal, the master processor 102 may post- filter the encoded image signal. A temporal and/or spatial filter may be used to reduce high frequency artifacts introduced during the encoding process. Preferably, a variable strength filter based upon the hit-ratio of the predictors in the working set model is used. Thus, the filter strength is variable across the whole image. Post-filtering may be implemented using two, one-dimensional filters - one in the horizontal direction and one in the vertical direction. A median, temporal, or combination of filter types could also be used.
Encoding System The image is in a packed format in the slave encoder processors' memory and is unpacked by the slave encoder processor once it receives the image. The encoding system is distributed over the two slave encoder processors 108, 110 with the resulting encoded bit streams being merged on the master processor 102.
The encoding system described herein allows interprocessor communication to synchronize and coordinate the encoding of two portions of the same input image. Once both the slave encoder processors 108 and 110 are ready to encode a new image, they will receive the image from the master processor 102. The slave encoder processors 108 and 1 10 then determine whether a scene change has occurred in the new image or whether the boundary needs to be changed. This information is then communicated between the two slave encoder processors.
If the boundary is to be changed, previously coded image (pci or PCI) data must be transferred from one slave encoder processor to the other. Boundary change is limited to one macro row per frame time where a macro row contains 16 rows of the image (16 rows of Y and 8 rows of U and V). In addition to the boundary change, extra PCI data needs to be exchanged to eliminate edge effects at the boundary line. This involves sending at least four additional rows from each slave encoder processor to the other. Assuming a one-wait state read and zero-wait state write time for interprocessor communication, the exchange of PCI data for a boundary change will take a maximum of 0.3% ((4 cycles/4 pixels) x (16 rows Y x 176 pixels + 8 rows U x 88 pixels + 8 rows V x 88 pixels) x .05 microseconds/cycle)/(66666 microseconds/frame x 100%) for packed data and 1.3% of a frame time for byte-wide data. Preferably, the slave encoder processors 108, 1 10 use the unpacked format.
After the exchange of PCI data, the slave encoder processors encode their respective pieces of the video image. This process is accomplished locally on each slave encoder processor with the master processor 102 removing the encoded bits from the slave encoder processors 108, 110 once per macro row. A ping-pong buffer arrangement allows the slave encoder processors 108, 1 10 to transfer the previously encoded macro row to the master while they are encoding the next macro row. The master processor 102 receives the bits, stores them in an encoder buffer and ultimately transfers the encoded bit stream data to the PC host processor. As previously noted, the control structure of the encoder is demonstrated in the pseudo code at Appendix 2, Appendix 3 and Appendix 5.
Since the HVQC algorithm is a lossy algorithm, image degradation occurs over time. Therefore, the image encoding must be periodically stopped and a high quality reference image (a still image) must be transmitted. A distortion measure for each image to be encoded may be used to determine whether or not an image is to be encoded as a still image (when the video encoder can no longer produce acceptable results) or with respect to the previously coded image (video encoder can still produce an acceptable image).
At the beginning of each image, the slave encoder processors 108, 1 10 calculate a MSE for each 8X8 luminance (i.e. Y component, no chroma calculation) block. The sum of these MSEs is compared to a threshold to determine if a scene change has occurred. The background detector takes the sum of four - 8X8 MSE calculations from the scene change test and uses that to determine if a macroblock is a background macroblock or not. 8X8 background detection is done by looking up the appropriate blocks MSE determined during the scene change test. Since the lowest level of the hierarchy and the still image level of the algorithm produce identical results, the use of a common VQ code and codebook for the still image encoder and the lowest hierarchical level of the encoder can save system size, complexity, and cost.
Preferably, each slave encoder processor, e.g. 108 and 110, calculates a scene change distortion measure, in the form of a MSE calculation, for the portion of the image that it is responsible for encoding. The slave encoder processor 108 then sums up all of the scene change MSEs from each of the slave encoder processors 108 and 1 10 to generate a total scene change MSE. In accordance with the preferred embodiment slave encoder, the slave encoder processor 1 10 passes its MSE calculation to the slave encoder processor 108 where it is summed with the slave encoder processor 108 MSE calculation to generate the overall scene change MSE.
In accordance with a preferred embodiment of the present invention, the three-step or the motion cache search processes are allowed to break out of the search loop (although the search process is not yet completed) because a "just good enough" match has been found. In the three-step search the distortion of the best-match is compared with the threshold in each level (there are a total of three to four levels depending on the search window size).
The search process may be terminated if any level there is a motion vector whose resulting distortion is less than the threshold. Similarly, in the motion cache search, the resulting distortion of the motion vector in each cache entry (starting from the first entry) is compared with the threshold. The search process may be terminated if there is any entry whose resulting distortion is less than the threshold. The computational saving is significant in the motion cache search since the first entry is a most probable motion vector candidate and the method requires one MSE computation instead of L (L is the cache size) during a cache- hit. During a cache-miss, the computation increases slightly because more compares are required to implement the search process.
The bit-rate for a given frame may be reduced because the early cached motion vectors require fewer bits to encode, but this is sometimes offset by the bit-rate required by later frames because of the poorer previously coded image quality cause by the use of less precise motion vectors. Accordingly, the computational load of the encoder processors 108, 110 is reduced by thresholding each level of the three-step motion search and by thresholding each entry of the motion cache. This method may be applied to any block size at any level of the hierarchy.
In accordance with a preferred embodiment, the threshold to encode a given image block is limited to a pre-defined range, [min_t, max_t]. Min_t and max t are therefore further adapted on either a macro row-by-macro row basis or a frame-by-frame basis to have an even finer control over the image quality as a function of the bit-rate. The adaptation can be based on the scene-change MSE or on the image statistics of the previously coded frame.
In addition, the threshold at which a given image block is encoded may be controlled based upon the buffer fullness. In a preferred embodiment, the video encoder/decoder 103 uses piece-wise linear functions whose threshold value increase linearly with the buffer fullness. In a preferred embodiment, a distortion measure obtained by encoding an image ara at a given level of the hierarchy may be compared against an adaptable threshold to enable a tradeoff between image quality, bit rate, and computational complexity. Increasing the thresholds will, in general, reduce the amount of computation, the bit rate, and the image quality. Decreasing the thresholds will, in general, increase the amount of computation, the bit rate, and the image quality. Different threshold adaptation methods include: updating after each macro row, updating after each of the largest blocks in the hierarchy (such as 16x16 blocks), or updating after every block of any size. Adaptation strategy can be based on a buffer fullness measure, elapsed time during the encoding process, or a combination of these. The buffer fullness measure tells the encoder how many bits it is using to encode the image. The encoder tries to limit itself to a certain number of bits per frame (channel bit rate/frame rate). The buffer fullness measure is used to adapt the thresholds to hold the number of bits used to encode an image to the desired set point. There is a time on the DSP and the system checks the time at the beginning and end of the encoding process. As an image is encoded, the number of bits used to encode it must be regulated. This can be done by adapting the hierarchical decision thresholds.
Sometimes an image area may be encoded at a lower level of the hierarchy, but the gain in image quality is not significant enough to justify the additional bits required to encode it. In these cases, the determination as to which level of the hierarchy to encode an image can be made by comparing the distortion resulting from encoding an image area at different levels of the hierarchy to the number of bits required to encode the image area at each of those levels. The hierarchical level with the better bit rate-to-distortion measure is then the level used to encode that image area.
Further, encoding of the chrominance (U and V) components of an image may be improved by encoding them independently of the luminance component. The U and V components may also be encoded independently of one another. This may be achieved, for example, by encoding the chrominance components by using separate VQ caches and by using two-dimensional mean VQ. The system can either share motion information between the luminance and chrominance components to save bits, but possibly at lower image quality, or use separate/independent components at the expense of more bits to achieve higher image quality. The system can even do a partial sharing of motion information by using the luminance motion information as a starting point for performing a chrominance motion search. The decision of whether or not to share is made at the outset by how much bandwidth (i.e., bit rate) is available. In addition, it may be less expensive to make a system that shares motion information since less memory and computation would be required than if the chrominance components had to be independently determined. Since spatially close image areas tend to be similar, it is preferable in a bit rate sense to encode these areas with similar mean values. To do this, the encoder may use various methods to compute the current image minimum and maximum mean quantizer values and then use these values to select a set of quantization levels for encoding the mean values of the variable sized image blocks. Basically this sets how many bits would be used to quantize a block with its mean value instead of a VQ value. If the block is of low activity (i.e., low frequency content), then it may be better in a bit rate sense to encode it with its mean value rather than using a VQ entry. Each block's mean value is then entered into the scalar cache. Pseudo code relating to this process is found at Appendix 2, Appendix 3 and Appendix 5.
Decoding System The master processor 102 acquires the encoded bit stream data from the PC host (or from the encoders directly when in a command mode) and decodes the bit stream into frames of YUV image data. Control of the encoder processors bit rate generation is preferably achieved by sending the encoded bit streams to a processor, such as the processor 108, which feeds back rate control information to the encoder processors 108 and 1 10. Generally, the HVQC algorithm allocates bits on an as needed basis while it encodes an image. This can present some image quality problems since initially, the encoder has many bits to use and therefore its thresholds may be artifically low. As coding progresses and the allocation of bits is used up, areas that require more bits to encode are unduly constrained. Therefore, the HVQC algorithm preferably examines the entire image date before the encoding of an image is begun to make an estimate of how many bits are to be used to code each are of the image. Thresholds may be determined not only by looking at how much distortion is present in the overall image as compared to the previously coded image, but also on a block by block basis and then adjusting the thresholds for those areas accordingly. Since the decoder processor must track the encoder processors, it is important that the decoder processor track even in the present of channel errors. This can be done by initialization of the VQ and motion caches to set them to a set of most probable values and then to re-initialize these caches periodically to prevent the decoder from becoming lost due to transmission errors. The master processor 102 (i.e. decoder) preferably uses a packed (4 pixels per 32-bit word) format to speed the decoding process. In a preferred embodiment, this process takes about 30% of a frame time (@ 15 frames/second) per decoded image frame.
Once the decoded image is formed, the Y component is filtered and sent to the YUV to RGB converter. The filter and transfer of the Y and the transfer of the U and V components takes 15 msec or 22.5% of the frame. Pseudo code for the decoding operations is found at Appendix 2 and Appendix 5.
The most computationally intensive portions of the HVQC algorithm are its block matching computations. These computations are used to determine whether a particular block of the current image frame is similar to a similarly sized image block in the previously coded frame. The speed with which these calculations can be performed dictates the quality with which these computations can be performed. The preferred architecture can calculate multiple pixel distortion values simultaneously and in a pipelined fashioned so that entire areas of the image may be calculated in very few clock cycles. Accelerator algorithms are employed to calculate the distortion measure used in the hierarchical algorithm. Since the amount of computation and bits spent encoding and refreshing an area of the picture is dependent upon the amount of time left to encode the image and the number of bits left to encode the image with, the overall image quality may be improved by alternating the scanning pattern so that no image blocks receive preferential treatment. The scanning pattern may correspond to scanning from the top of the image to the bottom or from the bottom to top.
Duπng image scenes with large amounts of motion, the encoder can generate far more bits per image frame that the channel can handle for a given image frame rate. When this occurs, there can be a large amount of delay between successive image frames at the decoder due to transmission delays.
This delay is distracting. Take the example of a first person listening to and viewing an image of a second person. The first person may hear the voice of the second person but, in the image of the second person, the second person's mouth does not move until later due to the delay created by a previous sequence. One way to overcome this is to not encode more images frames until the transmission channel has had a chance to clear up, allowing reduction of latency by skipping frames to clear out the transmission channel after sending a frame requiring more bits than the average allocated allotment.
The system does two types of frame skipping. First, immediately after encoding a still image it waits for the buffer to clear before getting a new image to encode, as shown in the pseudo code below:
Still_encode(&mac_hr);/* call subroutine to perform a still image encode */ /* Then wait until the buffer equals buffersize before encoding again */ waιtUntilBufferFullness(BufferSize+(capacιty*2)); /* Then skip two frames because the one old frame is already packed in the slave memory and the MD has a second old frame in its memory. Dropping two will get us a new frame */
NumFrmesToSkip=2;/*set how many frames to skip before acquiring a new one */ Second, when encoding video, a different method is used that is a little more efficient in that the processor does not have to wait for communications, but instead generates an estimation of how many frames to skip (based on how many bits were used to encode the last image) and then starts encoding the next frame after that, on the assumption that the buffer will continue emptying while the image is being collected and initially analyzed for a scene change. For instance, the following pseudo code demonstrates this process: /* now test to see if the previously coded image, pci, takes up too much channel capacity to be immediately followed by another image. The capacity or bits per frame is calculated from the bits/sec and frames/sec. In the capacity calculation there is a divide by 2 because there are two processors, each of which gets half the bits. */
/^calculate how many bits it took to encode the last image based upon the bits it took to encode it plus the number of bits used to refresh it.*/ rd.yuv_rate+=requestCumulativeRefreshBits(rd.yuv_rate);
/*Skip is an integer. The -1.0 is to downward bias the number of frames to be skipped. */
Skip=(rd. yuv_rate/(2 * capacity))- 1.0
/""calculate the number of input frames to skip before encoding a new image frame */ If(Skip >NumFramesToSkip)
NumFramesToSkip=Skip;
In addition, the software tells the video front end acquisition module when to capture an image for encoding by sending it a control signal telling it how may frames to skip (NumFramesToSkip)
The average allocated allotment of bits is determined by the channel bit rate divided by the target frame rate. It is calculated before each image is encoded so that it can be updated by the system.
Video Output System The post decoder spatial filter is responsible for filtering the Y image plane and transferring the Y, U, and V image planes to the host. The filter operation preferably uses no external memory to perform the filter operations. This requires that the filter operation be done in blocks and the filtered data be directly transferred into a YUV to RGB FIFO. For this purpose, the block size is two lines. The filter uses a four line internal memory buffer to filter the data. The format of data for the YUV-to-RGB converter is shown in Figure 15.
Task Prioritization and Timing Relationships Task prioritization and timing are preferably handled by the master processor 102 since it controls data communication. Except where otherwise noted, task prioritzation and timing will refer to the master processor 102. The highest priority task in the system is the collection of input image data. Since there is little buffer memory associated with the video front end 104, the master processor 102 must be able to collect the video data as it becomes available. Forcing this function to have the highest priority causes the next lower priority task to have a maximum interrupt latency and degrades system performance of the lower priority task. Slave processor communication is the next highest priority task. The slave processors 108, 1 10 have only one task to perform: encoding the video data. To accomplish this task, the slaves must communicate a small amount of data between themselves and the video image must be transferred to them by the master processor 102. This communication of data must take place in a timely fashion so as to not slow the slave processors 108, 1 10 down.
Next in the priority task list is the PC host communication. By far, the highest bandwidth data communicated between the master processor 102 and the PC host will be the RGB output data. Preferably, this data is double buffered through a FIFO. This allows the PC host the most flexibility to receive the data. The remaining tasks (input image pre-processing and decoding) are background tasks.
If one assumes that the master processor 102 is not over allocated (this must be true for the system to function) the background tasks will execute in a well-defined order. Once the input image is acquired, it will be filtered (pre-processed) and sent to the slave processors 108 and 110. The next background task will be to decode the bit stream coming from the PC host. Once this task is complete, the master processor 102 will post-process the image and transfer to the YUV-to-RGB FIFO as a foreground task.
Encoder Memory Organization For the video encoder/decoder 103 described herein, the memory system has 128 kwords of 32-bit memory followed by 128 kwords of byte-wide memory that is zero filled between bits 9 to 23 bits and the remaining upper 8 bits float.
YUV-to-RGB Conversion A FPGA has been designed that implements that RGB conversion using two-bit coefficients. The design requires nearly all of the logic units of a Texas Instruments TPC1020. The functions of this subsystem are to take data from an input FIFO and to convert this date to 15 or 16-bit RGB output. An interface is provided to an 8-bit EPROM to support 256 color VGA modes. Further details of this process are found below in the sections titled "YUV to RGB Conversion" and "Video Teleconferencing," and in Appendix 6.
Memory Requirements This section summarizes the memory requirements of a preferred embodiment video algorithm. All multi-dimensional arrays are allocated as a matrix that allocates an array of pointers to point to each row of data. This requires an extra "num-rows" of data elements. For QCIF image planes this accounts for the extra 144 data words. This includes all of the large memory structures and most of the smaller structures. The process employed to derive memory requirements is found at Appendix 1.
Pre- and Post-Filtering The pre- and post-filters are preferably row-column filters that filter the image in each dimension separately. The filter implementation may use internal memory for buffering. The internal memory is totally dynamic, i.e. it may be re-used by other routines. Less overhead will be incurred if the filter code is stores in internal memory (less than 100 words, statically allocated).
The pre-processing filter may be a five phase process. First, the data must be input from the video front end 104. Second, the row-wise filtering is performed. This filtering assumes a specific data format that contains two pixels per 32-bit word. The third phase re- formats the output of the row-wise filter. The fourth phase performs the column-wise filter and fifth phase re-formats the data into whatever format the encoder will use.
The fastest way of implementing this method is with 7-bit data. If 8-bit samples are used, the performance will be degraded slightly. Figure 16 shows the time for each phase assuming the encoder will be operating on byte packed data. This also assumes zero overhead for calling the function and function setup. It should be much less than one cycle per pixel (0.5 will be used in this discussion). The last phase consumes time in both the master processor 102 and encoder processors 108, 110, which is where the actual data transfer from master to encoder takes place. Total processor utilization for filtering, Y, U and V at 15 frames per second can be calculated as follows (if just Y is filtered, the total time is reduced by 1/3): TotalTimelnMicroSeconds = ( 176x 144x1.5)x.05xcyclesPerPixel PercentUtilization = TotalTimelnMicroSeconds / 666.6666
Total master processor 102 utilization for 7-bit data is then 29.7%. For 8-bit data, the master processor 102 utilization is 32.5%. For both cases the encoder processor utilization is 3.5% (data transmission).
The post processing may also be accomplished using a five step process. The data must be reformatted from the packed pixel to a filter format. Step two filters the data columnwise. Step three reformats the data. Step four filters the data row-wise. Step five reformats and transmits the data to the YUV-to-RGB interface. Figure 17 depicts the processor utilization (minus overhead) for these phases.
Total master processor 102 utilization for filtering 7-bit Y, U and V data is then 32.1%. For 8-Bit data the master processor 102 utilization is 35.1%. The output system is assumed to take data in the low byte at zero wait states. Master processor 102 utilization may be decreased by more then 7% if the YUV-to-
RGB interface could take two Ys at a time and both the U and V at a time in the format shown in Figure 18. In addition, master processor 102 utilization may be decreased by at least 3% if the YUV interface could take the 8-bit data from either bits 8-15 or 24-31 of the data bus as specified by the address. YUV-to-RGB Conversion This section summarizes the YUV-to-RGB conversion of image data for Host display conversion. This section assumes the YUV components are 8-bit linearly quantized values. It should be noted that a Philips data book defines a slightly compressed range of values for the digital YUV representation.
Approximate coefficients may be implemented with only hardware adder/subtractors i.e., no multiplier, if the coefficients contain only two bit that were on (set to one). This effectively breaks the co-efficient down in two one-bit coefficients with appropriate shifting. For lack of a better name, the rest of this section will call these coefficients "two-bit" coefficients. The two-bit coefficient method requires no multiplier arrays. The actual coefficients are: R = 1.403xV+Y G = Y-.714xV-.344xU B = 1.773xU+Y Testing of the two-bit coefficients revealed no subjective difference when compared directly with the eight bit coefficients. The two-bit coefficient values are: R = 1.500xV+Y G = Y-.750xV-.375xU B = 1.750xU+Y Coefficient error ranges from 1.2% to 9%. The 9% error occurs in U coefficient of the
G equation. This error is mooted by the other components of the equation. Shift operations do not involve logic. It is important to remember that the hardware must hard limit the RGB to 8-bits. Therefore, an overflow/underflow detector must be built into the hardward that implements this conversion. Threshold Adaption In a preferred embodiment, the video encoder system compares the mean squared error (MSE) between the current image and the last previously coded image to a threshold value. The video encoder looks at the ratio of buffer fullness to buffer size to determine how to set the g parameter for adjusting the threshold value. For a single encoder system, this is modified to be approximately:
(buffer fullness/buffer size) + (cumulative number of bits/expected number of bits) - 0.4.
Both systems, single and dual encoder, perform their threshold adaptation strategy on a block- by-block basis, but do not take into account whether or not a particular block has a high MSE with respect to other blocks in the image. The following is a method for adapting thresholds in a way that will better allocate bits and computational power for single and multi-encoder
systems.
As previously noted, the system does not compare the current block's MSE to the MSE of other blocks in the image, and therefore the system may spend a lot of time and bits trying to encode a given block when that time and those bits may have been better spent (proportionally) on other blocks with higher MSEs. This effect can be compensated for by using a block's MSE value to adjust its threshold level.
Since a scene change calculation is performed at the beginning of the encoding process for each frame, the MSEs for 8X8 blocks are readily available (stored in memory). In addition, the MSEs for 16X16 blocks are available from the 8X8 blocks by adding four localized 8X8 blocks together - this step is already done during the encoding process. By analyzing the MSE date before encoding, one can come up with any number of methods for partitioning the image blocks into various "bins" for allocating system resources so that the calculation time and encoding bits are efficiently distributed over the image. Each of these "bins" would specify a unique multiplicative constant to the threshold equation. This would allow more time and bits to be spent on the blocks that need it (those with a high MSE) and not waste those resources on blocks that do not. Initially, the problem of how to classify image blocks can be addressed by having a fixed number of default MSE ranges (possibly derived from a series of test sequences) so that any given blocks' MSE value can be categorized with a lookup table. This could possible be done at only the 16X16 block level so as to reduce the number of lookups to that required for the 99 blocks. The 16X 16 block level category could then be extended down to the lower levels of the hierarchy. If calculation power permits, then the lower hierarchical levels could do their own categorization on a block-by-block basis. Ideally, this block-by-block categorization would occur at all levels of the hierarchy. Each of these categories would then be used to modify the threshold levels.
A fixed number of default MSE ranges could be extended to more ranges, or even turned into a function, such as a piece-wise linear curve. One could even base the ranges/function on the statistics of each individual image in a sequence (if one has the processing power). Any number of techniques could be employed.
The proposed enhancements should fit in with the buffer fullness constraint used by the preferred embodiment algorithm to adjust the thresholds by realizing that the ratio of bits calculated to bits expected can be re-calculated from the new distribution of expected bits for each block. That is, the bits expected calculation is no longer just a function of time, but instead is based upon both the elapsed time and the expected number of bits to encode each of the previously encoded blocks. In addition, to adjust the number of expected bits used to encode a frame, the threshold for each block can be scaled by the following factor:
(allowed number of bits per frame)\(sum of expected bits for all the image blocks). The above ratio could be modified by using (allowed number of bits per frame - refresh bits per frame) in the numerator to allow for some amount of refresh to be performed each frame.
The proposed video coder algorithm enhancements distribute the bits spent encoding an image to those areas that need them in a mean squared error sense. Additionally, it helps with the computational processing power spent in encoding an image by also concentrating that power in those areas that need it in a mean squared error sense. This not only better utilizes resources in a multi-encoder environment, but it also helps with the realization of a 15 frames per second, 19.2KHz, single video encoder/decoder system.
Further details of the process for adjusting the quality of the video coder with adaptive threshold techniques are found in the following section. The quality and the bit rate of the video algorithm are controlled by the levels of the hierarchy that are used to encode the image. While encoding all the blocks with the 4x4 VQ results in the highest quality, it also imposes the largest bit-rate penalty, requiring over 3000 bits per macro row (over 27000 bits per frame). The hierarchical structure of the video coder addresses this problem by providing a collection of video coders, ranging from a very high coding gain (1 bit per 256 pixels) down to more modest coding gains (14 bits per 16 pixels). Adjusting the quality is the same problem as selecting the levels of the hierarchy that are used to encode the image blocks.
The selection of the levels of the hierarchy that are used and the particular encoders within each level of the hierarchy are controlled by thresholds. A distortion measure is calculated and this value is compared to a threshold to see if the image block could be encoded at that level. If the distortion is less than the threshold, the image block is encoded with that encoder. The thresholds are preferably adjusted for every image block based upon a measure of the image block activity (defined as the standard deviation). The dependence on the block activity follows a psychovisual model that states that the visual system can tolerate a higher distortion with a high activity block than a low activity block. In other words, the eye is more sensitive to distortion in uniform areas than in non uniform areas.
The threshold value, T, that is used for any image block may be written as: T=( x N)γ( σ + β) x bias[N] where N is the dimension of the image block (typically 16, 8, or 4), σ is the standard deviation of the image block, α and β are weighing parameters, and γ is an overall scale parameter whose value preferably varies based on buffer fullness and the type of encoding. The bias term, bias[N], is used to compensate for the fact that at a given distortion value, a larger image block may appear to have lower quality than a smaller image block with the same distortion. The bias term, therefore, attempts to exploit this phenomena by biasing the smaller image block size with a larger threshold, i.e. larger bias[N]. In a preferred embodiment, the bias terms are: 1, 1.2, and 1.4 for 16x16, 8x8, and 4x4 image blocks, respectively.
In addition, there is a constraint that α and β must sum to 1. In a preferred embodiment, α = 0.2 and β = 0.8. A large value of α biases the encoder to put more emphasis on the variance of the image block so that higher variance blocks (i.e., higher activity blocks) are encoded at earlier levels of the hierarchy and at lower bit rates because there is a higher threshold associated with those blocks. As noted above, the value of the scale parameter γ preferably varies as the HVQC algorithm is applied. A larger value of γ forces more image blocks to be encoded at the higher levels of the hierarchy because the threshold becomes much higher. On the other hand, when γ is assigned a smaller value, more image blocks may be encoded at the lower levels of the hierarchy.
There is a different value of γ for each of the individual encoders. In a preferred embodiment of the present invention, γ is an array of eight elements, three of which may be used. These elements are shown in Figure 19 in which γ ranges from a low of 6 to a high of 80. Gamma[BG16], appropriately scaled by the bias term for other image block sizes, is used for all the background and motion tests. The actual value of gamma[BG16] used within the coder is determined by the buffer fullness. Smaller thresholds are used when the buffer is nearly empty and larger thresholds are used when the buffer is almost full. As shown in Figure 20, the values for bgthl6[0], bgthl6[l], bgthl6[2], fl, and f2 used for a preferred embodiment are listed below. These values indicate that the system is operating with a set of relatively small thresholds (between bgthl6[0] and bgthl6[l] until the buffer fills to 70 percent full. Above f2, the overflow value, bgth overflow is used. The following variables correspond to Figure 20: bgthl6[0](MSE per pixel) 6.0
bgthl6[l](MSE per pixel) 20.0
bgthl 6[2](MSE per pixel) 50.0 fl (percent of buffer fullness) 0.70 f2(percent of buffer fullness) 0.95 bgthlό overflow 80
Sgth4[0](MSE per pixel) 3.0
Sgth4[ 1 ](MSE per pixel) 10.0
sgth4[2](MSE per pixel) 15.0 With respect to external controls, the channel capacity determines the bit rate that is available to the encoder. The bit rate, i.e. number of bits per second, is used in forming the encoder buffer size that is used in the ratio of the buffer fullness to buffer size. The buffer size is computed as 3.75 x bits per frame. When used with the frame rate, the number of bits per frame can be determined. The bits per frame are used to determine if there are bits available for refreshing after the image has been coded.
Frame rate specifies what frame rate is used with the channel capacity to compute the number of bits per frame. The motion tracking number is used to scale the ratio of the buffer fullness to buffer size. A small number makes the buffer appear empty and hence more hits are generated. Further details on threshold adjustment are found in Appendix 4, which contains code in the C programming language for computing thresholds.
System Software The video encoder/decoder system 103 may be composed of three TMS320C31 DSP processors. Specifically, in the preferred embodiment, there is a master processor 102 and two slave processors 108, 1 10. Under this architecture, the master processor 102 is responsible for all data input and output, overall system control, coordinating interprocessor communication, decoding images and pre- and post-processing of the images, such as filtering. The slave processors 108, 110 are solely responsible for encoding the image. The software system (micro-code) for the video encoder/decoder 103 is formed from three parts; the system software, encoding software and decoding software. The system software provides the real-time interface to the video source date, bit stream data, image transfer to the host and overall system control and coordination. This section is intended to give an overview perspective of the video encoder/decoder system software. The major data
paths and data buffering will be described. Interrupt processing on the master processors is then discussed. A major portion of the section is dedicated to describing the system source code modules from an overview perspective.
There are two basic data paths in the video encoder/decoder 103. One is the encoder data path while the other is the decoder data path. For the encoder, the date path starts at a video source on the master processor 102, it is then filtered and transferred by the master to the slaves 108, 110. The slaves 108, 110 encode the data one macro row at a time sending the encoded bits back to the master processor 102. A macro row is a horizontal strip of the image containing 16 full rows of the image. The master processor 102 buffers these bits until the host requests them. The video source can be either the video front end 104 or the host depending on the video mode set in the video encoder/decoder 103 status. By default the video mode is set to the host as the source.
The decoder data path starts with encoded data flowing into the decoder buffer from either the host when in "Normal" or the encoder buffer when in "pass through" mode. The master processor 102 then decodes the encoded data into an image or frame. The frame is then filtered by the master processor 102 and transferred to the host through the YUV to RGB converter system two lines at a time. This converter system contains a FIFO so that up to five lines of RGB data can be buffered up in the FIFO at any given time. The data is synchronously transferred two lines at a time. In theory, the image can be filtered and transferred to the host in one long burst. In practice this is not completely true. For about 9 milliseconds of a frame time, interrupts from the video input line collection system can cause the video encoder/decoder 103 to not keep up with the host if the host is fast enough.
The decoded image that is to be filtered and passed to the host becomes the reference frame for the next decoded image. This provides a double buffer mechanism for both the filter/data transfer and the decoding process. No other buffers are required to provide a double buffer scheme.
Besides the major data paths, there is a small amount of data that must flow between the slave processors 108, 1 10 every frame. This data path is a result of splitting the encoding process between the two slave processors 108, 110. This data path is handled by the master processor 102.
The major data buffers for the video encoder/decoder 103 are the master video input data buffer, the master decoder image buffers, the master decoder encoded bit buffer, the master encoder encoded bit buffer, the slave encoded bit buffer, and the slave packed image buffers and the slave unpacked encoder image buffers.
The master video input data buffer consists of three buffers containing the Y, U & V image planes. The Y image is formatted with two pixels per word. The size of the Y buffer is 178x144/2 or 12,672 words. There are two extra lines to make the spatial filtering easier and more efficient. The Y data format and U & V data format are shown in Figure 21. The U & V images are packed four pixels per word in the format received from the video front end 104. The least significant byte of the word contains the left most pixel of the four pixels. The size of each U & V buffer is 89x72/4 or 1602 words. There is an extra line because the system is collecting two extra Y lines. The names of these buffers are inputY, inputU and input V. There are two master decoder image buffers. One is used to send a decoded image to the host and as a reference to decode the next image, while the other is used to build the next decoded image from the reference image. Because the reference image is undisturbed by the decoding process, it can also be transferred to the host. Once a newly decoded image has been formed, it is then used as the reference image for the next decode. This approach allows for concurrent decoding with image transfer.
The master decoder encoded bit buffer is used to store encoded bit data to be decoded by the decoder. The buffer is a circular buffer. The bits are stores in packets. Each packet has an integer number of macro rows associated with it. The first word in the packet contains the number of bits in the data portion of the packet. Each macro row within the packet starts on a byte boundary. Bits in between macro rows that make up the gap must be zero. These bits are referred to herein as filler bits. The last macro row may be filled out to the next byte boundary with filler bits (value zero).
The size of this buffer is set to 48 Kbits or 1536 words in a preferred embodiment. The size of this buffer could be reduced substantially. It is preferably large so that one can use different techniques for temporal synching. The decoder buffer is maintained through a structure known as MDecoder and reference by GlobalDecoder. MDecoder is of type ChannelState and GlobalDecoder is of type ChannelState.
The master encoder encoded bit buffer is used to combine and store encoded bit data from the slaves processors 108, 1 10. The buffer is a circular buffer. Each slave processor 108, 1 10 sends encoded bits to the master processor 102 one macro row at a time. As the master process 102 receives macro rows from the slave processors 108, 110, it places them in the master encoded bit buffer in packet format. The lower 16 bits of the first word of each packet contain the number of encoded bits in the packet followed by the encoded bits. If the first word of the packet is negative it indicates that this packet contains what is known as a picture start code. The picture start coded packet delimits individual frames of encoded bits. Any unused bits in the last words of the packet are zero.
There are two slave encoded bit buffers, which are preferably "Ping-Pong" buffers. Each slave processor 108 or 110 contains a Ping-Pong buffer that is used to build and transfer encoded bits one macro row at a time. While one macro row is being encoded into one of the Ping-Pong buffers, the other is being transmitted to the master processor 102. Ping and pong may each have a size of 8192 bits. This restricts the maximum macro row size to 8192 bits.
The slave packed image buffer is used to receive the filtered video input image from the master processor 102. The Y, U & V Image planes are packed 4 pixels per word. The sizes are preferably 176x144/4 (6336) words for y and 88x7/4 (1584) words for U & V. The encoder will unpack this buffer into an unpacked encoder image buffer at the start of an encoding cycle.
There are two slave unpacked encoder image buffers. One that is the current unpacked video image to be coded and one that is an unpacked reference image. At the beginning of a coding cycle the packed image buffer is unpacked into the current video image buffer. The reference image is the last encoded image. For still image encoding the reference image is not used. As the encoder encodes the current video image, the current video image is overwritten with coded image blocks. After the current image has been fully coded, it then becomes the reference image for the next encoding cycle.
The master processor 102 is responsible for video input collection and filtering, image decoding and post filtering, bit stream transfer both from the encoder to the host and from the host to the decoder, inter-slave communication and overall system control. The master processor 102 is the only processor of the three that has interrupt routines (other than debugger interrupts). The master has interrupt routines for the video frame clock, video line collection, RGB FIFO half full, exchange register in and out the and the slave/debugger.
The highest priority interrupt is the video line collection interrupt. This is the only interrupt that is never turned off except to enter and exit a lower priority interrupt. Interrupt latency for this interrupt is kept to a minimum since the video line collection system has very little memory in it.
The next highest priority interrupt is the slave interrupt. This allows the slave processors 108, 1 10 to communicate between the slave and the master and also between slaves with the help of the master. Timely response to slave requests allows the slaves more time to do the encoding.
The next highest priority interrupts are the exchange register interrupts. There are two interrupts for each direction (four total) of the exchange register. Two are used to process CPU interrupts while the other two are used to process DMA interrupts (transfer). The CPU interrupts grants external bus access to the DMA that is a lower priority device on the bus thereby guarantying a reasonable DMA latency as seen from the host. When the master processor 102 turns off interrupts, the DMA can still proceed if it can gain access to the external bus. Normally, the only time the CPU interrupts are off is when the master is collecting a video input line or the master is trying to protect code from interrupts. In both cases, the code that is being executed should provide sufficient external bus access holes to allow for reasonable DMA latency times.
The next highest priority interrupt is the RGB FIFO half-full interrupt. This interrupt occurs any time the FIFO falls below half full. This allows the master processor 102 to continually keep ahead of the host in transferring RGB data assuming that the master is not busy collecting video line data. The lowest priority interrupt is the frame clock that interrupts 30 times a second. This interrupt provides for a video frame temporal reference, frame rate enforcement and simulated limited channel capacity in "pass through" and "record' modes.
The software is structured to allow up to three levels of interrupt processing. The first level can process any of the interrupts. While the video line interrupt is being processed, no other interrupt can be processed. If the slave, frame or exchange register interrupt is being processed, the video line collection can cause a second level interrupt to occur. While the RGB interrupt is being processed, the slave, frame or exchange register can cause a second level interrupt to occur, and while that second level interrupt is being processed, the video collection interrupt could cause a third level interrupt.
The host communication interface is a software system that provides a system for communicating between the video encoder/decoder 103 and the host processor. There are two hardware interfaces to support this system. The former was designed for low bandwidth general purpose communication, while the later was designed to meet the specific need of transferring high bandwidth video data to be displayed by the host. Only the 16 bit exchange register can cause a host interrupt. Interrupts are provided for both the host buffer full and from the host buffer empty conditions.
As data is received from the host, it is processed by a protocol parsing system. This system is responsible for interpreting the header of the protocol packets and taking further action based on the command found in the header. For protocol packets that contain data, the parser will queue up a DMA request to get the data. The name of the protocol parser is HOST_in( and can be found in the file host if.c.
Data to be sent to the host takes one of two forms. It is either general protocol data or YUV data to be converted to RGB. In either case the data is sent to the host through a queuing system. This allows for the master processor 102 to continue processing data (decoding) while the data transfer takes place. The system is set up to allow 20 outstanding messages to be sent to the host. In most cases, there will be no more than 2 or 3 outstanding messages to be sent to the host. Although YUV data is sent through the YUV to RGB converter system, the associated message informing the host of data availability is sent through the exchange register. This system contains a FIFO that will buffer up to 5 full lines of YUV data. When YUV data is ready to be sent to the host, the master processor places the data into the YUV to RGB FIFO, either two or six lines at a time depending on the size of the image. Each time the two or six lines are placed into the FIFO, the master processor 102 also queues up a protocol message to the host indicating that it has placed data in the FIFO. The master processor 102 will place data in the FIFO any time the FIFO is less than half full (and there is data to be sent). This allows the master processor 102 to keep ahead of the host unless the master processor 102 is collecting a line of video data or moving PCI data from one slave processor to the other, (i.e., has something of a higher priority to do).
YUV frames are queued up to be sent to the host with the RGB queue^) function. Packets of data to be sent to the host are queued up with the HOST_queue() and HOST queueBitBufferQ functions. The HOST _queue() function issued to queue up standard protocol packets. The HOST _queueBitBufjfer() is used to queue up a bit stream data buffer that may contain several protocol packets containing bit stream infromation (macro rows). The HOST _vsaRespond() function issued to send status information to the host. Further details of the video encoder/decoder host software interface are provided below.
The video front end interface is responsible for collecting and formatting video data from the video capture board. From the application level view, the video front end system collects one frame at a time. Each frame is collected by calling the VIDEO_start() function. The video start function performs several tasks. First, it looks at the video encoder/decoder status and sets the pan, pre-filter coefficients, frame rate, brightness, contrast, hue and saturation. Then it either arms the video front end 104 to collect the next frame or defers that to the frame interrupt routine depending on the state of the frame sampling system.
The frame sampling system is responsible for enforcing a given frame rate. If the application code requests a frame faster than the specified frame rate, the sampling waits an appropriate time before arming the video collection system. This system is implemented through the frame clock interrupt function FRAME_interrupt(). The FRAME_inerrupt() function is also responsible for coordinating the simulated limited channel capacity in "record" and "pass-through" modes. It may do this by requesting that encoded bits be moved from the encoder buffer to their detination on every frame clock. The amount of data moved is dependent on the specified channel capacity.
Once the video front end system has been armed, it will collect the next frame under interrupt control. The video lines are collected by two interrupt routines. One interrupt routine collects the even video lines and the other interrupt routine collects the odd lines. This is necessary to save time since it is necessary to decimate the U&V components in the vertical direction. The even line collection interrupt routine collects and places the Y component for that line into an internal memory buffer. The U&V components are stored directly to external memory. The odd line collection interrupt routine discards the U&V components and combines the Y components from last even line with the Y components from the current line and places these in external memory. The format of the Y image is compatible with the spatial filter routine that will be used to condition the Y image plane.
Since the video line collection interrupts could interrupt code that has the slave processors 108, 110 in hold (master processor 102 accessing slave memory), the interrupt code releases the hold on the slave processors 108, 110 before collecting the video line. Just before the interrupt routine returns, it restores the previous state of the slave hold.
The front end spatial filter is responsible for filtering and transferring the Y image plane to the slave processors 108, 110. The image is transferred in packed format (four pixel per word). The slave processors 108, 110 are put into hold only when the date transfer part of the process is taking place. Two lines of data are filtered and transferred at a time. The function InputFilterYQ is used to filter and transfer one Y image plane.
The U&V image planes are transferred to the slave using the ovelmageToSlaveQ function. This function moves the data in blocks from the master to the slave. The block is first copied into internal memory, then the slave is placed into hold and the data is copied from internal memory to the slave memory. The slave is then released and the next block is copied. This repeats until the entire image is moved. The internal memory buffer is used because this is the fastest way to move data around in the C31 processor.
The post decoder spatial filter is responsible for filtering the Y image plane and transferring the Y,U&V image planes to the host. The filter operation uses no external memory to perform the filter operations. This requires that the filter operation be done in blocks and the filtered data be directly transferred into the UUV to RGB FIFO. For this purpose, the block size is two lines. The filter uses a four line internal memory buffer to filter the data. To get things started (i.e., put two lines into the internal memory buffer and initialize the filter system) the function O tputFilterYinitQ is used. Each time the ststem wishes to place two more lines of data into the FIFO it calls the OutputFilterYQ function. These two functions are used exclusively to filter and transfer an image to the FIFO. They do not coordinate the transfer with the host. Coordination is performed by the RGB back end interface described next.
The RGB back end is responsible for coordinating the transfer of selfview and decoded images to the host. In the case of the decoded image, the Y plane is also spatially filtered. There are three types of images that can be transferred. They are a decoded image of size 176x144, a selfview image of size 176x144 and a selfview image of size 44x36. The video encoder/decoder system is capable of transferring both a selfview and a decoded image each frame time. The smaller selfview image is a decimated version of the larger selfview image.
Whenever the application code wants to send an image to the host it calls the KG _queue() function passing it the starting address of the Y,U&V image planes and what type of image is being transferred. The queuing system is then responsible for filtering if it is a decoded image and transferring the image to the host. Selfview images are not filtered. The queuing system can queue up to two images. This allows both a selfview and a decoded image to be simultaneously queued for transfer.
The images are transferred to the host in blocks. For the large images, the transfer block size is two lines. For the small image, the transfer block size is six lines. This makes the most effective use of the FIFO buffer in the YUV to RGB converter while allowing the host to extract data while the C31 is filling the FIFO. The mechanisms used to trigger the filtering/transfer or transfer of the image block is the FIFO half full interrupt. When the FIFO goes from half full to less than half full it generates as interrupt to the C31. This interrupt is processed by the RGB interrupt) function. The interrupt function filters and transfers a block of data using the OutputFilterYQ function in the case of a decoded image. In the case of a selfview image, the interrupt function copies the image block using the OutputYQ function for a large selfview image or the OutputYsmallQ function for the small selfview image. The OutputFilterYQ and OutputYsmallQ functions also transfer the U&V image planes.
The RGB _dropCurrent() function is called whenever the video encoder/decoder 103 wants to stop sending the current image to the host and start processing the next image in the queue. This function is called by the host interface when the host sends the drop frame protocol packet.
The slave communication interface is used to coordinate the transfer of video images to the slaves, transfer encoded bit stream data from the slave processors 108, 1 10 to the master processor 102, transfer data between the slaves and transfer data from the slaves to the host. The latter provides a way to get statistical data out of the slave processors 108, 1 10 and is used for debugging only. When a slave needs service it interrupts the master processor 102. The interrupt is processed by the SLA VE requestQ function. The types of slave requests that can be processed are:
The request "Request New Frame" is mad right after the encoder unpacks the packed video image at the beginning of an encoding cycle. SLAVE_request() simply records that the slave wants a new frame and returns from the interrupt. The main processing loop is then responsible for filtering and transferring the new image to the slave. Once the main processing loop has transmitted a new image to the slave the new Frame Available flag in the communication area is set to TRUE.
The request "Request Scene Change Information" is used to exchange information about the complexity of the image for each slave. If the value exchanged is above some threshold, both slaves will encode the frame as a still image. The slaves transmit their local scene change information to the master as part of this request. Once both slaves have requested this information, the master exchanges the sceneValue in the respective communication areas of the slave processors 108, 1 10 with the other slave's scene change information.
The request "Request Refresh Bits" is similar to the scene change information request.
The difference is that it is requesting a different type of information. However, this information is exchanged just like the scene change information.
There are two types of buffer fullness requests. In one case the master responds instantly with the size of the master's encoder encoded bit buffer. In the other case, the slave is wanting to wait until the master encoder buffer is below a specified size and for the other slave to request this information.
The request "Request PCI" is used to send or receive previously coded image data to or from the other slave processor. This is used when the adaptive boundary is changed. There are two forms of this request. One for sending PCI data and one for receiving PCI. After one slave sends a receive request and the other sends a send request, the master processor 102f then copies the PCI data from the sender to the receiver. The request "Send Encoded Bits" is used to transfer a macro row of data to the master to be placed in the master's encoder encoded bit buffer. There are two forms of this request.
One to send a picture start code and one to send a macro row. The picture start code is sent before any macro rows from either processor are sent.
The request "Acknowledge Slave Encoder Buffer Cleared" is used to acknowledge the master's request for the slaves to clear their buffers. This allows the master processor 102 to synchronously clear its encoder bits buffer with the slave clearing its buffer.
The request "Send Data to Host" is used to transmit a data array to the host processor. This is used to send statistical data to the host about the performance of the encoder.
The master processor 102 can send messages to the slave through the SLA VE nformQ function. The slave processors 108, 110 are not interrupted by the message. The slaves occasionally look (once a macro row or one a frame) to see if there are any messages. The types of messages are as follows:
The message "Image Ready" informs the slave that a packed image is ready.
The "Clear Encoder Buffer" message tells the slave to clear its Ping-Pong bit buffers and restart the encoding process with the next video frame. The slave will respond after resetting its buffers by sending back an acknowledgment (Acknowledge Slave Encoder Buffer Cleared).
The "Still Image Request" message is used to tell the slave to encode the next frame as a still image. The "Scene Change Information" message is used to respond to the slaves request to exchange scene change information.
The "Refresh Bits" message is used to respond to the slaves request to exchange refresh bit information.
The "Buffer Fullness" message is used to respond to the slaves request to wait until the size of the master's encoder encoded bit buffer has fallen below a certain level.
The master bit buffer interface is responsible for managing the encoder encoded bit buffer and the decoder encoded bit buffer on the master processor 102. The bit buffers are managed through a structure of type ChannelState. The bits are stores in the buffer in packets. A packet contains a bit count word followed by the encoded bits packed 32 bits per word. Bits are stored most significant to least significant. The important elements of this structure used for this purpose are:
• wordsinBuffer This contains the current number of words in the buffer. This includes the entire packet including the storage for the number of bits in the packet (see description of bit data buffers above)
• Bufferlndex This is the offset in the buffer where the next word will be taken out.
• BufferlntPtrindex This is the offset in the buffer where the next word will be placed. If BufferlntPtrindex is equal to B fferindex there is no data in the buffer.
• BufferFullness This measures the number of bits currently in the buffer.
This does not include unused bits in the last work of a packet or the bit count word itself. It only includes the actual bits that will be decoded. It does include any fill bits if they exist.
• bitPos This points to the current bit in the current word to be decoded. This is not used for the Encoder buffer. • bitsdLeftlnPacket This is used to indicate how many bits are left to be decoded in the decoder buffer. This is not used for the encoder buffer.
• pastBits This is used to keep track of the past bits when checking for unique picture start codes. This is not used otherwise.
• buffer This is a pointer to the beginning of the bit buffer. There are two functions that can place data into the decoder buffer. They are putDecoderBitsQ and encoderToDecoderQ. The function putDecoderBitsQ is used to put bits from the host into the decoder buffer. The encoderToDecoderQ function is used in the "pass-through" mode to move bits from the encoder buffer to the decoder buffer. The getBitsQ function is used to extract or look at a specified number of bits from the current bit position in the buffer. The bitsInPacketQ function is used to find how many bits are left in a packet. If there are no bits or only fill bits left in the packet the size of the next packet is returned. The skipToNextPacketQ function is used when the decoder detects an error in the packet and wants to move to the next packet ignoring the current packet.
The moveEncoderBitsQ function is used to simulate a limited channel capacity channel when the system is in the "record" or "pass-through" mode.
T egetEncoderBitsQ function is used to move encoder bits from the master processor 102 encoder bit buffer to the host. This is used in the real-time mode when there is a real limited bit channel system (like a modem) requesting bits from the video encoder/decoder system.
The encoderToHostQ function is used to build a host protocol buffer from the encoder buffer. The host protocol buffer can contain multiple protocol packets. This function is called by the moveEncoderBitsQ and getEncoderBitsQ functions.
The mainQ function and main processing loop, masterLoopQ is responsible for coordinating a number of activities. First it initializes the video encoder/decoder 103 subsystems and sends the initial status message to the host. It then enters a loop that is responsible for coordinating 6 things. 1. Filtering and transferring the input video image to the slaves.
2. Video input capture.
3. Transfer of selfview image to the host.
4. Transfer of YUV image to host.
5. Decoding and transfer of the decoding image to the host. 6. Transfer of encoder encoded bits to host under certain conditions.
If an input video frame is ready and the slave has requested a new frame, the main processing loop does the following. If the video encoder/decoder 103 status indicates that a selfview image should be sent to the host and the video encoder/decoder 103 is not currently sending a selfview frame, it will queue up the new frame to be sent to the host. It will then filter and transfer the new frame to the slave processors 108, 110. This process does not alter the input image so it does not effect the selfview image. If the host has requested that the frame be sent in YUV format, the YUV image will then be transmitted to the host in YUV format. Finally, the video input system will be restarted to collect the next frame. If the decoder state is non-zero and the processing loop was entered with a non-zero flag, the DECODE JmageQ function is called to coordinate the decoding and transfer to the host of the next video frame.
The last part of the main processing loop checks to see if encoded bits can be sent to the host if any are pending to be sent. If the host had requested encoder encoded bits and the encoder buffer was emptied in the host interrupt processing function before the request was satisfied the interrupt routine must exit leaving some number of bits pending to be transmitted.
The DECODE JmageQ function is responsible for coordinating the decoding process on a frame by frame basis and the transmission of the decoded frames to the host. A flag is used to coordinate the decoding and transferring of decoded images. The flag is called OKtoDecode. It is not OK to decode if there is a decoded frame currently being transmitted to the host and the next frame has already been decoded. This is because the decoder needs the frame being transmitted to construct the next decoded frame. Once that old frame has been completely transmitted to the host, the recently decoded frame is queued to be transmitted and the decoding process can continue. If the decoder bit buffer starts to back up (become full) the decoder will go on without queuing up the previously decoded image. This results in the currently transmitted image being overwritten with newly decoded data. This could cause artifacts in the frame being sent to the host. The decoding process gets or looks at bits in the decoder buffer by calling the getBitsQ and bitsInPacketQ function. If during the decoding process, the getBitsQ or the bitsInPacketQ function finds the decoder buffer is empty, it waits for the buffer to be non-zero. While it waits it calls the masterLoopQ function with a zero parameter value. This allows other tasks to be performed while the decoder is waiting for bits. The slave processors 108, 1 10 are responsible for encoding video frames of data This is their only responsibility, therefore there is not much system code associated with their function. The slave processors 108, 110 do not process any interrupts (except for the debugger interrupt for debugging purposes). There is a simple communication system that is used to request service from the master All communication from the master to the slave is done on a polled basis.
The slave and the master communication through a common communication structure located at 0x400 m the slave address space The structure is defined in the slave h file as a typedefSLA VE Request The name of this structure for the slave processor is SLA VE com. The master processor 102 accesses the communication area at offset 0x400 in the slave memory access window For slave 1, the communication structure would reside at location 0xa00400 in the master processor 102. For slave 2, the communication structure would reside at location 0xc00400. In order for the master to gain access to this structure, it must first place the slave(s) into hold. Every time the slave processor needs to request service from the master processor 102 it first checks to see if the master has processed the last request. If does this by checking the hardware bit Sex_THBE (where "x" indicates slave encoder processor is to be checked) that indicates that the slave is currently requinng service from the master. If the master has not processed the last request the slave waits until it has. When there is no pending service, the slave places a request or command in the communication structure element name "cmd" It also places any other data that the master will need to process the command m the communication structure. If the communication requires synchronization with the other slave or a response before the slave can continue, the slave also sets the communicationDone flag to F.ALSE. Then the slave sets the hardware request flag. If the slave needs a response before going on it will poll the communicationDone flag waiting for it to go TRUE. Once the flag is TRUE, any requested data will be available in the communication structure.
Video Bit Stream This section describes the bit stream that is in use in the video encoder/decoder system when compiled with picture start code (PSC) equal to four. This section addresses the header information that precedes each macro row and the format of the bit stream for video, still, and refresh information.
Figure 22 shows a high level example of a video bit stream where the video sequence includes the first frame coded by the still image coder and the second frame coded by the video coder followed by the refresh data for part of the macro row.
To accommodate macro rows that may be dropped by the supervisor instead of being sent to the decoder because of channel errors, the decoder can decode each macro row in a random order and independently from other macro rows. This flexibility is achieved by adding a macro row address (macro row number) and macro header information to each encoded macro row.
To support macro row independence, each of the motion caches must be reset before encoding and decoding each macro row. This modification results in a slightly lower PWNR (0.1 dB) for a number of file-io-sequences, but it was seen to have no perceptual degradation in both the file-io and the real-time simulations. The motion vectors should be reset to some small and some large values to account for the high-motion frames (the latest source code resets to only small motion vectors). The still image encoder is also macro row addressable, and therefore the still image VQ cache should be reset before encoding and decoding each macro row. There are four compiling options that affect how header information is added to the bit stream. These options are controlled by the compiler define switch PSC which is set in the master and slave make files. The meaning of the different values of PSC are detailed in Figure 23. Referring to Figure 23, the default configuration ofthe encoder/decoder is PSC=4. Macro rows of data may have leftover bits that are used to align the macro rows to byte boundaries. There will always be fewer than eight fill bits and their value will be zero. They will be a part of the bit stream, but the decoder knows not to interpret them.
This use of fill bits and packetizing the macro rows has ramifications for detecting bit errors because it indicates (without decoding the data) the boundaries ofthe macro rows. This information can be used to skip to the next macro row packet in the event the decode becomes confused.
By aligning the macro row packets of encoded data on known boundaries such as, for example, byte boundaries, the decoder can determine whether it has left over bits, or too few bits, for decoding the given macro row. Once errors are detected, error concealment strategies of conditionally replenishing the corrupted macro rows can be performed and decoding can continue with the next macro row packet of data. This allows the detection of bit errors through the use ofthe macro row packetized bit stream.
The format of the encoded data depends on the type of encoding (video or still) and at which level ofthe hierarchy the bits are generated. Regardless ofthe type of data, each macro row of data begins with the macro header, defined as follows.
Video, still images, and refresh data are encoded on a macro row basis and packetized according to their macro row address. The header information for each macro row includes the fields shown in Figure 24. The amount of overhead is 8 bits per macro row. The information is found in the bit stream in the order given in Figure 24, i.e., the data type is the first two bits, the macro row address the next four bits, etc.
Referring to Figure 24, the data type (DT) (2 bits) specifies what type of data is
encoded: 00 means video data, 01 means still image date, 10 represents VQ refresh data, and 11 represents scala refresh data. The macro row address (MRA) (4 bits) specifies which of the nine (for QCIF) macro rows the encoded data corresponds to. The value in the bit stream is the macro row address plus five. The HVQ structure bit (HS) (1 bit) specifies whether the structure of the encoder hierarchy for the macro row has been modified to support the working set model (HS=1) or is using the 4x4 motion cache (HS=0). This bit is ignored by the still image decoder and the refresh decoder. The relative temporal reference (RTR) (1 bit) signifies whether the encoded macro row is part ofthe current frame or the next frame. The value toggles between 0 and 1 when the frame changes. All macro rows encoded from the same input frame will have the same RTR. It is important to note that the RTR is not the least significant bit of the absolute frame reference. The RTR is an independent flag that differentiates between this frame and the next frame. In addition to the macro row header information, macro row zero (for video and still data types, not for refresh) contains the temporal reference information and once every 15 frames, frame rate information, and possibly other information as specified by a picture extended information (PEI) bit. The format ofthe extended macro row zero information is shown in Figure 25. Every macro row zero that contains either video or still encoded data has a 6 bit temporal reference that counts the absolute frame number, modulo 60. If the value ofthe PEI after the temporal reference is found to be 1 , the next 6 bits correspond to the frame rate divisor. The decoder uses the frame rate information and the temporal reference information to synchronize the output display ofthe decoded images. The frame rate divisor is sent once every 15 frames by the encoder. The actual frame rate is calculated by dividing 30 by the frame rate divisor. The value of 0 is used to tell the decoder that it should not try to time synchronize the display of decoded images.
After the frame rate information, another PEI indicated whether there is other information available in the bit stream. The decoder will continue to read and throw away these 6 bit fields and then check for another PEI. The cycle continues until the value ofthe PEI is found to be 0.
Video images use a quad-tree decomposition for encoding. The format ofthe output bit stream depends on which levels ofthe hierarchy are used and consequently follows a tree structure. The bit stream is perhaps best described through examples of how 16x16 blocks of the image are encoded. Considering that a QCIF image is composed of 99 16x16 blocks. Figure 26 shows several different decompositions for 16x16 blocks and the resulting bit stream. The notation of Flagc indicates the complement of the bit stream flag as defined in Figure 27, which depicts the flags used to describe the video bit stream. There are three valid data items that have been added to the cache during the encoding process. Miss entries are entries to indicate that suitable entry was not found in the cache. Invalid entries are entries that do not contain valid data because they represent uninitialized values.
Referring to Figure 26, part a., the 16x16 block is encoded as a background block. In part b., the 16x16 block is coded as an entry from the motion 16 cache. In part c, the 16x16 block is encoded as a motion block along with its motion vector. In part d., the 16x16 block is decomposed into 4 8x8 blocks which are encoded as (starting with the upper left block) as a background 8x8 block, another background 8x8 block, a hit in the motion 8 cache, and as an 8x8 motion block and the associated motion vector. In part e., the 16x16 block is divided into 3 8x8 blocks and 4 4x4 blocks. The upper left 8x8 block is coded as a motion 8 cache hit, the second 8x8 block is decomposed and encoded as a VQ block, a hit in the motion 4 cache, the mean ofthe block and finally as a VQ block. The third 8x8 block is encoded as an 8x8 motion block and its motion vector and the final 8x8 block is encoded as a motion 8 cache hit.
With respect to color information and flag bits, the color information is coded on a block by block basis based on the luminance coding. Figure 28 shows some sample decompositions of color (chrominance) 8x8 blocks and descriptions of their resulting bit streams.
Still images are encoded in terms of 4x4 blocks. The coding is performed using a cache VQ technique where cache hits required fewer bits and cache misses require more bits to address the VQ codebook. Assuming that each 4x4 block is coded sequentially (there are a total of 15844x4 luminance blocks), Figure 29 shows an example of coding 4 luminance blocks. Here, Mn, is the mean of the first block, Mn4 is the mean ofthe fourth block, VQ, and VQ4 are the VQ address es ofthe first and fourth block, respectively, can C2 and C3 are the variable length encoded cache locations. C, and C4 are invalid cache positions that are used as miss flags. The VQ addresses have one added to them before they are put into the bit stream to prevent the all-zero VQ address.
After every four 4x4 luminance blocks have been encoded, the corresponding 4x4 chrominance blocks (first U and then V) are encoded using the same encoding method as the luminance component. Different caches are used for the color and luminance VQ information.
Since the coder is lossy, the image must be periodically re-transmitted to retain a high image quality. Sometimes this is accomplished by sending a reference image called a "still." Other times the image can be re-translated piece-by-piece as time and bit rate permit. This updating of portions ofthe encoded image can be done through various techniques including scalar absolute refreshing, scalar differential refreshing, block absolute refreshing, and block differential refreshing.
Refresh is used to improve the perceptual quality ofthe encoded image. In the absence of transmission errors, the coded image is known to the transmitter along with the current input frame. The transmitter uses the latter to correct the former provided that the following two conditions are true: first, that the channel has extra bits available, and second that there is time available. Whether or not there are bits available is measured by the buffer fullness; refreshing is preferably performed only when the output buffer is less than 1/3 full. Time is available if no new input frame is waiting to be processed. The two independent processors also perform their refreshing independently and it is possible on some frames that one of the processors will perform refreshing while the other does not.
There are two levels of refreshing - block based or VQ (4x4) and pixel based refreshing. The refresh header information includes the fields shown in Figure 30. Each of these levels is further divided into absolute and differential refresh mode with the mode set by a bit in the header.
Referring to Figure 30, the refresh header information includes the field ofthe macro row header. Note that the extended macro row zero information does not apply to the refresher information, i.e., refresh macro rows do not contain extended information about the temporal reference and frame rate. The refresh address field specifies the starting 4x4 block that the refresher is encoding within the specified macro row. For QCIF images, there are 176 4x4 blocks in each macro row. The bit stream interface adds 31 to the refresh block address to present the generation of a picture start code. The decoder bit stream interface subtracts this offset. The "number of 4x4 blocks" field enumerates the number of 4x4 blocks ofthe image that are refreshed within this macro row packet. This field removes the need for a refresh end code.
As discussed, the system uses both VQ or block and scalar refreshing. The absolute refresher operates on 8x8 blocks by encoding each ofthe four luminance 4x4 blocks and single U and V 4x4 blocks independently. This means that the refresh information will always include an integer multiple of four luminance 4x4 blocks and one U and one V 4x4 block.
This section describes the VQ refreshing that occurs on each 4x4 luminance block and its corresponding 2x2 chrominance blocks through the use of 2, 16, or 24 dimensional VQ codebooks.
For VQ refreshing, each 4x4 luminance block has it mean and standard deviation mean calculated. The mean is uniformly quantized to 6-bits with a range of 0 to 255 and the quantized 6-bit mean is put into the bit stream. If the standard deviation ofthe block is less than a threshold, a flag bit is set and the next 4x4 luminance block is encoded. If the standard deviation is greater than the threshold, the mean is removed from the block and the 255 entry VQ is searched. The 8-bit address ofthe VQ plus one is placed intot he bit stream (zero is not allowed as a valid VQ entry in a preferred embodiment - the decoder subtracts one before using the address). The mean and standard deviation ofthe associated 4x4 U and V blocks are computed.
If the maximum of the U and V standard deviations is less than a threshold then the U and V means are used to search a two-dimensional mean codebook that has 255 entries. The resulting 8-bit VQ address plus one is put into the bit stream. If the maximum ofthe U and V standard deviations is greater than the threshold a 16-dimensional 255-entry VQ is searched for both the U and the V components. The two resulting 8 bit VQ addresses plus one are put into the bit stream.
VQ refreshing may be absolute of differential. For absolute VQ refreshing (19 bits per 4x4 block), each 4x4 luminance block has its mean calculated and quantized according to the current minimum and maximum mean for the macro row, and the 5-bit quantized value is put into the bit stream. The mean is subtracted from the 4x4 block and the result is used as the input to the VQ search (codebook has 255 entries). The resulting 8-bit VQ address is placed into the bit stream. The means of the 2x2 chrominance blocks are used as a 2-dimensional input to a 63 entry VQ. The resulting 6-bit address is placed into the bit stream. The result is 19 bits per 4x4 block.
For differential VQ refreshing (8 bits per 4x4 block), each 4x4 luminance block and the associated 2x2 chrominance blocks are subtracted from their respective values in the previously coded frame. The difference is encoded as an 8-bit VQ address from a 24- dimensional codebook that has 255 entries. The result is 8 bits per 4x4 block. Scalar refreshing is a mode that is reserved for when the image has had VQ refreshing performed on a large number of consecutive frames (for example, about 20 frames). This mode has been designed for when the camera is pointing away from any motion. The result is that the coded image quality can be raised to near original quality levels.
As with block refreshing, scalar refreshing may be either absolute or differential. For absolute scalar refreshing (168 bits per 4x4 block), the 4x4 luminance and 2x2 chrominance blocks are encoded pixel by pixel. These 24 pixels are linearly encoded with 7-bit accuracy resulting in 168 bits per 4x4 block.
For differential scalar refreshing (120 bits per 4x4 block), the 4x4 luminance and 2x2 chrominance blocks are encoded pixel by pixel. First the difference between the current block and the previously code block is formed and then the result is coded with 5 bits for each pixel (1 for the sign and 4 for the magnitude). The resulting 4x4 block is encoded with 120 bits.
The system selects one ofthe available methods of refreshing. The mode selected is based upon how many bits are availble and how many times an image has been refreshed in a given mode. In terms of best image quality, one would always use absolute scalar refreshing, but this uses the most bits. The second choice is differential scalar refreshing, followed by absolute VQ, and finally differential VQ. In pseudo code the decision to refresh using scalar or VQ looks like this:
/* determine either scalar or vq refresh based on the cumulative refresh bits */ if( cumulative_refresh_bits > TOO _MANY_VQ_REFRESH_BITS )
SET_SCALAR_REFRESH = TRUE; else
SET SCALAR tEFRESH = FALSE; where cumulative_refresh_bits is a running tally of how many bits have been used to encode a given frame, TOO_MANY VQ_REFRESH_BITS is a constant, also referred to as a refresh threshold, and SET SCALAR REFRESH is a variable which is set to let other software select between various refresh sub-routines.
The system allows different methods and the use of all, or some combination of these methods, to perform the refreshing based on bit rate and image quality factors. Figure 31 depicts the overhead associated with the bit stream structure.
Picture Header Description With respect to the picture header, in the default configuration ofthe encoder/decoder (compiled with PSC=4), the picture header is not part ofthe bit stream. This description of the picture header is only useful for historical purposes and for the cases where the compile switch PSC is set to values 1 or 3.
The picture header precedes the encoded data that corresponds to each frame of image data. The format of the header is a simplified version ofthe H.261 standard, which can require up to 1000 bits of overhead per frame for the header information. A preferred picture header format only requires 23 bits (when there is no extended picture information) for each video and still image frame and includes the fields shown in Figure 32 and described below.
In an attempt to save bits, the code works whether or not there is a start code as part of the bit stream.
Referring the Figure 32, the picture header includes a picture start code (PSC) (16 bits), a picture type (PTYPE) (1 bit) and a temporal reference (TR) (6 bits). With respect to the picture start code, a word of 16 bits with a value of 0000 0000 0000 0111 is a unique bit pattern that indicates the beginning of each encoded image frame. Assuming there have been no transmission bit errors or dropped macro rows the decoder should have finished decoding the previous frame when this bit pattern occurs. The picture type bit specifies the type ofthe encoded picture data, either as still or video data as follows:
PTYPE='0': moving picture (encoded by the video coder)
PTYPE=" 1 ': still picture (encoded by the still image coder)
With respect to the temporal reference, a five bit word ranging from 000010 to 111110 indicates the absolute frame number plus 2 (modulo 60) ofthe current image based on the information from the video front end 104. The bit stream interface adds two to this number to generate 6-bit numbers between 2 and 61 (inclusive) to avoid generating a bit field that looks like the picture start code. The decoder bit stream interface subtracts two from this number. This field is used to determine if the encoders have missed real-time and how many frames were dropped.
Error Concealment There are three main decoding operations: decoding still image data, decoding video data, and decoding refresh data. This section discusses how undetected bit errors can be detected, and the resulting error concealment strategies.
Several bit error conditions may occur in both the video and still image decoding process. For an invalid macro row address error, the system ignores the macro row packet and goes to next macro row packet. For a duplicated macro row address error, a bit error is either in the macro row address or the next frame. The system uses the RTR (relative temporal reference) to determine if the duplicated macro row data is for the next frame or the current frame. If it is for the next frame, the system finishes the error concealment for this frame and sends the image to the YUV to RGB converter and begins decoding the next frame. If it is for the current frame, then the system assumes there is a bit error in the address and operates the same as if an invalid macro row address were specified. For an error of extra bits left over after the macro row, the system overwrites the macro row with the PCI macro row data as if macro row had been dropped and never decoded. The system then continues with the next macro row of encoded bits. For an error of too few bits for decoding a macro row, the system overwrites the macro row with the PCI macro row data as if macro row had been dropped and never decoded. The system then continues with the next macro row of encoded bits. Another possible error is an invalid VQ address (8 zeros). In this case, the system copies the mean from a neighboring 4x4 block.
Another error is video data mixed with still data. In this case, the system checks the video data RTR. If it is the same as the still RTR then there is an error condition. If the RTR indicates the next frame, then the system performs concealment on the current frame, sends image to the YUV to RGB converter and continues with decoding on the next frame. Another error is refresh data mixed with still data. This is most likely an undetected bit error in the DT field because the system does not refresh after still image data. In this case, the macro row should be ignored. There are also a number of bit errors that are unique to the video image decoder. One type of error is an invalid motion vector (Y,U,V). There are two types of invalid motion vectors: 1. Out of range motion value magnitude (should be between 0 and 224); and 2. Motion vector that requires PCI data that is out ofthe image plane. In this situation, the system assumes a default motion vector and continues decoding. Another video image decoder error is refresh data mixed with video data. In this case, the system does the required error concealment and then decodes the refresh data.
With respect to refresh data error conditions, one error is an invalid refresh address. In this case, the system ignores the macro row data and skips to the next macro row packet.
Packetizing ofthe Encoded Data It is desirable to transmit the encoded data as soon as possible. The preferred embodiment allows the encoding of images with small packets of data which can be transmitted as soon as they are generated to maximize system speed, independent of synchronization with the other devices.
For example, by macro-row addressing the image data, image areas can be encoded in arbitrary order. That is, no synchronization is required among processors during the encoding of their respective regions because the decoder can interpret the macro rows and decode in arbitrary order. This modular bit stream supports the merging of encoded macro-row packets from an arbitrary number of processors. The macro row packets are not the same size. The packets sent over the modem, which consist of 1 or more macro rows, are not the same size either.
In addition, it is often desirable to use a single transmission channel to transmit various data types (voice, video, data, etc.) simultaneously. The preferred embodiment uses variable length packets of video data which may be interleaved at the data channel with other types of data packets.
Software Interface Between Video Encoder/Decoder and Host This section describes the software protocol between the host processor (PC) and the video encoder/decoder 103. There are two phases of communication between the PC and the video encoder/decoder 103. The first phase is concerned with downloading and starting the video encoder/decoder microcode. This phase is entered after the video encoder/decoder 103 is brought out of reset. This is referred to as the initialization phase. After the initialization phase is complete, the communication automatically converts to the second phase known as the run-time phase. Communication remains in the run-time phase until the video encoder/decoder 103 reset. After the video encoder/decoder 103 is taken out of reset it is back in initialization phase.
The video encoder/decoder 103 is place in the initialization phase by resetting the video encoder/decoder 103. This is accomplished by the following procedure:
1. Set bit 0 in the host control register (Offset 0x8) to one.
2. Reading the 16 bit exchange register (Offset 0x0) to make sure it is empty. 3. Reading the 8 bit master debug register (Offset 0x404) to make sure it is empty.
4. Set bit 0 in the host control register (Offset 0x8) to zero.
5. Wait (perhaps as much as a few milliseconds) for the video encoder/decoder 103 to set the MD THBF (bit 0) in the debug port status register to one. 6. Send the microcode file 16 bits at a time to the video encoder/decoder through the 16 bit exchange register (Offset 0x0). The data transfer must be coordinated with the FHBE (bit 1 ) in status and control register (Offset OxC) (i.e., data can only be placed in the 16 bit exchange register when the FHBE flag is one).
After the last 16 bit word ofthe microcode is transferred to the video encoder/decoder 103, the video encoder/decoder 103 will automatically enter the run-time phase.
The run-time phase is signaled by the video encoder/decoder 103 when the video encoder/decoder 103 sends a status message to the host. All run-time data transmission between the Host and video encoder/decoder 103 is performed either through the 16 bit exchange register or RGB FIFO. All data sent from the host to the video encoder/decoder 103 is sent through the 16 bit exchange register.
Data transmitted from the video encoder/decoder 103 to the host can be broken down into two categories. One category is RGB image data and the other contains eveything else. The RGB FIFO is used exclusively to convert YUV data to RGB data and transmit that data in RGB format to the host. Since the RGB FIFO has no interrupt capacility on the host side, coordination ofthe RGB data is accomplished by sending amessage through the 16 bit exchange register. The host does not need to acknowledge reception ofthe RGB data since the FIFO half full flag is connected to one of video encoder/decoder's master processor 102
interrupts.
All data transmission, other than RGB FIFO data, will take place in the context of a packet protocol through the 16 bit exchange register. The packets will contain a 16 bit header work an optionally some amount of addition data. The packet header contains two fields, a packet type (4 bits) and the size in bits (12 bits) ofthe data for the packet (excluding the header), as shown in Figure 33.
If the packet only contains the header the data size will be zero. Data will always be sent in increments of 16 bits even if the header indicates otherwise. The number of optional 16 bit words is calculated from the optional data size field ofthe packet header as: number of optional 16 bits words = int((optionaI_data_bits+15)/16)
It should be noted that the only case where the number of bits will not necessarily be a multiple of 16 is in the packets containing encoded bit stream data. In this case, if the header indicates the optional data (bit stream data) is not an increment of 16 bits, the last bits are transferred in the following format. If there are 8 or fewer bits, the bits are left justified in the lower byte ofthe 15 bit word. If there are more than 8 bits, the first 8 bits are contained in the lower byte ofthe 16 bit word and the remaining bits are left justified in the upper byte ofthe 16 bit word. All unused bits in the 16 word are set to zero. This provides the host processor with a byte sequential data stream.
Figure 34 shows encoded data in C32 memory (Big Endian Format), Figure 35 shows encoded data in PC memory for an original bit stream in the system, and Figure 36 shows encoded data in PC memory for a video encoder/decoder system.
With respect to timing of packet transfers, from the video encoder/decoder 103's point of view, packet transfers occur in one or two phases depending on the packet. If the packet contains only the header, there is only one phase for the transfer. If the packet contains a header and data, there are two phases to the transfer. In both cases, the header is transferred first under CPU interrupt control (Phase 1). If there is data associated with the packet, the data is transferred under DMA control (Phase 2). Once the DMA is finished, CPU interrupts again control the transfer ofthe next header (Phase 1). The video encoder/decoder 103 transfers the data portion of packets under DMA control in one direction at a time. For example, this means that if the video encoder/decoder 103 is currently transferring the data portion of a packet A to the host, the host will not be able to send the data portion of packet B to the video encoder/decoder 103 until the transfer of packet A from the video encoder/decoder 103 is done. This is not the case for transfers of packet headers. The host must be careful not to enter a dead lock situation in which the host is trying to exclusively read or write data from or to the video encoder/decoder 103 while the video encoder/decoder 103 is trying to DMA data in the other direction.
The rules for transferring data are as follows. There are three cases. The first case is when the host is trying to send or receive a packet with no data. In this case there is no conflict. The second case is when the host is trying to send a packet (call it packet A) containing data to the video encoder/decoder 103. In this case, the host sends the header for packet A, then waits to send the data. While the host waits to send the data portion of packet A, it must check to see if the video encoder/decoder 103 is trying to send a packet to the host. If the host finds that the video encoder/decoder 103 is trying to send a packet to the host, the host must attempt to take that packet. The third case is when the host is receiving a packet (call it packet B) from the video encoder/decoder 103. The host first receives the header for packet B, then waits for the data. If the host was previously caught in between sending a header and the data of a packet (call it packet C) going to the video encoder/decoder 103, it must check to see if the video encoder/decoder 103 is ready to receive the data portion of packet C. If so the video encoder/decoder 103 must transfer the data portion of packet C before proceeding with packet B.
This implies that the host interrupt routine will have three states for both receiving and sending packets (six states total). State one is idle, meaning no packet is currently being transferred. State two is when the header has been transferred and the host is waiting for the first transfer of the data to occur. State three is when the transfer of data is occurring. When the interrupt routine is in state two for the receiver or the transmitter, it must be capable switching between the receiver and transmitter portion ofthe interrupt routine based on transmit or receive ready (To Host Buffer Full and From Host Buffer Empty). Once in state three, the host can exclusively transfer data with only a video encoder/decoder 103 timeout loop.
Figure 37 is a flow diagram of a host interrupt routine that meets the above criteria. For clarity, it does not contain a video encoder/decoder 103 died time out. Time outs shown in the flow diagram should be set to a value that is reasonable based video encoder/decoder 103 interrupt latency and on the total interrupt overhead of getting into and out ofthe interrupt routine (this includes system overhead).
As shown in Figure 37, the video encoder/decoder 103 enters the host interrupt routine at step 112 and proceeds to step 1 14, where the routine checks for a transmit ready state (To Host Buffer Full). If the system is not in the transmit ready state, the routine proceeds to step 1 16, where the system checks for a receive ready state (From Host Buffer Empty) and a packet to transfer. If both are not present, the routine proceeds to check for time out at step 118 and returns from the host interrupt routine, at step 120, if time is out. If time is not out at step 1 18, the routine returns to step 114. If the system is in the transmit ready state at step 1 14, the host interrupt routine proceeds to step 122, where routine polls the receive state. If the receive state is "0" at step 122, then the routine reads a packet header and sets the receive state to "1 " at step 124. The routine then proceeds to look for packet data at step 126 by again checking for a transmit ready state (To Host Buffer Full). If the buffer is full, the routine reads the packet data and resets the receive state to "0" at step 130, and then the routine proceeds to step 116. On the other hand, if the buffer is not full, the routine proceeds to step 128, where it checks for time out. If there is no time out, the routine retums to step 126, whereas if there is a time out at step 128 the routine proceeds to step 1 16. If the receive state is not "0" at step 122, then the routine proceeds directly to step 130, reads data, sets the receive state to "0" and then proceeds to step 116.
Referring again to Figure 37, if the system is in the receive ready state and there is a packet to transfer to step 1 16, then the system proceeds to step 131 , where the routine polls the transmit state. If the transmit state is "0" at step 131, the routine proceeds to step 132, where the routine writes the packet header and sets the transmit state to "1." The routine again checks whether the From Host Buffer is empty at step 133. If it is not , the routine checks for time out at step 134. If time is out at step 134, the routine returns to step 116, otherwise the routine returns to step 133.
When the From Host Buffer is empty at step 133, the routine proceeds to step 136, where the routine writes the first data and sets the transmit state to "2." The routine then proceeds to step 1 7, where it checks for more data. If more data is present, the routine proceeds to step 139 to again check whether the From Host Buffer is empty. On the other hand, if no more data is present at step 137, the routine sets the transmit state to "0" at step 138 and then returns to step 116. If the From Host Buffer is empty at step 139, the routine proceeds to step 141, where it writes the rest ofthe data and resets the transmit state to "0," and the routine then returns to step 116. On the other hand, if the From Host Buffer is not empty at step 139, the routine checks for time out at step 140 and returns to step 139 if time is not out. If time is out at step 140, the routine returns to step 116. At step, 131, if the transmit state is not "0," then the routine proceeds to step 135, where the routine again polls the transmit state. If the transmit state is "1" at step 135, then the routine proceeds to write the first data and set the transmit state to "2," at step 136. If the transmit state is not "1" at step 135, then the routine proceeds to write the rest ofthe data and reset the transmit state to "0" at step 141. As noted above, the routine then retums to step 116 from step 141.
A control packet is depicted in Figure 38. The control packet always contains at least 16 bits of data. The data is considered to be composed of two parts, a control type (8 bits) and required control parameter (8 bits) and optional control parameters (always a multiple of 16 bits). The control type and required parameter always follow the Control Packet header word. .Any optional control parameters then follow the control type and required parameter.
A status request packet is depicted in Figure 39. The status request packet is used to request that the video encoder/encoder 103 send a status packet. The status packet contains the current state ofthe video encoder/decoder 103. An encoded bit stream request packet is depicted in Figure 40. The encoded bits stream request packets request a specified number of bits of encoded bit stream data from the local encoder. The video encoder/decoder 103 will respond by sending back N macro rows of encoded bit stream data (through the packet protocol). N is equal to the minimum number of macro rows that would contain a number of bits equal to a greater than the requested number of bits. Only one request may be outstanding at a given time. The video encoder/decoder 103 will send each macro row as a separate packet. Macro rows are sent in response to a request as they are available. The requested size must be less than 4096 bits.
A decoder bits packages are shown in Figures 41 and 42. Encoded data is sent in blocks. There are two types of packets used to send these blocks. The packets about to be described are used to transmit encoded bit stream data to the local decoder. One type of packets is used to send less than 4096 bits of encoded data or to end the transmission of more than 4095 bits of encoded data. The other is used to start or continue the transmission of more than 4095 bits of encoded data. The video encoder/decoder 103 will respond by sending back RGB data of the decoded bit stream after sufficient encoded bit stream bits have been sent to and decoded by the video encoder/decoder 103.
A decoder bits end packet is depicted in Figure 41 and a decoder bits start continue packet is depicted in Figure 42. If less than a multiple of 16 bits is contained in the packet, the last 16 bit word contains the bits left justified in the lower byte first, then left justified in the upper byte. Left over bits in the last 16 bit word are discarded by the decoder.
As the decoder decodes macro rows from the packet it checks to see if there are less than eight bits left in the packet for a given macro row. If there are less than eight and they are zero, the decoder throws out the bits and starts decoding a macro row from the next packet. If the "left over bits" are not zero, the decoder signals a level 1 error. If a block contains less than 4096 bits it will be transmitted with a single decoder bits end packet. If the block contains more than 4095 bits, it will be sent with multiple packets using the start/continue packet and terminated with the end packet. The start continue packet must contain a multiple of 32 bits. One way to assure this is to send the bits in a loop. If the bits left to send is less than 4095, send the remaining bits with the start continue packet and loop.
Referring to Figure 43, the YUV data for encoder packet is used to send YUV data to the encoder. It is needed for testing the encoder. It would also be useful for encoding YUV data off-line. For the purposes of the image transfer as well as the encoding process, a YUV frame contains 144 lines of 176 pixel/line Y data and 72 lines of 88 pixels/line of U and V data. Because the data size field in the packet header is only 12 bits, it will take multiple packets to transfer a single YUV image. Y data must be sent 2 lines at a time. U&V data are sent together one line at a time (One line of U followed by one line of V). All ofthe Y plane can be sent first (or last) or the Y and UV data mey be interspersed (i.e., two lines of Y, one line of U&V, two lines of Y, one line of U&V, etc.). Referring to Figure 43, for the data type:
0 = Start New Frame
1 = Continue Transferring Frame, data is 2 lines of Y data
2 = Continue Transferring Frame, data is one line of U data followed by one line of V data.
3 = End frame
A drop current RGB frame packet is depicted in Figure 44. The drop current RGB frame packet is used to tell the video encoder/encoder 103 to drop the current RGB frame being transmitted to the host. The video encoder/decoder 103 may already put some ofthe frame into the FIFO and it is the responsibility of the host to extract and throw away FIFO data for any FIFO ready packets received until the next top of frame packet is received.
A status packet is depicted in Figure 45. The status packet contains the current status ofthe video encoder/decoder system. This packet is sent after the microcode is downloaded and started. It is also sent in response to the status request packet. The packet is formatted so that it can be used directly with the super control packet defined below in the control types and parameters section. The microcode revision is divided into two fields: major revision and minor revision. The major revision is located in the upper nibble and the minor revision is located in the lower nibble. A decoder error packet is depicted in Figure 46. The decoder error packet tells the host that the decoder detected an error in the bit stream.
The "error level" shown in Figure 46 is as follows: 1. Decoder bit stream error, decoder will continue to decode the image. 2. Decoder bit stream error, decoder has stopped decoding the input bit stream. The decoder will search the incoming bit stream for a still image picture start code.
3. Column mismatch of the YUV data sent from the host.
4. Too many Y pixels sent from host for current YUV frame.
5. Too many UV pixels sent from host for current YUV frame. 6. Not enough YUV data sent from host for last YUV frame.
A decoder acknowledge packet is depicted in Figure 47. The decoder acknowledge packet is only sent when the video encoder/decoder 103 is in store/forward mode. It would be assumed that in this mode the bit stream to be decoded would be incoming from disk. Flow control would be necessary to maintain the specified frame rate and to not overrun the video encoder/decoder 103 computational capabilities. The acknowledge packet would signal the host that it could send another packet of encoded bit stream data.
The encoded bits from encoder packets are used to send a macro row of encoded bits from the local encoder to the host. These packets are sent in response to the host sending the encoded bit stream request packet or placing the encoder into record mode. Referring to Figures 48 and 49, there are two types of packets used to send these macro rows. One type of packet is used to send less than 4096 bits of encoded macro row data or to end the transmission of more than 4095 bits of encoded macro row data. The other is used to start or continue the transmission of more than 4095 bits of encoded macro row data. Figure 48 depicts encoded bits from encoder end packet. Figure 49 depicts encoded bits from encoder start/continue packet. If fewer than a multiple of 16 bits are contained in the packet, the last 16 bit word contains the bits left justified in the lower byte first, then left justified in the upper. Unused bits in the last 16 bit word are set to zero.
If a macro row contains less than 4096 bits it will be transmitted with a single encoded bits from encoder end packet. If the macro row contains more than 4095 bits, it will be sent with multiple packets using the start/continue packet and terminated with the end packet. The start/continue packets will contain a multiple of 32 bits.
The encoded bits frame stamp packet, as depicted in Figure 50, is used to send the video frame reference number from the local encoder to the host. This packet is send at the beginning of a new frame in response to the host sending the encoded bit stream request packet. The frame reference is sent modulo 30. The frame type is one if this is a still image or zero if a video image.
A top of RGB image packet, depicted in Figure 51, is send to indicate that a new frame of video data will be transmitted through the RGB FIFO. The parameter indicated what type of image will be contained in the FIFO. This packet does not indicate that there is any data in the FIFO.
Frame Types:
0x01 = Self view large
0x02 = Self view small 0x04 = Decoded image large
There are two image sizes. The large size is 176 columns x 144 rows (2 lines sent for every FIFO ready packet). The small size is 44 columns x 36 rows (6 lines sent for every FIFO ready packet). The frame number is the frame number retrieved from the decoded bit stream for the decoded image and the frame number from the video acquisition system for the selfview image. The number is the actually temporal reference modulo 30.
A FIFO ready packet, as depicted on Figure 52, is sent when there is a block of RGB data ready in the FIFO. The block size is set by the type of video frame that is being transferred (see RGB top of frame packet).
A YUV acknowledge packet, as depicted in Figure 53, is sent to tell the host that it is ok to send another YUV frame. Control types and parameters for the control packet (Host to video encoder/decoder 103). Each control type has a default setting. The default settings are in place after the microcode had been downloaded to the video encoder/decoder 103 and until they are changed by a control packets.
A control encoding packet is depicted in Figure 54. This control is used to change the state ofthe encoding process (e.g. start, stop or restart). Re-start is used to force a still image. The default setting is encoding stopped and decoder in normal operation. Referring to Figure 54, for the encoder state parameter:
0 = Stop encoding
1 = Start encoding
2 = Re-start encoding
3 = Start encoding in record mode (bit rate enforced by video encoder/decoder 103)
If bit 3 ofthe encoder state parameter is set to one, the encoder buffer is cleared before the new state is entered.
A frame rate divisor packet is depicted in Figure 55. This control is used to set the frame rate ofthe video capture system. The default is 15 frames per second. Referring to Figure 55, for the frame rate divisor parameter:
2 = 15 frames per second
3 = 10 frames per second
4 = 7.5 frames per second 5 = 6 frames per second
6 = 5 frames per second 10 = 3 frames per second 15 = 2 frames per second; and 30 = 1 frame per second. An encoded bit rate packet is depicted in Figure 56. This control is used to set the encoded bit stream rate for the local encoder. The default is 19.2 kbits per second (parameter value of 63). Referring to Figure 56, with respect to the bit rate parameter: If bit 7 ofthe bit rate parameter is 0 ActualBitRate = (BitRateParameter +1) *300 bits/sec else
ActualBitRate = (BitRateParameter •0x7f)+l ) *1000 bits/sec
A post spatial filter packet is shown in Figure 57. This control is used to set the post spatial filter. The filter weight parameter specifies effect of the filter. A parameter value of zero provides no filtering effect. A value 255 places most ofthe emphasis ofthe filter on adjacent pixels. The default value for this filter is 77. The spatial filter is implemented as two 3 tap, one dimensional linear filters. The coefficients for these filters are calculated as follows:
Figure imgf000106_0001
Figure imgf000106_0002
Where/? is the value ofthe parameter divided by 256.
A pre spatial filter packet is shown in Figure 58. This control is used to set the pre spatial filter.
A temporal filter packet is shown in Figure 59. This control is used to set the temporal filter for the encoder. The filter weight parameter specifies effect ofthe filter. A parameter value of zero provides no filtering effect. A value 255 places most ofthe emphasis ofthe filter on the previous image. The temporal filter is implemented as:
> - M - α-/>) x>M +/> x.v[M] where/? is the value ofthe parameter divided by 256.
A still image quality packet is shown in Figure 60. This control is used to set the quality level ofthe still image coder. A quality parameter of 255 indicates the highest quality still image coder and a quality parameter of zero indicates the lowest quality still image coder. There may not be 256 different quality levels. The nearest implemented quality level is selected if there is no implementation for the specified level. The default is 128 (intermediate quality still image coder).
A video mode packet is depicted in Figure 61. This control is used to set video input & output modes. The large and small selfview may not be specified at the same time. It is possible that the PC's or the video encoder/decoder's bandwidth will not support both a large
selfview and a large decoded view at the same time. Large views are 176 x 144 and small views are 44 x 36. The default setting is not views enabled. The reserved bits should be set to zero. Referring to Figure 61, for the video source: 0 = Video source is the host interface through YUV packets; and
1 = Video source is the video front end.
A video pan absolute packet is shown in Figure 62. This control is used to set electronic pan to an absolute value. Because the video encoder/decoder system is not utilizing the complete video input frame, one can pan electronically (select the part ofthe video image to capture). Planning is specified in Vi pixel increments. The pan parameters are 8 bit signed numbers. The Y axis resolution is 1 pixel. The default pan is (0,0) which centers the captured image in the middle ofthe input video image. The valid range for the x parameters is about - 100 to +127. The valid range for the y parameter is -96 to +96.
A brightness packet is shown in Figure 63. The brightness packet controls the video front end 104 brightness setting. The brightness value is a signed integer with a default value as zero.
A contrast packet is shown in Figure 64. The contrast packet controls the video front end 104 contrast setting. The contrast value is an unsigned integer with a default setting of 128. A saturation packet is shown in Figure 65. The saturation packet controls the video front end 104 saturation setting. The saturation value is an unsigned integer with a default value of 128.
A hue packet is shown in Figure 66. The hue packet controls the video front end 104 hue setting. The hue value is a signed integer with a default value of zero. A super control packet is shown in Figure 67. The super control packet allows all of the above defined control to be set all at once. See above for the various parameter definitions.
A control decoding packet is shown in Figure 68. This control is used to change the state of the decoding process (e.g. look for still image or normal decoding). The default setting for the decoder is normal decoding. For the decoder state parameter:
0 = Decoder not decoding
1 = Decoder normal operation (decoding, decoder does not acknowledge encoded bit packets) 2 = Decoder in playback mode (decoder acknowledges encoded bit packets); and
3 = Decoder in bit stream pass through mode (decoder gets bits from local encoder). If bit 3 ofthe decoder state parameters is set to one, the decoder buffer is cleared before the new state is entered. The set motion tracking control is used to set the motion tracking state ofthe encoder.
This parameter is used to trade frame rate for better quality video frames. A value of zero codes high quality frames at a slower rate with longer delay. A value of 255 will track the motion best with the least delay, but will suffer poorer quality images when there is a lot of motion. The default setting is 255. A motion tracking packet is shown in Figure 69. The request control setting packet is used to request the value ofthe specified control setting. The video encoder/decoder 103 will respond by send back a packet containing the requested value. The packet will be formatted so that it could be used directly to set control value. A request control setting packet is shown in Figure 70. An example of a request control setting packet is shown in Figure 71. If the PC sent the request control setting packet of Figure 71 , the video encoder/decoder 103 would respond by sending back a frame rate divisor packet formatted, as shown in Figure 72.
A request special status information packet is shown in Figure 73. The request special status setting packet is used to request the value ofthe specified status setting. The video encoder/decoder 103 will respond by send back a packet containing the requested value. This will be used mainly for debugging.
Request for the following information types will result in the video encoder/decoder 103 sending back a packet as shown in Figure 74, which depicts buffer fullness (type 0x81). A request YUV frame (type 0x82) causes the video encoder/decoder 103 to send to
YUV frames of data until the request is made a second time (i.e., the request is a toggle). The
video encoder/decoder 103 will send back YUV frames through the packet protocol. There are four types of packets that will be sent. A packet indicating the top of frame, a packet containing two lines of Y data, a packet containing one line of U data followed by one line of V data and a packet indicating the end ofthe frame. A YUV top of frame is depicted in
Figure 75 and a Y data frame is depicted in Figure 76. A UV data frame is depicted in Figure 77 and a YUV end of frame is depicted in Figure 78.
Means Squared Error (MSE) Hardware As previously noted, the preferred architecture can calculate multiple pixel distortion values simultaneously and in a pipelined fashion so that entire areas of image may be calculated in very few clock cycles. Two accelerator algorithms may be used, the Mean Squared Error (MSE) and the Mean Absolute Error ("MAE"). In a preferred embodiment, the algorithm is optimized around the MSE calculations because the MSE results in the lowest PSNR. The amount of quantization or other noise is expressed as the ratio of peak-to-peak signal to RMSE expressed in decibels (PSNR).
Computing the MSE and MAE in digital signal processor (DSP) software takes about the same computer time. In a TMS320C31 digital signal processor manufactured by Texas Instruments, also referred to as a C31, the MSE takes 2 clock cycles and MAE takes 3 cycles. A TMS320C51 digital signal processor running at twice the clock speed can calculate the MSE and MAE in 5 clocks. The MSE accelerator requires multipliers whereas the MAE can be implemented with only adders, rendering the hardware implementation ofthe MAE easier.
The accelerator connects to the slave encoder memory bus and receives control from the slave DSP as shown in Figure 79. Preferably, the DSP fetches 32 bit words, 4 pixels, from memory and passes it to the accelerator. There are four accelerator modes. The first mode, MSE/MAE calculations, is the most used mode. This mode may be with or without motion. The second mode, scene change MSE/MAE calculation, is done once per field. This mode computes MSE/MAE on every 8x8 block and stores result in internal DSP RAM, compares the total error of a new filed with the previous field's total error, and then software determines if still frame encoding is necessary. The third mode, standard deviation calculations, calculates the standard deviation for each block to be encoded to measure block activity for setting the threshold level. The fourth mode is a mean calculation. This is rarely computed, only when the algorithm is in the lowest part of hierarchy.
Referring to Figure 79, a memory read takes one DSP clock, and therefore transfers four pixels in one clock. .An MSE/MAE accelerator 142 requires four NCI (newly coded image) pixels from a NCI memory 144 and four PCI (previously coded image) pixels from a PCI memory 146 for processing by a DSP 148. In a preferred embodiment, the DSP 148 is a Texas Instruments TMS320C31 floating point DSP. A fixed point DSP or other digital signal processor may also be used. The DSP 148 reads the result from the accelerator 142 along the data bus 150, which takes one clock. For a 4x4 block without motion, it thus takes nine clocks plus any hardware pipeline delays, four clocks for NCI data, plus four clocks for PCI data, plus one clock to read the result, 4 + 4 + 1 = 9. Figure 79 also includes a new image memory 152. If a motion search is necessary, then a second set of four PCI pixels must be read from memory. The accelerator stores the first set of PCI data and uses part ofthe second PCI data. The DSP instructs the accelerator how to process (MUX) the appropriate pixels. For a 4x4 block with motion it takes 13 clocks plus pipeline delays, which is four clocks for NCI data, four clocks for PCI data, four clocks for the second set of PCI data, and one clock for reading the result, 4 + 4 + 4 + 1 = 13.
The 8x8 blocks are calculated in the same way. Without motion takes 33 clocks plus pipeline delay, and with motion it takes 49 clocks plus pipeline delay.
For the software to compute an MSE, first it reads a 16x16 block of NCI and PCI data into internal memory. Then it takes 2 clocks per pixel for either the MSE or MAE calculation, or a total of 128 clocks for an 8x8 block, or 32 clocks for a 4x4 block. For comparison purposes, only the actual DSP compute time of 128 or 32 clocks are compared.
The hardware is not burdened with the overhead associated with transferring the 16x16 blocks, therefore the actual hardware speed up improvement is greater than the value calculated. Other overhead operations not included in the calculations are: instruction cycles to enter a subroutine, instructions to exit a routine, time to communicate with the hardware, etc.
The overhead associated with these operations represents only a small percentage ofthe total and they tend to cancel each other when comparing hardware to software methods, so they are not included in the calculations. Figure 80 is a table that represents the formulas for calculating speed improvements. The expression represents a ratio between total time taken by software divided by the time for implementing the calculations in hardware. "Pd" represents the hardware pipeline delay. Assuming a pipeline delay of 3 clocks, speed improvement is found. With respect to the worst case scenario, a 4x4 condition, assuming that motion is present 3 out of 4 times, then the average speed improvement is (32/12*1+32/16*3)4=2.167 times over the software only implementation. For the 8x8 case, assuming motion 3 out of 4 times, then the average improvement is (128/36* l+128/52*3)/4=2.74 times over the software only implementation. Overall, assuming the hierarchy stays in the 8x8 mode 3 out of 4 times, then the total speed improvement would be (32/12*l+32/16*3)/4*l(128/36*l+128/52*3)/4*3)*3)/4=2.593 times over the software only implementation.
Figure 81 shows the speed improvement for various pipeline delay lengths. Assuming a motion search is required 3: 1 and that the 8x8 block is required 3:1 compared to 4x4 in the overall section. With respect to programmable logic devices (PLDs), a preferred embodiment employs, for example, Altera 8000 Series, Actel ACT 2 FPGAs, or Xilinx XC4000 Series
PLDs.
Figure 82 depicts a mean absolute (MAE) accelerator implementation in accordance with the equation shown in the Figure. As shown at 162, the input is 4 newly coded image pixels (NCI), which is 32 bits, and, as shown at 164, 4 previously coded image pixels (PCI). As shown at 166, if a motion search is necessary, the system further requires an additional 4 PCI pixels. As seen at 168, the PCI pixels are multiplexed through an 8 to 4 bit multiplexer (MUX), which is controlled by a DSP (not shown). As shown at 170, a group of adders or summers, adds the newly encoded pixels and the previously encoded pixels and generates 9 bit output words As seen at 172, the absolute value of these words, generated by removing a sign bit from a 9 bit word resulting in an 8 bit word, is then input into a summer 174, the
output of which is a 10 bit word input into a summer 176 The summer 176 then accumulates the appropπate number of bits depending on the operation mode Figure 83 depicts a mean absolute (MSE) accelerator implementation The configuration of this implementation is similar to that shown m Figure 82, and includes a group of multipliers 202
Figure 84 illustrates an mean absolute error (MAE) implementation for calculating the MAE of up to four pixels at a time with pixel interpolation for sub-pixel resolution corresponding to the equation at the top of Figure 84 The implementation includes a 32 bit input As seen at 202, four non-interpolated pixels are transmitted via a pixel pipe to a seπes of summers or adders 204 At 206, four previously encoded pixels are input into a 8 to 5 multiplexer (MUX) 208 The output ofthe MUX 208 is hoπzontally interpolated with a seπes of summers or adders 210 and the output 9 bit words are manipulated as shown at 212 by the DSP (not shown) The resulting output is transmitted to a 4 pixel pipe 214 and is vertically interpolated with a senes of summers or adders, as seen at 216 As seen at 217, the absolute value ofthe output ofthe adders 204 is input to adders 218, the output of which is provided to adder 220. The adder 220 accumulates the appropπate number of bits depending on the mode of operation. Figure 85 illustrates a mean square error (MSE) implementation for calculating the
MAE of up to four pixels at a time with pixel interpolation for sub-pixel resolution corresponding to the equation at the top of Figure 85. The implementation includes a 32 bit input As seen at 222, four non-interpolated pixels are transmitted via a pixel pipe to a senes of summers or adders 224 At 226, four previously encoded pixels are input into a 8 to 5 multiplexer (MUX) 228. The output of the MUX 228 is horizontally interpolated with a series of summers or adders 230 and the output 9 bit words are manipulated as shown at 232 under the control ofthe DSP (not shown). The resulting output is transmitted to a 4 pixel pipe 234 and is vertically interpolated with a series of summers or adders, as seen at 236. As seen at 237, the squared value ofthe output of the adders 224 is input to adders 238, the output of which is provided to adder 240. The adder 240 accumulates the appropriate number of bits depending on the mode of operation.
Video Teleconferencing Video teleconferencing allows people to share voice, data, and video simultaneously. The video teleconferencing product is composed of three major functions; video compressions, audio compression and high speed modem. The video compression function uses the color space conversion to transform the video from the native YUV color space to the host RGB display format.
A color space is a mathematical representation of a set of colors such as RGB and YUV. The red, green and blue (RGB) color space is widely used throughout computer graphics and imaging. The RGB signals are generated from cameras and are used to drive the guns of a picture tube. The YUV color space is the basic color space used by the NTSC (National Television Standards Committee) composite color video standard. The video intensity ("luminance") is represented as Y information while the color information ("chrominance") is represented by two orthogonal vectors, U and V.
Compression systems typically work with the YUV system because the data is compressed from three wide bandwidth (RGB) signals down to one wide bandwidth (Y) and two narrow bandwidth signals (UV). Using the YUV color space allows the compression algorithm to compress the UV data further, because the human eye is more sensitive to luminance changes and can tolerate greater error in the chrominance information.
The transformation from YUV to RGB is merely a linear remapping ofthe original signal in the YUV coordinate system to the RGB coordinate system. The following set of linear equations can be used to transform YUV data to the RGB coordinate system: R = Y + 1.366*V - 0.002*U
G = Y - 0.700*V - 0.334*U B = Y - 0.006*V = 1.732*U This type of transformation requires hardware intensive multipliers. Approximate coefficients can be substituted which simplifies the hardware design. Two-bit coefficients can be implemented in hardware with adders, subtractor and bit-shifts (no multipliers). The two- bit coeffecient method requires no multiplier arrays. A simplified set of equations follows:
R = Y + 1.500*V G = Y - 0.750*V - 0.375*U B = Y + 1.750*U Coefficient errors due to this approximate range from 1.2% to 9%. The 9% error occurs in the U coefficient ofthe G equation. To implement the two-bit coefficients in hardware, the equations can be viewed where the multiplication is a simple bit shift operation. Shift operations do not involve logic. It is important to remember that the hardware must hard limit the RGB data range. Therefore, an overflow/underflow detector must be built into the hardware that implements this conversion.
The YUV to RGB matrix equations use two-bit coefficients which can be implemented with bit-shifts and adds. High speed adder structures, carry save and carry select, are used to decrease propagation delay as compared to a standard ripple carry architecture. The two types of high speed adders are described below. Carry save adders reduce three input addends down to a sum and carry for each bit. Let A = An-1 ... A1A0 and B = Bn-1 ... B1B0 and D = Dn-1 ... DIDO be the addend inputs to an n-bit carry save adder. Let Ci be the carry output ofthe ith bit position and Si be the sum output of the ith position. Ci and Si are defined as shown in Figure 86. A sum and carry term is generated with every bit. To complete the result, the sum and carry terms must be added in another stage such as a ripple adder, carry select adder or any other full adder structure. Also not that the carry bus must be left shifted before completing the result. For instance, when the LSB bits are added, sum SO and carry CO are produced. When completing the result, carry CO is added to sum SI in the next stage, thus acting like a left bus shift.
Figure 87 is a block diagram of a color space converter that converts a YUV signal into RGB format. The R component is formed by coupling a IV, a 0.5V and a 1Y signal line to the inputs of a carry save adder 242, as shown in Figure 87. The carry and sum outputs from the carry save adder 242 are then coupled to the inputs of a full adder 243. The sum output ofthe full adder 243 is then coupled to an overflow/underflow detector 244, which hard limits the R data range.
The B component of the RGB formatted signal is formed by coupling a 2U, a -0.25U and a 1 Y signal line to the inputs of a carry save adder 245. The carry and sum outputs ofthe carry save adder 245 are coupled to the inputs of a full adder 246, whose sum output is coupled to an overflow-underflow detector 247. The overflow-underflow detector 247 hard limits the B data range.
The G component of the RGB formatted signal is formed by coupling a -0.25U, a - 0.125U and a -0.5V signal line to the inputs of a carry save adder 248. The carry and sum outputs ofthe carry save adder 248 are coupled, along with a -0.25 V signal line, to the inputs of a second carry save adder 249. The carry and sum outputs ofthe carry save adder 249 are coupled, along with a I Y signal line, to the inputs of a third carry save adder 250. The carry and sum outputs ofthe third carry save adder 250 are coupled to the inputs of a full adder 251, as shown in Figure 87. The sum output ofthe full adder 251 is coupled to an overflow/underflow detector 252, which hard limits the data range ofthe G component.
In a preferred embodiment, the Y signal has eight data bits and the U and V signals have seven data bits and a trailing zero bit. U and V signals may therefore be processed by subtracting 128. In addition, the Y, U and V signals are preferably sign extended by two bits, which may be used to determine overflow or overflow state. For example, 00=OK, 01=overflow, 1 l=underflow and 10=indeterminant. Each ofthe adders shown in Figure 87 is preferably, therefore, a 10 bit adder.
Figures 88, 89 and 90 depict a carry select adder configuration related bit tables for converting the YUV signal to RGB format. Figure 88 includes a carry select adder 253 coupled to a full adder 254 that generate the R component ofthe RGB signal. Figure 89 includes a carry select adder 255 coupled to a full adder 256 that generate the B component of the RGB signal. Figure 90 includes carry select adders 257, 258, 259 and a full adder 260 that generate the G component ofthe RGB signal.
Figure 91 is a block diagram timing model for an YUV to RGB color space converter. A FIFO 262 is coupled by a latch 264 to a state machine 266, such as the Xilinx XC3080- 7PC84 field programmable gate array. The state machine 266 includes an RGB converter 272 that converts a YUV signal into a RGB formatted signal. The converter 272 is preferably a ten bit carry select adder having three levels of logic (as shown in Figures 87-90). The RGB formatted data is coupled from an output ofthe state machine 266 into a buffer 270. In accordance with the preferred embodiment, the state machine 266 is a memoryless state machine that simultaneously computes the YUV-to-RGB conversion while controlling the flow of data to and from it.
The carry select adder, as implemented in the Xilinx FPGA, adds two bits together and generates either a carry or a sum. Carry select adders are very fact because the propagation delays and logic levels can be kept very low. For instance, a ripple carry adder of 10 bits requires 10 level of logic to compute a result. A 10 bit carry select adder can be realized in only three levels of logic.
Sums and carries are generated, and in some cases, a carry is generated without completing the carry from lower order bits. For instance, carry four (C4) generates two carry terms, each based on the outcome of carry two (C2). Equation C_40 is based on C2 being zero, and C 41 is based on C2 being one. Once C2 is determined, then the appropriate signal, C_40 or C_41, can be selected in later logic levels.
Generating multiple carries allows for parallel processing, because the carries can be generated without waiting for earlier results. This idea greatly improves the speed of an adder as compared to ripple carry architecture.
Similarly, higher order bits will generate multiple carry types which will be completed in following levels of logic. As the lower order sums and carries are generated, the upper order results can be determined. The entire adder tree is completed in only three levels of logic. Appendix 6 contains code for implementing the carry select adder and the carry save adder.
Figure 92 is a state diagram relating to the conversion ofthe YUV signal to RGB format in accordance with the timing model shown in Figure 91.
Since different platforms have different color space requirements, the output color space converter shown in Figure 91 is in-circuit configurable. The application software queries the platform to determine what color space it accepts and then reconfigures the color space converter appropriately. In the preferred embodiment, YUV-to-RGB 565 color space conversion is performed. Nonetheless, the FPGA shown in Figure 91 may perform RGB color space conversions to different bit resolutions or even to other color spaces. Half-Pixel Motion Estimation Memory Saving Technique
In video compression algorithms, motion estimation is commonly performed between a current image and a previous image (commonly called the reference image) using an integer pixel grid for both image planes. More advanced techniques use a grid that has been unsampled by a factor of two in both the horizontal and vertical dimensions for the reference image to produce a more accurate motion estimation. This technique is referred to as half- pixel motion estimation. The motion estimation is typically performed on 8x8 image blocks, or on 16x16 image blocks (macro blocks), but various image block sized may be used.
Assume that 16x16 image blocks (macro blocks), as shown in Figure 93, are used with a search range of approximately +/-16 pixels in the horizontal and vertical directions and that the image sizes are QCIF (176 pixels horizontally by 144 pixels vertically) although other image sizes could be used. Further assume that each pixel is represented as one byte of data, and that motion searches are limited to the actual picture area. A 16 pixel wide vertical slice is referred to as a macro column and a 16 pixel wide horizontal slice is referred to as a macro row. An interpolated macro row of reference image data is 176x2 bytes wide and 16x2 bytes high. Figure 94 shows a macro block interpolated by a factor two horizontally and vertically. When designing video compression algorithms, there is a trade-off between computation power and memory usage. On the one hand, if one has more computational power and memory bandwidth, then one can save on memory space by performing the interpolation on a 3x3 macroblock portion of the reference image for each macro block ofthe current image to undergo motion estimation. On the other hand, if one can afford to use more memory to save computational power than one can interpolate the full reference image before any motion estimation is performed and therefore only interpolate each portion ofthe image once. This provides computational savings since one is only interpolating the image once instead of multiple times. This computational savings comes at the cost reference image memory size which must be four times the original reference image memory size (176x144x2x2 = 101,376 bytes). In a preferred embodiment, there are two methods of reducing the computation and the memory requirements simultaneously.
The first method involves using a memory space equivalent to three of the macro rows inteφolated by a factor of two in both the horizontal and the vertical directions in place ofthe whole reference image plane inteφolated up in size. That is, the memory size is 3x176x16x2x2 = 33,792 bytes. This reduces the inteφolated reference image memory requirement by approximately 67%. This inteφolated reference image memory space is best thought of as being divided up into thirds with the middle third corresponding to roughly the same macro row (but inteφolated up in size) position as the macroblock being motion estimated for the current image. The motion estimation using this memory arrangement can be implemented as follows:
1) Copy the first three (top) inteφolated macro rows ofthe reference image data into the previously described memory space (note: this is 3x176x16x2x2 bytes of video data) and calculate the motion estimation for the first two current image macro rows. Figure 95 depicts the memory space after this step during the first iteration ofthe first method.
2) Copy the second third ofthe inteφolated macro row ofthe reference image data into the memory space where the first third ofthe inteφolated macro row of the reference image data currently resides. (In the first iteration ofthe algorithm, inteφolated macro row 2 is copies with memory holding inteφolated macro row 1).
3) Copy the last third inteφolated macro row ofthe reference image data into the memory space where the second third ofthe inteφolated macro row of the reference image data currently resides. (In first iteration of the algorithm, inteφolated macro row 3 is copied into the old inteφolated macro row 2 memory space).
4) Copy the next inteφolated macro row ofthe reference image data into the memory space where the last third ofthe inteφolated macro row ofthe reference image data currently resides. (In first iteration of the algorithm, inteφolated macro row 4 is copied into the old inteφolated macro row 3 memory space). Figure 96 depicts the memory space after this step after the first iteration of the first method.
5) Calculate the motion estimation for the next current image macro row assuming that its reference position is centered in the middle third ofthe inteφolated reference image memory space.
6) Return to step 2. This first method requires that additional time and memory bandwidth be spent performing the copies, but saves memory and computational power.
The second method, like the first method, involves using a memory space equivalent to three ofthe macro rows inteφolated up by a factor of two in both the horizontal and the vertical directions. Unlike the first method, the second method uses modulo arithmetic and pointers to memory to replace the block copies ofthe first method. The second method is better than the first method from a memory bandwidth point of view, and offers the memory and computational power savings ofthe first method.
Three macro rows of inteφolated reference image data corresponds to 96 horizontal rows of pixels. If, as in method 1 , the memory space of the inteφolated reference image is divided up into thirds, then the middle third ofthe memory space begins at the 33rd row of horizontal image data. A pointer is used to locate the middle third ofthe memory space. Assuming that a number system starts at 0, then the pointer will indicate 32. The motion estimation ofthe second method using this memory arrangement can be implemented as follows:
1) Calculate and copy the first three (top) inteφolated macro rows ofthe reference image data into the previously described memory space (3X176X16X2X2 bytes of video data) and calculate the motion estimation for the first two current image macro rows (pointer was initialized to 32). 2) Set the pointer to 64 and copy the next inteφolated macro row of reference image data into memory rows 0-31.
3) Calculate the motion estimation for the next current image macro row assuming that its reference position is centered in the middle third ofthe inteφolated reference image memory space. 4) Set the pointer to 0 and copy the next inteφolated macro row of reference image data into memory rows 32-63.
5) Calculate the motion estimation for the next current image macro row assuming that its reference position is centered in the middle third ofthe inteφolated reference image memory space. 6) Set the pointer to 32 and copy the next inteφolated macro row of reference image data into memory rows 63-95.
7) Calculate the motion estimation for the next current image macro row assuming that its reference position is centered in the middle third ofthe inteφolated reference image memory space. 8) Return to step 2.
Figure 97 depicts memory space addressing for three successive iterations of a second
motion estimation memory saving technique. The numbers on the left indicate the horizontal memory space addresses. The "X" indicates the center inteφolated macro row and pointer location. Figure 97 gives a conceptual view of how using a pointer and modulo arithmetic makes the memory space appear to "slide" down an image.
The above two methods can be extended to cover extended motion vector data (~ +/- 32 pixel searches instead of just the ~ +/- 16 pixel searches described previously) by using a memory space covering five inteφolated macro rows of reference image data. This larger memory space requirement will reduce the inteφolated reference image memory requirement by approximately 45% instead of the 67% savings that using only three inteφolated macro rows of reference image data gives. Of course the second method uses modulo 160 arithmetic.
Panning and Zooming Traditional means of implementing a pan of a video scene is accomplished by moving the camera and lens with motors. The hardware in a preferred embodiment uses an electronic means to accomplish the same effect.
The image scanned by the camera is larger than the image required for processing. An FPGA crops the camera image to remove any unnecessary data. The remaining desired picture area is sent to the processing device for encoding. The position ofthe desired image can be selected and thus "moved" about the scanned camera image, and therefore has the effect of "panning the camera." The software tells the input acquisition system to change the horizontal and vertical offsets that it uses to acquire the image into memory. The application software provides the users with a wire frame diagram showing the total frame with a smaller wire frame inside of it which the user positions to perform the panning operation.
Traditionally, a zoom in a video scene is accomplished by moving the camera and lens with motors. In a preferred embodiment, electronics are used to accomplish the same effect. For the preferred embodiment, the image scanned by the camera is larger than the image required for processing. Accordingly, the video encoder/decoder 103 may unsample or subsample the acquired data to provide zoom in/zoom out effects of a video conferencing environment. For example, every pixel is a given image area may be used for the most zoomed in view. A zoom out effect may then be achieved by decimating or subsampling the image capture memory containing the pixel data. These sampling techniques may be implemented in software or hardware. A hardware implementation is to use sample rate converters (inteφolating and decimating filters) in the master digital signal processor or in an external IC or ASIC.
Sending High Quality Still Images A video conferencing system employing the invention may also be used to sent high quality still images for viewing and recording to remote sites. The creation of this still can be initiated through the application software by either the viewing or the sending party. The application then captures the image, compresses its, and transmits it to the view over the video channel and data channel for maximum speed. In applications using more than one processing device, the still image can be compressed by one ofthe processing devices while the main video conferencing application continues to run.
In a preferred embodiment, the still is encoded in one of two methods, depending upon user preference - 1.) a HVQC still is sent, or ii.) a differential pulse code modulation (dpcm) technique using a Lloyd-max quantizer (statistical quantizer) with Huffman encoding. Other techniques that could be used include JPEG, wavelets, etc. In applications using more than one processing device, the MD is used to actually do the stills encoding as a background task and then sends the still data with a special header to tell the receiving codec what the data is. The receiver's decoder would then decode the still on its end.
Stills and video conferencing frames are presented to the user in separate application windows. The stills can then be stored to files on the users system which can then be called up later after a software or hardware decoder.
In accordance with the present invention, a video sequence may be transmitted over an ordinary modem data connection, either with another party or parties, or with an on-line service. Upon connection, the modems exchange information to decide whether there is to be a two-way (or more) video conference, a single ended video conference, or no video conference. The modem may then tell the application software to automatically start up the video conference when appropriate. In addition, the system described herein provides the users the capability of starting an application at any point during their conversation to launch a data and/or video conference.
In addition, the video encoder/decoder 103 described herein may be used with a telephone, an audio compression engine and a modem to produce a video answering machine. For this embodiment, a DRAM may be used to store several minutes of data (multiple audio/video messages). The answering machine may play an audio/video message to the caller and may then record a like image. The caller may choose to send either a video sequence or a still image, such as by pushing appropriate buttons on the telephone. The call may be given the option of reviewing the video sequence or still image by preventing the encoding and transmission ofthe video until the caller presses an appropriate button to signal its approval. It is to be understood, however, that additional alternative forms ofthe various components ofthe described embodiments are covered by the full scope of equivalents ofthe claimed invention. Those skilled in the art will recognize that the preferred embodiment described in the specification may be altered and modified without departing from the true spirit and scope ofthe invention as defined in the following claims, which particularly point out and distinctly claim the subjects regarded as the invention.
Dynamic memory: bit buf.c
1536 Encoder buffer size 1536 Decoder buffer size master.c r
Allocate input image y. u and v buffers.
The input image will be two lines longer than the coded image. This allows us to filter the image with no edge effect on the top and bottom of the image. Side edge effects are taken care of by the filter function.
25696/2 inputY - (Utong')rnalC)c(((IMAGE.data.ro¥rBY^)-IMAGE_da1a.coeiT ^): 25520/4 inputυ > (U wg*)rτwlkκ(((IMAGE_dΛ«-row8UV+1),IMAGE_daia-colsUVy4): 25520/4 inputV - (Ulong')maJloc(((IMAGE_daU.fowsUV+1)*IMAGE da .coltUVy*);
hvqintt.c
6570 imgβ.yc • (Utong-)gmatrbr( -1 , dim.y, 0. (dim.x>S2M . »iMθf(Ulonβ) ); f current coded frame V
6570 imgs.yp - (Uteng-)gmatrtx( -1 , dim.y, 0, (dim.x»2)-l , stzeof(Ulong) ); r prev. coded frame */
Λ COLOR image planes /
1656 imgs.uc ■ (Utong^gmatιte( 0. dim.yc-1. 0, (dim -c»2)-1. eizβof (Utong) ); 1656 imgβ.vc - (Uicng**tøiMtιiχ( 0, dim.yc-1 , 0. (dhτu(c>>2)-1 , βαeoffUbng) );
1656 irnga.up > (Ulonfl-)gmβtrtx( 0. dim.yc-1 , 0. (dim.xc»2H , 8-zeofΛJtoog) ); 1656 imgβ.vp - (Ubng**)gmatnx( 0. dim.yc-1. 0, (dim ee»2)-1 , sizeof (Ulong) );
Total dynamically aJkxated memory
48444 32-øH words for ima e end bβ buffers (1β3776 byteβ)
Appendix 1: 1 Statically allocated memory:
.text 11709 words (program)
.bss 2304 words (statically allocated arrays) .const 636 words (constants)
.data 3342 words (nβarty all « ccdebooks) internal memory: itβxt 438 words (internal program)
.idata 52 words (internal data)
.craml 531 words (input spatial filter, mput video buffer)
.ofHtβr 529 words (post spatial filter)
Total external memory requirements of Decoder
48444 + 17991 « 66435 32-bit words
Slave Encoders
Dynamically allocated memory: hvqmit.c
(byte memory)
25488 ιmgβ.y ■ bytβMatrixfO. dtm.y-1. 0. dvn.x-1. byteStart); Λ original vnage V 25488 imgs.yc ■ byteMatπx(0, dim.y- 1. 0. dιm c-1.byteStart); Λ original vnage 7 25488 imgs.yp - byteMatnx(0. dtm.y-1. 0. dtm.x-1.byteStart); r original vnage */ r AHocate the chommance signals out of byte-wide memory 7
6408 imgβ.u « byteMatrix( 0. dim.yc-1. 0. dim.xc-1 , byteStart);
6408 vngs.uc - byteMatπx( 0. dtm.yc-1. 0. dim.xc-1. byteStart);
6408 imgs.up • bytβMatrtx( 0, dim.yc-1. 0. dvn.xc-1 , byteStart);
6408 tmgβ.v - bytβMatrtx( 0. dim.yc-1. 0. dim.xc-1. byteStart);
6408 ngs.vc « byte atπx( 0, dim.yc-1 , 0, dim.xc-1. byteStart);
6408 ngs.vp - byteMatrix( 0, dvn.yc-1. 0. dvn.xc-1 , byteStart);
Appendix 1:2 Total dynamically allocated memory:
114912 8-bit bytes for images on each slave encoders.
PingPongx
512 • 2x256 32-blt words for ping pong buffers
Statically allocated memory:
MS 10983 words (slave MD communication and packed Y. U, and V plane recVd from master)
.const 992 words (constants) .data 12 words text 35766 words (includes program and oodebooks: luminance mean removed: 10689 color mean: 3563 color vq 10689)
.itβxt 280 words (program in internal memory)
Total memory required by Encoder
48265 32-blt words + 114912 8-bit bytes
Appendix 1:3
This file contains pseudo code/flow charts of how the decoder may beadle fait errors. The bit errors may be misting macro-raw packets of video data, or they may be undetected bit erran that the supervisor does not find. i code, a similar ofthe encoder This is done to make it the
#deflne BrT ERΛOR l #define NO_ERΛOI\ 0
Appendix 2: 1 Decoder Code
•••••••••••••••••••••••••••••••■•a hvq_decodeτO
The main program for decoding video bits
void hvq_decoderO
( initialize data structures and read in codcbookr, • Look ahead is a non-desctnictive look at the bit i.c. it does not advance buffer pointers, ... "/ look ahead lor picture start code (pec); while (1)
( bit_eπor_flag - NO.ERROR;
I* Check to see if there is a single pec in the bit stream •/ if (single pec present in the bit stream)
Figure imgf000131_0001
temporal reference, min and max means of die image, picture type */ extract picture header information; • Look far consistent mean uifbraution, valid temporal reference, picture extended infioπnaόoo, etc... •/ if (picnire header infbπnation is consistent)
{ bit eπor flag - frame decodetO;
}
K bit_eπw_flag - undetected_bit_eπorO;
Appendix 2:2 if (bu_eπor_flag — NO_ERROR) send the decoded data to the PC for display;
> else
{ do not update the display;
)
>
I* If there was a problem while decoding the image, or there is no pec waiting in the bit stream */ if ((bit error flag — BIT ERROR) OR not single sec present in the bit stream) {
I* Wait until a still image is sent to the decoder */ while (last 16 biu read are not equal lo single psc and still image pec)
( read a bit from the bit stream;
}
)
>
>
hkuΛ_ decode rQ
The program for leading the bit stream, and decoding i of actual video data, either still or video.
int frame decoderO ( • Tell the decoder that no macro-rows have been decoded */ reset which macro-tows have been decoded; • Check for valid macro row address in the bit stream */ loce: ahead in the bit i
Appendix 2:3 while (there is another macro-row video packet or still image encoded data packet)
{ get macro row address; if (duplicated macro row address)
{
Undetected bit error or lost macro-row packet that had pec info;
I bit eπorO;
BΓT.ER OR;
} se
{ if (still image encoded data) { loca._enor_.lag - stiU_macro_row_decoderO;
10
) else if (video image encoded data)
{ local error flag " video macro row decoderO;
>
if (local error flag) { return BIT ERROR;
15 > else if (interleaving video and still macro row within a frame)
{ * Interleaving of video and still data is not currently allowed, but may be allowed at some time in the future */ return BlT_ERROR;
> else
{ * If there are fill bits, they should be zero and should be less than 8 of them before the next macro-row address * 20 remove fill bits if there are any,
>
} look ahead at next 16 bits in the bit stream;
Appendix 2:4 if (there are macro-rows that have not been decoded)
{ * The supervisor has indicated that there was a bit error */ if (video encoded data)
{ t* Perform error concealment which currently is thought to be copying the macro-rows from the previously coded image, but could be other techniques as well */ copy missing macro-rows from previously coded image;
) else
{
I* still image encoded data can't have missing macro-row data t missing macro-rows or encode another still image? */
*" undetected_bit_errorO; return BIT ERROR;
> } if (look ahead data indicates a new single pec)
{ return NO ERROR;
}
IS else if Ocok ahead data indicates refresh data)
{ * refresh the image */ return, (rrfrrsh_decodetO);
> else
{
I* Unknown field in the bit stream */ unαetected_bit_errorO; return BIT ERROR;
20
Appendix 2:5 ^............•......••.•••..•>....... ...«......••..•......•........... undetected_bιt_errorO
Routine that is called when something unexpected appean in the bit-stream that is not recoverable.
•/ int undetected bit_errorO
{ signal remote encoder to send a new still image; return BIT ERROR; }
rrfrcsh_dc oderO
The routine for refreshing the image.
•/ int refresh decoderO
{ while (Ipec in look ahead data) { * EXPAND into vq, scalar, diff. or absolute •/ i row address of the data; if (not valid encoded data) { undetected_btt_errorO; ream BIT ERROR; } } }
/•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••a* video_mecro_row_decoderO
The routine for decoding a macro-row of video encoded data.
Appendix 2:6 int video macro row decoderO
{
/• Determine if you are faτ"""g from top to bottom or bottom to top */ get scan direction; for (each o the 11 16x16 blocks)
{ local_eπor_flag - hiersrcfaical_decoder( one 16x16 block ); ifOocal error flag implies invalid hufiman symbol OR invalid vq address)
{ return (undetected bit errorO);
)
stiU_nuκro_row_decoderO
The routine for decoding a macro-row of still encoded data.
int still macro row decoderO { for (each ofthe 1764x4 blocks)
{ kκal_eπw_flag - βtiU_vq_decoder (one 4x4 block); if (local error flag "■» invalid huffinan symlwl OR invalid vq address)
{ return (undetected bit errorO);
} )
> r.............«..............«............................ hιerarehκal_decoderO
The routine for hierarchically decβduig each 16x16 block, if necessary,
Appendix 2:7 each 16x16 is iteratively decomposed into four sub-blocks until the block sue is 4x4.
int bierarchical.decoderO { local_hitl6_flag - decoderl6 (one 16x16 block); if ((local bit flag)
{ de wnpoor into four 8x8 blocks; for (each txS block)
{ local.bitt flag - decoder* (one 8x8 block); ifdlccal hits flag )
{ decompose into four 4x4 blocks; for (each 4x4 block)
{ decoder4 (one 4x4 block);
}
}
>
)
/••••••••••••••••••••••••••••••••••••••••••••••a*.**.**** decoderlβO
This routine decodes s 16x16 block using the different 16x16 coden.
decoderl60 ( hitFlag - read a bit from the bit stream; if(httFlag)
{ copy 16x16 Y block from previously coded image; copy SxS U sad V blocks from previously coded color planer, ι (NO_ERROR);
Appendix 2:8 } else
{ hitFlag - read a bit from the bit stream; if(hitFlag) look for a valid hufrman code for the motion vector cache position;
if (cache hit)
{ update the cache; extract the Y motion vector from the cache;
} else
( nriAat* the cache, exαsct the motion vector from the bit stream;
} form the U.V motion vecton from the Y motion vector, copy the blocks from the previously coded data at the motion vector offset;
}
) r* Decoders and decoder4 are not specified here yet */
Cede
hvq_,encβder
Appendix 2:9 void hvq_encoderO { initialize data stracnarcs and read in uidetiu ks, profnun the video A D with frame rates and filter options; i the master and slave vortex encoders; while (1)
{ both master and slave get images; rrrhangr frame πuiwutri. if (master frame number not equal to slave frame number) {
«!—»—■■■-» who needs to drop frames and acquire new frames to catch up; } both the master and slave compute the scene change mean square error. if (master vortex or slave vortex indicates a scene change occurred)
{ tell both encoders to use the still image encoder, reset boundary to 5/4 macro-rows for master/slave, respectively; put picture header information into the bit stream; still frame encoderO; } else
{ exchange master and slave scene change mean square error,
(faewminr if boundary needs to be adapted; if (boundary needs to be adapted)
{ exchange previously coded macro-row data; exchange scene change mse array for use with beckground detector, >
Appendix 2: 10 put picture header information into the bit stream, video_fiSj-ne_encoderO; if (there are extra bits are left over) { refresh_encoderO;
} else
{ reset the scalar refresher state; >
}
vιdeo_frame_encodcrO
The routine for esKoding a frame of video image.
int video frame eacoderO { for (each macro-row)
( reset the caches put macro-row address into the bit stream; for (each 16x16 block)
{ hierarchical encoder (one 16x16 block);
} }
Update threshold infoπnaάon bsaed on buffer fullness;
}
Appendix 2: 11 stiU_frame_encoder
The routine for »«-»»<■«§ a frame of still image.
int still frame encoderO { for (each macro-row) ( reset the cache t; put macro-row address into the bit stream;
• the codebooks used for Y, U and V can be different */ for (each 4x4 Y block) stiD_vo_encoder (one 4x4 block);
for (each 4x4 U block) stiU_v _.eacoder(one 4x4 block); for (each 4x4 V block) stffl_vα_cncoder(ooe 4x4 block);
still_vα_.eacoderO
The routine for encoding one 4x4 block of a still image.
Appendix 2: 12 still vq_encodeO
{ if (4x4 block is found in the stack cache)
{ encode 4x4 block with the cache entry, return ( TRUE );
) else
( encode this 4x4 block with a given codebook;
) ) •••••••••••••••■••••••••••••••••••■•••••••••••••••« re resh_encodeιO
The routine for refreshing a fraction of an image frame to avoid buffer underflow.
refresh encoderO { if (scalar refresh state active)
{ invoke the slow (but high quality) scalar refresher, > else
{ i the fast vq refresher, i the sane transition threshold for the scalar refresher, ) >
bjs-xaιchιCM_ejM rwf rQ
The routine for luetaicnically encoding each 16x16 block, if each 16x16 is iteratively docotnposed into four sub-blocks until the block size is 4x4.
Appendix 2: 13 int hicrarchical_eacoderO { local_hitl6_flag - encoderl6 (one 16x16 block); if (llocal hitl6 flag)
{ drmmnner into four SxS blocks; for (each SxS block)
( local ύtS_flag - encoders (one SxS block); if (llocal hits flag )
{ decompoar into four 4x4 blockr, for (each 4x4 block)
{ encoder4 (one 4x4 block);
>
>
} >
encoder 160
The routine for encoding one 16x16 block. It retums the status of the encoding process (hit miss flag).
int encoderl60
{ if (beckground test( 16x16 ))
{ i this block as a background block; m color components (U and V) also u background blocks; i (TRUE);
if ( MOTION CACHE16 - motion cache test(16xl6) ) { found a motion vector in the cache that passes tiie threshold test; i-T rtafr tht rtrfw cwwewf.
Appendix 2: 14 ) else if ( BLOCK MATCH16 - bkxk_match test( 16x16) )
( * bkκk_match_test can be full-search ot 3-step search */
found a motion in the cache that passes the threshold test; update the cache content;
} if (MOTION CACHE16 1| BLOCK MATCH16) ( at this point, it is certain that the luminance component can be coded with the motion vector. But we need to tee
10 for the color co poneneα with the same motion vector
•/ the 16x16 block with the resulting motion vector (MV);
• COLCA coniponcnts: both U and V hβve block size of SxS */ if (U and V share the MV of Y)
/• do nothing •/
15 else if(U shares the MV of Y and V does not) find tlιe nκ>tiOT vectOT θfV siartiιιg at the MV ofY; else if (V shares the MV of Y and U does not) find the motion vector of U starting at the MV of Y;
else • both U and V do not share MV with Y •/
20 find the motion vector of U starting at the MV ofY; find the motion vector ofV starting at the MV of Y;
U and V with the resulting motion vectors; (TRUE);
Appendix 2: 15 > else
{ eacoderlθ fails to encode this 16x16 block; return (FALSE);
encodertO
The routine for encoding one SxS block. It returns the status of the encoding process (hit/miss flag). 0
ua encodertO { if (background teat( SxS ))
{
\ this block as a background block; i the color ceeaponentt (U and V) also as background Mockr. i (TRUE);
> else 5 { if ( MOTION CACHES - motion cache tesKSxS) )
{ found a motion vector in the cache that passes the threshold test; ! the cache content;
> dee if ( BLOCK MATCH - block match testfSxS) )
( f* block_match_test can be lull-search ot 3-stcp search */ found a motion in the cache that passes the threshold test; ø npdatf the cache content;
) if (MOTION CACHES f| BLOCK MATCHI) {
Appendix 2: 16 at this point, it is certain that the luminance component can be coded with the motion vector. But we need to test for the color componenets with the same mouoo vector the SxS bkxk with the resulting mouon vector (MV);
/* COLOR components: both U and V have block size of 4x4 •/ if (U and V share the MV of Y)
{
/* do nothing */
> else if (U shares the MV of Y and V does not)
{ find the motion vector of V starting at the MV ofY;
10 ) else if (V shares the MV of Y and U does not)
{ find the motion vector of U starting at the MV of Y;
) else * both U and V do not share MV with Y •/
{ find the motion vector of U starting at the MV of Y; find the motion vector of V starting at the MV of Y;
15 ) encode U and V with the resulting motion vectors; return (TRUE);
> else
{ encoders Mis to encode this 8x8 block; encode each 4x4 U and 4x4 V using a 16-dim VQ;
20
/* So this is the last level the the
U and V components needed to be tested */ return (FALSE);
Appendix 2: 17 encoder40
The routine for encoding one 4x4 block. It always returns TRUE.
utt encoder40 ( if ( MOTION CACHE4 - motion cache_test(4x4) ) { found s motion vector in the cache that passes the threshold test;
: the <
1 else if ( BLOCK MATCH4 - block nιatch_test(4x4) )
(
I* blodc_mstch_test can be full-search ot 3-step search */ found a motion in the cache that passes the threshold test;
update the cache content;
} , . if (MOTION_CACHE4 || BLOCK MATCH4)
15 ( encode the 4x4 block with the resulting motion vector (MV); i (TRUE);
} else
{ each 4x4 bkxk with a mean-removed VQ; (TRUE);
20
)
0
Appendix 2: 18 found a motion in the cache that passes the threshold test; update the cache content;
> if (MOTION CACHE 16 1| BLOCK MATCH 16) { at this point, it is certain that the luminance component can be coded with the motion vector. But we need to test for the color cβinponenets with the same motion vector
encode the 16x16 bkxk with the resulting motion vector (MV);
* COLOR components, both U and V have block size of 8x8 */
10 if (U and V share the MV of Y)
/* do nothing */ else if (U shares the MV of Y and V does not) find the motion vector of V starting at the MV of Y; else if (V shares the MV of Y and U decs not) find the motion vector of U starting at the MV of Y;
15 else • both U sad V do not share MV with Y •/ find the motion vector ofU starting at the MV ofY; find the motion vector of V starting at the MVo Y;
U and V with the resulting motion vectors; (TRUE);
20
encoderlβ fails to encode this 16x16 block; i (FALSE);
Appendix 2: 19 encoderβO
The routine for encoding one SxS block. It returns the status of the encoding process (hit/miss flag).
int encodertO ( if (background test( SxS ))
{ encode this block as a background block; encode die color components (U and V) also as background blocks; return (TRUE);
> else
{ if ( MOTION CACHES - motion cache test(8xS) )
( found a motion vector in the cache that passes the threshold test; update the cache ww*;
} elae if ( BLOCK MATCHS - block match testiSxS) )
{
/• block_match_test can be full-search ot 3-step search */ found a motion in the cache that pastes die threshold test; update the cache content;
} if (MOTION CACHES R BLOCK MATCHS) {
I* at this point it is certain that the luminance component can be coded with the motion vector. But we need to tea for the color componenets with the same motion vector
'/ the SxS block with the resulting motion vector (MV);
Appendix 2:20 * COLOR components: both U and V have block size of 4x4 */ if (U and V share the MV of Y)
{ 5 * do nothing */
) else if (U shares the MV of Y and V does not)
{ find the motion vector of V starting at the MV of Y;
} else if (V shares the MV of Y and U does not)
{ find the m . otion vector of U starting at the MV of Y;
10 > else /* both U and V do not share MV with Y •/
{ find the motion vector of U starting at the MV of Y; find the motion vector ofV starting at the MV of Y; >
> u and V with the resulting motion vectors; return (TRUE);
15 encoders foils to encode this 8x8 block; encode each 4x4 U and 4x4 V using a 16-dim VQ, • So this is the last level the the
U and V components needed to be tested */
(FALSE);
20
Appendix 2:21 /• eacoder40
The routine for encoding one 4x4 bkxk. It always returns TRUE.
L encoder40 { if ( MOTION CACHE4 - motion cache_test(4x4) ) { found a motion vector in the cache that passes the threshold test; i the i > dae if ( BLOCK MATCH4 - block match test(4x4) ) { * blcck_match_test can be full-search ot 3-βtep search */ found a motion in the cache that passes the threshold test;
update the cache content;
> if (MC4TON.CACHE 1| BLOCK MATCH4)
( encode the 4x4 bkxk with the resulting motion vector (MV); ι (TRUE);
! each 4x4 block with a mean-removed VQ; return (TRUE);
}
0
Appendix 2:22 /" This is psuedo code for the HVQC encoder. It outlines the basic steps involved in the encoding process without going into the actual HVQC subroutines.
encodelmage0{ while (IcodiitgDone)
{ /• test to see if the previously coded image (pci) takes up too much channel capacity to be immediately followed by another image.
The opacity or bits per frame is calculated from the bits sec and frames sec. In the capacity calculation there is a divide by 2 is because there are two processors, each of which gets half die bits. Therefore, each of the capacity multiplies in die following section have an addition multiplication factor of 2 (i.e., ranges are 1.5 to 2.5, 2.5 to 3.5, and greater than 3.5
•/
/ Skip is an integer. The -1.0 is to downward bias die number of frames to be skipped.
•/
S]άp>(ιd.yuv_rate/(2* afn*aty)H-0; if ( Skip > NvmJraxriesToSkip) NumFraπiesToSlrip-Skip;
/• Get the bit me so dut the Capacity can be appropriately computed once per frame per processor "/
Appendix 3: 1 /• Compute die capacity or bits per frame from ώe bits/sec and ftames sec. . .
Recall tiiat die frame rate is given in terms of a divisor into 3U foe The divide by 2 is because there are two processor, each of which gets half die bits.
•/
/ Update die buffer size based on die capacity and frame rate: divided die bit rate by two because mere are two encoders •/
/* Force die buffer size to be determined by the capacity so that it will always be able to hold 4 1/2 frame pictures. At 15 fps and 19200 this becomes 2560 bits. This is a test to see if die buffer should be increased as die frame rate is reduced
BufferSize * 3.75 "capacity;
/ Determine the number of bits per frame •/ /*"*"" reset counters to zero after each video or still frame *••*/ /••*** scene change test based on a MSE d resholds •*••*••/ / If a scene change is being forced by die remote decoder tiien adjust the scenejnse_per_pixel so tiiat die other slave wul also be forced to scene change. This is a precaution just in case the other slave encoder did not receive the force scene change message from die MD processor •/ /* Exchange of scene change mse/pixel values */
/* Force bom master/slave scene change at die same time */
/* Start die timer encoding the frame */ if(scene change) /* if 123 */ {
Appendix 3:2 /* encode image as a still image */ cumulative_refresh_bits « 0; /• set back to use vq refresh •/
Reset die adaptable boundary to be 80/64 split,
The parameter passed here should be the same as die one in initEncoderState.
•/
/* Communication to synchronize the sel and se2 processors, no real information is exchanged. This makes sure die picture start code is sent first before any other macro row data
•/
/* Wait until die buffer fullness equals buffersize before encoding again */
/* Skip two frames because die one old frame is already packed in the slave memory and the MD has a second old frame in its memory. Dropping two will get us a new frame
•/ }/• if_123: if scene_change •/ else
{
/* encode image as a moving image */
/* Compute New Image Boundary based on mse pixel values */
Appendix 3:3 /" die two slave encoders, sel & se2, are set with compile switches during die make process to be a MASTER and a SLAVE. Do not confuse this with die master decoder and die slave encoder terminology */ #if MASTER /* i.e., sel */
/* Exchange of cumulative_refresh_bits */ /* DO NOT adapt boundary when SCALAR refresh is active "/ if( cumulative_refresh bits < TOO MANY VQ_REFRESH BITS )
{
#if AWS / adaptive working model at high motion frame
/* set to use AWS if overall scene nse > 600 •/ /• NOTE: The Slave should have the same status on SET_AWS /
if( ((scene_mse >er >ttd+otherMsePerPixel)*0.5) > 400 ) {
/• use AWS in bit stream */
SET AWS - 1;
} else
{
/* do not use AWS in bit stream */
SET AWS - 0; } #endif /» AWS */
/* Compute boundary adaptation •/ mseRatio * scene_mse_per_pixel (o_ιerMsePerPixd + scene_mse_per_pixel);
Appendix 3:4 By putting in extra test in de OR statement if (mseRatio < « MAX_MSE SiΛ. GEncoderState-
>macroRows < 5), die boundary can be slowly INCREASED back to
Figure imgf000156_0001
center (i.e., 5 macroRows for die master and
4 for die slave for QCIF format).
For QCIF image:
DISP ROWS>>4-9 |
DISP_ROWS>>5 -4 j --> 9-4 =
For 128x96 image:
DISP ROWS >>4 -6 | DISP_ROWS> >5 » 3 | --> 6-3 =
By putting in extra test in die OR statement else (mseRatio > - MIN_MSE &Λ
GEncoderState-> macroRows > 5), the boundary can be slowly DECREASED
Figure imgf000156_0002
center (i.e., 5 macroRows for die master and 4 for de slave). •/ / Otherwise, don't adapt die boundaries */
Appendix 3:5 } /* if cumulative_refresh >its /
#else /* if se2 */
/* exchange scene change flags /
/ SLAVE:
In order for die master and slave not adapting die boundary at the same time when SCALAR refresh is active.
The individual processor must know die cumulative bits of the other processor
•/ /• recieve and send cumulative_refresh >its */
/* DO NOT adapt boundary when SCALAR refresh is active */ if( cumulative refresh bits <
TOO_MANY_VQ_REFRESH_BΓΓS )
#if AWS / adaptive working model at high motion frame
/• set to use AWS if overall scene_mse > 600 */ /* NOTE: The Slave should have die same status on
SEI_AWS •/ if( ((scene nse_p»jpixd+c<herM3ePerPixd)*<).5) > 400 )
{
/• use AWS in bit stream •/
SET AWS - 1;
} dse
{
/* do not use AWS in bit stream */
SET AWS - 0; }
#endif /* AWS */
Appendix 3:6 /* Compute boundary adaptation */
/•
By putting in extra test in die OR statement if (mseRatio < « MAX_MSE SJc. GEncoderState-
> macroRows < 5), the boundary can be slowly INCREASED back to the center (i.e., 5 macroRows for die master and for die slave for QCIF format). For QCIF image:
DISP ROWS >>4 -9 |
DISP_ROWS> >5 -4 ) --> 9-4 - 5
For 128x96 image:
DISP ROWS >>4 -6 | DISP~ROWS> >5 - 3 I --> 6-3
By putting in extra test in die OR statement else (mseRatio > - MIN_MSE Sc& GEncoderState-> macroRows > 5), die boundary can be slowly DECREASED back to die center (i.e., 5 macroRows for die master and 4
Appendix 3:7 for the slave). •/ /* Otherwise, don't adapt die boundaries */
#endif /* #if MASTER at IF 666 */
NOT a scene change — — — > video
/* Communication to synchronize die sel and se2 processors, no real information is exchanged. This makes sure die picture start code is sent first before any other macro row data
•/ requestSceneChangeInfo(0) ;
#if ADAPT.ALPHA /• this is to adapt alpha (see thr.h) to control buffer fullness •/ overau_scmse — (scene_mse__)er_pixd+oti erMsePerPixd)*0.5; if (overall.scmse > 300.0 ) alpha » 0.7; else if (overall_scmse > 200.0 ) alpha « 0.6; else if (overall_scmse > 150.0 ) alpha * 0.5; else if (overall.scmse > 100.0 ) alpha - 0.4; else if (overall_scπιse > 75.0 ) alpha « 0.3; else
Appendix 3:8 alpha - 0.2; beta - 1.0 - alpha; lendif timvideo - get_timeO; /* start timer to see how long encode takes "/ video_encode( coder, J.mac_hdr ); /* call subroutine to encode die video •/
timvideo » get_timeO-timvideo; /• stop timer for encode, this will tell us if we have enough time to refresh */
«*••• pQjj processing, book keeping, update buffer, refresh,
/••• update buffer fullness based on the current bit rate *•"/ /••**••••••••••«»••»*.••«• je r jh •••»••••••••••••**••/
/• Amount we could refresh is determined by die difference of die allowed bits per frame and die number of bits we used to sencode die frame
/* If a new frame is available tiien do not refresh */ /• If there is no time for refresh and tiύs occurs three times in a row, tiien decrement die cum_lative_refresh_bits because we don't want to continue witii scalar refresh in diis case, but would rather continue refreshing with vq refresh
/* If there is no time to refresh more titan three times dian reset cumulative refresh bits to 0 and start from beginning with vq refresh •/ } /• if_123: dse of if scene_change */
Appendix 3:9 159
/* Determine die number of RT frames to skip based on die bit-rate to avoid buffer overflow /
/* Swap de coded and die previously coded images "/
} / WHILE JcodingDone /
/•••***••*•****,••** g j 0f encod ng } /* encodelmage */
Appendix 3:10 /..........................
Tide: Compute direshold, std, and mean of an image block, float computeMseTx2d( uchar ""img, int r, int c, VEC_SIZ *wrd, float gamma, float alpha, float beta, float "mean, float "std ) img is die image in which std and mean will be extracted. r and c are ie image indices, wrd specifies the image block size mean and std are die returned values Returned Value: die direshold
Description:
This routine computes die direshold to encode a given image block This direshold is s functional of die standard dev. of die image block and a number of other variables (see titr.h section later in diis file for detailed description of tiiese variables).
This routine is called by die still encoder and by die video encoder for all levels of die hierarchy.
/* computeMsetx2d : compute die direshold for ach block being encoded "/ float computeMseTx2d( uchar ""img, int r, int c, VEC_SIZ "wrd, float gamma, float alpha, float beta, float "mean, float "std )
{
/" compute the mean and std of x */ int i, j;
Uint x;
Uint imean-04-td-O; for(i"r; i<r+wrd->p; i++)
{ forfj-c; j <c+wrd->p; j++)
{ x - (Uint) img ilO]; imean + - x; istd +« x"x;
Appendix 4:1 }
•std - istd; /" Divide by die block size (multiple by 1 blocksize) "/ std[0] - std[OJ*wrd- > l_l;
/* Divide by the block size (multiple by 1/blocksize) "/ mean[0] - mean[0]"wrd-> l_l; std[0] « sqrt( (double) ($td[0] - (mean[0]"mean[0])) );
Figure imgf000163_0001
/" now calculate direshold for the block and return "/ /* gamma is adjusted each macro block in intra_adapt subroutine */ return( wrd->l " gamma * (std[0]"alpha+beta) );
}
Tide: Inαa-frame direshold adaptation
©Prototype: void intra_adapt( int tRate, int blockCount ) tRate is die cumulative bit-rate to encode a fractional frame. This value is reset to 0 after each frame. blockCount is die total number of blocks encoded in a frame. This value is also reset after each frame.
Description:
This routine adapt die duesholds for encoding a video frame based on the buffer fullness, defined as < buffer_mode > in diis program.
Appendix 4:2 (1) Compute buffer fullness < buffer mode > :
The buffer fullness is based on three parameters:
(1) die real-time buffer (defined as BufferFullness)
(2) die measured frame time up to encode the current block (defined as deltaTime).
(3) die estimated frame time up to current block ( defined as fractioιιal_frame_time) .
The EFFECTIVE buffer fullness is computed based on 5 cases: case 1: based on bits only, with a focus switch (motion tracking) for trading quality and frame-rate. case 2: based on max of bits and time case 3: based on sum of bits and time case 4: based on time only
The default is case 1.
(2) Compute die direshold based on buffer fullness <buffer_mode> : The duesholds bgthl6(0,l,2] (as shown in Figure below) are die ranges of duesholds tiiat are used witii buffer fullness to determine a rneaningful direshold value for encoding the current macro-row or current block depending on wheuier die duesholds are adapted once per macro-row or once per block.
THRESHOLD bgthl6[2]
bgthl6[l] I ./
bghl6[0] 1/
BUFFER FULLNESS fullness overflow cut_off
Appendix 4:3 bgthl6[0,l,2] are used to encode motion and background blocks at different levels of the coder, where as s_m4[0,l,2] are used for encoding die 4x4 spatial background blocks. for a given bufferfullness < bufferjnode > , there are three cases: case 1: Small direshold bUjfer node < fullness_cutoff (this value is typically 0.75) find a~tiιreshold between bgthl6[0] and bgthl6[l], case 2: Large threshold fullness_cufoff < bufferjnode < overflow find a direshold between T>gthl6[l] and bgdιl6[2]. case 3: Overflow threshold bufferjnode > overflow (tiύs value is typically 0.95) set direshold - buffer javerflo h; mmmmmmmm*m»*mm,mmmmmm*mm mmm ,Z mm mm mmm mm m,m.. m* m m mm,mmmmmm / volatile float ext_buffer_mode; /* bufferfullness used by die pingpong buffer (see pingpong.c) "/ void intra adapt( int ϋtate, int blockCount)
{ volatile float tiπiejactor, /* processor time bufferfullness */ volatile float bitjatejactor; /* bit-rate bufferfullness */ volatile float bufferjnode; /* effective bufferfullness */ volatile float ddtaTime; /" cumulative processor time */ volatile float fractional_,capacity; /" estimated capacity at block i "/ volatile float fractional_frame_time; /* estimated processor time at block i "/ static float old jufferjnode«0.5; float rnodifiedBufferFullness; float expectedBits; /* The average number of bits to have encoded after blockCount */ float cumulativeBits; /" Number of bits tiiat have been used */ float bitScale*.50;
Appendix 4:4 /* Find buffer fullness / rnodifiedBufferFullness - BufferFuUness ; mcdiήedBufferFullness — BufferFuUness + rd.yuv ate;
/••*••**** Real-time bufferjnode computation •••••••••/
/* Create a buffer fullness based on bit-rates "/ bitjatejactor - ((float) modifiedBufferFuUness )*inv_BufferSize;
/* Create another buffer fullness based on die timing "/ fractional Jramejime « (1.0/(framejate*GEncoderState- > numBloclα))*blockCount;
/*" measured frame time up to block i (in sec) */ ddtaTime - ((float) (get imeO - sTimeN)) " 0.0000001; timejactor - 0.1 "ddtaTime / fractional_framejime;
/" determine how to find die effective buffer fullness */ switch (bufferMode)
{ case 1:
/" Look at bits only: tmodeScale defaults to 1.0 and is changeable with the focus command line parameter
•/
/ buffer_mode - tπ.todeScale*bitjatejactor; •/
/* bufferjnode - bitScale + tmoc^Scale*bitjate actor,
• /
/* bufferjnode - bitjatejactor, / bufferjnode - bitjatejactor bufferjnode "- tmodeScale; break; case 2:
/* Look at max of bits and time */ bufferjnode - MAX(bufferRateSc e*bitjate_factor, bufferTimeScale*timeJactor); break;
Appendix 4:5 case 3:
/* Look at sum of bits and time "/ bufferjnode - bufferRateScale*bit_rate_factor + bufferTimeScale*time_factor, brea ; case 4:
/* Look at time only "/ bufferjnode - bufferTιmeScale*timejactor; break; default:
/* Look at bits only: tmodeScale defaults to 1.0 and is changeable witii the focus command line parameter */ buffer_mode - tπ ιdeScale*bitjatejactor, break; }
/• Find threshold based on buffer fullness */
/* check over and under limit (FILE-based only) */ if (buffer_mode < 0.0 ) bufferjnode - 0.0;
/* Case 1: small threshold "/ if ( buffer mode < - fullness cutoff ) /* slow adaptation "/ ( float tmp - (1.0 - buffer jnode"fullness nιtoff_inv); gamma(BG16] - (bgthl6[0] - bgthl6{l])tmp + bgthl ]; gamma(SBG4] - (s m4[0] - s m4[l])*tmp + s m4[l];
/•
SBG4, BG 16 are id numbers for each specific predictors (see constant. h section at end of this file).
•/
} /* Case 2: overflow threshold "/ else if (buffer mode > - buffer overflow )
{ buffer_mode - buffer _overflow; gamma(BG16] - bgdιl6 overflow; }
Appendix 4:6 /* Case 3: large direshold "/ dse /* rapid adaptation "/
{ float tmp- 1.0 - buffer jnode/buffer verflow; gamma|BG16] - (bgthΪ6[l] - bgthl6T2])"tmp + bgthl6[2]; gammafSBG4] - (s m4[l] - s m4p])*tmp + s_m4[2]; } ext nifferjnode - buffer_mode; /• Save bufferjnode •/ old Jnifferjnode - bufferjnode; } /• intra_adapt */
* /
CONSTANT.H SECTION
•/
/* The foUowings are CODER16, CODER8, CODER4 are index ID'S used to reference the variables (e.g, caches, vector sizes) that associated time different block sizes: 16x16, 8x8, 4x4. •/
^define CODER160 #define CODER8 1 #define CODER4 2
The foUowings are index ID'S to reference variables associated with different predictors. For example, Thr[STILL4] is die direshold to code each still image 4x4 block.
Appendix 4:7 background coder 16x16 "/ /• motion coder 16x16 / /" background coder 8x8 */
3 / motion coder 8x8 /
4 /• motion coder 4x4 "/
/* vq coder 4x4 for moving images */ /" still image coder 4x4 (both cache and
Figure imgf000169_0001
/• spatial backgroung coder 4x4 (only die mean is used to code each vq block
/* THR.H SECTION
Tide: Pre-defined duesholds for encoding video and images Description:
This file contains aU die duesholds to encode video and images.
Some duesholds are used in conjunction witii intra_adaptO and buffer-fullness to simulate a fixed-rate/variable-quality coder.
SmaUer (larger) direshold results better (poorer) image video quality and higher (lower) bit-rates. Parameter definition: thbgl6(0,l,2]:
The duesholds bgthl6[0,l,2] (as shown in Figure bdow) are the ranges of duesholds dial to be used with buffer fullness to come up witii a meaningful threshold value for encoding die current
Appendix 4:8 macro-row or curren block depend on whether the thresholds are adapted once a macro-row or once a block. See intra_adaptO on how die direshold is adapted.
The resulting direshold computed with thbgl6[0,l,2] are used in
BG16, M16. BG8, M8, M4 (see constant. h for definition of diese terms).
THRESHOLD
/. / .
Figure imgf000170_0001
sjn4{0,l,2]:
Same as bgthl6[0,l,2], but die rerulting direshold is used for SBG4 (spatial background) coder only (and SBG8, SBG16 if exist). bgdιl6_overflow:
Threshold used when die buffer is full. This value is at least the same as bgthl6[2]. ganrøarBG16,M16,BG8,M8,M4,VC ,STILL4,SBG4]:
Appendix 4:9 This array contains the instantaneous duesholds computed from dtiier bguil6Q, or s_m4Q by intra_adaptO. gamma(VQ4] is not used at diis time. gamma[STTLL4] is fixed at all time. gamma[BG16,M16,BG8,M8,M4] are based on bgthlfjQ. gamma{SBG4] is based on s_m4Q. beta, alpha:
Threshold used to bias die high/low activity image block
(tiiey wiU be defined shortly).
Constraint on these values is that alpha + beta - 1. blocksize_bias[CODER16,CODER8,CODER4]:
It is known that larger block size can perceptually more annoying than smaller block size at a given mse. Thus diese dueshold attempts to exploit diis advantage by biasing die smaller block size coder (CODER4) witii a larger direshold. r_std4y, r_std4uv:
These are duesholds used for low activity block in die absolute refresher.
Threshold definition for coders BG16, M16, BG8, M8, M4:
For a given gamma[i] where i=BG16, M16, BG8, M8, or M4, the dueshold is defined as T » gammafj] " blocksize * (std * alpha + beta) where std is die standard deviation of die 16x16, 8x8, or 4x4 image block, and blocksize is 256, 64, or 16.
Appendix 4: 10 Tx here is die threshold at each levd of the coder to determine whether a give image block passes die dueshold test or not.
For coder SBG4, die direshold gamma[SBG4] is compared directly witii the std of the 4x4 block. float alpha - 0.2; float beta - 0.8; float bgthl6(3]- {10.0, 15.0, 18.0}; float bgthl6 overflow - 20.0; /• blocksize_bias[CODER16-0], blocksize J>ias[CODER8« 1], blocksize bias[CODER4*2]
•/ float blocksize ήasP] - {1.0, 1.2, 3.0}; float s_m4P]-{3.0, 10.0, 15.0};
/ types of coders : BG16-0, M16- 1, BG8-2, M8-3, M4-4, VQ4-5, STTLL4-6, SBG4-7 •/ volatile float gammarø -{14.0,14.0,16.0,16.0,18.0,80.0,80.0,4.0};
/* duesholds for low activity blocks in vq abs. refresher "/ float r rtd4y - 3.0; float r~std4uv - 3.0;
Appendix 4:11 Encoder/Decoder Flowchart
This file contains a pseudo-code listing of the video encoder and decoder source code This listing provides a concise summary ot the code and defines where the subroutines are located.
Encoder main loop encode LmageO / encodcc / mιt_cwsιze( ); /• hvonu c •/ imt.meinO: I* bvquuLc / mit_vq_vecO: /* hvquutc / ιnιt_πue(): I* bvquuLC /
Figure imgf000173_0001
r* bvqπutc / iniuuicoderSuueO; / bvcμmtc / ιnn_pιc_hdrO; /* bvqnuLc / ιnιι_mac_bdr(); /* bvqnuLc /
/................•.. en_s_,g _,c v,_β0 codm loo ' '••/ wbile (D{ getNextVideoFrameO; f* slavcc / scene.cbange ■ scene_clιange_(estO: t* vtdeouα.c •/ if (fcrce-sccne-duuige-is-needed) force-scene-cbange ■ TRUE: requestSce-eCbangeu_bO; I* slavcc •/ ιf(scene_c&ange OR force-scene-change) ( resetEiicoderSutteO: / eacodex */ requestSceaeChangelnibO; I* slavcc •/ suU_encodeO; /* bvqualsx */ wauUnάlBufferFuUnessO; • slavcc / else( if (my-scene-nue-is-very-sau.il) rcvPσ(16); I* encodcc / else if (my-scene-mse-u- very-Urge) seadPCK 16): * eacodex / requestSceneChsngelnfoO; slavcc */ video.encodeO; t* hvqunlsx •/ ifOeft-over-bits sad not-αuss-real-ϋme) retpesh_encooe0: f* re&eshx /
} r scene.c ange */
/" currendy coded become previously coded by swappmg image planes */ ) /• WHILE 1*/
Appendix 5: 1 ) /* encode Image */ sulLencodeO /* bvqutils.c •/ ( suckO: / stackx V }/* sulLencode */
Still Image encoder suckO /• stackx •/ fc^ea<Λ--naσo-row-m-πnage){ reset.cacbeO; /* hvqmitx / fσiteacb 4x4-block in macro-row) { newmemblc k memblocl_30 / computeMseTx2d(Y): /* videouti-c / u±eckLRUCacbeO ; /* cache.c / ιf( cache-hit ){ pa k_word(cache--ut) /* bitPackO in pingpong. c newmemblock Y) I* memblocl J30 /
) else( check_scalar_cac_e(Y); /• cache.c •/ ρack_wc4d(ca-he-mus); /* pmgpong */ ιhlιmιt(Y-mesn) I* vlibx */ pack_word(Y-mean): /* pmgpong */ vsubm(Y); vmse _30 / ibtnSearchNodeCY); I* treevqx •/ bvmoveOO; f* vπue_30 •/ vsubtn ; I* vmse J30 •/ ihlimitO I* vlibx */ lupdaieLRUCacbeOT); / cache.c / pack.wc-OXvq-aOjdress); /* pmgpongx */ newmemblcckOO; /* memblocl-30 */
} if( every-four-Y-bloc )(
LOOP1: newmemblcckdJ); r emblocl-30 */ cctnpuιeMseTx2d(U); I* vtdeouu-c •/ u±ec_LRUCache(U); r cache.c / if( cache-hit )< pack_word(c«che-hit) /* pmgpongx */ newmemblockfU) * memblocl_30 /
Appendix 5:2 etset cbeck_scalar_ca_be(U); / cache.c / pack_wσrd(cache-πuss); /* pmgpon / ibinSearchNode(U): • treevqx / bvmoveOJ); I* vmse_30 / iupdateLRUCacbe(U): /* cache.c / pack.woπK vq-address): /* pmgpongx */ newmemblock(U); I* memblocl J30 /
)
END LOOP1: repeat LOOP1 for V. )/• every-four-Y-block •/ )/ for each 4x4 block */ sendToMD(0); /* pmgpongx */ } /• for mac_row / ) / stack /
Video encoder routines vιdeo_encodeO I* bvqutilsx / fσrteach maαo-row m an πnage) { reset_cacheO; I* hvqautc •/ put.mraO; / hvq_hdrx / for(each 16x16 block in a maαo-row)
{ newmembtockO; /" memblocl J30 */ code_rec0; /* hvqnύlsx •/
) /* for each 16x16 block */ sendToMDO; /* pmgpongx */ mua_adapt(); / adaptx */
)/* for each maαo-row */
) / vιdeo_encode */
* coderfO] » e_wl60. coder(l] ■ e_w80. coder[2]«e_w40 */ code_rec(l. coderO)
( codιng_staιus - (*coder(l])0: I* e_wl6x, e_w8.c e w4.c /
Appendix 5:3 if (codmg_status ∞ DONE) reιum(O): else
{
I- -; for four quadrant) code_rec(l. coder); ) /* else /
)/* code.rec •/ e_wl60 / e_wl6.c */ computeMseTx2dO; * vi eoutLC •/
Get background-mse from scene-change mse; ιf( background-mse < direshold )
{ ρack_word( bgl6-hit ); /* pmgpongx / newmemblock( Y , U, and V); I* memblocl.s30 */ return(l);
1 pack_word( bgio-miss ); /* pmgpongx */ bm_cache( Y ); videouac / if( cache-mse >« direshol ){ if( cu mt-block-mside-image ) threestep.uisideO OR threestepO; /* threestepx */ ) ιf( (cache-mse < threshold) OR (threestep-mse < threshold) ) ( ρack_woιd0: /* pmgpongx "/ newmemblockCY); /* memblocl-30 */ compute_mse_blk(U and V): videoutLC •/ ιf( u-needs-new-mv AND v-needs-new-mse) threestep_color( U and V); r* threestepx •/ ιf( cache-mse < direshol ) pack_word0; /* pmgpongx */ else cbeck_scsJar_cacbeO; /• cache.c / pack_word0; / pmgpongx */ ιupdateLRUCacbe( new- v ); /* cache.c */ pack_word( new-mv ); /* pmgpongx */
) pack_word(color-mv-flag); pack-word(color-mv); newmemblock( U and V); retum(l);
Appendix 5:4 I pack_woτd( ml6-mιss ); retum(O), )/ e_wl6 /
e_w80 T e_w8x / { cσmputeMseTx2dO; /• videoutLC /
Get background-mse from scene-change mse; ιf( background-mse < threshold )
( pack_word( bg8-hιt ); /• pmgpong */ newmemblock( Y , U, and V); I* memblocl_30 / return(l); } pack_word( bg8-αuss ); bm_ca be( Y ); /• videouu-c •/ ιf( cache-mse >« threshold ) mreestep.msideO OR threestepO; / drreestepx */ ιf( (cache-mse < threshold) OR (threestep-mse < threshold) ) f pack.wσπttm 16-hit); newmemblockCY); compute_mse_blk(U and V); • videouu-c •/ ιf( u-needs-new-mv AND v-needs-new-mv) threesteρ_color( U and V); /* threesiep.c */ ιf( cache-mse < threshold ) pack_woid(cache-hit); else
( check_scalar_cacheO: I* cache.c */ pack_word0; ιupdateLRUCacbe( new-mv ); / cache.c / pack_woπK new-m ); /* pmgpongx */
1 pack_word(color-mv-flag and color-mv); newmemblock( U and V); return(l);
); I* videoutLC /
Figure imgf000177_0001
pack_word(hιt-std); ιbtnSesrchNode2(2-dιm-vq); /* treevqx / pack_word(vq-address); / pmgpongx / mcput_vbyte(U and ); t* mput_vby.s30 /
Appendix 5:5 else
{ pack_wσrd(mιss-std); newmαnblockfU and V); lbtnSearchNodedJ and V); I* treevqx / bvmove( U and V); r vmse_30 / newmemblock(U and V); I* memblocl_30 / pack_word(U and V); return(O); )/ e_w8 /
e_w40 I* e_w4.c / computeMseTx2d0; • videoutLC •/ if (!SET_AWS){ bm_cacheO; I* videou c / ιf( cache-mse » dnesbold ) thre«tep_msideO OR threestepO; /* threestepx */ ιf( (cache-mse < threshold) OR (threestep-mse < threshold) )
{ pack_word(m4-hιι); / pmgpongx / newmemblockO ; memblocl_30 •/ if( cache-mse < dueshol ) pack_wσrd(cacne-hit); else { cbeck_scalar_cacheO: /• cachffx •/ pack.woidO; iupdateLRUCache( new-mv ); /• cache */ pack_word( new-mv );
1 remrn( 1 );
)
)/• !SET_AWS */ else
( bm_current0i videouu-C / if( ws-mse < duethold )
{ newmemblockO; • memblocl.s30 •/ pack_word(nag and address); return( 1 );
1
/ else AWS /
Appendix 5:6 newmemblockO; pack.word(mean); ιf( y-low-acuviry ) ρack_word(std-hit); mput_vbyteO; I* mput_vby.s30 */ else pack_woπKstd-πuss); vsubmO; ibmSearcnNodeO; vsubmO; I* vmse J30 / ihlumtO; vlibx / newmemblockO; ρack_word(vq-address); return(l); / e_w40 */
Encoder refresh routines refresh_encodeO; /• refreshx */ if (rcfresh.rate < refresh-overhead) return(0); tf( cumulauve_refrc_b_bits > TOO_MANY_BITS )
SET_SCALAR_REFRESH - TRUE; else
SET_SCALAR_REFRESH - FALSE; whiledbere is soil bits for refresh) { if( !SET_SCALAR_REFRESH ) eacode_abs_vq_refreshO; /• refreshx / else eacode_abs_scalar_refresbO; /• refreshx •/
Update indices to make sure that all blocks; m die image will be refreshed sequentially; if (new-piαure-amved) break; ) / while /
UpdaieRefreshHeadeiO; /• pmgpongx / sendToMDO; / pmgpongx / /* refresh.encode */
Appendix 5:7 encode_abs_scalar_refreshO /• refreshx / ( newmemblock(4x4 Y); /• memblocl.s30 */ newmemblock(2x2 U); newmαnblock(2x2 V); ford 6 Y -pixels, 4 U-pixels, and 4 V-ριxels){ Quantize each pixel; pack_woιd( each pixel ); I* bu_buf.c /
) newmemblcckOO; newmemblock(U and V);
} /• encoαe_abs.scalar_refresh •/ encoαe_abs_vq_refresbO / refreshx / ( for( four 4x4 blocks wittun an 8x8 block) { computeMseTx2d(Y-mean); I* videoutLC •/ bιιPack(Y-mean and std): / bιi_bufx / if( low-acovity-block ){ bid>ack(std-bit); /* bit_bufx */ mput_vbyte(Y); mput_vby.s30 •/ else
{ newmemblcckOO; r emblocl-30 */ vsubmOO; I* vmse J 0 / bitPack(s_l-miss); I* bit_psck.c •/ ibinSearchNodeOO; /* treevqx / bitPacktvq-address); • bit_b_f.c •/ vsubm< Y ); vmse t30 •/ ihlimiK Y ); I* vlibx */ new emblockOO; Λ memblocl-30 */
)
}/• for */ computeMseTx2d( U ); I* videoutLC •/ computeMseTx2d( V ); / videouαc / ιf( color-low-scQvήv-block)( bid»ack(color-std-bit); /* biubufx */ ibmSearcaNodc2(2-dim-vq); I* treevqx / biιPack(2-dim-vq-address); mcput_vbyte(U and V ); /• mput_vby.s30 •/
} elsef
Appendix 5:8 bitPack(color-std-miss); newmemblock(U and V); / memblocl_30 / ibmSearchNode(16-dιm-U and V); * treevqx •/ bvmove( U and V); / vmove.s30 / newmemblock( U and V ); I* memblocl.s30 / bιtP_ck(U-vq-address and V); /* bit_bufx */
)
/ encode_abs_vq_refresb 0 */
Decoder main loop mi decodelmageO r decodeLC */
{ if (bit-error) skφToNextfacketO; • bit_bufx */
VSA_errorO:
Get relanve-temporal-reference; if (incoπect-temporal-reference) goto concealment;
Figure imgf000181_0001
Appendix 5:9 AfterDecoding: if (unique-psc-found)) break: Get macro-row-header, if ( valid-macro-row-address ) longjmpO; if ( mcαrre -temporal-reference )( goto concealment: break: get_mra0; /* hvq_hdrx •/
) / while(l) •/ concealment: if (not-aU-nukcro-rows-decoded) for (each macro-row) if (any-missing -macro-row)
ConcealMissmgMacroRow(mιssmg-macro-row); f* decodeix •/ return(O); }/* decodelmageO */
Video decoder routines videolmageDecodeMacroRowO / d_uύlsx /
{
Set SET.AWS according to macro-row-header, for each 16x16 block in a macro-row) decode_rec( level decoder ); * d.utilsx / return (OK);
sαlllmageDecodeMacroRowO d_u.u_x •/ ( d_stack_row0; / d_stack.c / return (OK);
r decoderfO) - d_wl60. decoderf 1] - d_w80. decoderf2]-d-w40 */ decode _rec l decoder)
( codιng_staιus • (•decoderfl])( r, c ); /* d.utilsx */ if( codιng_status « DONE ) return (0); else
1++; for(four quaddrani) decode_rec(l, decoder);
Appendix 5: 10 181
refresh ImageDecodeO * decodeix / { refresh.decodeO; / refreshx / r«urn(NO_ERROR);
d_wl60 / d.uulsx / geu3its(bg-flag); I* bu.buf.c V ιf( background-hit )
{ mugeI6xl6move( Y ); I* imove_30 •/ ιmage8x8move( U ); I* nnove_30 •/ unage8x8move( V ); return(DONE);
} getBits( motion-flag ); I* bit.bofx / ιf( mouon-hit )
{ dehufFfcache-ensy); f* huff.c */ Get cache-content: iswapTwofupdate-content); cache.c •/ if( not-cache-hit )
( geιBιts(mouon- vector); ιupdateLRUCacfae new-mouon- vector); I* cache.c •/ ) geu3its(color-mouon-flag); bit_bufx •/ if (need)
{ getBits( U-mouon-vector ); bit_bufx / getBits( V-mouon-veαor );
} αnage8x8move U); I* unove J30 / αnage8x8move(V); * tmovcs30 */ retuπ(DONE); (/ Mlβ uit remm(NOT-DONE); / d_wl6 »/
Appendix 5: 11 d_w80 / d.uuls.c /
{ geu3ιts(background-flag); I* bit bufx / ιf( background-hit ) { αnageβxβmoveOO; I* ααove.30 / unage4x4move(U and V); I* tmove_30 / reuιrn(l);
) hitFlag * getBιts(monon-flag); bιt_bufx / ιf( mouon-hit )
{ defauff(cacbe-cntry) r huf .c •/
Get cache-content: iswapTwoO*. /• cache.c •/ ιf( mouon-cache-πuss ) getBιts(mouon- vector); t* bu_bufx / geϋ)tts(color-mouon-flag); if (needed) getBits(U-mouon- vector); getBιts(V-mouon-vector);
1 αnage4x4move(U and V); • mιove-30 / nnageβxsmove O; I* πnove.30 •/ return(l); )/* if mouon-hit •/ geιBιts(color-sul-Qag); ιf( color-std-coded )
( getBιts(2-dim-vq-address); chroπunanceMean4x4(U and V) /• αnean.s30 */
} else
{ getBιts(16-dim-vq-address); r b«_bufx */ vectorl6move(U); /• vmovej30 */ geu3i.(16-dim-vq-address); vectorl6move(V);
1 return(O); }/ d_w8 •/
Appendix 5: 12 d_w40 /• d.utilsx •/
{ if( !SET_AWS ) getBιts(mouon-flag); bit.buffx / if( mouon-hit )
{ defauff(cache-entry); I* huff / Get cache-content: iswapTwoO; / cache.c / ιf( cache-miss );
{ getBitsOnouσn- vector); ιupdateLRUCache(new-mouon- vector); / cachex /
) nnage4x4move(Y); imove J30 / retum(DONE);
)/* if mouon-hit / )/ if !SET_AWS / else
( getBits(aws-flag); /* bn_bufx / if( aws-hit )
{ getisitsfaws-entry); I* bit_bufx /
Conven aws-enuy to motion vector. αnage4x4move00; /* unove-30 •/ return(DONE);
)
) SET.AWS / dehufffmean-enry); I* huffx */ Reconsαυα mean; getBιts(s_t-flag); r b«_bufx •/ if (not-std-coded);
{ getBit vq-address); umιmanceMeanRcaικ)vedV(^lmageO : /• hnrvq J30 •/ )/* if hitFlag — !HIT_SBG4 •/ else { lummanceMean4x40; /* cτnean,s30 */
) retum(DONE); / d_w4 •/
Appendix 5: 13 Still Image decoder d_stack_rowO fort each-macro-row ){ for( each 4x4 block in a macro-row ){ dehufftcacbe-entry); I* huff.c / Get die cache-content: ιf( cache-hit ) ( vectorl6move(Y); vmove.s30 / iswapTwoOO; /• cache.c / ) else
{ iswapTwoOO; /• cache.c / dehuff(block-mean); • huff x /
Reconstruct mean; geιBιts(vq-address); r bit_bufx •/ lummanceMeanRemovedVQωVectorfY); /* Imrvq J30 / veoorl6moveO ; I* vmove.s30 / ιswapTwo(new-vq-block);
Figure imgf000186_0001
ιf( every-four- Y-block )
{ dehufTCU- ache-enrry); Get the cache-content; if( cache-hit ){ vectorl6move(U); ιswapTwo(U);
) else
< iswapTwoftJ); getBιts(U-vq-address); vectorlomovetU); ιswspTwo(new-U-vq-block);
) deaufff -cache -entry);
Get the cache-contenc if( cache-bit ){ vectαri6move(V); tswapTwotV);
) dse{ iswapTwoOO; getBιts(V-vq-address); vectorl6move(V); ιswapTwo(new-V-vq-block);
Appendix 5: 14 ) }/* if( every-four-Y-bkxk) / )/ each 4x4 block "/ )/* for each macro-row */ )/ d.stack.row •/
Decoder refresh routines refresh.decodeO /• refreshx / for (_3ven-number-of-refresh-blocks){ u"(scalar-ref esh) decc le_abs_scalar_refresbO; t* refreshx •/ else decode_abs_vq_tcfreshO; / refreshx / Update mdices to make sure that alt blocks; in die unage will be refreshed sequentially; ) / for loop */ ) / refrcsh.decode •/ decode abs_vq_rcfreshO /• refreshx / { fortfour 4x4 blocks within an 8x8 block){ geϋ3tts(y-meao); • bit_bufx / getBitststd-Oag); if( std-hit ) l____anceMean4x4(Y); /* αnean-30 */ else
{ ged»its(vq-address); lummanceMeanRemovedVQtolπugeOO; I* hnrvq.s30 /
}
) getBitstcolor-std-fla.) • btt_bufx */ ιf( color-std-hil )
{ geu3its(2-dim-vq-address); clvomιnanceMean4x4CU and V); /* αnean_30 */ else
( getBits(U-vq-address); * bu_buf.c •/ vectorlomovetU); I* vmovcs30 •/ ged»its(V-vq-address); vectorl6move(V); / decode_abs_vq_refresh /
Appendix 5: 15 decode_abs_scalar_refresbO /* refreshx •/
(
Get 16 absolute coded pixel values mto word for Y;
Copy pixels mto Y-coded image;
Get 4 absolute coded pixel values mto word for U; Copy pixels mto U-coded unage;
Get 4 absolute coded pixel values mto word for V; Copy pixels mto V -coded image;
/• decode_abs_scalar_rcfrcsh •/
Appendix 5: 16 //
// Function: Carry Select Adder adds two number, (A + B) and
„ // ............................ re..t.u.rn.s.. re.s.u.l.t. S.u.m............. int CarrySelect( int Addend 1, int Addenda, int Carryln ) {
// Declare variables and arrays int A[10], B[10], S[10], i; int C2, C 40, C 41, C 60, C 61, C 80, C 81; int S_50, S_51, C6, C_90, C_91; int Sum;
// Create single bit numbers by shifting and masking for(i»0; i<-9; i++)
{
A[i] - ( Addendl > > i ) & Oxl; B[i] - ( Addend2 >> i)&Oxl;
}
// Calculate first level of logic for Cany Select topology
S[0] A[0]AB[0]ACarryIn; // A exor B exor Carryln S[l] ( ( A[0]ΛB(0] ) I ( Caπyln&( A[0]|B{0] ) ) )*A[irB(l]; C2 ( ( A[1]&B[1] ) I ( ( A[0J&B[0] ) | ( Caπyln&( A[0]|B(0] ) ) ) ) & ( A[1]|B[1]);
C 40 ( A13]ΛB(31 ) ((Ar2]ΛBT2])Λ(AP]|B(3])); C~41 (APl&Bpl) ((A[2]|B[2])Λ(AP]|BΠ]));
C 60 ( A[5]AB[5] ) ((A[4]ΛB[4])Λ(A[5]|B[5])); C 61 ( A[5]ΛB(5] ) ((A[4]|B(4])Λ(A[5]|B(51));
C 80 (A[ηΛB«7]) ((A[6]ΛB(6])Λ(AπiBrη)); C 81 (AΓ7]ΛBΓΠ) ((A[6]|B[6])&(AmiBT7]));
// Calculate Second level of logic
Appendix 6: 1 S[2] = A[2]ΛB[2]*C2;
S[3] - ( ( A[2]ΛB[2] ) | ( C2 & ( A[2]|B[2] ) ) )*A[3]AB[3];
S[4] - ( -C2 t ( A[4]*B[4]"C_40 ) ) | ( C2 & ( A[4rB[4]*C_41 ) );
S_50 = ( ( A[4]&B[4] ) | ( C,40 Λ ( A[4]|B[4] ) ) )AA[5]AB[5];
S_51 - ( ( A[4]&B[4] ) | ( C_41 Λ ( A[4]|B[4] ) ) )*A[5]*B[5];
C6 - ( C 60&-( ( -C2ΛC 40 ) | ( C2AC 41 ) ) ) I ( C_61& ( ( - C2AC_40 ) | ( C2&C_4Ϊ ) ) );
C 90 = ( A[8]&B[8] ) I ( C 80&( A[8]|B{8] ) ); C_91 - ( A[8]ΛB[8] ) | ( C_81Λ( A[8]|B[8] ) );
// Calculate Third level of logic
S[5] « ( -C2ΛS 0 ) I ( C2<fcS_51 );
S[6] - C6*A(6]*B[6];
S[η - ( ( A[6]&B[6] ) I ( C6&( A[6]|B[6] ) ) rAr7rB_7J;
S[8] - ( -C6Λ( ArøAB[8] A C 80 ) ) | ( C6Λ( A[8]*B[8rC 81) ); S[9] - ( - C6Λ( A(9]AB[9] * C_90 ) ) | ( C6Λ( A[?]*BC9]AC_91) ); // Create Sum, shift bit from lsb to appropriate position and add for ( Sum-0, i-9; i> «0; i- ) Sum I « S[i] < < i; return Sum;
Appendix 6:2 // ............................................................
// Function: Carry Save Adder. Adds three numbers (A + B + D)
// and returns a sum and carry.
7/ ............................................................
void CarrySave( int Addend 1, int Addend2, int Addend3, int *pSum, int •pCarry )
{
// Declare variables and arrays
int A[10], B[10J, D[10], S, C, i; int Sum=0, Carry -0;
// Create single bit numbers by shifting and masking for (i= •0; i< =»9; i++) {
A[i] = ( Addend 1 > > i ) & Oxl B[i] - ( Addend2 > > i ) & Oxl D[i] = ( Addend3 > > i ) & Oxl
// Calculate sum and carry
S - [i]*BTirD[i]; Sum
C - ( A[i]&B[i] ) I ( D[i]&( A[i] I B[i] ) ); // Carry
// Create Sum and Carry, shift bit from lsb to appropriate position and or Sum t - S < < i;
Carry | « C < < i;
}
*pSum * Sum;
•pCarry ■ Carry;
Appendix 6:3

Claims

We claim:
1. A method for generating a compressed video signal, comprising the steps of: converting an input signal into a predetermined digital format; transferring said digital format image signal to at least one encoder processor;
applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digital image signal; and collecting a resultant encoded bit stream generated by said application of said algorithm, wherein said predetermined digital format comprises a digital luminance component and a digital chrominance component.
2. A method as claimed in claim 1 , wherein the hierarchical vector quantization compression algorithm is applied independently to said luminance component and said chrominance component.
3. A method as claimed in claim 1 , wherein said step of applying, at least one encoder processor, said hierarchial vector quantization compression algorithm comprises said steps of: encoding said luminance component with reference to a first motion vector cache; and encoding said chrominance component with reference to a second motion vector cache.
4. A method as claimed in claim 1 , wherein said step of applying, at said at least one encoder processor, said hierarchial vector quantization compression algorithm comprises said steps of:
dividing said chrominance component into a plurality of images blocks; calculating a mean value for at lest one of said plurality of image blocks; and storing each calculated mean value in a scalar cache.
5. A method as claimed in claim 1 , wherein said step of applying, at said at least one encoder processor, said hierarchical vector quantization compression algorithm comprises said steps of: dividing said luminance component into a plurality of images blocks; calculating a mean value for at lest one of said plurality of image blocks; and storing each calculated mean value in a scalar cache.
6. A method as claimed in claim 1 , wherein said step of applying, at said at least one encoder processor, said hierarchical vector quantization compression algorithm comprises said steps of: dividing said chrominance component into a plurality of images blocks; calculating a chrominance mean value for at lest one of said plurality of image blocks; storing each calculated chrominance mean value in a scalar cache; dividing said luminance component into a plurality of images blocks; calculating a luminance mean value for at least one of said plurality of image blocks; storing each calculated luminance mean value in a scalar cache;
7. A method as claimed in claim 1, further comprising the step of encoding a mean value of an image block with a mean value in said scalar cache.
8. A method as claimed in claim 1 , further comprising the steps of: providing a first vector quantization cache having a plurality of entries for encoding said luminance component; during said step of applying said algorithm, comparing said luminance component to an entry from said first vector quantization cache.
9. A method as claimed in claim 8, further comprising the steps of: providing a second vector quantization cache having a plurality of entries for encoding said chrominance component; during said step of applying said algorithm, comparing said chrominance component to an entry from said first second quantization cache.
10. A method as claimed in claim 1, further comprising the steps of: providing first vector quantization cache having a plurality of entries for encoding said luminance component; and providing a second vector quantization cache having a plurality of entries for encoding said chrominance component; during said step of applying said algorithm, comparing said luminance component to an entry from said first vector quantization cache and comparing said chrominance component to an entry from said second vector quantization cache.
11. A method as claimed in claim 1 , further comprising the step of providing a two-dimensional mean vector quantization cache for encoding said chrominance component of said predetermined digital format.
12. Apparatus for color space conversion of video data YUV format to RGB format comprising in combination:
a plurality of high speed adders for receiving an input signal corresponding to video data in said YUV format and converting said input signal to an output signal corresponding to said RGB format; and a plurality of decoders for sensing if said output signal exceeds a predetermined RGB data range.
13. Apparatus as in claim 12, wherein said plurality of high speed adders are
carry save adders.
14. Apparatus as in claim 12, wherein said plurality of high speed adders are carry select adders .
15. A state machine comprising the apparatus of claim 12, said state machine comprising a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to plurality of input signals corresponding to said YUV format.
16. Apparatus as in claim 12 further comprising in combination a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to plurality of input signals corresponding to said YUV format.
17. Apparatus as in claim 12 further comprising in combination an EPROM containing a programmable look up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
18. Apparatus as in claim 12 further comprising in combination a field programmable gate array containing a programmable look up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
19. Apparatus for color space conversion of video data from an YUV format input to RGB format comprising in combination: a first carry save adder coupled to a first full adder, said first full adder generating an R component of said RGB format output in response to said YUV format input; a second carry save adder coupled to a second full adder, said second full adder generating an B component of said RGB format output in response to said YUV format input; and a third carry save adder coupled to a third full adder, said third full adder generating a G component of said RGB format output in response to said YUV format input.
20. Apparatus as in claim 19 further comprising in combination a fourth carry save adder coupled to a fifth carry save adder, said fifth carry save adder coupled to said third carry save adder.
21. Apparatus as in claim 20 further comprising in combination: a first detector coupled to said first full adder for detecting if said R component is within a first predetermined range; a second detector coupled to said second full adder for detecting if said B component is within a second predetermined range; and a third detector coupled to said third full adder for detecting if said G component is within a third predetermined range.
22. Apparatus as in claim 19 further comprising in combination an
EPROM containing a programmable look up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
23. Apparatus as in claim 19 further comprising in combination a field programmable gate array or application specific integrated circuit containing a programmable look up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
24. A state machine comprising the apparatus of claim 19, said state machine comprising a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to a plurality of input signals corresponding to said YUV format.
25. Apparatus as in claim 19 further comprising a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to a plurality of input signals corresponding to said YUV format.
26. Apparatus for color space conversion of video data from an YUV format input to RGB output comprising in combination:
a first carry select adder coupled to a first full adder, said first full adder generating an R component of said RGB format output in response to said YUV format input; a second carry select adder coupled to a second full adder, said second full adder generating an B component of said RGB format output in response to said YUV format input; and
a third carry select adder coupled to a fourth carry select adder, said fourth carry select adder coupled with a fifth carry select adder, said fifth carry select adder coupled to a second full adder, said second full adder generating a G component of said RGB format output in response to said YUV format input.
27. Apparatus as in claim 26 further comprising in combination an EPROM containing a programmable lock up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
28. Apparatus as in claim 26 further comprising in combination a field programmable gate array or application specific integrated circuit containing a programmable lock up table for converting an output signal of a first bit resolution corresponding to said RGB format output to a modified output signal of a second bit resolution.
29. A state machine comprising the apparatus of claim 28, said state machine comprising a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to a plurality of input signals corresponding to said YUV format.
30. Apparatus as in claim 28 further comprising a programmable memoryless state machine for concurrently generating a plurality of output signals corresponding to said RGB format in response to a plurality of input signals corresponding to said YUV format.
31. A method for generating a compressed video signal, comprising the steps of: converting an input image signal into a predetermined digital format; transferring said digital format image signal to at least one encoder processor; applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digitized image signal; collecting a resultant encoded bit stream generated by said application of said algorithm; storing said digital format signal corresponding to said converted input image signal; converting a second input image signal into said predetermined digital format; calculating a difference between said stored input image signal and said second input image signal; and comparing said difference to a threshold; wherein said stored input image signal comprises a first plurality of image blocks, and said second input image signal comprises a second plurality of image blocks.
32. A method in claim 31 , further comprising the steps of: searching a predetermined set of first plurality of image blocks; selecting a best fit block from said predetermined set; comparing a difference between a current block from said plurality of image blocks and said best fit block to a threshold; encoding said current block as said best fit block if said difference satisfies said threshold; and otherwise, reducing a neighborhood of search and repeating some of said steps a said predetermined set of image blocks in said reduced neighborhood.
33. A method as claimed in claim 31 , wherein said step of applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm of said digitized image signal comprises said steps of: providing a motion vector cache; initially storing a set of most probable motion vectors of said motion vector cache; and comparing a distortion measure calculated using an entry from said motion vector cache to a threshold.
34. A method as claimed in claim 33, wherein said step of comprising said distortion measure to said threshold is repeated using a different entry when said distortion measure exceeds said threshold.
35. A method for generating a compressed video signal, comprising the steps of: converting an input signal into a predetermined digital format; transferring said digital format image signal to at least one encoder processor;
applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digitized image signal; and collecting a resultant encoded bit stream generated by said application of said algorithm, wherein said resultant encoded bit stream is collected in a buffer coupled to said at least one encoder processor; transferring said resultant encoded bit stream from said buffer to a control processor when a predetermined portion of said input image signal is encoded; providing feedback to said at least one encoder processor, said feedback indicating fullness of said buffer; adjusting a threshold range for a threshold that is applied by said at least one encoder processor, wherein said threshold range is periodically adjusted in accordance with said feedback.
36. A method as claimed in claim 35, wherein said threshold range is adjusted on a frame by frame basis.
37. A method as claimed in claim 35, wherein said threshold range is adjusted on a macro row by macro row basis.
38. A method as claimed in claim 35, wherein an amount by which said threshold range is adjusted is determined in accordance with a mean squared error calculation or a mean absolute error calculation.
39. A method as claimed in claim 35, wherein an amount by which said threshold range is adjusted is determined in accordance with a bit stream length for a previously encoded image frame.
40. A method as claimed in claim 35, further comprising the step of adjusting a threshold applied by said at least one encoder process in accordance with said feedback, wherein said step of adjusting said threshold is performed for each of a plurality of image blocks.
41. A method as claimed in claim 35, wherein said hierarchical vector quantization compression algorithm comprises a plurality of image encoding levels, each level having a corresponding image block size.
42. A method as claimed in claim 41, wherein said algorithm further comprises said steps of: determining a distortion measure of an image area resulting from encoding said image area at a level from said plurality of image encoding levels; and determining a length of said resultant encoded bit stream resulting from encoding said image area at said level.
43. A method as claimed in claim 42, further comprising the step of encoding said image area at a level having a sufficient length to distortion measure.
44. A method as in claim 35 wherein said threshold range is adjusted on a macro block by macro block basis.
45. An apparatus for encoding an image signal, comprising: an acquisition module disposed to receive said image signal; a first processor coupled to said acquisition module; and at least one encoder processor coupled to said first processor, wherein said at least one encoder processor produces an encoded signal under control of said first processor.
46. An apparatus as claimed in claim 45, wherein said acquisition module comprises means for converting an analog into a digital signal.
47. An apparatus as claimed in claim 46, wherein said digital signal is in a digital YUV format.
48. An apparatus as claimed in claim 47, wherein a U component and a V component of said YUV formatted digital signal are subsampled by a factor of four relative to a Y component of said YUV formatted digital signal.
49. An apparatus as claimed in claim 47, further comprising means for converting said YUV formatted digital signal into an RGB formatted signal, said converting means being coupled to said first processor.
50. An apparatus as claimed in claim 47, further comprising a display device coupled to said converting means.
51. An apparatus as claimed in claim 45, wherein as least one encoder processor comprises two encoder processors coupled in parallel to said first processor.
52. An apparatus as claimed in claim 51 , wherein as least two encoder processors apply a hierarchical vector quantization compression algorithm to said image signal.
53. An apparatus as claimed in claim 52, wherein said two encoder processors comprise digital signal processors.
54. An apparatus as claimed in claim 45, further comprising a data storage device coupled between said first processor and said at least one encoder processor.
55. An apparatus as claimed in claim 54, wherein said data storage comprises a direct memory access interface.
56. An apparatus as claimed in claim 45, wherein said first processor performs at least one of video encoding, data acquisition and control.
57. An apparatus as claimed in claim 51 , further comprising means for spatially
partitioning said image signal among said two encoder processors.
58. An apparatus as claimed in claim 57, wherein said partitioning means is responsive to said encoded image signal.
59. An apparatus as claimed in claim 57, wherein said two encoder processors receive unequal partitions of said image signal.
60. An apparatus as claimed in claim 57, wherein said partitioning means comprises means for calculating a mean squared error or mean absolute error of a portion of said image signal.
61. An apparatus as claimed in claim 48, wherein said at least one encoder processor comprises means for detecting a scene change.
62. An apparatus as claimed in claim 61, wherein said detecting means calculates a difference between an incoming image signal and a previously coded image signal.
63. An apparatus as claimed in claim 62, wherein said detecting means compares said calculated difference to a threshold.
64. An apparatus as claimed in claim 63, wherein said detecting means calculates a mean squared error between said Y component of said incoming image signal and said Y component of said previously coded image signal.
65. An apparatus as claimed in claim 63, wherein said at least one encoder encodes said incoming signal as a still image when said difference exceeds said threshold.
66. An apparatus as claimed in claims 52 or 65, further comprising a memory structure containing a vector quantization codebook, said memory structure being coupled to said at least one encoder.
67. An apparatus as claimed in claim 45, further comprising a buffer coupled between said at least one encoder processor and said first processor.
68. An apparatus as claimed in claim 27, wherein said encoded image signal produced by said at least one encoder processor accumulates in said buffer for transmission to said first processor.
69. An apparatus as claim in claim 67, further comprising a mean squared error accelerator coupled to said at least one encoder processor.
70. An apparatus as claimed in claim 69, wherein said mean squared error accelerator comprises a field programmable gate array or an application specific integrated
circuit.
71. An apparatus as claimed in claim 67, further comprising a mean absolute error accelerator coupled to said at least one encoder processor.
72. A method for generating a compressed video signal, comprising the steps of: converting an import image signal into a predetermined digital format; transferring said digital format image signal to at least one encoder processor; applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digitized image signal; and
collecting a resultant encoded bit stream generated by said application of said algorithm.
73. A method as claimed in claim 72, further comprising the step of converting
said resultant encoded bit stream into a digital signal having a RGB format.
74. A method as claimed in claim 72, wherein said predetermined digital format comprises data for a plurality of macro rows.
75. A method as claim in claim 74, wherein said step of applying, at said at least one encoder processor, said hierarchical vector quantization compression algorithm to said digitized image signal comprises sequentially applying said algorithm to each macro row of
said plurality of macro rows.
76. A method as claim in claim 71 , wherein said digital image signal is transferred to a memory that is accessible to more than one encoder processor.
77. A method as claim in claim 72, wherein said step of applying said hierarchical vector quantization compression algorithm further comprises said steps of: applying, at a first encoder processor, said algorithm to a first portion of said digital image signal; and applying, at a second encoder processor, said algorithm to a second portion of said digital image signal.
78. A method as claim in claim 77, wherein said first portion of said digital image signal is distinct from said second portion of said digital image signal.
79. A method as claimed in claim 78, further comprising the step of varying a size of said first portion and a size of said second portion in response to said resultant encoded bit stream or activity in the respective image parts.
80. A method as claimed in claim 79, further comprising the step of transferring a portion of said digital image signal between said first encoder processor and said second encoder processor.
81. A method as claimed in claim 80, further comprising the step of transferring mean squared error data corresponding to said portion of said digital image signal between said first encoder processor and said second encoder processor.
82 A method as claimed m claun 79, wherein said sizes are varied in accordance with a length of said resultant encoded bit stream or activity in the respective image parts
83 A method as claim in claim 79, wherein said sizes are varied in accordance with a flag bit within said resultant encoded bit stream
84 A method as claimed in claim 72, further comprising the steps of storing said digital format image signal corresponding to said converted input image signal, converting a second input image signal into said predetermined digital format, calculating a difference between said stored input image signal and said second input image signal, and comparing said difference to a threshold
85 A method as claimed in claim 84, further comprising the step of encoding said second input unage signal as a still image when said difference exceeds said threshold
86 A method as claimed in claun 84, wherein said stored input image signal comprises a first plurality of image blocks, and said second input unage signal comprises a second plurality of unage blocks
87. A method as claimed m claun 86, wherein said step of calculating said difference comprises calculating a mean squared error between an image block from said first plurality of image blocks and a corresponding image block from said second plurality of image blocks.
88. A method as claimed in claim 87, further comprising the step of estimating how many bits are used to encode each of said second plurality of image blocks based upon said calculated mean squared error.
89. A method as claimed in claim 88, further comprising the step of adjusting a threshold applied by said at least one encoder processor in accordance with said estimate.
90. A method as claim in claim 72, wherein said resultant encoded bit stream is collected in a buffer coupled to said at least one encoder processor.
91. A method as claimed in claim 90, further comprising the step of transferring said resultant encoded bit stream from said buffer to a control processor when a predetermined portion of said input image signal is encoded.
92. A method as claimed in claim 91, further comprising the step of providing feedback to said at least one encoder processor, said feedback indicating fullness of said buffer.
93. A method as claimed in claim 92, further comprising the step of adjusting a threshold applied by said at least one encoder processor in accordance with said feedback.
94. A method as claimed in claim 72, wherein said step of applying, at said at least one encoder processor, said hierarchical vector quantization compression algorithm to
said digitized image signal comprises dividing said digitized image signal into a plurality of image blocks.
95. A method as claim in claim 94, further comprising the steps of:
calculating a mean value for at least one of said plurality of image blocks; and storing each calculated mean value in a scalar cache.
96. A method as claimed in claim 94, wherein said steps of calculating a mean value and storing said calculated mean value in said scalar cache are repeated at each level of said hierarchical vector quantization compression algorithm.
97. A method as claim in claim 94, further comprising the step of encoding a mean value of an image block with a mean value stored in said scalar cache.
98. A method as claimed in claim 72, further comprising the steps of calculating an overall scene change mean squared error and comparing said calculated value to a first threshold before said step of applying said hierarchical vector quantization compression algorithm to said digitized image signal.
99. A method as claimed in claim 98, further comprising the step of predicting a motion vector based upon said digitized image signal if said calculated value does not exceed said first threshold.
100. A method as claimed in claim 99, further comprising the step of encoding an image block by its mean value if said calculated value does not exceed a second threshold.
101. A method as claimed in claim 100, further comprising the step of encoding an image block by a mean-removed vector quantizer if said calculated value does not exceed a third threshold .
102. A method as claimed in claim 72, wherein said step of applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digitized image signal comprises said steps of: providing a motion vector cache and a vector quantization cache; storing initially a set of most probable motion vectors in said motion vector cache and a set of most probable code vectors in said vector quantization cache; encoding said digitized image signal in accordance with said hierarchical vector quantization compression algorithm; and reinitializing said motion vector cache with said set of most probable motion vectors and reinitializing said vector quantization cache with said set of most probable code vectors.
103. A method as claimed in 72, wherein said motion vector cache and said vector quantization cache are periodically reinitialized.
104. A method as claimed in claim 72, further comprising the steps of: providing a decoder having a decoder motion vector cache and a decoder vector quantization cache; storing initially said set of most probable motion vectors in said decoder motion
vector cache and said set of most probable code vectors in said decoder vector quantization cache; decoding said resultant encoded bit stream; and reinitializing said decoder motion vector cache with said set of most probable
motion vectors and reinitializing said decoder vector quantization cache with said set of most probable code vectors.
105. A method as claimed in claim 104, wherein said decoder motion vector cache
and said decoder vector quantization cache are reinitialized periodically during decoding.
106. An apparatus as in claim 71 , wherein said mean absolute error accelerator comprises a field programmable gate array or application specific integrated circuit.
107. A method as claimed in claim 72, wherein said step of applying, at said at least one encoder processor, a hierarchical vector quantization compression algorithm to said digitized image signal comprises said steps of: applying said hierarchical vector quantization compression algorithm to a plurality of image blocks corresponding to said digitized image signal, wherein an image block from said plurality of image blocks is selected to be encoded in accordance with a scanning pattern; and varying said scanning pattern when said algorithm is applied to a subsequent digitized image signal.
108. A method for compressing an image signal representing a series of image frames, comprising the steps of: a) analyzing a first image frame by:
i) computing a mean value for each of a plurality of image blocks within said first image frame; ii) storing said computed mean values in a scalar cache; iii) providing a mean value quantizer comprising a predetermined number of quantization levels arranged between a minimum mean value and a maximum mean value stored in said scalar cache, said mean value quantizer producing a quantized mean value; and iv) identifying each image block from said plurality of image blocks that is a low activity image block; b) encoding each of said low activity image blocks with its corresponding quantized mean value; and c) repeating steps a) and b) for a second frame of said image signal
109. A method as claimed in claim 108, further comprising, after step b), said step of encoding remaining image blocks using a hierarchical vector quantization compression algorithm.
110. A method of motion estimation between a current image and a reference image in a video compression system, comprising the steps, in combination, of: generating a first reference macro row, a second reference macro row, and a third reference macro row, said first, second and the third reference macro rows corresponding to interpolated image data of said reference image; storing said first reference macro row in a first memory location, said second reference macro row in a second memory location, and said third reference macro row in a third memory location; generating a first motion estimation value corresponding to a first current and a second current image macro row of said current image; copying said second reference macro row to said first memory location; copying said third reference macro row to said second memory location; generating a fourth reference macro row corresponding to said reference image; copying said fourth reference macro row to said third memory location; and generating a second motion estimation value corresponding to a third current image macro row of said current image.
111. A method as in claim 110, where said steps are repeated in combination.
112. A method as in claim 110, wherein said first, second, third and fourth reference macro rows are inteφolated by a factor of two in the horizontal direction.
113. A method as in claim 111, wherein said first, second, third and fourth reference macro rows are interpolated by a factor of two in the horizontal direction.
114. A method as in claim 110, wherein said first, second, third and fourth reference macro rows are interpolated by a factor of two in the vertical direction.
115. A method as in claim 111 , wherein said first, second, third and fourth reference macro rows are interpolated by a factor of two in the vertical direction.
116. A method of motion estimation between a current image and a reference image in a video compression system, comprising the steps, in combination, of: generating a first reference macro row, a second reference macro row, and a third reference macro row, said first, second and third reference macro rows corresponding to interpolated image date of said reference image; storing said first reference macro row in a first memory location, said second reference macro row in a second memory location, and said third reference macro row in a
third memory location; generating a first motion estimation value corresponding to a first current and a second current image macro row of said current image; setting a pointer to said third memory location; copying a fourth reference macro row to said first memory location; generating a motion estimation value corresponding to a third current image macro row of said current image; setting said pointer to said first memory location; copying a fifth reference macro
row to said second memory location; generating a second motion estimation value corresponding to a fourth current image macro row of said current image; setting said pointer to said second memory location; copying a sixth reference macro row to said third memory location; generating a third motion estimation value corresponding to a fifth current image macro row of said current image.
117. A method as in claim 116, wherein said steps are repeated in combination.
118. A method as in claim 116, wherein said first, second, third, fourth, fifth and sixth reference macro rows are interpolated by a factor of two in the horizontal direction.
119. A method as in claim 117, wherein said first, second, third, fourth, fifth and sixth reference macro rows are inteφolated by a factor of two in the horizontal direction.
120. A method as in claim 116, wherein said first, second, third, fourth, fifth and sixth reference macro rows are inteφolated by a factor of two in the vertical direction.
121. A method as in claim 117, wherein said first, second, third, fourth, fifth and sixth reference macro rows are inteφolated by a factor of two in the vertical direction.
122. A method as in claim 115, wherein said reference image comprises 96 horizontal rows of pixels.
123. A method as in claim 120, wherein said first second memory location begins at the first row of said 96 horizontal rows, said second memory location begins at the 32nd row of said 96 rows, and said third memory location begins at the 64th row of said 96 rows.
124. A method of periodically enhancing a video image with at least one processor, said video image comprised of a series of frames, each of said frames including at least one block, comprising the steps of:
determining space availability of said processor; determining time availability of said processor; determining said number of bits used to enforce a frame; assigning a value to a first variable corresponding to said number of bits used to encode said frame; comparing said fist variable to a predetermined refresh threshold; scalar refreshing said video image if said first variable exceeds said predetermined refresh threshold; and block refreshing said video image if said first variable is less than or equal to said predetermined fresh threshold.
125. A method as in claim 124, wherein said step of determining space availability of said processor comprises said steps of: generating a fullness value corresponding to said fullness of an output buffer; and comparing said fullness value to a first predetermined value.
126. A method as in claim 125, wherein said first predetermined value corresponds to said output buffer at least 30 percent full.
127. A method as in claim 124, wherein said step of determining time availability of said processor comprises said step of determining whether a new input frame is waiting to be processed.
128. A method as in claim 124, wherein said step of scalar refreshing said video image includes said steps of: generating a refresh instruction corresponding to said space availability, said time availability and previous refresh instructions; and
absolute scalar refreshing said video image in response to said refresh instruction.
129. A method as in claim 124, wherein said step of scalar refreshing said video image further includes said steps of: generating a refresh instruction corresponding to said space availability, said time availability and previous refresh instructions; and differential scalar refreshing said video image in response to said refresh instruction.
130. A method as in claim 124, wherein said step of block refreshing said video image further includes said steps of:
generating a refresh instruction corresponding to said space availability, said time availability and previous refresh instructions; and absolute block refreshing said video image in response to said refresh instruction.
131. A method as in claim 124, wherein said step of block refreshing said video image further includes said steps of:
generating a refresh instruction corresponding to said space availability, said time availability and previous refresh instructions; and
differential block refreshing said video image in response to said refresh instruction.
132. A method of refreshing a plurality of video images wherein said method is selected from said group consisting of: block refreshing at least one of said images; and pixel refreshing as least one of said images.
133. A method as in claim 132, wherein said step of block refreshing is selected from said group consisting of: absolute block refreshing at least one of said images; and differential block refreshing as least one of said images.
134. A method as in claim 133, wherein said step of pixel refreshing is selected from said group consisting of: absolute scalar refreshing at least one of said images; and differential scalar refreshing as least on of said images.
135. A method for transmitting data from an encoder to a decoder comprising in combination of: encoding said data into a plurality of macro rows; transmitting said macro rows from said encoder to said decoder; and decoding said macro rows at said decoder.
136. A methods as in claim 135, wherein said data is selected from said group comprising video data, voice data and audio data.
137. A method as in claim 135, wherein said data is video data and voice data.
138. A method as in claim 135, wherein said data is video data and audio data.
139. A method as in claim 135, wherein said data is video data and a second data type selected from said group consisting of voice and audio data.
140. A method as in claim 135, wherein said step of transmitting comprises said steps of generating a plurality of data packets corresponding to said macro rows, each of said packets containing at least one macro row.
141. A method as in claim 140, further comprising the steps of: transmitting said data packet to said decoder and aligning said data packet along with a predetermined location; and determining at said data decoder whether said data packet is aligned along said predetermined location and generating a responsive error condition.
142. A method as in claim 141 , wherein said predetermined location is a byte boundary.
143. A method as in claim 141 , wherein said error condition is selected from said group consisting of invalid macro row address, duplicated macro row address, extra bits left over after macro row, too few bits for decoding a macro row, invalid VQ address, video data mixed with still data, refresh data mixed with still data, invalid motion vector, and refresh data with video data.
144. A method as in claim 141 , further comprising the step of concealing errors in response to said error condition.
145. A method as in claim 144, wherein said step of concealing errors comprises conditionally replenishing macro rows.
146. A method as in claim 141 , wherein a macro row packet is ignored and said next macro row packet is decoded in response to an invalid macro row address error condition.
147. A method as in claim 142, further comprising the steps of, in response to a duplicated macro row address: determining whether said duplicated macro row data is for said next frame or said current frame using said relative temporal reference; ignoring said macro row packet and decoding said next macro row packet if said duplicated macro row is for said current frame; and
performing error concealment on said current frame and transmitting said image to a
YUV to RGB convertor and decoding said next frame if said duplicate macro row is for said next frame.
148. A method as in claim 141 , wherein said determining step comprises analysis with a relative reference.
149. A method as in claim 141 , wherein said macro row is overwritten with a PCI macro row in response to an extra bits left over after macro row error condition.
150. A method as in claim 141, wherein said macro row is overwritten with a PCI macro row in response to a too few bits for decoding a macro row error condition.
151. A method as in claim 141, wherein said macro row is ignored in response to a refresh data mixed with still error condition.
152. A method as in claim 124, wherein said packets are of variable size.
153. An apparatus for transmitting image data comprising in combination: at least one encoder for encoding said image data into a plurality of macro rows and generating and addressing bit corresponding to each of said macro rows, said encoder generating a plurality of packets, each packet containing at least one macro row; and a decoder for receiving said packets and generating a responsive image corresponding to said addressing bit.
154. An apparatus as in claim 153, wherein said packets are of variable size.
155. A video transmission system comprising in combination: a first processor for decoding a plurality of encoded image frames; a second processor for encoding image data into said plurality of encoded image frames; and a transmission channel coupled between said first processor and said second processor for transmitting said encoded image frames to said first processor, said system skipping transmission of at least one of said encoded images frames in response to fullness of said transmission channel.
156. A video transmission system as in claim 155, wherein said image frame data includes video data and audio data.
157. A method for transmitting video data comprising the steps of: encoding said video data into a first image frame; transmitting said first image frame to a decoder along a transmission channel to a buffer; generating a first frame skip value corresponding to said fullness of said buffer; skipping transmission of video data in response to said first frame skip value; encoding said video data into a second image frame; and transmitting said second image frame to said decoder along said transmission channel.
158. A method as in claim 157, wherein said generating step comprises said steps of: generating a first buffer measurement corresponding to said number of bits provided to said buffer; and comparing said first buffer measurement to a first buffer value, said first buffer value corresponding to said size of said buffer.
159. A method as in claim 157, wherein said generating step comprises said step of: generating a first buffer measurement corresponding to said number of bits provided to said buffer; and comparing said first buffer measurement to a first buffer value, said first buffer value corresponding to an average of allocated bit allotment space for an image frame.
160. In a video conferencing system having a camera, an image processor and an image display, an apparatus for electronically panning an image scene, said apparatus comprising: an image acquisition module disposed to receive an image signal from said camera, said image acquisition module producing a digital representation of said image signal; means for transmitting a portion of said digital representation of said image signal to said image processor, said transmitting means being coupled between said image acquisition module and said image processor; and an image pan selection device coupled to said transmitting means, wherein said image pan selection device is operable to change said portion of said digital representation of said image signal that is transmitted to said image processor.
161. An apparatus as claimed in claim 160, wherein said transmitting means comprises a field programmable gate array, which is coupled to said image acquisition module.
162. An apparatus as claimed in claim 161 , wherein said portion of said digital representation of said image signal corresponds to an N x M pixel region of said image signal, said region having a predetermined horizontal offset and a predetermined vertical offset.
163. An apparatus as claimed in claim 161, wherein said image pan selection device changes said portion of said digital representation of said image signal that is transmitted to said image processor by signaling said field programmable gate array to adjust at least one of said horizontal offset and said vertical offset.
164. In a video conferencing system having a camera, an image processor and an image display, an apparatus for performing an electronic zoom on an image scene, said apparatus comprising:
an image acquisition module disposed to receive an image signal from said camera, said image acquisition module producing a digital representation of said image signal; means for transmitting a portion of said digital representation of said image signal to said image processor, said transmitting means being coupled between said image acquisition module and said image processor; and an image zoom selection device coupled to said transmitting means, wherein said image zoom selection device is operable to change said portion of said digital representation of said image signal that is transmitted to said image processor.
165. An apparatus as claimed in claim 162, wherein said transmitting means comprises a sample rate converter.
166. An apparatus as claimed in claim 165, wherein said sample rate converter comprises at least one of an inteφolating filter and a decimating filter.
167. An apparatus as claimed in claim 162, wherein said transmitting means comprises means for subsampling said digital representation of said image signal.
168. An apparatus as claimed in claim 164, wherein said image zoom selection device comprises a user operable control button.
169. An apparatus as claimed in claim 168, wherein said control button is bidirectional.
170. A video message system for receiving, storing and playing back a video message comprising, in combination: a first modem for receiving incoming telephone calls, a compressed bit stream of video data and an audio message;
a first decoder processor for decoding said compressed bit stream of video data; a first audio compression for processing said audio message; a first memory for storing said bit stream and said audio message; a first video display for displaying video data upon a use request; a speaker for playing said audio message; and a first control mechanism to allow a first user to play said audio message on said speaker and display said video message on said video display.
171. A video message system as in claim 170, wherein said compressed bit stream of video data comprises a plurality of image frames.
172. A video message unit as in claim 170, wherein said compressed bit stream of video data corresponds to a single image frame.
173. A video message as in claim 170, further comprising a notification mechanism for notifying said first user of said receipt of a video message.
174. A video message system as in claim 170, further comprising, in combination: a first telephone for placing outgoing telephone calls;
a first video camera for capturing a bit stream of video data corresponding to a video image; a first encoder processor for encoding a compressed bit stream of video data; and a second modem for transmitting said compressed bit stream of video data and an
audio message; and a second control mechanism for capturing a video bit stream.
175. A video message system as in claim 174, wherein said video bit stream
corresponds to a signal image frame.
176. A video message system as in claim 175, wherein said video bit stream corresponds to a plurality of image frame.
177. An apparatus for encoding a first image signal and a second image signal, said second image signal corresponding to a video still, comprising: an acquisition module disposed to receive said first image signal and said second image signal; a first processor coupled to said acquisition module, said first processor performs as least one of video decoding, data acquisition and control, said first processor decoding said first image signal in a foreground mode and decoding said second image signal in said background; a display device coupled to said first processor, said display device displaying said first image signal and said second image signal; at least one encoder processor coupled to said first processor, wherein said at least one encoder processor encodes an image signal under control of said first processor; and a data storage device coupled between said first processor and said at least one encoder processor.
178. An apparatus as in claim 177, wherein said first processor is coupled to a first memory structure for processing of said first image signal and a second memory structure for processing of said second image signal.
179. An apparatus as in claim 178, further comprising a user selection device for selection of said second image signal.
180. An apparatus as in claim 179, wherein said selection device is coupled to said display device.
181. An apparatus as in claim 180, wherein said selection device is coupled to said acquisition module.
182. A method for generating a compressed video signal, comprising the steps of: converting an input image signal into a predetermined digital format; transferring said digital format image signal to at least one encoder processor; filtering said input image signal to reduce artifacts; applying, at said at least one encoder processor, a hierarchical vector quantization
compression algorithm to said digitized image signal; and collecting a resultant encoded bit stream generated by said application of said algorithm.
183. A method as claimed in claim 182, wherein said artifacts comprise at least one of a random camera noise, a shaφ image transition and minor motion artifacts.
184. A method as claimed in claim 182, wherein said step of filtering said input image signal comprises passing said input image signal through at least one of a spatial filter, a median filter and a temporal filter.
185. A method as claimed in claim 184, wherein a filter is a variable strength filter.
186. A method as claimed in claim 185, wherein said variable strength filter varies in accordance with a ratio of predictor hits to a predictor misses.
PCT/US1997/013039 1996-08-09 1997-08-06 Video encoder/decoder WO1998007278A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU40455/97A AU4045597A (en) 1996-08-09 1997-08-06 Video encoder/decoder

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US69473496A 1996-08-09 1996-08-09
US08/689,444 US5805228A (en) 1996-08-09 1996-08-09 Video encoder/decoder system
US08/689,519 1996-08-09
US08/689,456 1996-08-09
US08/689,519 US6072830A (en) 1996-08-09 1996-08-09 Method for generating a compressed video signal
US08/694,949 US5926226A (en) 1996-08-09 1996-08-09 Method for adjusting the quality of a video coder
US08/689,444 1996-08-09
US08/694,949 1996-08-09
US08/689,401 1996-08-09
US08/694,734 1996-08-09
US08/689,456 US5864681A (en) 1996-08-09 1996-08-09 Video encoder/decoder system
US08/689,401 US5831678A (en) 1996-08-09 1996-08-09 Video encoder/decoder system

Publications (2)

Publication Number Publication Date
WO1998007278A2 true WO1998007278A2 (en) 1998-02-19
WO1998007278A3 WO1998007278A3 (en) 1998-05-14

Family

ID=27560234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1997/013039 WO1998007278A2 (en) 1996-08-09 1997-08-06 Video encoder/decoder

Country Status (2)

Country Link
AU (1) AU4045597A (en)
WO (1) WO1998007278A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017513346A (en) * 2014-03-17 2017-05-25 クアルコム,インコーポレイテッド System and method for low complexity coding and background detection
USRE48845E1 (en) 2002-04-01 2021-12-07 Broadcom Corporation Video decoding system supporting multiple standards
CN116594320A (en) * 2023-07-18 2023-08-15 北京啸为科技有限公司 Image sensor simulation device and controller test system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4849810A (en) * 1987-06-02 1989-07-18 Picturetel Corporation Hierarchial encoding method and apparatus for efficiently communicating image sequences
US4853779A (en) * 1987-04-07 1989-08-01 Siemens Aktiengesellschaft Method for data reduction of digital image sequences
WO1994025934A1 (en) * 1993-04-30 1994-11-10 Comsat Corporation Codeword-dependent post-filtering for vector quantization-based image compression
US5444489A (en) * 1993-02-11 1995-08-22 Georgia Tech Research Corporation Vector quantization video encoder using hierarchical cache memory scheme

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4853779A (en) * 1987-04-07 1989-08-01 Siemens Aktiengesellschaft Method for data reduction of digital image sequences
US4849810A (en) * 1987-06-02 1989-07-18 Picturetel Corporation Hierarchial encoding method and apparatus for efficiently communicating image sequences
US5444489A (en) * 1993-02-11 1995-08-22 Georgia Tech Research Corporation Vector quantization video encoder using hierarchical cache memory scheme
WO1994025934A1 (en) * 1993-04-30 1994-11-10 Comsat Corporation Codeword-dependent post-filtering for vector quantization-based image compression

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USRE48845E1 (en) 2002-04-01 2021-12-07 Broadcom Corporation Video decoding system supporting multiple standards
JP2017513346A (en) * 2014-03-17 2017-05-25 クアルコム,インコーポレイテッド System and method for low complexity coding and background detection
CN116594320A (en) * 2023-07-18 2023-08-15 北京啸为科技有限公司 Image sensor simulation device and controller test system
CN116594320B (en) * 2023-07-18 2023-09-15 北京啸为科技有限公司 Image sensor simulation device and controller test system

Also Published As

Publication number Publication date
WO1998007278A3 (en) 1998-05-14
AU4045597A (en) 1998-03-06

Similar Documents

Publication Publication Date Title
US5926226A (en) Method for adjusting the quality of a video coder
US5805228A (en) Video encoder/decoder system
US6072830A (en) Method for generating a compressed video signal
US5864681A (en) Video encoder/decoder system
US5831678A (en) Video encoder/decoder system
CA2171727C (en) Video compression using an iterative error data coding method
US5650782A (en) Variable length coder using two VLC tables
US5706290A (en) Method and apparatus including system architecture for multimedia communication
US5812788A (en) Encoding/decoding video signals using quantization tables based on explicitly encoded base and scale matrices
US5144423A (en) Hdtv encoder with forward estimation and constant rate motion vectors
US6301392B1 (en) Efficient methodology to select the quantization threshold parameters in a DWT-based image compression scheme in order to score a predefined minimum number of images into a fixed size secondary storage
US6026190A (en) Image signal encoding with variable low-pass filter
EP0739137A2 (en) Processing video signals for scalable video playback
Westwater et al. Real-time video compression: techniques and algorithms
JPH03136595A (en) Processing method of video picture data used for storing or transmitting digital dynamic picture
WO1998059497A1 (en) Method for generating sprites for object-based coding systems using masks and rounding average
WO1994009595A1 (en) Method and apparatus including system architecture for multimedia communications
WO2000000932A1 (en) Method and apparatus for block classification and adaptive bit allocation
JP2002518916A (en) Circuit and method for directly decoding an encoded format image having a first resolution into a decoded format image having a second resolution
US20130064304A1 (en) Method and System for Image Processing in a Microprocessor for Portable Video Communication Devices
US5778190A (en) Encoding video signals using multi-phase motion estimation
US5758092A (en) Interleaved bitrate control for heterogeneous data streams
US5691775A (en) Reduction of motion estimation artifacts
KR100282581B1 (en) Control Scheme for Shared-Use Dual-Port Prediction Error Array
EP0920204A1 (en) MPEG2 decoder with reduced RAM requisite by recompression using adaptive tree search vector quantization

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ

AK Designated states

Kind code of ref document: A3

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase in:

Ref country code: JP

Ref document number: 98509744

Format of ref document f/p: F

NENP Non-entry into the national phase in:

Ref country code: CA

122 Ep: pct application non-entry in european phase