LOW-POWER PARALLEL PROCESSOR AND IMAGER INTEGRATED CIRCUIT
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to a low-power, single chip, parallel processor and imager system, and, more specifically, in one embodiment, a low power, large scale MPEG2 encoder and imager system for a single-chip digital CMOS video camera is disclosed.
2. Background of the Related Art
Processing of digital data obtained from an image sensor requires complex calculations.
Processing of video data, which requires motion estimation, is particularly computationally intensive. Accordingly, various techniques have been proposed to meet these processing requirements. Thus, processors capable of performing over one billion operations per second are becoming commonplace.
A conflicting requirement for certain applications, however, is that the overall power be minimized, especially for devices such as camcorders and the like that are required to be battery powered. Thus, although the same complex calculations are required, they must be performed with a system that uses minimal amounts of power, so that the devices can operate for a reasonable period of time before requiring recharging.
Existing video processing engines are designed to optimize processing of video data stored in a secondary storage medium, e.g., random access memory, hard drive, or DVD. This results in a need for an external chipset whose primary task is to provide the necessary bandwidth for data transfer between the video engine and the secondary storage medium. The requirement of such an external data transfer eliminates the possibility for a low-power, single-chip solution.
Another existing solution that uses less power is a single integrated circuit chip for both the image sensor and digital processor. An example of such a single integrated circuit chip is the VLSI Vision Limited VV6405 NTSC Colour CMOS Image Sensor. The digital processor disclosed operates upon consecutive rows of pixel data sequentially to perform simple pixel- level computations. While this solution uses less power than other alternatives, it does not have the ability to perform operations at rates that are desired.
SUMMARY OF THE INVENTION
It is an object of the invention, therefore, to provide an integrated image sensor and processor architecture which satisfies low power requirements.
It is a further object of the invention to provide a integrated image sensor and processor capable of performing complex operations.
In view of the above recited objects, among others, the present invention implements a parallel processing architecture in which a plurality of parallel processors concurrently operate upon a different block, preferably a column, of image data. Implemented on a single monolithic integrated circuit chip, this single chip solution has characteristics that provide the throughput necessary to perform computationally complex operations, such as color correction, RGB to YUV conversion and DCT operations in either still or video applications, and motion estimation in digital video processing applications.
In a specific embodiment according to the present invention, a parallel processor and imager system according to the present invention implements in a preferred embodiment a single-chip digital CMOS video camera with real-time MPEG2 encoding capability. Computationally intensive operations of the video compression algorithms can be performed on-chip, at a location right beside the output of the imager, resulting in low latency and low power consumption. In all embodiments, this architecture takes advantage of parallelism in image processing algorithms, which is exploited to obtain efficient processing. BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects, features, and advantages of the present invention are better understood by reading the following detailed description of the preferred embodiments, taken in conjunction with the accompanying drawings, in which:
Fig. 1 illustrates a single monolithic integrated circuit containing an image sensor array and parallel processors according to the present invention;
Figs 2A-C illustrate alternative manners in which instructions can be fed into each of the plurality of parallel processors according to the present invention;
Fig. 3 illustrates a single integrated circuit containing an image sensor array, parallel processors, and embedded memory capable of encoding sequential images according to the present invention;
Figs. 4 illustrate another layout of a single integrated circuit for the embodiment described in Fig. 3;
Fig. 5 illustrates a more detailed diagram of one of the parallel processors for the embodiment described in Fig. 3 according to the present invention; Fig. 6 illustrates a more detailed diagram of one embodiment of an arithmetic logic unit for the embodiment described in Fig. 5 according to the present invention;
Figs 7A and 7B illustrate alternative addressing schemes that can be used with the parallel processors operating upon columns of pixel data according to the present invention;
Fig. 8 provides a table of estimated cycle count per processor per frame needed for each encoding/decoding step; and
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
This invention, in its most basic form, has the capacity to sense to a single image, generate pixel data as a result of the sensed image, and concurrently process that image using a plurality of parallel processors, each of which simultaneously operate on portions of the pixel data associated with the image. In a preferred embodiment, as described hereinafter, the portions of the pixel image that each processor operates upon is a column of pixel data, although pixel data that is concurrently operated upon can be divided in various other ways, such as blocks.
As illustrated in Fig. 1, digital processor and imager system 10 includes a sensor array 12 that detects an image and generates detected signals corresponding thereto. This sensor array 12 is preferably a CMOS photo sensor array, but could also be other types of arrays, such as charge coupled devices. Also included in the system 10 are a plurality of parallel processors 14, each of which inputs certain predetermined ones of the detected signals by being coupled to and in close proximity with the sensor array 12, and also being coupled to an output buffer 16. The image data, such as from a single image that is sensed in a digital camera, is detected by the sensor array 12, and the detected signals, also called pixel data are transmitted columnwise into a plurality of parallel processors 14, forty in the embodiment illustrated. Each of the forty processors operates upon the input detected signals to generate encoded signals, which are then output to the output buffer 16, the encoded signals being encoded based upon the algorithm that each of the processors is implementing. In the specific preferred embodiment disclosed hereinafter, the number of parallel processors, the size of each of the parallel processors, the search space within a processor domain, and the size of certain memories, for instance, are based upon an array having a predetermined resolution of 640x480 array of sensing elements.
It should be noted, however, that, for each of the embodiments described, the specific numbers of processors, implementation of each processor, and search space, memory requirements, and other specific implementation aspects as recited are not intended to be limiting, but instead to completely describe a presently preferred embodiment. As described, the relationship of specific implementation aspects is not arbitrary, but based upon considerations in which computationally intensive operations can be simultaneously repeated by multiple processors in order to obtain the fullest throughput. This throughput is dependent in part upon the algorithms that need to be implemented, for example the fact that motion estimation requires knowledge of neighboring pixel data, whereas RDB to YUV conversion and DCT operations do not require such knowledge. Further, the size of the sensing array will assist in determining the proper search space, with the larger the sensor array, the larger the search space being able to be without having adverse effects on throughput and increased power usage. Similarly, the larger the number of pixels that each processor operates upon, the greater the resulting clock rate, and the more complex the associated circuitry becomes. Accordingly, specific implementation aspects are dependent upon factors such as these.
Figs 2A-2C illustrate the manner in which the parallel processors 14 can be loaded with instructions, that will then cause them to perform the intended operation. As illustrated in Fig. 2 A, each processor 14 can sequentially receive the same instruction, whereas Figs 2B and 2C illustrate more complex instruction loading sequences. These instruction loading sequences are maintained by a host processor that provides overall control of the parallel processors, and uses the equivalent of the interprocessor communication unit, to communicate with each of the parallel processors, in a manner that is known with respect to parallel processor implementations generally. The host processor can be implemented on the same monolithic integrated circuit chip, or die, or off-chip. There are also custodian tasks that need to be performed, such as variable length encoding, after the pixel data has been processed. The computation of these tasks can easily be integrated on the same chip, as their computation requirements are much more relaxed compared to that of the pixel level processing
The descriptions provided hereinafter, which are of a specific preferred embodiment shown in block diagram form in Fig. 3, are also not intended to be interpreted as showing only a single particular embodiment, but rather the descriptions provided with respect to this embodiment are intended to illustrate that the parallel processors, operating concurrently on various portions of pixel data, can be configured in a variety of ways, since the operations described that these parallel processors operate upon are the most computationally difficult.
Accordingly, many modifications can be made and still be within the intended scope of the invention. With reference to this embodiment illustrated in Fig. 3, the parallel processor and imager system 20 according to this embodiment of the present invention exploits the parallelism inherent in video processing algorithms, the small dynamic range used by existing video compression algorithms, the digital CMOS sensor technology, and the embedded DRAM technology to realize a lower power, single-chip solution for low-cost video capturing. Thus, the invention enables capture and processing of video data on the same chip. The acquired video data is stored directly in the on-chip embedded DRAM, also termed pixel memory 30, which serves as a high-bandwidth video frame buffer. The bandwidth of embedded DRAM can be as high as 8 Gbyte/s, making it possible to support several (40 in this preferred embodiment described herein) parallel video processors. It should be noted that the preferred embodiment is described with respect to a particular implementation, including a configuration in which each processor is limited to 16 bits. This description is not intended to be limiting, as many alternative configurations are possible, as will be apparent. For low power purposes, these parallel processors are designed to run at relatively low clock rates described further hereinafter, thereby allowing total computational throughput as high as 1.6 BOPS while consuming less than 40 mW of power.
Fig 3 also illustrates one layout of the CMOS photo sensors 22, the embedded DRAM 30, and the parallel DSP processors 40-1 to 40-40 on a single integrated circuit chip 20. The CMOS photo sensor array 22 are disposed on a top layer of the integrated circuit chip in such a location where they will be able to receive incident light, and include, for instance, photo diodes, A/D converters, and A/D offset correction circuitry. The embedded DRAM or pixel memory 30 resides under the photo diodes and provides storage for the current and two past frames of captured image, as well as intermediate variables such as motion vectors (MV's) and multi-resolution pixel values. The parallel video processors 40 are located next to the imaging circuitry and each operates independently on a 16 column of pixels.
In the specific embodiment of the processor system 20, as described herein, has the advantage of supporting high computational throughput at low clock rates when executing highly repetitive operations. It is less efficient when operating on more complex algorithms that require access to data outside of the processor domain. The size of the processor domain is, therefore, an important design parameter, which requires careful examination of the types of video processing algorithms, as described hereinafter.
Processor system 20 is described herein with reference to its structure, and then described with reference to how this structure can implement three algorithms commonly used in video coding standards: RGB to YUV conversion, DCT, and motion estimation. RGB to YUV conversion is performed on the pixel level and requires no additional information from neighboring pixels. It is computationally intensive, requiring multiple multiplies and adds per pixel, but can be easily achieved with a parallel architecture. DCT, on the other hand, is performed on a block basis. It operates on a row or a column of pixels in each pass and requires bit reverse or base offset addressing to simplify the instruction set. Implementing DCT with a pixel-level processor domain would be unnecessarily complicated. Similar to DCT, motion estimation works best with a block-level processor domain.
Unlike DCT in which processing variables are confined within a block, motion estimation requires access to adjacent blocks regardless of the size of the processor domain. The extent of the locality of interprocessor communication depends on the search space. In this processor design, a search space between processor domains is assumed. No assumption is made with the size of the search space within a processor domain. Furthermore, some motion estimation procedures do not require any multiplication other than simple shifts, as in the example below.
These algorithmic constraints place certain requirements on the design of the parallel processor. In short, the computational throughput (less than 1.6 BOPS based on the algorithm proposed by Chalidabhongse and Kuo in Junavit Chalidabhongse and C.-C. Jay Kuo. "Fast motion vector estimation using multiresolution-spatio-temporal correlations" IEEE Transactions on Circuits and Systems for Video Technology, Vol.7, No.3, pp. 477-488, June 1997.) required for motion estimation results in the most effect size being 16 pixels for each processor with the given technology (preferably less than 0.2m) and the clock rate (preferably less than about 40 MHz). Special addressing modes such as bit reversal, base-offset, auto increment, and modulo operations are needed for DCT and motion estimation. Interprocessor communication circuitry is needed to access data between processor domains and to communicate domain- specific information such as MV's and reference blocks for block search.
In addition to constraints posed by the algorithms, physical and technological limitations are also considered. In the physical layout, each CMOS photo diode has a dimension of lOμ x lOμ. With 16 pixels per processor, each processor is preferably limited to a width of 160μ. This limits the datapath to 36 bits for the arithmetic unit assuming that the individual ones of the parallel processors are staggered so that certain processing units in the datapath can be made wider. With staggering, the width dimension can at most double at the cost of more
complicated layout and routing. Although the embedded DRAM can sustain high memory throughput via large data buses (64 bits), the access time of the embedded DRAM with a 3.3V supply is twice as long as the cycle time (50 ns). A DMA (direct memory access) unit is introduced to serve as an interface between the DRAM and the local memory units, as described hereinafter. In addition, the DMA unit may communicate with adjacent processors to access pixel data outside of the processor domain.
Finally, an important algorithmic distinction is made with data dependency. As the local program memory space is severely limited, it is desirable to partition the program code such that individual code segments can be stored locally. It is also advantageous to partition the program code based on data dependency. A data independent algorithm enables codes to be executed in a predictable manner. A data dependent algorithm has an unpredictable program flow and, therefore, would require the attention of individual processors. By partitioning the code into data independent and dependent segments, it is possible to store data independent codes outside of the processor and only to store data dependent codes local to the processor. Data independent instructions can be stored on a much larger program space either on-chip or off-chip and instructions would be sequentially pipelined into the individual parallel processors. If instructions are not so pipelined, a large memory bandwidth to the central program store is required. Program flow control such as branching can be performed outside of the parallel processors. This reduces unnecessary energy overhead to perform program decoding in the parallel processors, which, consequently, gets multiplied by the number of parallel processors to account for the total consumed power. Most image transformation and filtering algorithms are data independent. DCT and color conversion are such examples. A portion of the motion estimation algorithm is also data independent. It is, however, data dependent during MV refinement where local searches are required, as will be described hereinafter.
The single chip parallel processor and image system of Fig. 3 according to the invention achieves the following three goals simultaneously: realize the image/video processing algorithms; minimize DMA accesses to the pixel DRAM; and maximize computational throughput while keeping the power consumption at a minimal level. Minimizing DMA access to the pixel memory is crucial not only to reduce power consumption, but also to reduce instruction overhead incurred with access latencies. Each processor 40 as illustrated in Fig. 5 described herein contains a DMA 50, a 288-byte block visible RAM 52, a 36-byte auxiliary RAM 54, a 32-word register file 56, an ALU 58, an inter-processor communication unit 60, an
external I/O buffer 62, and the processor control unit 64. The processor control unit 64 consists of the program RAM 66, the instruction decoder 68, and the address generation unit 70.
To realize the image/video processing algorithms, the proposed parallel processor and imager system 10 supports certain types of addressing modes and data flow between memory units mentioned above. For color conversion and DCT, there is no need to access adjacent pixel memories. Transfer of data from the pixel memory 30 to local memories are implemented with a simple DMA. Local memories and addressing modes requirements are implemented as described hereinafter. Two-operands single cycle instructions can be realized with two data paths 80 and 82 to the ALU 58, a path 80 from local pixel storage (block visible RAM 52) and a second path 82 from coefficients storage (auxiliary RAM 54 or the register file 56). Automatic post increment and offset addressing modes are available.
For motion estimation, data flow involves adjacent pixel memories. Depending on the motion estimation algorithm used, data flow may involve pixel memories that are two processor domains away. The motion estimation algorithm can be partitioned into four main sections: subsampling, hierarchical and multiresolution block matching, MV candidate selection, and MV refinement. The data flow for subsampling and hierarchical resolution reduction is restricted to the current processor domain. Block matching requires access to adjacent pixel memories. And MV candidate selection may require access to data stored two processor domains away. The proposed processor enables these types of data flow by employing special DMA, local memories, and addressing schemes, as will be described hereinafter.
The DMA 50 illustrated in Fig. 5 is the primary interface between the parallel processor's local memories (i.e. auxiliary RAM 54 and block visible RAM 52) and the embedded pixel DRAM 30. It is also the primary mechanism for inter-processor data transfer. The DMA 50 separates the task of pixel memory access from the parallel processors such that DRAM access latencies do not stall program execution. The DMA 50 also supports memory access requests from pixel DRAM's that lie within two processor domains. Access requests that involve two processor domains are not optimal and are meant only for retrieving small amounts of data.
The DMA 50 is implemented in the preferred embodiment described herein with four access registers and memory buffers as is conventional. Each memory access consists of a 64- bit (8 pixels) packet. Access requests are pipelined along with the instructions into the access registers and they are prioritized in a first come first serve fashion. Memory buffers provide the temporary storage needed for the DMA to work with both 64-bit (DRAM) and 8-bit (SRAM)
data packets. An access request contains information such as the source and destination addresses, the relative processor domain "read" ID, the relative processor domain "write" ID, and the read/write block size. A status flag is associated with each DMA access register to indicate access request completion. This flag is used in conjunction with a wait instruction to allow better program flow control. Program flow control is necessary during external pixel DRAM accesses, especially during data dependent processing.
The DMA 50 resolves access contention from the on-chip or off-chip host processors, as previously described by placing the request in a FIFO queue. External access requests are treated with equal priority by the DMA 50 as the internal access requests. However, each DMA 50 has a limited FIFO queue and if full, new DMA access requests will be stalled and so will the processor 40 issuing the request. To keep track of accesses to pixel DRAM's 30 that are two processor domains away, a relative processor LD and a backward relative processor D is appended to each access request.
Two special addressing schemes are available for the block visible RAM 52. The block visible RAM 52 is used to provide temporary storage for a block of up to 16x16 pixels of 9-bit wide data for motion estimation and 8x8 pixels of 18-bit wide data for LDCT to comply with the IEEE error specifications. These addressing schemes provide additional flexibility to facilitate local memory accesses and to reduce DMA overheads, as described hereinafter.
The first addressing scheme is called block visible addressing and is illustrated in Fig. 7A. It enables the block visible RAM 52 in one processor (such as 40-3) to be readable by adjacent processors (such as 40-2 and 40-4). This is especially useful in operations that involve access to a block of data stored in the block visible RAM 52 of adjacent processors. It is specifically used in data independent mode; otherwise, the data stored in adjacent block visible RAM's cannot be predetermined. Being able to address data from adjacent block visible RAM's 50 has the advantage of providing a second level of inter-processor data communication without the cost of performing external DMA accesses. The cost of utilizing this addressing scheme is an increased number of SRAM reads per cycle to avoid memory access contentions. However, it is justified due to a much larger energy and latency overhead associated with DMA accesses. Also, this addressing scheme reduces chip area, a result of reusing the block visible RAM 52. The second addressing scheme is called modulo offset addressing and is illustrated in Fig.
7B. It involves an automatic modulo offsetting of the addresses issued to the block visible
RAM. This addressing scheme may work in both data dependent and independent modes. The block visible RAM 52 and the auxiliary RAM 54 are addressed by two address pointers, each
pointer representing a coordinate in the cartesian coordinate system with the pointer address being generated from the processor 40, the DMA 50, as well as the address generation unit 70. This data address representation is more suitable for image processing due to the 2- dimensional nature of images. In addition, this representation supports more flexible addressing modes such as auto increments and modulo in both x and y directions.
The modulo offset addressing scheme augments the 2-D address representation by allowing all addresses to be offset by an amount preloaded into the two offset registers (one for each dimension). There are two advantages for using this addressing scheme. First, all address pointers are relative to the offset coordinates (i.e. the offset coordinates are treated as the origin). This allows a program to be reused for processing another set of pixels by simply modifying the offset values. In data dependent mode, this may result in a smaller code size needed to be stored in the local program RAM 66. The second advantage lies with a reduction of DMA accesses to external pixel DRAM. During block search, blocks of 16x16 pixels belonging to the previous frame need to be read from the pixel memory and stored in the block visible RAM 52. Almost all blocks used in block search require external pixel DRAM access. However, since consecutive blocks that are retrieved from the pixel DRAM 30 are displaced by only a few pixels, it is costly to re-read pixels in the overlapped region. DMA 50 accesses to external pixel memories 30 are inefficient since it contends with adjacent DMAs for memory bandwidth. The modulo offset addressing scheme offers a simple implementation to reuse pixel values in the block visible RAM 32. Offsets may be modified to reposition the origin to point to the coordinates of the new block. Only non-overlapped pixel regions between the previous block and the current block need to be updated with DMA accesses. These DMA updates may be interleaved into the search algorithm (since a 16x16 block search requires a minimum of 256 cycles to calculate the error metric) to reduce DMA access latencies. Note also that the modulo offset addressing not only modifies the address pointers, but also the ones generated by the DMA 50. Therefore, DMA access requests can remain the same in the program code.
The modulo offset addressing is available for both data dependent and independent operations. On the other hand, the block visible addressing is available only during data independent mode. Visibility can be turned off to reduce the power consumption induced by multiple reads issued to the block visible RAM.
The auxiliary memory 54 in the preferred Fig. 5 embodiment being described herein is a 4x8 by 9-bit SRAM used to provide a second pixel buffer for operations that involve two
blocks of pixels (i.e. block matching). It provides the second path 82 to the ALU 58 for optimal computational efficiency. It can also be used to store lookup coefficients that are 9-bit wide during non-block matching operations. The auxiliary memory 54 does not support the two addressing schemes available to the block visible RAM 52 since it is used to store pixel values primarily from the current processor domain. Its role in block matching is to buffer the reference block, which remains constant throughout block search. The auxiliary memory 54 and the block visible RAM 52 are the only two local memories accessible by the DMA. The auxiliary memory 54 also serves as a gateway between the processor 40 and the external I/O buffer 62. Data from the processor 40 can be transferred to the external I/O buffer 62 which communicates with the I/O pins (not shown).
To compliment the 9-bit local SRAM units that make up auxiliary memory 54, a 32 word, 18-bit register file 56 is available. The register file 56 provides a fast, higher precision, low power workable memory space. The register file 56 has two data paths 84 and 86 to the ALU 58 allowing most operations to be performed by the ALU 58 and the register file 56. It is large enough such that it can also store both lookup coefficients (e.g. DCT coefficients) and system variables.
The ALU 58 illustrated in Fig. 5 has a limited complexity due to the constraints on area and power. The ALU 58 is implemented, as shown in Fig. 6, with a 36-bit carry select adder 90, an 9-bit subtractor 92, a conditional signed negation unit 94 (for calculating absolute values), a 16x17 multiplier 96, a bit manipulation logic unit 98, a shifter 100, a T register 100, and a 36-bit accumulator 102. Operations involving addition, shifting and bit manipulations can be executed in one cycle. The calculation of the absolute error involves the 9-bit subtractor 92, the conditional signed negation unit 94, and the adder 90. Operations are pipelined in 2 stages such that one subtract-absolute-accumulate (SAA) instruction can be executed every cycle. The first stage consists of the 9-bit subtraction and conditional signed negation, and the second stage involves accumulating the absolute differences. The T register 100 is used in conjunction with the SAA instruction, primarily for algorithmic power reduction. The T register 100 can be preloaded with a pixel value from the auxiliary memory 54 and depending on the algorithm, it can be reused without incurring SRAM memory access energy overheads. Finally, the hardware multiplier 96 is implemented to perform the DCT and IDCT efficiently.
The inter-processor communication unit 60 illustrated in Fig. 5 is responsible for instruction pipelining and processor status signaling. Instructions are pipelined from one processor 40 to the next and they may be executed immediately or stored in the program RAM
66 depending on whether the processor 40 is operating in data independent or dependent modes, respectively. In a data dependent mode, execution of the code stored in the program memory 40 occurs immediately after the first instruction has been buffered. Execution of the code segment ends when an end-of-program instruction is reached. At this point, a status flag is set to indicate code completion and the processor 40 halts until a new instruction clears it and forces the processor 40 to operate in data independent mode. The central controller (not shown) reinitializes instruction pipelining when it determines that all processors 40 have completed execution. In data independent mode, the task of address generation may be handled by the central controller in order to reduce power consumption. With the construction described above, the individual parallel processors 40 according to the preferred embodiment of the present invention consume less than 1 mW of power at a clock rate of 40 MHz, amounting to approximately 40 mW of total power consumption. An estimated cycle count per processor per frame needed for each encoding/decoding step is provided in Fig. 11. The number of cycles necessary to perform IBBPBBPBB MPEG-2 encoding at 30 φs is estimated to be 35 MIPS for each processor 40. The utilization of the functional units within the processor 40 is approximately 40% for the adder, 6% for the multiplier, 50% for the subtract-absolute-accumulate unit, and 4% for DRAM memory accesses. The processor area is approximately 160 um by 1800 um.
Appendix A outlines the psuedo code for implementing the RGB to YUV conversion. This pseudo code is provided as one exemplary way in which the processors 40 can implement these this and other algorithms.
While the invention has been described with reference to preferred embodiments, variations and modifications may be made without departing from the spirit and scope of the invention. For example, while the algorithms noted above are described in terms of visual video, an additional parallel processor can be used to implement an audio channel, which audio is sensed using AN analog to digital converter. . Also, the photo sensor array, as illustrated in Figs. 4, can be located adjacent to the pixel memory, rather than above it as illustrated in Fig, 3. Accordingly, the present invention is properly defined by the following claims.
APPENDIX A
The RGB-YUV conversion is a pixel level operation. It consists of a matrix multiplication of the color vector to produce the target color vector. This is depicted in the following equation:
Y an an an R u = a2 *12 a23 X G
V β3. α32 Ω33 B
The implications are as follows:
1. The color vectors have to be pre-loaded from pixel DRAM 30
2. The coefficients aυ have to be loaded into the local memory of each processor 40
3. The resulting color vector has to be stored back to the pixel DRAM 30
Note that this algorithm is data independent (i.e. regardless of what values R, G, or B takes on, the program flow is not affected). This means that instructions can be pipelined to each processor in a predictable manner. Also, no local buffering of the instructions is necessary. Each processor executes the instruction on a first-come-first-serve basis. In effect, the array processors can be programmed as a single processing entity. Note that the pseudo-code given below does not pay any attention to how the instructions are fed to each processor.
The processor uses a 4 stage pipeline: fetch, decode/address generation, read, and execute.
In data independent mode, the processor takes the pipelined instruction and decodes them directly. As a result, the pipeline looks like a 3 stage pipeline.
A sample pseudo code for implementing this algorithm follows:
Total cycle count for RGB-YUV is 152 cycles / 8 pixels per cycle * 480 V pixels * 16 H pixels = 145,920 cycles.