WO2006055546A2

WO2006055546A2 - A video processor having a scalar component controlling a vector component to implement video processing

Info

Publication number: WO2006055546A2
Application number: PCT/US2005/041329
Authority: WO
Inventors: Shirish Gadre; Ashish Karandikar; Stephen Lew; Christopher T. Cheng
Original assignee: Nvidia Corporation
Priority date: 2004-11-15
Filing date: 2005-11-14
Publication date: 2006-05-26
Also published as: JP2008521097A; KR100917067B1; KR101084806B1; KR20090020715A; WO2006055546A9; KR100880982B1; KR101030174B1; KR20100093141A; JP4906734B2; KR20110011758A; KR20070063580A; WO2006055546A3; KR20080080419A; EP1812928A4; EP1812928A2; CA2585157A1; KR101002485B1

Abstract

A latency tolerant system for executing video processing operations is described. So too is a stream processing in a video processor, a video processor having scalar and vector components and multidimensional datapath processing in a video processor.

Description

VIDEO PROCESSING

This application claims the benefit under 35 U.S. C. Section 119(e) of U.S. Provisional Application Serial No. 60/628,414, filed on 11/15/2004, to Gadre et al., entitled "A METHOD AND SYSTEM FOR VIDEO PROCESSING" which is incorporated herein in its entirety.

FIELD OF THE INVENTION

[001] The field of the present writing pertains to digital electronic computer systems. More particularly, the present writing relates to a system for efficiently handling video information on a computer system. Described in one aspect is a latency tolerant system for executing video processing operations. Described in another aspect is stream processing in a video processor. Further described is multidimensional data path processing in a video processor. Also described is a video processor having scalar and vector components.

BACKGROUND

[002] The display of images and full-motion video is an area of the electronics industry improving with great progress in recent years. The display and rendering of high-quality video, particularly high-definition digital video, is a primary goal of modern video technology applications and devices. Video technology is used in a wide variety of products ranging from cellular phones, personal video recorders, digital video projectors, high-definition televisions, and the like. The emergence and growing deployment of devices capable of high-definition video generation and display is an area of the electronics industry experiencing a large degree of innovation and advancement.

[003] The video technology deployed in many consumer electronics-type and professional level devices relies upon one or more video processors to format and/or enhance video signals for display. This is especially true for digital video applications. For example, one or more video processors are incorporated into a typical set top box and are used to convert HDTV broadcast signals into video signals usable by the display. Such conversion involves, for example, scaling, where the video signal is converted from a non-16x9 video image for proper display on a true 16 x 9 (e.g., widescreen) display. One or more video processors can be used to perform scan conversion, where a video signal is converted from an interlaced format, in which the odd and even scan lines are displayed separately, into a progressive format, where an entire frame is drawn in a single sweep.

[004] Additional examples of video processor applications include, for example, signal decompression, where video signals are received in a compressed format (e.g., MPEG-2) and are decompressed and formatted for a display. Another example is re- interlacing scan conversion, which involves converting an incoming digital video signal from a DVI (Digital Visual Interface) format to a composite video format compatible with the vast number of older television displays installed in the market.

[005] More sophisticated users require more sophisticated video processor functions, such as, for example, hi-Loop/Out-of-loop deblocking filters, advanced motion adaptive de-interlacing, input noise filtering for encoding operations, polyphase scaling/re-sampling, sub-picture compositing, and processor-amplifier operations such as, color space conversion, adjustments, pixel point operations (e.g., sharpening, histogram adjustment etc.) and various video surface format conversion support operations.

[006] The problem with providing such sophisticated video processor functionality is the fact that a video processor having a sufficiently powerful architecture to implement such functions can be excessively expensive to incorporate into many types of devices. The more sophisticated the video processing functions, the more expensive, in terms of silicon die area, transistor count, memory speed requirements, etc., the integrated circuit device required to implement such functions will be.

[007] Accordingly, prior art system designers were forced to make trade-offs with respect to video processor performance and cost. Prior art video processors that are widely considered as having an acceptable cost/performance ratio have often been barely sufficient in terms of latency constraints (e.g., to avoid stuttering the video or otherwise stalling video processing applications) and compute density (e.g., the number of processor operations per square millimeter of die). Furthermore, prior art video processors are generally not suited to a linear scaling performance requirement, such as in a case where a video device is expected to handle multiple video streams (e.g., the simultaneous handling of multiple incoming streams and outgoing display streams).

[008] Thus what is needed, is a new video processor system that overcomes the limitations on the prior art. The new video processor system should be scalable and have a high compute density to handle the sophisticated video processor functions expected by increasingly sophisticated users.

SUMMARY

[009] Embodiments of the present writing provide a new video processor system that supports sophisticated video processing functions while making efficient use of integrated circuit silicon die area, transistor count, memory speed requirements, and the like. Embodiments of the present writing maintain high compute density and are readily scalable to handle multiple video streams.

[010] In one embodiment, there is implemented, a latency tolerant system for executing video processing operations in a video processor. The system includes a host interface for implementing communication between the video processor and a host CPU, a scalar execution unit coupled to the host interface and configured to execute scalar video processing operations, and a vector execution unit coupled to the host interface and configured to execute vector video processing operations. A command FIFO is included for enabling the vector execution unit to operate on a demand driven basis by accessing the memory command FIFO. A memory interface is included for implementing communication between the video processor and a frame buffer memory. A DMA engine is built into the memory interface for implementing DMA transfers between a plurality of different memory locations and for loading a datastore memory and an instruction cache with data and instructions for the vector execution unit.

[011] In one embodiment, the vector execution unit is configured to operate asynchronously with respect to the scalar execution unit by accessing the command FIFO to operate on the demand driven basis. The demand driven basis can be configured to hide a latency of a data^' transfer from the different memory locations (e.g., frame buffer memory, system memory, cache memory, etc.) to the command FIFO of the vector execution unit. The command FIFO can be a pipelined FIFO to prevent stalls of the vector execution unit.

[012] In one embodiment, the present invention is implemented as a video processor for executing video processing operations. The video processor includes a host interface for implementing communication between the video processor and a host CPU. The video processor includes a memory interface for implementing communication between the video processor and a frame buffer memory. A scalar execution unit is coupled to the host interface and the memory interface and is configured to execute scalar video processing operations. A vector execution unit is coupled to the host interface and the memory interface and is configured to execute vector video processing operations. The video processor can be a standalone video processor integrated circuit or can be a component integrated into a GPU integrated circuit.

[013] In one embodiment, the scalar execution unit functions as a controller of the video processor and controls the operation of the vector execution unit. The scalar execution unit can be configured to execute flow control algorithms of an application and the vector execution unit can be configured to execute pixel processing operations of the application. A vector interface unit can be included in the video processor for interfacing the scalar execution unit with the vector execution unit. In one embodiment, the scalar execution unit and the vector execution unit are configured to operate asynchronously. The scalar execution unit can execute at a first clock frequency and the vector execution unit can execute at a different clock frequency (e.g., faster, slower, etc.). The vector execution unit can operate on a demand driven basis under the control of the scalar execution unit.

[014] In one embodiment, the present invention is implemented as a multidimensional datapath processing system for a video processor for executing video processing operations. The video processor includes a scalar execution unit configured to execute scalar video processing operations and a vector execution unit configured to execute vector video processing operations. A data store memory is included for storing data for the vector execution unit. The data^: store memory includes a plurality of tiles having symmetrical bank data structures arranged in an array. The bank data structures are configured to support accesses to different tiles of each bank.

[015] Depending upon the requirements of the a particular configuration, each of the bank data structures can comprise a plurality of tiles (e.g., a 4 x 4, 8 x 8, 8 x 16, 16 x 24, or the like). In one embodiment, the banks are configured to support accesses to different tiles of each bank. This allows a single access to retrieve a row or column of tiles from two adjacent banks. In one embodiment, a crossbar is used for selecting a configuration for accessing tiles of the plurality of bank data structures (e.g., row, column, block, etc.). A collector can be included for receiving the tiles of the banks accessed by the crossbar and for providing the tiles to a front end of the vector datapath on a per clock basis.

[016] In one embodiment, the present invention is implemented as a stream based memory access system for a video processor. The video processor includes a scalar execution unit configured, to execute scalar video processing operations and a vector execution unit configured to execute vector video processing operations. A frame buffer memory is included for storing data for the scalar execution unit and the vector execution unit. A memory interface is included for implementing communication between the scalar execution unit and the vector execution unit and the frame buffer memory. The frame buffer memory comprises a plurality of tiles. The memory interface implements a first stream of a first sequential access of tiles for the scalar execution unit and implements a second stream of a second sequential access of tiles for the vector execution unit.

[017] In one embodiment, the first stream and the second stream comprise a sequential series of prefetched tiles, prefetched in such manner as to hide the access latency from the originating memory location (e.g., frame buffer memory, system memory, etc.). In one embodiment, the memory interface is configured to manage a plurality of different streams from a plurality of different originating locations and to a plurality of different terminating locations, hi one embodiment a DMA engine built into the memory interface is used to implement a plurality of memory reads and a plurality of memory writes to support the multiple streams.

[018] Broadly, this writing discloses at least the following four methodologies.

A) The method broadly taught in this description is a method for multidimensional datapath processing system in a video processor for executing video processing operations, comprising: executing scalar video processing operations by using a scalar execution unit; executing vector video processing_. operations by using a vector execution unit; storing data for the vector execution unit by using a data store memory, wherein the data store memory comprises a plurality of tiles comprising symmetrical bank data structures arranged in an array, and wherein the bank data structures are configured to support accesses to different tiles of each bank. Further, the method of A above comprises each of the bank data structures including a plurality of tiles arranged in a 4 x 4 pattern. Also, the method of A above comprises each of the bank data structures including a plurality of tiles arranged in a 8 x 8, 8 x 16, or 16 x 24 pattern. Additionally, the method of A above comprises the bank data structures being configured to support accesses to different tiles of each bank data structure, wherein at least one access is to two adjacent bank data structures comprising a row of tiles of the two bank data structures. The method of A above also involves the tiles being configured to support accesses to different tiles of each bank data structure, wherein at least one access is to two adjacent bank data structures comprising a column of tiles of the two adjacent bank data structures. Further, the method of A above comprises selecting a configuration for accessing tiles of the plurality of bank data structures by using a crossbar coupled to the data store. In this selecting step, the crossbar accesses the tiles of the plurality of bank data structures to supply data to a vector datapath on a per clock basis. Also, there involves receiving the tiles of the plurality of bank data structures accessed by the crossbar by using a collector; and providing the tiles to a front end of the vector datapath on a per clock basis.

B) The method broadly taught in this description is as well a method for executing video processing operations, the method implemented using a video processor of a computer system executing computer readable code, comprising: establishing cornmunication between the video processor and a host CPU by using a host interface; establishing communication between the video processor and a frame buffer memory by using a memory interface; executing scalar video processing operations by using a scalar execution unit coupled to the host interface and the memory interface; and executing vector video processing operations by using a vector execution unit coupled to the host interface and the memory interface. The method of B above further comprises the scalar execution unit functioning as a controller of the video processor and controlling the operation of the vector execution unit. The method B above also comprises a vector interface unit for interfacing the scalar execution unit with the vector execution unit. The method of B above also comprises the scalar execution unit and the vector execution unit being configured to operate asynchronously. Also, the scalar execution unit executes at a first clock frequency and the vector execution unit executes at a second clock frequency. The method of B above comprises the scalar execution unit being configured to execute flow control algorithms of an application and the vector execution unit being configured to execute pixel processing operations of the application. Further, the vector execution unit is configured to operate on a demand driven basis under the control of the scalar execution unit. Additionally, the scalar execution unit is configured to send function calls to the vector execution unit using a memory command FIFO, the vector execution unit operating on a demand driven basis by accessing the memory command FIFO. Also, the asynchronous operation of the video processor is configured to support a separate independent update of a vector subroutine or a scalar subroutine of the application. Finally, the method of B above comprises the scalar execution unit being configured to operate using VLIW (very long instruction word) code. C) The method here described also broadly teaches a method for stream based memory access in a video processor for executing video processing operations, comprising: executing scalar video processing operations by using a scalar execution unit; executing vector video processing operations by using a vector execution unit; storing data for the scalar execution unit and the vector execution unit by using a frame buffer memory; and implementing communication between the scalar execution unit and the vector execution unit and the frame buffer memory by using a memory interface, wherein the frame buffer memory comprises a plurality of tiles and wherein the memory interface implements a first stream comprising a first sequential access of tiles and implements a second stream comprising a second sequential access of tiles for the vector execution unit or the scalar execution unit. The method of C above also has the first stream and the second stream including at least one prefetched tile. The method of C above further comprises the first stream originating from a first location in the frame buffer memory, and the second stream originating from a second location in the frame buffer memory. The method of C above comprises too the memory interface being configured to manage a plurality of streams from a plurality of different originating locations and to a plurality of different terminating locations. In this respect, at least one of the originating locations or at least one of the terminating locations is in a system memory. The method of C above also comprises implementing a plurality of memory reads to support the first stream and the second stream; and implement in a plurality of memory writes to support the first stream and the second stream by using a DMA engine built into the memory interface. Further, the method of C comprises the first stream experiencing a higher amount of latency than the second stream, wherein the first stream incorporates a larger number of buffers for storing tiles than the second stream. The method of C as well comprises the memory interface being configured to prefetch an adjustable number of tiles of the first stream or the second stream to compensate for a latency of the first stream or the second stream.

D) The method here described also includes broadly a method for latency tolerant video processing operations, comprising: implementing communication between the video processor and a host CPU by using a host interface; executing scalar video processing operations by using a scalar execution unit coupled to the host interface; executing vector video processing operations by using vector execution unit coupled to the host interface; enabling the vector execution unit to operate on a demand driven basis by accessing a memory command FIFO; implementing communication between the video processor and a frame buffer memory by using a memory interface; and implementing DMA transfers between a plurality of different memory locations by using a DMA engine built into the memory interface and configured for loading a datastore memory and an instruction cache with data and instructions for the vector execution unit. The method of D above further comprises the vector execution unit being configured to operate asynchronously with respect to the scalar execution unit by accessing the command FIFO to operate on the demand driven basis. The method of D above also comprises the demand driven basis being configured to hide a latency of a data transfer from the different memory locations to the command FIFO of the vector execution unit. Further, the method of D above comprises the scalar execution unit being configured to implement algorithm flow control processing and wherein the vector execution unit is configured to implement a majority of a video processing workload, hi this, the scalar execution unit is configured to pre-compute work parameters for the vector execution unit to hide a data transfer latency. The method of D above comprises the vector execution unit being configured to schedule a memory read via the DMA engine to prefetch commands for subsequent execution of a vector subroutine. Here, the memory read is scheduled to prefetch commands for the execution of the vector subroutine prior to calls to the vector subroutine by the scalar execution unit.

BRIEF DESCRIPTION OF THE DRAWINGS

[019] The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

[020] Figure 1 shows an overview diagram showing the basic components of a computer system in accordance with one embodiment of the present invention.

[021] Figure 2 shows a diagram depicting the internal components of the video processor unit in accordance with one embodiment of the present invention.

[022] Figure 3 shows a diagram of an exemplary software program for the video processor in accordance with one embodiment of the present invention.

[023] Figure 4 shows an example for sub-picture blending with video using a video processor and accordance with one embodiment of the present invention.

[024] Figure 5 shows a diagram depicting the internal components of a vector execution in accordance with one embodiment of the present invention.

[025] Figure 6 shows a diagram depicting the layout of a datastore memory having a symmetrical array of tiles, in accordance with one embodiment of the present invention. DETAILED DESCRIPTION OF THE INVENTION

[026] Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which maybe included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be recognized by one of ordinary skill in the art that the present invention may be practiced without these specific details, hi other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the present invention.

Notation and Nomenclature:

[027] Some portions of the detailed descriptions, which follow, are presented in terms of procedures, steps, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, computer executed step, logic block, process, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

[028] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as " processing" or "accessing" or " executing" or " storing" or "rendering" or the like, refer to the action and processes of a computer system (e.g., computer system 100 of Figure 1), or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Computer System Platform:

[029] Figure 1 shows a computer system 100 in accordance with one embodiment of the present invention. Computer system 100 depicts the components of a basic computer system in accordance with embodiments of the present invention providing the execution platform for certain hardware-based and software-based functionality. In general, computer system 100 comprises at least one CPU 101, a system memory 115, and at least one graphics processor unit (GPU) 110 and one video processor unit (VPU) 111. The CPU 101 can be coupled to the system memory 115 via the bridge component 105 or can be directly coupled to the system memory 115 via a memory controller (not shown) internal to the CPU 101. The bridge component 105 (e.g., Northbridge) can support expansion buses that connect various VO devices (e.g., one or more hard disk drives, Ethernet adapter, CD ROM, DVD, etc.). The GPU 110 and the video processor unit 11.1 are coupled to a display 112. One or more additional GPUs can optionally be coupled to system 100 to further increase its computational power. The GPU(s) 110 and the video processor unit 111 are coupled to the CPU 101 and the system memory 115 via the bridge component 105. System 100 can be implemented as, for example, a desktop computer system or server computer system, having a powerful general-purpose CPU 101 coupled to a dedicated graphics rendering GPU 110. In such an embodiment, components can be included that add peripheral buses, specialized graphics memory and system memory, IO devices, and the like. Similarly, system 100 can be implemented as a handheld device (e.g., cellphone, etc.) or a set-top video game console device such as, for example, the Xbox®, available from Microsoft Corporation of Redmond, Washington, or the PlayStation3®, available from Sony Computer Entertainment Corporation of Tokyo, Japan.

[030] It should be appreciated that the GPU 110 can be implemented as a discrete component, a discrete graphics card designed to couple to the computer system 100 via a connector (e.g., AGP slot, PCI-Express slot, etc.), a discrete integrated circuit die (e.g., mounted directly on the motherboard), or as an integrated GPU included within the integrated circuit die of a computer system chipset component (e.g., integrated within the bridge chip 105). Additionally, a local graphics memory can be included for the GPU 110 for high bandwidth graphics data storage. Additionally, it should be appreciated that the GPU 110 and the video processor unit 111 can be integrated onto the same integrated circuit die (e.g., as component 120) or can be separate discrete integrated circuit components otherwise connected to, or mounted on, the motherboard of computer system 100.

Embodiments of the present invention

[031] Figure 2 shows a diagram depicting the internal components of the video processor unit 111 in accordance with one embodiment of the present invention. As illustrated in Figure 2, the video processor unit 111 includes a scalar execution unit 201, a vector execution unit 202, a memory interface 203, and a host interface 204.

[032] hi the Figure 2 embodiment, the video processor unit (hereafter simply video processor) 111 includes functional components for executing video processing operations. The video processor 111 uses the host interface 204 to establish communication between the video processor 111 and the host CPU 101 via the bridge

105. The video processor 111 uses the memory interface 203 to establish communication between the video processor 111 and a frame buffer memory 205 (e.g., for the coupled display 112, not shown). The scalar execution unit 201 is coupled to the host interface 204 and the memory interface 203 and is configured to execute scalar video processing operations. A vector execution unit is coupled to the host interface 204 and the memory interface 203 and is configured to execute vector video processing operations. [033] The Figure 2 embodiment illustrates the manner in which the video processor 111 partitions its execution functionality into scalar operations and vector operations. The scalar operations are implemented by the scalar execution unit 201. The vector operations are implemented by the vector execution unit 202.

[034] In one embodiment, the vector execution unit 202 is configured to function as a slave co-processor to the scalar execution unit 201. In such an embodiment, the scalar execution unit manages the workload of the vector execution unit 202 by feeding control streams to vector execution unit 202 and managing the data input/output for vector execution unit 202. The control streams typically comprise functional parameters, subroutine arguments, and the like, In a typical video processing application, the control flow of the application's processing algorithm will be executed on the scalar execution unit 201, whereas actual pixel/data processing operations will be implemented on the vector execution unit 202.

[035] Referring still to Figure 2, the scalar execution unit 201 can be implemented as a RISC style scalar execution unit incorporating RISC-based execution technologies. The vector execution unit 202 can be implemented as a SIMD machine having, for example, one or more SIMD pipelines. In a 2 SIMD pipeline embodiment, for example, each SIMD pipeline can be implemented with a 16 pixel wide datapath (or wider) and thus provide the vector execution unit 202 with raw computing power to create up to 32 pixels of resulting data output per clock, hi one embodiment, the scalar execution unit 201 includes hardware configured to operate using VLIW (very long instruction word) software code to optimize the parallel execution of scalar operations on a per clock basis.

[036] In the Figure 2 embodiment, the scalar execution unit 201 includes an instruction cache 211 and a data cache 212 coupled to a scalar processor 210. The caches 211-212 interface with the memory interface 203 for access to external memory, such as, for example, the frame buffer 205. The scalar execution unit 201 further includes a vector interface unit 213 to establish communication with the vector execution unit 202. hi one embodiment, the vector interface unit 213 can include one or more synchronous mailboxes 214 configured to enable asynchronous communication between the scalar execution unit 201 and the vector execution unit 202.

[037] In the Figure 2 embodiment, the vector execution unit 202 includes a vector control unit 220 configured to control the operation of a vector execution datapath, vector datapath 221. The vector control unit 220 includes a command FIFO 225 to receive instructions and data from the scalar execution unit 201. An instruction cache 222 is coupled to provide instructions to the vector control unit 220. A datastore memory 223 is coupled to provide input data to the vector datapath 221 and receive resulting data from the vector datapath 221. The datastore 223 functions as an instruction cache and a data RAM for the vector datapath 221. The instruction cache 222 and the datastore 223 are coupled to the memory interface 203 for accessing external memory, such as the frame buffer 205. The Figure 2 embodiment also shows a second vector datapath 231 and a respective second datastore 233 (e.g., dotted outlines). It should be understood the second vector datapath 231 and the second datastore 233 are shown to illustrate the case where the vector execution unit 202 has two vector execution pipelines (e.g., a dual SIMD pipeline configuration). Embodiments of the present invention are suited to vector execution units having a larger number of vector execution pipelines (e.g., four, eight, sixteen, etc.).

[038] The scalar execution unit 201 provides the data and command inputs for the vector execution unit 202. In one embodiment, the scalar execution unit 201 sends function calls to the vector execution unit 202 using a memory mapped command FIFO 225. Vector execution unit 202 commands are queued in this command FIFO 225.

[039] The use of the command FIFO 225 effectively decouples the scalar execution unit 201 from the vector execution. unit 202. The scalar execution unit 201 can function on its own respective clock, operating at its own respective clock frequency that can be distinct from, and separately controlled from, the clock frequency of the vector execution unit 202.

[040] The command FIFO 225 enables the vector execution unit 202 to operate as a demand driven unit. For example, work can be handed off from the scalar execution unit 201 to command FIFO 225, and then accessed by the vector execution unit 202 for processing in a decoupled asynchronous manner. The vector execution unit 202 would thus process its workload as needed, or as demanded, by the scalar execution unit 201. Such functionality would allow the vector execution unit 202 to conserve power (e.g., by reducing/stopping one or more internal clocks) when maximum performance is not required. [041] The partitioning of video processing functions into a scalar portion (e.g., for execution by the scalar execution unit 201) and a vector portion (e.g., for execution by the vector execution unit 202) allow video processing programs built for the video processor 111 to be compiled into separate scalar software code and vector software code. The scalar software code and the vector software code can be compiled separately and subsequently linked together to form a coherent application.

[042] The partitioning allows vector software code functions to be written separately and distinct from the scalar software code functions. For example, the vector functions can be written separately (e.g., at a different time, by different team of engineers, etc.) and can be provided as one or more subroutines or library functions for use by/with the scalar functions (e.g., scalar threads, processes, etc.). This allows a separate independent update of the- scalar software code and/or the vector software code. For example, a vector subroutine can be independently updated (e.g., through an update of the previously distributed program, a new feature added to increase the functionality of the distributed program, etc.) from a scalar subroutine, ύr vice versa. The partitioning is facilitated by the separate respective caches of the scalar processor 210 (e.g., caches 211- 212) and the vector control unit 220 and vector datapath 221 (e.g., caches 222-223). As described above, the scalar execution unit 201 and the vector execution unit 202 communicate via the command FIFO 225.

[043] Figure 3 shows a diagram of an exemplary software program 300 for the video processor 111 in accordance with. one embodiment of the present invention. As depicted in Figure 3, the software program 300 illustrates attributes of a programming model for the video processor 111, whereby a scalar control thread 301 is executed by the video processor 111 in conjunction with a vector data thread 302.

[044] The software program 300 example of the Figure 3 embodiment illustrates a programming model for the video processor 111, whereby a scalar control program (e.g., scalar control thread 301) on the scalar execution unit 201 executes subroutine calls (e.g., vector data thread 302) on the vector execution unit 202. The software program 300 example shows a case where a compiler or software programmer has decomposed a video processing application into a scalar portion (e.g., a first thread) and a vector portion (e.g., a second thread).

[045] As shown in Figure 3, the scalar control thread 301 running on the scalar execution unit 201 is computing work parameters ahead of time and feeding these parameters to the vector execution .unit 202, which performs the majority of the processing work. As described above, the software code for the two threads 301 and 302 can be written and compiled separately.

[046] The scalar thread is responsible for following:

1. Interfacing with host unit 204 and implementing a class interface;

2. Initialization, setup and configuration of the vector execution unit 202; and

3. Execution of the algorithm in work-units, chunks or working sets in a loop, such that with each iteration; a. the parameters for current working set are computed; b. the transfer of the input data into vector execution unit is initiated; and c. the transfer of the output data from vector execution unit is initiated.

[047] The typical execution model of the scalar thread is "fire-and-forget". The term fire-and-forget refers to the attribute whereby, for a typical model for a video baseband processing application, commands and data are sent to the vector execution unit 202 from the scalar execution unit 201 (e.g., via the command FIFO 225) and there is no return data from the vector execution unit 202 until the algorithm completes.

[048] In the program 300 example of Figure 3, the scalar execution unit 201 will keep scheduling work for vector execution unit 202 until there is no longer any space in command FIFO 225 (e.g., !end_of_alg & !cmd_fifo_full). The work scheduled by the scalar execution unit 201 computes parameters and sends these parameters to the vector subroutine, and subsequently calls the vector subroutine to perform the work. The execution of the subroutine (e.g., vector_funcB) by the vector execution unit 202 is delayed hi time, mainly to hide the latency from main memory (e.g., system memory 115). Thus, the architecture of the video processor 111 provides a latency compensation mechanism on the vector execution unit 202 side "for both instruction and data traffic. These latency compensation mechanisms are described in greater detail below.

[049] It should be noted that the software program 300 example would be more complex in those cases where there are two or more vector execution pipelines (e.g., vector datapath 221 and second vector datapath 231 of Figure 2). Similarly, the software program 300 example would be more complex for those situations where the program 300 is written for a computer system having two vector execution pipelines, but yet retains the ability to execute on a system having a single vector execution pipeline.

[050] Thus, as described above in the discussion of Figure 2 and Figure 3, the scalar execution unit 201 is responsible for initiating computation on the vector execution unit 202. In one embodiment, the commands passed from the scalar execution unit 201 to the vector execution unit 202 are of the following main types:

1. Read commands (e.g., memRd) initiated by the scalar execution unit 201 to transfer current working set data from memory to data RAMs of the vector execution unit 202;

2. Parameter passing from the scalar execution unit 201 to the vector execution unit 202; 3. Execute commands in the form of the PC (e.g., program counter) of the vector subroutine to be executed; and

4. Write commands (e.g., memWr) initiated by scalar execution unit 201 to copy the results of the vector computation into memory.

[051] In one embodiment; upon receiving these commands the vector execution unit 202 immediately schedules the memRd commands to memory interface 203 (e.g., to read the requested data from the frame buffer 205). The vector execution unit 202 also examines the execute commands and prefetches the vector subroutine to be executed (if not present in the cache 222).

[052] The objective of the vector execution unit 202 in this situation is to schedule ahead the instruction and data steams of the next few executes while the vector, execution unit 202 is working on current execute. The schedule ahead features effectively hide the latency involved in fetching instructions/data from their memory locations, hi order to make these read requests ahead of time, the vector execution unit 202, the datastore (e.g., datastore 223), and the instruction cache (e.g., cache 222) are implemented by using high speed optimized hardware.

[053] As described above, the datastore (e.g., datastore 223) functions as the working RAM of the vector execution unit 202. The scalar execution unit 201 perceives and interacts with the datastore as if it were a collection of FIFOs. The FIFOs comprise the "streams" with which the video processor 111 operates. In one embodiment, streams are generally input/output FIFOs that the scalar execution unit 201 initiates the transfers (e.g., to the vector execution unit 202) into. As described above, the operation of the scalar execution unit 201 and the vector execution unit 202 are decoupled.

[054] Once the input/output streams are full, a DMA engine within the vector control unit 220 stops processing the command FIFO 225. This soon leads to the command FIFO 225 being full. The scalar execution unit 201 stops issuing additional work to the vector execution unit 202 when the command FIFO 225 is full.

[055] In one embodiment, the vector execution unit 202 may need intermediate streams in addition to the input and output streams. Thus the entire datastore 223 can be seen as a collection of streams with respect to the interaction with the scalar execution unit 201. [056] Figure 4 shows an example for sub-picture blending with video using a video processor in accordance with one embodiment of the present invention. Figure 4 shows an exemplary case where a video surface is blended with a sub-picture and then converted to an ARGB surface. The data comprising the surfaces are resident in frame buffer memory 205 as the Luma parameters 412 and Chroma parameters 413. The sub- picture pixel elements 414 are also resident in the frame buffer memory 205 as shown. The vector subroutine instructions and parameters 411 are instantiated in memory 205 as shown.

[057] In one embodiment, each stream comprises a FIFO of working 2D chunks of data called "tiles". In such an embodiment, the vector execution unit 202 maintains a read tile pointer and a write tile pointer for each stream. For example, for input streams, when a vector subroutine is executed, the vector subroutine can consume, or read, from a current (read) tile. In the background, data is transferred to the current (write) tile by memRd commands. The vector execution unit can also produce output tiles for output streams. These tiles are then moved to memory by memWr() commands that follow the execute commands. This effectively pre-fetches tiles and has them ready to be operated on, effectively hiding the latency.

[058] In the Figure 4 sub-picture blending example, the vector datapath 221 is configured by the instantiated instance of the vector sub routine instructions and parameters 411 (e.g., &v_subp_blend). This is shown by the line 421. The scalar execution unit 201 reads in chunks (e.g., tiles) of the surfaces and loads them into datastore 223 using the DMA engine 401 (e.g., within the memory interface 203). The load operation is shown by line 422, line 423, and line 424.

[059] Referring still to Figure 4, since there are multiple input surfaces, multiple input streams need to be maintained. Each stream has a corresponding FIFO. Each stream can have different number of tiles. The Figure 4 example shows a case where the sub-picture surface is in system memory 115 (e.g., sub-picture pixel elements 414) and hence would have additional buffering (e.g., n, n+1, n+2, n+3, etc.), whereas the video stream (e.g., Luma 412, Chroma 413, etc.) can have a smaller number of tiles. The number of buffers/FIFOs used can be adjusted in accordance with the degree of latency experienced by stream.

[060] As described above, the datastore 223 utilizes a look ahead prefetch method to hide latency. Because of this, a stream can have data in two or more tiles as the data is prefetched for the appropriate vector datapath execution hardware (e.g., depicted as FIFO n, n+1 , n+2, etc.).

[061] Once the datastore is loaded, the FIFOs are accessed by the vector datapath hardware 221 and operated upon by the vector subroutine (e.g., subroutine 430). The results of the vector datapath operation comprises' an output stream 403. This output stream is copied by the scalar execution unit 201 via the DMA engine 401 back into the frame buffer memory 205 (e.g., ARGB_OUT 415). This shown by the line 425. [062] Thus, embodiments of the present invention utilize an important aspect of stream processing, which is the fact that data storage and memory is abstracted as a plurality of memory titles. Hence, a stream can be viewed as a sequentially accessed collection of tiles. Streams are used to prefetch data. This data is in the form of tiles. The tiles are prefetched to hide latency from the particular memory source the data originates from (e.g., system memory, frame buffer memory, or the like). Similarly, the streams can be destined for different locations (e.g., caches for vector execution unit, caches for scalar execution unit, frame buffer memory, system memory, etc.). Another characteristic of streams is that they generally access tiles in a lookahead prefetching mode. As described above, the higher the latency, the deeper the prefetching and the more buffering that is used per stream (e.g., as depicted in Figure 4).

[063] Figure 5 shows a diagram depicting the internal components of a vector execution unit in accordance with one embodiment of the present invention. The diagram of Figure 5 shows an arrangement of the various functional units and register/SRAM resources of the vector execution unit 202 from a programming point of view.

[064] In the Figure 5 embodiment, the vector execution unit 202 comprises a VLIW digital signal processor optimized for the performance of video baseband processing and the execution of various codecs (compression-decompression algorithms). Accordingly, the vector execution unit 202 has a number of attributes directed towards increasing the efficiency of the video processing/codec execution.

[065] In the Figure 5 embodiment, the attributes comprise: 1. Scalable performance by providing the option for the incorporation of multiple vector execution pipelines;

2. The allocation of 2 data address generators (DAGs) per pipe;

3. Memory/Register operands; 4. 2D (x,y) pointers/iterators;

5. Deep pipeline (e.g., 11-12) stages;

6. Scalar (integer)/branch units;

7. Variable instruction widths (Long/Short instructions);

8. Data aligners for operand extraction; 9. 2D datapath (4x4) shape of typical operands and result; and

10. Slave vector execution unit to scalar execution unit, executing remote procedure calls.

[066] Generally, a programmer's view of the vector execution unit 202 is as a SIMD datapath with 2 DAGs 503. Instructions are issued in VLIW manner (e.g., instructions are issued for the vector datapath 504 and address generators 503 simultaneously) and are decoded and dispatched to the appropriate execution unit by the instruction decoder 501. The instructions are of variable length, with the most commonly used instructions encoded in short form. The full instruction set is available in the long form, as VLIW type instructions.

[067] The legend 502 shows three clock cycles having three such VLIW instructions. In accordance with the legend 510, the uppermost of the VLIW instructions 502 comprises two address instructions (e.g., for the 2 DSGs 503) and one instruction for the vector datapath 504. The middle VLIW instruction comprises one integer instruction (e.g., for the integer unit 505), one address instruction, and one vector instruction. The lower most VLIW instruction comprises a branch instruction (e.g., for the branch unit 506), one address instruction, and one vector instruction.

[068] The vector execution unit can be configured to have a single data pipe or multiple data pipes. Each data pipe consists of local RAM (e.g., a datastore 511), a crossbar 516, 2 DAGs 503, and a SIMD execution unit (e.g., the vector datapath 504). Figure 5 shows a basic configuration for explanatory purposes, where only 1 data pipe is instantiated. When 2 data pipes are instantiated, they can run as independent threads or as cooperative threads.

[069] Six different ports (e.g., 4 read.and 2 write) can be accessed via an address register file unit 515. These registers receive parameters from the scalar execution unit or from the results of the integer unit 505 or the address unit 503. The DAGs 503 also function as a collection controller and manages the distribution of the registers to address the contents of the datastore 511 (e.g., RAO, RAl, RA2, RA3, WAO, and WAl). A crossbar 516 is coupled to allocate.the output data ports RO, Rl, R2, R3 in any order/combination into the vector datapath 504 to implement a given instruction. The output of the vector datapath 504 for can be fed back into the datastore 511 as indicated (e.g., WO). A constant RAM 517 is used to provide frequently used operands from the integer unit 505 to the vector datapath 504, and the datastore 511.

[070] Figure 6 shows a diagram depicting a plurality of banks 601-604 of a memory 600 and a layout of a datastore having a symmetrical array of tiles 610 in accordance with one embodiment of the present invention. As depicted in Figure 6, for explanatory purposes, only a portion of the datastore 610 is shown. The datastore 610 logically comprises an array (or arrays) of tiles. Each tile is an array of sub-tiles of 4x4 shape. Physically, as shown by the memory 600, the data store 610 is stored in an array of "N" physical banks of memory (e.g., banks 601-604).

[071] Additionally, the data store 610 visually depicts a logical tile in a stream. In the Figure 6 embodiment, this tile is 16 bytes high and 16 bytes wide. This tile is an array of subtiles (in this example 4x4). Each subtile is stored in a physical bank. This is shown in Figure 6 by the number within each 4x4 subtile, in a case where there are 8 banks of physical memory (e.g., banks 0 through 7). The organization of subtiles in banks is done such that there is no common bank in 2 x 2 arrangement of subtitles. This makes any unaligned access (e.g., in both x and y direction) possible without any bank collision.

[072] The banks 601-604 are configured to support accesses to different tiles of each bank. For example, in one case, the crossbar 516 can access a 2 x 4 set of tiles from bank 601 (e.g., the first two rows of bank 601). hi another case, the crossbar 516 can access a 1 x 8 set of tiles from two adjacent banks. Similarly, in another case, the crossbar 516 can access an 8 x 1 set of tiles from two adjacent banks. In each case, the DAGs/collector 503 can receive the tiles as the banks are accessed by the crossbar 516, and provide those tiles to the front end of the vector datapath 504 on a per clock basis. [073] In this manner, embodiments of the present invention provide a new video processor architecture that supports sophisticated video processing functions while making efficient use of integrated circuit silicon die area, transistor count, memory speed requirements, and the like. Embodiments of the present invention maintain high compute density and are readily scalable to handle multiple video streams. Embodiments of the present invention can provide a number of sophisticated video processing operations such as, for example, MPEG-2/WMV9/H.264 encode assist (e.g., In-loop decoder), MPEG- 2/WMV9/H.264 decode (e.g., post entropy decoding), and In Loop/Out of loop deblocking filters.

[074] Additional video processing operations provided by embodiments of the present invention include, for example, advanced motion adaptive deinterlacing, input noise filtering for encoding, polyphase scaling/resampling, and sub-picture compositing. The video processor architecture of the present invention can also be used for certain video processor-amplifier (procamp) applications such as, for example, color space conversion, color space adjustments, pixel point operations such as sharpening, histogram adjustment, and various video surface format conversions.

[075] Broadly and without limitation, this writing has disclosed the following. A latency tolerant system for executing video processing operations is described. The system includes a host interface for implementing communication between the video processor and a host CPU, a scalar execution unit coupled to the host interface and configured to execute scalar video processing operations, and a vector execution unit coupled to the host interface and configured to execute vector video processing operations. A command FIFO is included for enabling the vector execution unit to operate on a demand driven basis by accessing the memory command FIFO. A memory interface is included for implementing communication between the video processor and a

frame buffer memory. A DMA engine is built into the memory interface for implementing DMA transfers between a plurality of different memory locations and for loading the command FIFO with data, and instructions for the vector execution unit. A video processor for executing video processing operations is described. The video processor includes a host interface for implementing communication between the video processor and a host CPU. A memory interface is included for implementing communication between the video processor and a frame buffer memory. A scalar execution unit is coupled to the host interface and^' the memory interface and is configured to execute scalar video processing operations. = A vector execution unit is coupled to the host interface and the memory interface and is configured to execute vector video processing operations. A multidimensional datapath processing system for a video processor for executing video processing operations is described. The video processor includes a scalar execution unit configured to execute scalar video processing operations and a vector execution unit configured to execute vector video processing operations. A data store memory is included for storing data for the vector execution unit. The data store memory includes a plurality of tiles having symmetrical bank data structures arranged in an array. The bank data structures are configured to support accesses to different tiles of each bank. A stream based memory access system for a video processor for executing video operations is described. The video processor includes a scalar

execution unit configured to execute scalar video processing operations and a vector

execution unit configured to execute vector video processing operations. A frame buffer memory is included for storing data for the scalar execution unit and the vector execution unit. A memory interface is included for establishing communication between the scalar execution unit and the vector execution unit and the frame buffer memory. The frame buffer memory comprises a plurality of tiles. The memory interface implements a first sequential access of tiles and implements a second stream comprising a second sequential access of tiles for the vector execution unit or the scalar execution unit.

[076] The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order .to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

1 Overview

VP2 is a VLIW SIMD video DSP coupled to scalar control processor. Its main focus is video codecs and video baseband processing.

1.1 Sml of VP2.0

* Efficiency: VP2.0 will be a compute efficient machine for video applications in terms of perf/mm 2 and perf/mW.

» Programmability: It will be a highly programmable, easily compilable and safer to program machine,

* Scalability: VP2.0 design/architecture should be scalable to match the performance requirements for multiple application areas.

1.2 Design Goals

* Compute density o Offer sigttificaat perf mm2 advairtage over VP LO. o Efficient implementation of new application areas like H.264.

* Latency tolerance in HW Un-burdeaϊng the SW developer o Hiding the data fetch latencies by reordering memory accesses and compute, o Auto prefetch of instruction streams.

* Hiding data-path latencies o Selective forwarding of intermediate results. o Streaming computation model

* Scalability: o Architecturally VP2 vector unit can scale its datapath up by 2x and down by 1/2x. o The frequency improvements can be achieved by selective re-pipelining.

1.3 Application Targets

VP2.0 design and instruction set would be optimized to do the folio wing applications very efficiently.

* mpeg2/wmv9/H.264 encode assist (In loop decoder)

* mpeg2/wrav9/H,264 decode (post entropy decoding)

* In Loop/Out of loop deblocking filters.

* Advanced motion adaptive de-interlacing

* Input noise filtering for encoding

* Polyphase scaling/resampling

* Sub-picture compositing

* prooarnp, color space conversion, adjustments, pixel point operations such as sharpening, histogram adjustment etc.

* various video surface format conversion support

Architecturally VP2.0 can be efficient in following areas 2D primitives, blits, rotates etc.

Refinement-based software motion estimation algorithms.

16/32-bit MAC applications

2 Top level architecture

VP2.0 machine has been partitioned, into scalar and vector processors. Vector processor acts as a slave co-processor to scalar processor. Scalar processor is responsible to feed control streams (parameters, subroutine arguments) to vector processor and also manage tlie data I/O into vector processor, AU the control flow of an algorithm will be executed on scalar machine whereas actual pixel/data processing operations will be done on vector processor.

Scalar processor will be a typical MSC style scalar and vector coprocessor is a SEMD machine, with 1 or 2 SIMD pipes (each SIMD pipe has 16 pixel datapath). Thus, Vector coprocessor could create up to 32 pixels of result as a raw compute power.

Scalar processor sends function calls to vector coprocessor using a memory mapped command FIFO. Coprocessor commands are queued in this FIFO. Scalar processor is completely decoupled from vector processor using this FIFO. Scalar processor can run on its own clock. Vector processor operates as a demand driven unit,

Top level diagram of the VP2.0 is given below.

VP2.0 programs can be compiled into separate scalar code and vector code and then can later be linked together. Separately, vector functions can be written separately and can be provided as a subroutine or library functions to scalar threads. Scalar processor has its own instruction and data cache. Vector tiait also has an instruction cache and a data RAM (referred to as datastore). These two engines, are decoupled and communicate through FIFO's.

3 Simple Programming Model

Simplest programming model for VPLO is a scalar control program executing subroutine calls on vector slave co-processor. There is a inherent assumption here is that the programmer, has decomposed the problem these 2 threads. The thread running that is running on the scalar processor is computing work parameters ahead of rime and feeding them to the vector processor which is the main workhorse. The programs for these two threads are expected to written and compiled separately.

The scalar thread is responsible for following.

1. Interfacing with host unit and implementing the class interface.

2. Initialization, setup and configuration of vector unit

3. Execution of the algorithm in work-units, chunks or Working sets in a loop such that each iteration a. Compute the parameters for current working set. b. Initiate the transfer of the input data into vector processor, c. Initiate the transfer of the output data from vector processor.

The typical execution model of the scalar thread is fire and forget. This is expected to be the typical model for video baseband processing where there is no return data from vector co-processor. The scalar processor will keep scheduling work for vector processor until there is space in command FIFO, The execution of the subroutine by the vector processor is delayed in time, mainly due to latency from main memory. Thus its important to provide latency compensation mechanism on the vector side. In VP2.0 the vector processor provides latency compensation for both instruction and data traffic. The mechanisms for that are outlined in section.

A typical VP program will look like this

More complex programming model is when we have 2 data pipes. Or when we want to write code for 2 data pipes and have it execute on a 1 data pipe machine. The programming model for that is explored in section 6.

4 Streaming model.

As outlined before the scalar engine is responsible for initiating computation on vector processor. The commands passed from scalar engine to vector engine are of the following main, types

1. Read commands (memRd) initiated by scalar to transfer current working set data from memory to data RAMs of the vector engine.

2. Parameter passing from scalar to vector.

3. Execute commaads in the form of the PC of the vector subroutine to be executed. ^"4. Write commands (mem Wr) initiated by scalar to copy the results of the vector computation into memory. Upon receiving these commands the vector processor immediately schedules the memRd commands to frame buffer (FB) interface, ϊt also examines the execute commands and prefetches the vector subroutine to be executed (if not present in the cache). One objective is to schedule ahead the instruction aaddata steams of the next few executes while the vector engine it working on current execute. Ia order to make these read requests ahead of time, the vector engine manages the datastore and instruction cache in hardware.

Datastore is the working RAM of the vector processor. Scalar processor sees this datastore as a collection of FIFO's or streams. Streams are essentially input/output FIFOs that scalar initiates the transfers into. Once the input/output streams are full vector DMA engine stops processing the command FIFO from scalar, soon making it full. Thus scalar stops issuing more work to vector engine. In addition to the input and output streams me vector may need intermediate streams. Thus the entire datastore can be seen as a collection, of streams from scalar side. Bach stream is a FIFO of working 2D chunks called as Tiles. Vector processor maintains a read tile pointer and write tile pointer for each stream. For input streams, when a vector subroutine is executed it can consume or read from current (read) tile. In the background data is transferred to the current (write) tile try memRd commands. Vector processor can also produce output tiles for output streams. These tiles are then moved to memory by memWr() commands what follow the execute commands.

This model is illustrated by the example for sub-picture blending with video. For example, consider a simplistic example where a video surface (e.g., NV 12 format) is blended with, sub-picture and then converted to an ARGB surface. These surfaces are resident in memory. The scalar processor reads in chunks (Tiles) of these surfaces and loads them into datastore. Since there are multiple input surfaces we have to maintain multiple input streams. Bach stream can have different number of tiles (e.g., in this example we could assume that sub-picture surface is in system memory hence we should buffer it more), whereas video stream can have a smaller number of tiles.

5 Vector Co-Processor

Vector Co-processor of the VP2 is a VLIW DSP designed for video baseband processing and codecs. Some important design attributes of this processor comprise:

1. Scaleable performance, I or 2 data pipes.

2. Each pipe lias 2 Data address generators (DAGs)

3. Memory/Register operands

4. 2D (x,y) pointers/iterators

5. Deep pipeline (11-12) stages

6. Scalar (ittteger)/branch units

7. Variable instπxctioa widths (Long/Short instructions)

8. Data aligners for operand extraction

9. 2D datapath (4x4) shape of typical operands and result

10. Slave processor to scalar processor, executing remote procedure calls.

The programmer's view of the vector co-processor in simplest terms is a SMD datapath with 2 DAGs. Instructions are issued in VLIW manner (Le., instructions are issued for the vector datapath and address generators simultaneously). The instructions are of variable length, with the most commonly used instructions encoded in short form. The foil instruction set is available in the long form. For example, from a programmer's point of view the arrangement of various functional units and register/SRAM resources is as shown below.

The vector unit instantiate a single data-pipe or dual data-pipes. Each data pipe consists of local RAM (datastore), 2 DAGs, and a SIMD execution unit, In baste configuration only 1 data-pipe is present. When 2 data-pipes are present they can run as independent threads or as cooperative threads. The complete pipeline diagram of the vector processor is described below. Here a full configuration with 2 data-pipes.

6 Advanced Programming models

In section 3, an RPC model was introduced to illustrate the basic architecture. In this section more advanced concepts are introduced.

6.1 Dual data-pipe configuration

In dual pipe- configuration the following resources of the processor are shared.

• Scalar controller

• Vector control Unit in vector-coprocessor

• DMA engine for instruction/data fetch

• Instruction cache (may be dual ported)

The following resources are duplicated

• Data pipes ( Address/Branch/Veetor execution units)

• Datastore

• Register files

It should be noted that:

1. A program can be written for 2 -pipes on a instance with only 1-pipe. Vector control unit wilt map executes for each pipe on same physical pipe. However since streams for both pipes are present in just one data store, the data store size to be adjusted. A simple way is to cut the Tile size or the number of tiles in stream in half. This would be done by the scalar thread at the configuration time. There are issues like duplication of global registers Mid stream mapping that need iσ be resolved at micro architecture stage,.

2. The program written for 1-pipe cati run on instance with 2-pipes. However this code will run only on one pipe and not use the other. The machine would be half idle.

3. A program can be written for 2~ρipes each running 2 completely different threads. This may not be preferrable since we have only a single scalar that is not multi¬ threaded. Since we support only one scalar execution thread, this may not be preferable, however this model may be supported.

4. A-program can be written for 2-pipes each running same thread. This is a typical model that is expected for paratlclizable algorithms such as most video base-band processing. This allows to use same instruction stream to operate on two strips of a video or two half etc. Each data-pipe has its own execution unit and data-store. Scalar controller has to feed 2 data-pipes, However the parameters, read and write commands are related to each other (offset) hence scalar performance requirement does not exactly double up. An example of this model is shown below,

5. A program can be written with 2-cooperating threads. This is a model expected for codecs where we have a single scalar control thread but multiple functional vector functional blocks may be need to be connected together. This resembles a direct-show pin model of functional blocks. An example of such application is shown below. This model as restricted to only two co-operative threads since we only have 2 data pipes. Another caveat is that threads should be balanced between the two threads. Otherwise there is loss of performance. Within, these constraints this model would work on 2 data pipes and also can be scaled back to a single pipe.

Two data-pipe can be synchronized with each other, The basic approach to synchronization is data driven. The vector functions are executed when the data is available to process. The streams are filled by reads from memory or writes from the other data-pipes. Once data is available the vector control unit will activate the execute and run it. The streams can also be used as counting semaphores. Both scalar controller and vector data-pipes can increment and decrement tine tile pointers and use a stream descriptor as a counting semaphore even when there are no data transfers.

- Supplemental Overview:

In general, embodiments of the invention perform the following:

1. Decomposing a media algorithm into scalar and vector part.

Off the shelf scalar design and it also gives us ability to run scalar and vector part at different clock speeds based on their power and performance requirements,

2. Stream processing.

3.2D datapath processing.

4. Latency hiding (for both data and command fetches)

Application areas: Crypto:

Opcode hiding

Encryption program can just sit on chip. Scalar/controller block Just requests a particular operation to be performed and encryption engine will fetch the instructions etc. Since scalar cannot even see what algorithm is being run, it's very secure, itgiyes a mechanism for hiding encryption algorithm from user,

2D

VP2 instruction set architecture supports instructions for 20 processing. These include ROP3 and ROP4 support used In many GUI/window systems. This allows the media processor to run 2D ops on media processor- Inherent advantage here is power saving.

ISA

Condition code as an instruction slot:

We have separate issue slot (in our multi -issue instruction bundle), for condition code operations. Prior art is people use SIMD instruction that can also affect condition codes/predicate registers. But with the approach taken in VP2 data processing and predicate register processing can be independently scheduled resulting in higher performance.

Memory I/O

Micro coded DMA engine;

DMA engine can be programmed {or can have its own smait microcode) to perform various operations like data prefetching for streams, formatting the data formats, edge padding etc. ϊn genera!, a programmable DMA engine and nσts hard wired functionality. Thus the combination of an memory I/O processor with media processing core Increases overall system level performance. Media processor core is offloaded from having to do data I/O processing.

Memory Hierarchy Architecture:

In VP2 architecture, the memory hierarchy is optimized to minimize memory BW as well as provide latency compensation. Many different schemes are provided such as First level of streaming daiastαre that is visible to vector core as a scratch ram. Managed by HW to took ahead into request stream generated by scalar processor. This datastore is optionally backed by an L2 cache for data reuse, L2 cache can be partitioned into individual sectors on stream basis. L1 cache that is backed by streaming datastore. Daiastøre has prβfetched the next relevant data set.

- Cache using a Stream pointer as data-tags.

- Using scalar generated stream address to prefetch/cache L1 datastore and L2 caches.

Optimized Scalar to vector communication link:

MemRd/Wr format:

Compact commands from scalar for reading and writing system memory into the local memory. Saves on the control flow bandwidth needed to manage tte DMA engine. At the same time not restricting the typea of transactions supported.

Speculation on scalar 2 vector for vector L2:

Parameter compressions with support for parameter modifier and iterators for reducing the bandwidth of communication.

Pipeline cache:

Pipelined instruction cache. Variety of schemes are supported such as

Managing the life oycfe of each cache line by tracking the executions in flight between vector and scaler processor. This ailows the instructions to be ready before the vector processor starts execution. If the instructions are not already in cache they are prefetches.

For small iatency configurations, instruction cache minimized by turning it into a small-RFO. The executions already in FIFO can be reused otherwise they are fetched again.

Overaϋ architecture:

Datastore can be shared between various processing elements. Theses communicate through streams and can feed each other. The architecture envisions a set of heterogeneous functional units like StMD vector cores, DMA engines, fixed function units connected through streams.

Computertioras/DP

Arbitrary / Flexible shapes / half pipe:

Datapath operates on variable shapes. The shape of the datapath can be configured to match the problem set. Typicaily people do 1D datapaths. VP2 can process shapes can be variabte size 4x4, 8x4 or 16x1 etc to match the algorithm.

Scalability:

VP2 datapath architecture uses an Instruction convoying technique (Note: We have 16-way SlMD pipe, where each operand is 1-byte wide. We can have 8-way SIMD pipe (group 2 pipes together) and have wider SIMD datapath with each operand as 2 bytes and similarly we can have 4-way SIMD pipe (group 4 pipes together) and have wider SlMD datapath with each operand as 4 bytes.) to execute wider SIMD instructions on narrower datapath over multiple cycles to save area. e.g., VP2 can scale datapath from 16-way SlMD to 8-way SlMD.

Coupling bytes lanes. Coupling SIMD ways to increase the operand width. Example, currently 16-way SIMD with 8 bit operands. Can increase it to 16-bit operands wfJh S-way SIMD and 32Mf operands with 4-way SIMD.

SIMD address generators

Separate stream address generators for each way of the SfMD pipe. VP2 can use SIMD address generators whose requests are coalesced into Minimal accesses to datastore.

Date expansion using crossbars and collectors

Abifiiy to create more data operands using the crossbars. Reduces the read port pressure and saves power.

X2 Instructions:

Not all instructions can use alt the HW elements (adders/multipliers) in the datapath. So we can for simple instructions like add/sub we can process wider Data shapes than for complex instructions. So instead of limiting the performance to the least common size, VP2 uses flexible instruction set that opportunistically tries to operate on wider shape as long as the read ports can sustain the operation bandwidth.

Multithreaded/Multicore media processing.

VP2 architecture supports various multithreading options such as

Multithreaded scalar processor schedules procedure calls on multiple vector units connected through streams.

Multiple threads running on single vector engine on instruction/by instruction or execute py execute thread switching.

Power management using different vector/scalar

With decoupled scalar and vector part, you can run these 2 blocks at different speeds based on power and performance requirements.

Context-switch;

This media processor has the ability to support very fast context switches due to its register-leas architecture. HW support exists for tracking scalar2vector command queue and saving and replaying it to achieve context switching. Also context switches can be initiated on page faults.

This enables the media processor to maintain the real time processing task like input/output display processing while being able to support non-reai time tasks like 2D acceleration or just in time video enhancement to feed the display pipeline.

This context switch capability along with its instruction set allows VP2 to be a unified pixel/codec processing.

Datastore Organization:

VP2 uses a datastore organization that has following properties

Up to 16 pixels in each direction can be accessed without Bank conflicts. This is done while keeping the stride requirements to minimum. Datastore organization allows efficient transpose of data shapes.

2D addressing is supported inside datastore, elimination SW computation of linear addresses in most media processing applications like video.

51

Claims

CLAIMSWhat is claimed is:

1. A system comprising: a scalar execution unit configured to execute scalar video processing operations; a vector execution unit configured to execute vector video processing operations; a data store memory for storing data for the vector execution unit, wherein the data store memory comprises a plurality of tiles comprising symmetrical bank data structures arranged in an array, and wherein the bank data structures are configured to support accesses to different tiles of each bank.

2. The system of claim 1 wherein said system is a multidimensional datapath processing system for said video processor for executing video processing operations.

3. A system for multidimensional datapath processing to support video processing operations, comprising: a motherboard; a host CPU coupled to the motherboard; a video processor coupled to the motherboard and coupled to the CPU and comprising the system of Claim 1.

4. The system of claim 1, 2 or 3, wherein each of the bank data structures includes a plurality of tiles arranged in a 4 x 4 pattern.

5. ^' The system of claim 1, 2 or 3, wherein each of the bank data structures includes a plurality of tiles arranged in a 8 x 8, 8 x 16, or 16 x 24 pattern.

6. The system of claim 1, 2 or 3, wherein the bank data structures are configured to support accesses to different tiles of each bank data structure, and wherein at least one access to two adjacent bank data structures comprising a row of tiles of the two bank data structures.

7. The system of claim 1, 2 or 3, wherein the tiles are configured to support accesses to different tiles of each bank data structure, and wherein at least one access is to two adjacent bank data structures comprising column of tiles of the two adjacent bank data structures.

8. The system of claim 1, 2 or 3, further comprising: a crossbar coupled to the data store and for selecting a configuration for accessing tiles of the plurality of bank data structures.

9. The system of claim 8, wherein the crossbar accesses the tiles of the plurality of bank data structures to supply data to a vector datapath on a per clock basis.

10. The system of claim 9, further comprising a collector for receiving the tiles of the plurality of bank data structures accessed by the crossbar and providing the tiles to a front end of the vector datapath on a per clock basis.

11. A video processor for executing video processing operations, comprising: a host interface for implementing communication between the video processor and a host CPU; a memory interface for implementing communication between the video processor and a frame buffer memory; a scalar execution unit coupled to the host interface and the memory interface and configured to execute scalar video processing operations; and a vector execution unit coupled to the host interface and the memory interface and configured to execute vector video processing operations.

12. A system for executing video processing operations, comprising: a motherboard; a host CPU coupled to the motherboard; said video processor of claim 11, coupled to the motherboard and coupled to the CPU.

13. The video processor of claim 11, wherein the scalar execution unit functions as a controller of the video processor and controls the operation of the vector execution unit

14. The video processor of claim 11, further comprising a vector interface unit for interfacing the scalar execution unit with the vector ^v execution unit.

15. The video processor of claim 11, wherein the scalar execution unit and the vector execution unit are configured to operate asynchronously.

16. The video processor of claim 15 or system of claim 12, wherein the scalar execution unit.executes at a first clock frequency and the vector execution unit executes at a second clock frequency.

17. The video processor of claim 11 or the system of claim 12, wherein the scalar execμtion unit is configured to execute flow control algorithms of

an application and the vector execution unit is configured to execute pixel processing operations of the application.

18. The video processor of claim 17, wherein the vector execution unit is configured to operate on a demand driven basis under the control of the scalar execution unit.

19. The video processor or system of claim 17, wherein the scalar execution unit is configured to send function calls to the vector execution unit using a command FIFO, and wherein the vector execution unit operates on a demand driven basis by accessing the command FIFO.

20. The video processor or system of claim 17, wherein the asynchronous operation of the video processor is configured to support a separate independent update of a vector subroutine or a scalar subroutine of the application.

21. The video processor of claim 11, wherein the scalar execution unit is configured to operate using VLIW (very long instruction word) code.

22. A stream based memory access system for a video processor for executing video processing operations, comprising: a scalar execution unit configured to execute scalar video processing operations; a vector execution unit configured to execute vector video processing operations; and the vector execution unit; and a memory interface for implementing communication between the scalar execution unit and the vector execution unit and the frame buffer memory, wherein the frame buffer memory comprises a plurality of tiles and wherein the memory interface implements a first stream comprising a first sequential access of tiles and implements a second stream comprising a second sequential access of tiles for the vector execution unit or the scalar execution unit.

23. A system for executing stream based memory accesses to support video processing operations, comprising: a motherboard; a host CPU coupled to the motherboard; a video processor coupled to the motherboard and coupled to the CPU, comprising: a host interface for establishing communication between the video processor and the host CPU; a scalar execution unit coupled to the host interface and configured to execute scalar video processing operations; a vector execution unit coupled to the host interface and configured to execute vector video processing operations; and a memory interface coupled to the scalar execution unit and the vector execution unit and for establishing stream based communication between the scalar execution unit and the vector execution unit and a frame buffer memory, wherein the frame buffer memory comprises a plurality of tiles and wherein the memory interface implements a first stream comprising a first sequential access of tiles and imprements a second stream comprising a second sequential access of tiles for the vector execution unit or the scalar execution unit.

24. The system of claim 22, wherein the first stream and the second stream include at least one prefetched tile.

25. The system of claim 22, wherein the first stream originates from a first location in the frame buffer memory, and the second stream originates from a second location in the frame buffer memory.

26. The system of claim 22 or 23, wherein the memory interface is configured to manage a plurality of streams from a plurality of different originating locations and to a plurality of different terminating locations.

27. The system of claim 26, wherein at least one of the originating locations or at least one of the terminating locations is in a system memory.

28. The system of claim 22 or 23, further comprising: a DMA engine built into the memory interface and configured to implement a plurality of memory reads to support the first stream and the second stream, and to implement a plurality of memory writes to support the first stream and the second stream.

29. The system of claim 22 or 23, wherein the first stream experiences a higher amount of latency than the second stream, and wherein the first second stream.

30. The system of claim 22 or 23, wherein the memory interface is configured to prefetch an adjustable number of tiles of the first stream or the second stream to compensate for a latency of the first stream or the second stream.

31. A system comprising: a host interface for implementing communication between the video processor and a host CPU; a scalar execution unit coupled to the host interface and configured to execute scalar video processing operations; a vector execution unit coupled to the host interface and configured to execute vector video processing operations; a command FIFO for enabling the vector execution unit to operate on a demand driven basis by accessing the memory command FIFO; a memory interface for implementing communication between the video processor and a frame buffer memory; and a DMA engine built into the memory interface for implementing DMA transfers between a plurality of different memory locations and for loading a datastore memory and an instruction cache with data and instructions for the vector execution unit.

32. The system of claim 31 wherein the system is a latency tolerant system for executing video processing operations.

33. The system of claim 32 further comprising: a motherboard; a host CPU coupled to the motherboard; a video processor coupled to the motherboard and coupled to the CPU,

34. The system of claim 31, 32 or 33, wherein the vector execution unit is configured to operate asynchronously with respect to the scalar execution unit by accessing the command FIFO to operate on the demand driven basis.

35. The system of claim 31, 32 or 33, wherein the demand driven basis is configured to hide a latency of a data transfer from the different memory locations to the command FIFO of the vector execution unit.

36. The system of claim 31, 32 or 33, wherein the scalar execution unit is configured to implement algorithm flow control processing and wherein the vector execution unit is configured to implement a majority of a video processing workload.

37. The system of claim 36, wherein the scalar execution unit is configured to pre-compute work parameters for the vector execution unit to hide a data transfer latency.

38. The system of claim 31, wherein the vector execution unit is configured to schedule a memory read via the DMA engine to prefetch commands for subsequent execution of a vector subroutine.

39. The system of claim 38, wherein tne memory read is scheduled to prefetch commands for the execution of the vector subroutine prior to calls to the vector subroutine by the scalar execution unit.

40. The system of claim 33, wherein the vector execution unit is configured to schedule a memory read via the DMA engine to prefetch commands for subsequent execution of a vector subroutine, and wherein the memory read is scheduled to prefetch commands for the execution of the vector subroutine prior to calls to the vector subroutine by the scalar execution unit.