WO2001037220A1

WO2001037220A1 - Rendering image data

Info

Publication number: WO2001037220A1
Application number: PCT/US2000/031689
Authority: WO
Inventors: Sushma Triveda
Original assignee: Info Assets, Inc.
Priority date: 1999-11-18
Filing date: 2000-11-17
Publication date: 2001-05-25
Also published as: AU1776001A

Abstract

A scene is rendered by a first stage of sorting (154, 158) primitives (Fig. 19) of the scene among regions of the scene and a second stage of sorting the primitives of each of the regions among sub-regions of the regions. Each of the sub-regions is processed to render the scene. State information (156) is maintained in connection with processing primitives of portions of a scene to be rendered. The state information (156) is used in rendering each of the portions of the scene to which the primitives belong. The state information (156) is divided into state partitions. Fragments of primitives are culled using a depth test (174). The colors of pixels in the scene are determined based on computations (168) with respect to separate depth sample and color grids. Color data for each one of a set of tiles of the scene are drawn from an on-chip buffer (176). Color conversion (188), color correction (186) and other post-processing operations are performed on each pixel in the image. Color separations (194) of each of the tiles are generated without storing the color data off-chip. The color separations (194) for each of the tiles are delivered to a printer or to an off-chip storage device. Post-rendering steps including half-toning (192) are performed on a tile-by-tile basis for the tiles of the scene.

Description

RENDERING IMAGE DATA

BACKGROUND

The invention relates to rendering image data, e.g., for printing and other high resolution and high quality visual applications such as film production.

3D image data, for example, defines three-dimensionally relevant characteristics (such as shape and appearance) of objects in an image. The shape of each three-dimensional object in the scene is described as a collection of planar triangular patches. Among the sources of 3D data are HTML, Java, digital content creation software for video games and films, and CAD applications.

3D data may be used to render 2D rasterized pixel data for display on a monitor ("display rendering"). Also, when 3D data is to be printed, the printer often is driven using 2D display data that has been rendered from the 3D data ("print rendering"). Printers have much higher resolution than display monitors and using 2D display data to generate the printed image causes the edges of objects in the scene to appear blurred in the printed image. In display rendering, the objects of the image are rendered serially. This does not necessarily correspond to their closeness in the rendered 2D image. Consequently, the entire image must be built up in a frame buffer before the raster data may be delivered in scan line order from the frame buffer to the display. Because of limitations on frame buffer size, this approach may not be suitable for generating large images.

A series of the rendered images may represent a sequence of temporal frames. Each frame in the sequence may capture a snapshot of a changing scene that contains moving objects at a moment in time.

The art of display rendering to generate 2D images from 3D data has been documented in several books, including Foley, Van Dam, Feiner and Hughes, "Computer Graphics: Principles and Practice", Second Edition, Addison Wesley, 1993; and Watts, "3D Computer Graphics", Second Edition, Addison Wesley, 1993. Rendering 3D objects requires several distinct processing steps. The process is computationally intensive and requires substantial amounts of memory for storage and of memory bandwidth for reading and writing data. An image with width of 1600 pixels and 1200 rows (1600x1200) contains 1.92 million pixels. If 32 bits per pixel are used for color information, 24 bits per pixel for depth information, and 8 bits per-pixel for stencil information, then 15.36 megabytes (MB) are needed to store a single set of color, depth, and stencil buffers. The color buffers may need to be double buffered so that while one frame is being displayed (or scanned out), the display processor can be rendering the next frame in the temporal sequence. Some systems use more than two color buffers to even out the per-frame rendering time. Memory is also needed to store the geometric descriptions of the objects in the scene and the textures used in 3D rendering. Rendering of each pixel potentially requires reading and writing of the color, depth and stencil buffer values at the location of that pixel. If the object is textured, the texture elements overlapping the pixel also need to be read. Many objects in the scene may overlap a particular pixel and therefore a pixel may need to be colored many times. (The average number of object surfaces overlapping a particular pixel is referred to as depth complexity.) All this can add up to several gigabytes per second of memory bandwidth required for rendering images for display on computer monitors.

To meet these computational, memory, and bandwidth requirements, graphics processors often use frame buffer memory local to the processor. Figure 1 A and Figure IB show configurations of graphics processors in typical personal computer systems. The host computer 200 includes a CPU 300, a memory controller 400, and a system memory 500. In Figure 1A, the graphics processor 100 connects to the host computer 200 on a PCI bus, which it shares with other peripheral components. Figure 1 B shows a more recent configuration, in which the graphics processor 100 connects to the host computer 200 on an AGP bus, which provides higher bandwidth than the PCI bus. Similar configurations are used in Unix workstations. In all cases, the communication between the host computer and the graphics processor is limited by the bandwidth of the bus connecting the graphics processor to the host (host bus). The display Tenderer is incorporated in the graphics processor. The graphics processor typically also performs other functions, such as scanning out the images to the computer monitor. The frame buffer memory 600 is used to store the color, depth, and stencil buffers. In some systems, frame buffer memory also stores textures. The performance of systems that render 2D data from 3D data is thus limited by the available frame buffer and system memories, host bus bandwidth, frame buffer memory bandwidth, and the CPU processing capability.

Figure 2 shows the rough breakdown of the processing steps for rendering an image. The per- vertex operations 700 involve transformation of the vertices of the object's triangles from their model space to image space and computation of colors at the vertices of these triangles by performing lighting computations based on lights in the scene and material properties of the object. The rasterization stage 800 is responsible for determining which image pixels are touched by each triangle and for interpolation of various attributes (such as color, depth, and texture coordinates) at the location of each pixel. Per-pixel operations 900 involve texturing, hidden surface removal, alpha blending, and other effects at each rasterization pixel.

Some of these steps are implemented on the CPU 300 and some in dedicated display rendering hardware. Most systems perform per-vertex operations on the host 200 and rasterization and per-pixel operations in the display renderer, which is a part of the graphics processor 100. Even if the color, depth, stencil, and texture data are stored in the frame buffer memory, the bandwidth requirements for reading and writing every pixel several times for each high resolution frame are beyond the capabilities of current systems.

Some rendering architectures divide the image into square tiles and build the entire image by rendering one tile at a time. See, e.g., J. Kajiya and J Torberg, "Talisman: A Commodity Graphics System", Siggraph 1996 and "Why Power VR is the Future of 3D Graphics Technology" Technical Backgrounder on Power VR from NEC Corporation and VideoLogic Inc., 1996. This approach is based on dividing the screen or the display surface into a number of abutting screen-aligned rectangular tiles. The color, depth, and stencil buffer values for a tile are held on the graphics processor chip, which eliminates the need to go to the frame buffer memory to read and write depth, color, and stencil buffer values for each pixel that every triangle touches. When all primitives incident on the tile have been processed, the depth, color and stencil buffers for the tile can be saved off the chip.

In a tile-based architecture, a sort stage 1000 is incorporated after the per-vertex operations, as shown in Figure 3. In the sorting stage, all triangles contained within the scene are sorted by the tiles they touch. Then, for each tile, all the triangles incident on that tile are rendered in the order they are received from the per vertex operation stage. The tile traversal stage 1001 reads the data in each tile and sends it to the rasterization stage 800. Tiling helps to reduce the memory bandwidth requirement. However, the triangles must be sorted by tile before the rendering for the tile can begin. Sorting can cause significant additional computing burden on the host. The sort memory 1010 may be a part of the frame buffer memory or may reside on a separate memory interface of its own.

Typically, the per-vertex processed data is delivered to the graphics processor as it is processed by the host. However, if sorting is done on the host, i.e. the dedicated hardware does not sort the triangles by tile, then the host needs to process all triangles in the scene to create the sorted lists of triangles for each tile. The graphics processor starts processing the tiles only after all the sorting is done. Thus all the data for the frame to be processed by the graphics processor needs to be stored in system memory. As the graphics processor renders tiles for a current frame, the CPU starts processing the next frame in the sequence. This causes the host to accumulate the triangles for up to two frames. This can increase the system memory requirement for large scenes.

In tile-based architectures, at any moment, there are at least two frames in flight, while a third frame is being displayed. The transformed and lit vertices are sent to the sort stage 1000. The sort stage 1000 bins the triangles into tiles as it receives them from the per- vertex operations stage 700. An EndScene signal indicates the end of the frame. At this time, the tile traversal 1001, rasterization 800, and per-pixel operations 900 stages can start working on the binned or sorted tiles. The image can be displayed when all the tiles have been rendered. Thus, as shown in Figure 4, in a tiled, double-buffered system, while the per- vertex operations and sort stages are working on frame N, the tile traversal, rasterizer, and stages downstream are working on frame N-l, and frame N-2 is being displayed. This is schematically outlined in Figure 4. This gives rise to a latency in display of the images that is intrinsic to tiled rendering, because the scene must be sorted before the tiles can be rendered. In a tiled system more triangles must be processed during rendering than in a non- tiled system because a triangle often touches several tiles requiring repeated rendering of the triangle. Depending on how the sorted data is organized, more memory may be needed to store the sorted data than the original scene contained. Furthermore, if the setup stage (computation of edge slopes and attribute gradients) for rasterization is performed after sorting, the number of triangles to be setup is also increased correspondingly. The color value of a pixel in an image may be represented in many color spaces.

Commonly used color spaces are RGB for display on computer monitors, CMYK for printing, and YUV for television. The 2D image data produced by the display renderer is typically in RGB color space. This pixel data needs to be converted to the CMYK color format for printing using a mapping from one color space to another. Different display devices have different display characteristics and therefore linear mapping from one space to another space alone is not sufficient. The color conversion methodology needs to take into account the color characteristics of the display device and incorporate color correction. The 2D display data is of relatively low resolution. Maximum available resolution on current high-end systems is 2048 pixels per row and 1280 rows.

The resolution required for high quality printing is much higher. At 1200 dots per inch, an 8" x 6" image corresponds to 9600x7200 pixels, which is 36 times more pixels (with corresponding increase in frame buffer memory) than the 1600x1200 image considered earlier. This scales up linearly for larger image sizes and higher resolutions, which are the norm for hardcopy devices such as printers and digital film recorders. A 16384x16384 pixel image will require 2 gigabytes of memory to hold a single set of color, depth, and stencil buffers. In the tiled architecture, the depth and stencil buffers may not need to be saved. Even so, 1 GB of memory will be needed for the color image. Furthermore, these image pixels need to be read and processed for color correction, color conversion, undercolor removal, and black generation to create the color separations, before the data can be printed, thus giving rise to additional memory and bandwidth requirements.

Application developers use several programming interfaces for content creation. For incorporating 3D scenes in an application (say, for example, an electronic on-line book with an animated illustration of a 3D scene such as an electric motor), developers can use APIs such as the OpenGL, Java3D, and VRML. Page descriptions can be generated using PostScript, HTML, SGML, or other interfaces. Printer drivers interpret the description and appropriately process the data to suit the capabilities of the target print engine. 3D image data of a scene to be rendered is typically represented as a directed acyclical graph referred to as a scene graph. A scene is made up of objects. The shape of the rendered object is determined by its geometry and the transformations applied to various parts of the object. (The same geometry may be drawn with different transformations to create multiple instances of the same thing in the scene, e.g. wheels on the car. The wheels on the car have the same geometry and appearance but different transformations.) The object has other attributes that determine its appearance such as color, material properties, and textures associated with them. The nodes in the scene graph represent objects and the information about them such as their appearance and transformations. The nodes can be of various types, such as transformation, appearance, and geometry. Cameras and lights may be treated as special objects or node types. The branches in the scene graph provide associations and scene traversal order. (Additional information about description of scenes can be found in Virtual Reality Modelling Language specification. In figure 5, we show the description of a simple scene. This scene consists of two cylinders on a table and a pyramid. One of the cylinders is tall and thin and the other squat and wide. The root node of the tree, labeled "Scene" defines the world co-ordinate system. The camera or the observer is placed at some position in the scene. In this example, the "camera transform" node defines the viewpoint relative to the world coordinate systems and the viewing projections. Other than the branch leading to the "camera", the root node has two other branches emanating from it. One branch corresponds to the "table and things on it", and the second branch corresponds to the pyramid. The node at the "table and things on it" branch is the transformation node and defines the transformation needed to place the "table and things on it" in the world coordinate system. Similar logic applies to other objects in the scene. In this example, the positioning of the short and squat cylinder is obtained by applying a composite transform (T₀ι * T₀₀ * T]₀)to the points describing the geometry of the cylinder. Toi is the viewing transformation associated with the camera. The concatenation of Too and Tι₀ provides the model-world transformation. Tι₀ transforms the cylinder description in its model coordinate frame into that of the "table and things on it" coordinate frame. Too transforms the "table and things on it" into the world coordinate system. Similarly, the tall cylinder is positioned by applying transformation (Toi * Too * Tπ)to the points describing the geometry of the cylinder. The pyramid is obtained by applying transformation (Toi * To₂)to the points describing the pyramid. The table is obtained by applying transformation (Toi * Too * ₁₂) to the points describing the geometry of the table.

The appearance of an object in the scene depends on the relationship of the object to the viewpoint, viewing projections, rendering attributes such as the material and texture associated with the object, and the environment of the object, e.g., lighting and fog conditions. The appearance node in the graph contains information about the color, material properties, and texture applied to the geometry of the object. The environment node (not shown) may contain information about other parameters such as fog. All parameters affecting the appearance of an object (such as model-view transformation, viewing projection, material, texture, fog, and lights in the scene) are typically encompassed in what is called a rendering state or context.

During rendering a scene graph is typically traversed depth first, i.e. a branch of the tree is traversed all the way down to its leaf node before processing the second branch at the same node. The branches are traversed left to right, i.e. the left branch is traversed before the branch at the right. Thus in the example of Figure 5, the short and squat cylinder is rendered first, then the tall cylinder, then the table, and finally the pyramid. The geometry associated with the object is rendered using the applicable rendering state.

APIs are available for processing 3D scene graphs. VRML, for example, is scene- graph based. In some traversal schemes, state changes between branches of the scene graph may be incremental. State parameters that are not changed retain their value. In effect, there is one "current rendering state" or "context" that is applied to geometry at the leaf node. In this scheme, by the time the pyramid is encountered in the scene graph, the state changes due to rendering of table-and-things-on-it will be reflected in the current state, and the pyramid appearance node will only contain the needed state changes. In other traversal schemes, the appearance node may encompass all parameters of the state. In yet another scheme, the state may be inherited down the braches of the graph but not across the branches in the graph.

The geometry of an object is described as a set of topologically connected triangular patches. Triangular representation is chosen because triangles are guaranteed to be planar. Thus, the object may be described as a set of triangles, triangle strips, triangle fans, and/or meshes. The geometry of the cylinder in the example above, may be described by one triangle strip and one triangle fan. The triangle strip is used to represent the walls of the cylinder and the triangle fan is used to represent the base of the cylinder. Figure 6 shows the triangle fan and triangle strip associated with the cylinder as well as the connectivity of vertices for the triangle fan and strip description. The triangle fans and strips reduce the number of vertices that need to be processed for an object. The higher order representations of objects such as parametric surfaces are reduced to triangle based primitives before rendering.

Figure 7 shows an outline of processing in traditional display renderer. The output of one processing stage is input to the next stage in a pipeline fashion; therefore the traditional processing is also referred to as 3D pipeline. Processing of each object begins with transformation of the vertices to obtain the coordinates in the eye space. This stage may also include computation of color at each vertex (i.e. per-vertex lighting) by taking into account the material properties and normals at each vertex, and the projection transformation to account for the perspective correction. Next, view volume clipping is performed. This removes the parts of the scene that are not inside the viewing frustum. This may be followed by the viewport/device transformation that generates the projection of the triangle on the viewing plane along with the information about the depth of each vertex from the view point. We now have coordinates of each vertex in the window or device coordinate space. This completes the per-vertex operations 700 in the pipeline.

Next the triangle setup is performed. Triangle setup is a pre-cursor to rasterization. Rasterization determines which pixels in the image are covered by each of the primitives (triangles, lines, and points). The rasterization also computes the inteφolated attribute values at each pixel touched by the triangle. Several methods are available for the rasterization of primitives. One commonly used method involves edge walking followed by span walking. Edge walking is used to compute the end points of the span covered by the primitive for each of the scan lines it touches. For the triangles, this is done by first finding the vertices with the maximum and minimum y value. This provides the range for scanlines touched by the triangle. For each scan line, its intersection with the left and right edges of triangle is computed. The intersection points provide the left and right end-points of the span within the projection of the triangle. The edge walking stage also computes the attribute (color, depth, texture coordinates etc.) values for each of the span end-points. The span-walking stage inteφolates the values of the attributes for each of the pixels covered by the span. In order to perform edge and span walking, we need information about the slopes of each of the edges. We also need the slope and gradients for each of the attribute values. The triangle setup stage computes the parameters such as the range of y values covered, the orientation of the long edge, the slope of each of the edges, co-ordinates of the turning point (i.e. the vertex with the mid-y value), and the slopes and gradients of each of the attributes.

After rasterization, for each pixel touched by the triangle, its color is determined.

This color computation may involve fragment lighting, texturing and fog effects. Next, the scissor, stipple, alpha, and color tests are performed. Next, a stencil test and a depth test are applied and the results are stored in a depth and stencil buffers. Next, the pixels are subjected to alpha blending, dithering, and logical operations, and stored in ARGB format in a color buffer, ready for display on a monitor. SUMMARY

In general, in one aspect, the invention features rendering a scene by a first stage of sorting primitives of the scene among regions of the scene and a second stage of sorting the primitives of each of the regions among sub-regions of the regions. Each of the sub-regions is processed to render the scene.

In general, in another aspect, the invention features maintaining state information in connection with processing primitives of portions of a scene to be rendered, and using the state information in rendering each of the portions of the scene to which the primitives belong, the state information being divided into state partitions.

In general, in another aspect, the invention features culling fragments of primitives belonging to a portion of a scene to be rendered. The fragments are rasterized in a sequence. For each fragment, a pixel of the fragment is discarded if it fails a depth test. Rendering operations are performed on the fragments of the primitives using pixels that have not been discarded.

In general, in another aspect, the invention features pre-processing fragments of primitives of regions of a scene to be rendered, and, after the pre-processing, determining the colors of pixels in the scene based on computations with respect to separate depth sample and color grids. In general, in another aspect, the invention features receiving from an on-chip color buffer, color data for each one of a set of tiles of the scene, generating color separations of each of the tiles without storing the color data off-chip, and delivering the color separations for each of the tiles to a printer or to an off-chip storage device.

In general, in another aspect the invention features rendering tiles of a scene as a rasterized array of pixels and performing post-rendering steps on a tile-by-tile basis for the tiles of the scene.

In general, in another aspect the invention features color conversion and color correction of the pixels in the tiles using a uniformly or non-uniformly distributed multidimensional look-up table, of a scene as a rasterized array of pixels and performing this and other post-rendering steps on a tile-by-tile basis for the tiles of the scene. In general, in another aspect the invention features color conversion and color correction of the pixels of a primitive using a uniformly or non-uniformly distributed multidimensional look-up table before other per-fragment operations.

In general, in another aspect the invention features application of halftone threshold arrays to pixels in the tiles after color correction, color conversion, black generation, and undercolor removal.

Among the advantages of the invention are one or more of the following.

Large print-quality images of 3D scenes can be generated. Large numbers of pixels and large numbers of triangles can be handled. Separate threshold arrays may be specified for each of the C, M, Y, and K channels. Rendering features such as alpha blending, stencils, multi-pass rendering, fog, and multi-texturing are supported. Each of the RGB A, depth, and stencil tiles may be saved to system memory. Texture look-up and fragment color computation happen at a stage in the process that is computationally efficient.

The invention is useful, for example, in 3D workstations used in CAD and content creation and high-end graphics boards for Windows NT, Unix or other platforms.

Consumer applications that involve 3D technologies are starting to emerge. These include electronic books, space planning and remodeling, and other consumer applications with 3D content such as tools for making invitation cards. The invention addresses these applications either as a low- cost part incoφorated as an option in a graphics processor or in a printer or incoφorated in a board-level product that will connect to a personal computer.

The independent generation of color separations for 3D rendered scenes eliminates the need for large storage. RGB frame buffers and z buffers are not stored in local memory.

The approach can be extended to unified memory architectures.

The number of pointers needed in the two-stage binning process is relatively small (nBands + nTiles) rather than nBands * nTiles; where nBands is the number of bands in the image and nTiles is number of tiles in each band. The frame-buffer memory needed to store the bin-buckets is much smaller than if the primitives were sorted directly into tiles because of reduced duplication of data. The tiles need not be square; they may be, for example, rectangular. The touched tile computation is exact and simple. Method is scalable because one band at a time is rendered. The rendered bands can be saved in local memory, system memory, or even on disk for later integration into a printed image. The latency in the display of images is reduced. The band sort process can keep track of the maximum number of triangles incident within a band. This number can be used as an estimate of amount of time required to render the busiest band in the image.

The depth culling approach is inexpensive to implement. It reduces tile rendering latency, thereby reducing the required sizes of FIFOs. The depth tile used in the culling process can be relatively small, which is useful for high resolution images.

The color grid used for per fragment operations enables smooth scalability from fast preview mode to high quality, high resolution images. A new anti-aliasing method is proposed. Only the color separations for generated images need be saved. The original RGB image does not need to be saved in memory.

Seamless integration of 2D and 3D rendered data is enabled.

The multi-dimensional color table look-up technique is applicable to all imaging, not merely to 3D rendering. The incoφoration of multi-dimensional table look-up for color conversion and correction before and/or after the per-fragment operations enables integration of images from various sources (e.g. digital cameras, scanners etc.) into the final rendered image in a collage like manner.

Due to the pipelined nature of processing, some stages can be implemented in software and some in hardware. The invention applies to both hardware and software implementations of the method.

Other advantages and features will become apparent from the following description and from the claims.

DESCRIPTION Figures 1 A and IB show a graphics processor on an AGP bus and the PCI bus.

Figure 2 shows a top level breakdown of 3D display operations.

Figure 3 shows 3D display operations in tiled architectures.

Figure 4 shows a latency of display in tiled architectures.

Figure 5 shows a simple scene graph and its display image. Figure 6 shows a topological representation used in graphics. Figure 7 shows a display rendering pipeline.

Figure 8 shows a modified rendering pipeline. Shaded areas are the new stages introduced in this invention.

Figure 9 illustrates tile binning. Figure 10 shows a modified binning process.

Figure 11 shows a binning data structures.

Figure 12 shows a bandsort process.

Figure 12A shows a tilesort process.

Figure 13 shows a band and tile bucket current pointer. Figure 14 shows a cull stream entry.

Figure 15 shows a band bucket entry.

Figure 16 shows contents of the front-end state block.

Figure 17 shows contents of the fragment stream entries.

Figure 18 shows a coordinate space Figure 19 shows triangles on tiles of a scene.

Figure 20 shows a state of structures after receiving V₂.

Figure 21 shows structures after binning triangle V₀, Vi, V₂.

Figure 22 shows structures after binning first object.

Figure 23 shows data structures after binning the first triangle of second object. Figure 24 shows data structures after endscene.

Figure 25 shows data structures after tile_sort on 1^st band.

Figure 26 shows structures after receiving vertex V₂ for preloaded incremental state blocks.

Figure 27 shows structures after binning triangle Vo Vi V₂. Figure 28 shows data structures after endscene for preloaded incremental state blocks.

Figure 29 shows a location of color computation.

Figure 30 shows a conversion of halftone parameters to a brick pattern.

Figure 31 shows a halftone brick and tiles.

_ We shall illustrate our approach using an example. We shall consider a scene that consists of 200,000 triangles and is being rendered as a 1600x1200 display image. We also assume that we want to render the same scene for printing on a paper into an 8"x6" image at 1200 dots per inch. We shall also assume that the rendered image has an average depth complexity of 4. This results in nearly 40 pixels per triangle in the display image and 1380 dots per triangle in the printed image.

In Figure 8, we outline a modified version of the rendering pipeline, which substantially reduces memory, bandwidth, and computational requirements of the traditional pipeline. This modified pipeline is suitable for fast and high-quality rendering of large-format print-quality images. The method introduces sorting, culling, and post-processing stages into the traditional pipeline. We have also reordered some of the stages to achieve better efficiency. These stages are described in detail in the following sections.

Scene Traversal and Per-vertex Operations

The scene processing 150 and the per-vertex operations 152 of the modified pipeline perform the same functions as the corresponding stages in the traditional pipeline. They include traversal of the scene graph and accumulation of changes to parameters of the rendering state, and per-vertex transformations, lighting, and viewport clipping. The processing stages may be incoφorated in dedicated graphics processor. The host sends the geometry and state information to the graphics processor, and the graphics processor traverses the scene and performs the per-vertex operations and all other pipeline stages downstream. Alternatively, the scene traversal can be performed on the host. Per-vertex and other operations downstream are then carried out in the dedicated hardware. In yet another implementation, both the scene traversal and per-vertex operations may be carried out on the host, and the operations downstream may be performed in the dedicated hardware.

Binning, State Management and Setup

These are new stages in the modified pipeline. The modified pipeline renders to a small region of the image (tile) at a time. The 3D data for each object that results from the view volume and user clipping of the per-vertex operations is delivered to a binning and state management module 154. This module accepts the incoming object data and sorts (bins) the objects in accordance with bands of the image and tiles within each band in which the objects appear.

The entire image is built up tile by tile, rendering one tile's worth of data at a time. A tile of size 32x32 contains 1024 pixels. The 1600x1200 display image in our example will require 1900 such tiles (38 rows of 50 tiles across). On the other hand, the 8"x6" image for printing has 9600x7200 pixels, and will require 67,500 tiles.

To render the data, one approach would be to process the entire scene for each tile and only render objects that touch the tile. This is clearly wasteful. Another approach be to bin the data by tile. In this approach, a bin bucket is created for each tile to hold the geometry that overlaps that tile. The method first determines the tiles that are touched by each triangle, and then writes the triangle information to the bin buckets of each of the touched tiles. Relevant state changes are written to each bin bucket before the triangle is written. Using a 32x32 tile and a simple bounding box check for tile assignment will cause a triangle to be written to nearly 3 tiles on average in a typical display image and tol3.5 tiles in the print image. More accurate tile assignment than the bounding box check can be done so that a triangle is assigned only to the tiles that are actually touched by the triangle. Then a triangle will be written to 1.5 tiles on average in a display image. In the print image the same triangle will touch nearly 6.7 tiles. Storage requirements can be reduced by storing the actual vertex data and state information separately only once, and then using only pointers into the state and vertex data lists in each assigned bin bucket.

In a traditional pipeline the triangles are processed in the order they are received. All state changes made to the context by the time a triangle is encountered are applied to that triangle. Once the triangle is rendered, the state can be modified for the next primitive. In tiled architecture, a primitive may straddle the tile boundary and therefore may be visited many times. The state management part of this stage is responsible for associating the correct rendering state parameters with each primitive.

As mentioned before, the setup stage does the computation of per-triangle parameters such as edge slopes and attribute gradients that are used for edge walking and scan- conversion during rasterization. These computations can be carried out either during binning (and written to memory) or during the traversal of tile primitive lists.

Binning

Figure 9 outlines the general data flow for binning the geometry by the tiles. For each new vertex the binning unit 154 determines if that vertex completes a triangle. (The first two vertices of the triangle are simply held in registers.) If the vertex completes the triangle, then the triangle data is saved to memory for state and geometry, and the triangle memory pointer register is updated. This memory may be a portion of the frame buffer memory or a separate memory with its own memory interface.

Next, the tiles touched by this triangle are determined. Exact tile coverage can be determined in one of many ways. This can be done by first finding the bounding box of the triangle and then for each tile touched by the bounding box, by computing the Manhattan distance from the center of the tile to each edge. The sign and magnitude of the Manhattan distance to each edge is used to determine if the tile is touched by the triangle. This approach requires that the tiles be square. The exact coverage can also be determined by rasterizing the triangle in tile space. Although these methods are exact, they are expensive to implement. For high resolution images such as those required for printing, the number of tiles touched is significantly larger, and therefore the cost of exact touched tile determination can be fairly high.

For each tile touched by the triangle, a current_bucket_pointer is needed in order to update its bin bucket. The bin buckets may be implemented as linked lists or as paged arrays. In either case, for each bin bucket two pointers are required. These are the pointer to the start of the bin bucket and the current pointer for updating the bin bucket. The start pointer is used during tile traversal and rendering. If the current pointer is stored in memory, then it needs to be read from the memory, and written to memory when it is updated. This can result in a large number of memory access requests to off-chip memory. It is possible to keep all current bin bucket pointers on-chip, but this limits the maximum allowable image size. In the example above, if 1600x1200 is the maximum allowable display image size, on-chip memory would have to store 1900 pointers. For the 8"x6" print image, 67,500 pointers would have to be stored on-chip. Since our objective is to allow images of even larger sizes, this method of binning is clearly not suitable.

A procedure that is suitable for binning and tile processing of large images is outlined in Figure 10. Binning in this approach is a two-step process. Incoming primitives are first binned into successive bands of the image and then the primitives within each band are binned into tiles within that band. This approach has several advantages:

1. The number of pointers needed is reduced to (nBands + nTiles) instead of (nBands*nTiles), where nBands is the number of bands in the image and nTiles is number of tiles in each band. For the 9600x7200 pixel image in our example, only 525 pointers would be needed instead of 67500.

2. The frame-buffer memory needed to store the bin-buckets is much smaller than that needed in the approach of sorting into tiles straight away. This is due to reduced duplication of data.

3. Two-step binning allows use of rectangular tiles and not just square tiles.

4. The touched tile computation can be exact yet simpler.

5. The method allows scalability of rendered images, by rendering to one band at a time.

6. The method reduces the latency in the display of images.

In summary, the invention provides a practical method to render large 3D rendered images.

Binning Data Structures The binning process creates several data structures in memory. These are outlined in

Figure 11. The binning process gets the data from the per-vertex operations stage. The input data may contain the rendering state and geometry information. This information is saved in three different streams in the binning memory: the state stream, the cull stream, and the fragment stream. The state stream has information about state changes, the cull stream has information about spatial data used for binning and opaque culling, and the fragment stream has information needed by the units dealing with per-fragment operations.

State Stream

An embodiment of this invention uses three kinds of state blocks: a front-end state block, a fragment operations state block, and a post-process state block. The front-end state block has information about the format of vertices in the fragment stream, information needed for culling, such as a depth test, depth write enable flag, scissor window extent, and a stipple pattern. The fragment operations state block contains information needed for per- fragment operations. This state block contains information such as the fog parameters, texture filtering and blending modes, stencil test and alpha blending related parameters. The post-process block has information about color tables for color conversion, half-tone threshold arrays, etc. In another embodiment the state blocks are self contained, and each state block contains all the information needed to set all state registers for the corresponding stage.

In another embodiment the state blocks are further subdivided into sub-blocks. For example, the post-process state block may contain two sub-blocks; one corresponding to color-conversion stage and the other to halftoning. The fragment operations state block may contain sub-blocks for texture address and filtering, the fog parameters, and one for post- texture fragment operations such as alpha testing, alpha blending, depth and stencil tests etc. In one embodiment, the fragment operations state block may also contain pointers to a color table for color conversion and correction.

In another embodiment of this invention, the state blocks are implemented as a dictionary of key-value pairs. This embodiment allows incremental update of the state. A state loader module in each processing block inteφrets the dictionary keys and loads the appropriate registers.

The band Sort process (Figure 12) uses three additional data structures. The geometry and its associated state incident within each band is recorded in the bin bucket of that band. This is referred to as the band bucket. The contents of a band bucket entry are outlined in Figure 15. Each entry has a "type" and "data' associated with it. The "type" determines how the data is inteφreted. Each band bucket entry has a fixed number of bits. In one implementation, we use 128 bits per entry. If the type is "MODE", then the data is inteφreted as containing the state block pointers for the applicable modes as follows; with 2 bits of type, 24 bits of front-end state block pointer, 8 bits of front-end state block size, 24 bits of fragment-ops state_block pointer, 8 bits of fragment ops state block size (in 8 byte units, the size can thus be 8 to 2048 bytes), 24 bits of post-processing state pointer, 8 bits of post-processing state block size (in 32 byte units, the size can thus be 32 to 8192 bytes), and a 24 bit next pointer. 6 bits are unused. If the type is "PRIM", then the data is 16 bits of tile_min, 16 bits of tile max, 32 bits of cull stream pointer, 32 bits of fragment stream pointer, and 24 bits of next pointer. 6 bits are unused. The type of "EOB" indicates the end of band. A band bucket is a sequence of band bucket entries.

In one embodiment of this invention the size associated with "mode" information is in variable units. Thus the size for the front-end state block may be in units of 32 bit words, whereas the size of post-process state block may be in units of 64 bit words:

One embodiment of this invention uses a linked list of paged arrays for band buckets. Paged arrays allocate a page full of band bucket entries at a time. A page thus has a number of band bucket entries followed by a next page pointer.

Another embodiment of this invention uses a linked list of band bucket entries for band buckets.

The cull stream contains the spatial part of the vertex data (Xwmdow, Ywmdow, Zwmdow) used by the binning and opaque cull process. The Fragment stream contains the information needed to determine the color of each fragment. As outlined in Figure 12, the primitive build stage 1541 builds the triangles from the incoming vertices. This stage orders the vertices to determine the bands intersected by the triangle as well as to determine the left, right and bottom edges of the triangle. It also determines the long edge in y.

In one embodiment, the cull stream consists of the (x,y) coordinates of the three vertices in screen or window coordinates system, the depth gradients, the edge slopes and the LeftCorner {X_W0, Ywo, Xwi , Ywi , Xw2, Yw₂, Z_W0, δZ/δX, δZ/δY, (dX/dy)_left, (dX/dY)_πght, (dX/dY)bottom, LeftCorner}. The contents of this structure are shown in Figure 14. The data is organized as lists of triangles.

In another embodiment, the fragment stream consists of the data specified by the vertex format field in the front-end state block. As shown in Figure 17, the vertex of the triangle in the fragment stream may consist of one or more of the eye coordinates, vertex normals, vertex diffuse color, vertex specular color, vertex tangents and binormals, and texture coordinates for one or more textures. Each texture coordinate set may consist of 1, 2 or 3 values. Two values are used for two-dimensional textures and the third value for projective textures (if in use). Another embodiment uses one and three-dimensional textures as well as the two-dimensional textures. The maximum number of textures that can be used on a primitive simultaneously determines the maximum texture coordinate sets allowed. The value is programmable. The number of bits in the format field is adjusted accordingly.

In another embodiment, the data in the fragment stream is packaged by vertices, which are then combined to build the triangle in the per-fragment operations unit.

In another embodiment, the data in the fragment streams is packaged by triangles. Each triangle has the attribute values specified at the vertices.

In another embodiment, the data in the fragment streams is packaged by triangles. Each triangle has one reference value for each attribute specified at one of the vertices, along with the x and y gradients of each attribute. Thus instead of storing three values for each attribute corresponding to the three vertices of the triangle, the three values in this implementation correspond to one reference value and two gradients. The atribute values at any point interior of the triangle then can be determined by the equation

A(x, y) = Aref + (x - xref) * Ax + (y - yref) * Ay;

where Ax and Ay are the gradients of A along x and y axis. Aref is the reference value of the attribute at point (xref, yref). The target point is (x,y).

The band bucket start pointers are a set of nBands pointers that indicate the start of the bin buckets for the corresponding bands. NBands is the number of bands the rendered image is divided into. The tile bucket start pointers are a set of nTiles pointers that indicate the start of the bin buckets for the corresponding tiles, where nTiles is the number of tiles within a band.

As shown in Figures 13, the currrent pointer for a band bucket contains a pointer to the last entry in the bucket. This pointer also has the time stamp for when the last state changed in this bin bucket. This time stamp is used to determine the state changes accumulated since the last mode change for this band. The tile bucket current pointer works in the same way.

The front end state block has data about the window x, y, z coordinates, scissor, and stipple. This information is used by the pre-rasterization, scissor and stipple test, and the opaque culling blocks. The contents of the front-end state block in one implementation are outlined in Figure 16. Binning Process

The binning process is outlined in Figures 10 and 12. As mentioned before, the binning happens in two stages. In the first stage the primitives are binned to bands. The bands can be horizontal or vertical. For the puφose of this discussion only, we shall assume that the bands are horizontal. After the band sorting is completed, the primitives in each band are binned into tiles by the Tile Sort process. When the binning for a band is completed, the Tile_Sort process signals the Tile Render process to start processing the completed band. The Tile render process sends the resulting binned data for each tile to the culling and per-fragment operations units. While the Tile Render process works on tiles from one band, the Tile_Sort process does the sorting by tiles for the next band in the image. The following pseudo-code outlines the binning process. Since a processing can and does happen in parallel (in hardware implementation), we use threads for each processing unit that can operate in parallel.

Process_Scene() {

Sort_Scene_By_Bands; sort_band = 0; process_band = 0;

Start_Tile_Sort_Thread();

Send_Band_Number_to_Tile_Sort(sort_band++); Start_Ti.e_Render_Thread();

Wait for signal from Tile_Sort that it is done sorting the first band;

Do { Send_Band_Number_to_Tile_Render(process_band++); Send_B__nd_Number_to_Tile_Sort(process_band); } while (process_band < NUM_BANDS);

// Render Tiles in the last band Send_Band_Number_to_TileRender(process_band); }

In this section we describe the different steps of the binning process. The input vertices are received from the per-vertex operations stage. The vertices can be in flexible vertex format or in indexed mode. In the indexed mode, the driver is responsible for loading the vertex buffers in the local memory before sending the vertex indices. The vertex format and the buffer pointer are programmable parameters. The input processor in the binning process is responsible for retrieving the vertex from the vertex buffer and for creating the data packets for the cull and Fragment streams. The data sent to the binning process in flexible vertex format is used as is to construct the cull and fragment streams.

For puφose of this discussion, we assume that the device coordinate system is as shown in Figure 18. The device x-axis points to the right, the y-axis points down, and the z- axis (depth) points into the page. Thus objects farther from the viewer have larger z-values. After all the vertices for the triangle are obtained, the band Sort process first determines the characteristic functions for the triangle. Once the characteristic functions are determined, the band_Sort process determines the bands in the image touched by the triangle. The characteristic functions include identification of the edges as left, right, and bottom edges, edge slopes, depth gradients, LeftCorner, ymin and ymax. The vertices are first ordered in y axis. The slopes of the two edges emanating from the top vertex are determined. Since x is increasing to the right, the edge with the smaller slope is the left edge. If it is the edge that connects the top vertex to the mid-vertex, then LeftCorner is true, i.e. the long edge of the triangle lies to the right and the span walking during rasterization may proceed right to left. The computation of these parameters is outlined in the following pseudo-code.

// Given the vertex coordinates in device coordinate space, determine the edge slopes // and the depth gradients Compute Charaeteristic Functions () { Sort_vertices_in_y;

// use top, mid, and bot suffices to indicate the vertex order. Top_to_Mid_Slope = (Xmid - Xtop)/(Ymid - Ytop); Top_to_Bot_Slope = (Xbot - Xtop)/(Ybot - Ytop); Bot_Slope = (Xbot - Xmid)/(Ybot - Ymid); If ( (Top_to_Mid_S lope = Top_to_Bot_S lope) || (Mid_To_Bot_Slope = Top_to_Bot_Slope) ) Discard_triangle; // zero area degenerate triangle

If (Top_to_Mid_Slope < Top_to_Bot_Slope) { // long edge is to the right Left Slope = Top_to_Mid_Slope; Right_Slope = Top_to_Bot_Slope; LeftCorner = TRUE; } else { // long edge is to the left Right_Slope = Top_to_Mid_Slope; Left Slope = Top_to_Bot_Slope; LeftCorner = FALSE; }

Compute_Depth_Gradients_along_X_and_Y_axes;

Package_triangle_information_and_send_to_Cull_stream.; }

Ytop and Ybot provide the extent of the triangle vertically, and are used to determine the bands intersected by the triangle. The band Sort unit bins the triangle to the bands touched by the triangle. The Following pseudo-code outlines the process of sorting the geometry in the scene into bands.

Sort_Scene_By_Bands() {

While (! EndScene) { if (State_packet) { Save State Packet; Update time stamp; } else { // geometry packet if ( completing vertex) { Write the data for the Fragment stream; Compute_Characteristic_Functions; Write the data for Cull stream; Band_Sort the triangle;

} Signal_Tile_Sort_unit_or_thread;

} The band Sort process determines the bands intersected by the triangle. Then for each band touched by the triangle, the band Sort process determines the minimum and maximum x value. This xmin, and xmax value in each band is stored in the "PRIM" entries of the band bucket and is used to find the tiles touched by the triangle within the band. The following pseudo code outlines the band Sort process.

//Given the triangle find the bands touched by the triangle and output the data for band buckets Band_Sort (int band_index) { band_xmin, band xmax; xl, x2;

band min = Ytop/bandjieight; band max = Ybot/band_height; for (idx= band_min; idx <= band_max; idx++) { get the band_ymin and band_ymax for this band

// compute the horizontal extent of the long edge within the band xl = (ytop < band_y min) ? (xtop + (band_ymin - ytop) * Top_to_Bot_Slope) : xtop; x2 = (ybot > band_ymax) ? (xtop + (band_ymax - ytop) * Top_to_Bot_Slope) : xbot; band xmin = MIN(xl, x2); band_xmax = MAX(xl, x2); .// Compute the horizontal extent of the top_to_mid edge within the band if and only .// if the edge intersects the band. Update xmin and xmax. .if(ymid > band_ymin) { xl = (ytop < band_ymin) ? (xtop + (band_ymin - ytop) * slope_top_to_mid) : xtop; x2 = (ymid > band_ymax) ? (xtop + (band_ymax - ytop) * slope_top_to_mid) : xmid; band_xmin = MIN(x 1 , x2, band_xmin); band_xmax = MAX(xl, x2, band xmax);

// Compute the horizontal extent of the bottom edge within the band if and only // if the edge intersects the band. Update xmin and xmax. if (ymid < band_ymax) { xl = (ymid < band_ymin) ? (xmid + (band_ymin - ymid) * slope_mid_to_bot) : xmid; x2 = (ybot > band_ymax) ? (xmid + (band_ymax - ymid) * slope_mid_to_bot) : xbot; band_xmin = MIN(xl, x2, band_xmin); band xmax = MAX(x 1 , x2, band_xmax); }

.Assemble band Bucket entry for this triangle and write it to band bucket;

The EndScene command indicates the end of the current scene. At this time the band processing unit starts processing the triangles within a band. The band processing unit consists of two parts. The Tile_Sort process and the Tile Render process. The Tile Sort process sorts the triangles within a band by their incidence on the tiles within the band. The Tile Sort process is outlined in Figure 12A. Once all triangles within the band are sorted, as indicated by the EOB (end of band) packet, the Tile Sort process signals the Tile Render process to start processing the tiles within that band. The band_Bucket can be freed at this time, and the Tile Sort process can start working on the next band. The Tile Render process reads the bin bucket for each tile and sends the triangles for the tile downstream for processing. When rendering of a tile is complete, the bin bucket for the tile can be freed. The Tile Render process moves on to the next tile in the band. When all tiles within the band have been processed, and the Tile-Render process has received a DONE signal from the Tile Sort process, it moves on to the next band.

The Tile Sort process uses the xmin and xmax of each triangle in the band to determine its coverage on the tiles. This provides exact coverage determination, as the triangle will be assigned only to the tiles that are indeed touched by it. This method of sorting also allows us to have rectangular tiles and not just square tiles.

The Tile Render process reads the bin buckets of the Tiles. The tile bucket entries are just like the band bucket entries except that the band xmin and band_xmax are not needed in the entries of type "PRIM". The Tile Render process retrieves the Front-end state and the cull stream entry associated with this triangle, and sends the associated data to the cull and the per-fragment operations units.

The binning process is assigned a chunk of frame buffer memory that it uses for writing bin buckets and cull and fragment streams. Some implementation may use separate binning memory.

In one implementation, each band is 32 scan-lines high and encompasses a string of 32x32 tiles. An 8K x 8K image would have 256 bands and 256 tiles within each band. Each sorted list is a linked list of data blocks that are dynamically allocated.

State Management

Different traversal schemes in the scene graph (Figure 5) pass down the state to be rendered in different ways. The state refers to the set of rendering parameters that affect the appearance of the rendered object. The set of all parameters may be divided into several partitions, with each partition containing the parameters required by a particular processing stage. Thus the rendering state may consist of partitions such as light state block, texture state block, material state block, blending state block, etc. To illustrate our point, we shall assume that the state is divided into three partitions, corresponding to (1) front-end processing unit, (2) the per-fragment operations unit, and (3) post-processing operations unit. The invention includes other state partitioning schemes as well. Several embodiments can be devised, each containing a different set of state partitions within the framework of this invention.

This invention encompasses state management for tiled architectures under four input scenarios. Each scenario will be illustrated through the use of the example in Figure 19, which contains 3 objects rendered with different sets of states. The geometry of the first object is a triangle strip made up of 6 vertices (V0, VI, V2,..., V5), the geometry of the second object is a list of two triangles (V6, V7, V8) and (V9, V10, VI 1), and the geometry of the third object is a triangle fan made up of 5 vertices given by (VI 2, VI 3, VI 4, VI 5, VI 6). These objects cover four bands, as shown in Figure 19. We further assume that the first object uses Front-end state partition Al, Fragment Ops state partition B2, and the Post- process state partition Cl. The second object uses Front-end state partition Al, Fragment- ops state partition BI, and the Post-process state partition Cl . The third object uses Front-end state partition A2, Fragment-ops state partition BI, and the Post-process stae partition Cl . We describe the input scenarios in the following sections.

Preloaded Full State Partitions In this embodiment, full state partitions are preloaded into the frame buffer memory and the pointers to them are passed during scene traversal to the binning and state management stage. Various stages retrieve the state data as needed. Such a scenario is applicable when the same rendering state may apply to a number of objects in the scene. For the example, the data processing steps for the example in Figure 19 may proceed as follows:

Load Front-end state partition A 1 at address al_ptr; Load Front-end state partition A2 at address a2_ptr; Load Fragment-Ops state partition BI at address bl_ptr; Load Fragment-Ops state partition B2 at address b2_ptr; Load Post-process state partition Cl at address cl_ptr; Setup vertex arrays for obj 1, obj2, obj3; Begin Scene

Set Front-end state pointer to al_ptr;

Set Fragment-Ops state pointer to b2_ptr;

Set Post-process state pointer to cl_ptr; Draw_Triangle_Strip(6, obj l); // obj l has 6 vertices and is drawn with state A 1, B2, Cl Set Fragment-Ops state pointer to bl_ptr;

Draw_Triangle_List(6, obj2); // obj2 has 6 vertices and is drawn with state Al, BI, Cl Set Front-end state pointer to a2_ptr; Draw_Triangle_Fan(5, obj3); // obj3 has 5 vertices and is drawn with state A2, B 1, Cl

EndScene

The host computer loads the front-end state partitions Al and A2 at locations al_ptr and a2_ptr respectively. The fragment operations state partitions B 1 and B2 are loaded at locations bl_ptr, b2_ptr respectively. A postprocessing state partition Cl is loaded at location cl_ptr. Note that these state partitions may be used for rendering multiple frames. Once a state partition is no longer needed, the host may free the memory associated with it.

The state pointers for the state partitions, and the data associated with the vertices needed to complete a triangle, are retained as a part of a current rendering context. The current rendering context may reside on chip. When scene processing begins, the pointers for each state partition are initialized to a known (reset) state. If the frame being rendered is not the first frame in the sequence, then the last known state partitions from the previous frame are inherited. The time stamps for the state partitions are initialized to zero. The band buckets are initialized and so are the band bucket current pointers. The band bucket current pointers point to the start of each band bucket and have a time stamp of zero.

Following the example in Figures 19 and the pseudo code above, when the binning process receives the front-end state pointer, it updates the on-chip current front-end state pointer to al_ptr, increments the time stamp (to 1) and assigns it to the front-end state pointer. The binning process then receives the fragment operations state pointer b2_ptr. It then assigns that pointer to the fragment operations state pointer, and assigns it the incremented time stamp (which is now 2). Similarly, cl_ptr is assigned to the postprocessing state pointer, which is assigned a time stamp 3. Next comes geometry for objectl.

Figure 20 shows the state of various data structures at the time that the binning process receives vertex V_2, the completing vertex of the first triangle in object 1. The band bucket current pointers have a time stamp of 0. Since the state pointers have time-stamp greater than 0, the state needs to be incoφorated in the band bucket. Vertex V₂ completes the triangle formed by vertices V_0> Vι_ and V₂. This triangle touches the bands 0 and 1. The state of the data structures after processing this first triangle is shown in Figure 21. Note that the time stamps of the current pointers for the first two band buckets have been updated to 4, which indicates the time at which the state packet was updated in these two bands. At this time, the bin buckets for both the first and second bands have three entries each: one state packet, one primitive packet and one "end of bucket" packet. The bminOO and bmaxOO are the minimum and maximum x value of ΔVoV[V₂ in bandO and bminlO and bmaxlO are the minimum and maximum x value of ΔV₀Vι V₂ in bandl . They appear in the corresponding primitive packet in the band buckets. The cull and fragment streams contain one entry each, corresponding to ΔV₀Vι V₂.

Figure 22 shows the state of various data structures after all four triangles in the first object have been binned. The first two triangles intersect bandO and bandl. The next two triangles only intersect bandl . Therefore the bin bucket for bandO contains two "primitive" entries and the bin bucket for bandl contains four "primitive" entries. The cull and fragment streams have four entries each. Figure 23 shows the data structures after the second object has been binned to bands. Figure 24 shows the various data structures at the time that EndScene is received. In this implementation, the time stamp is updated every time any of the state pointers change, and when the first primitive is completed after the state change. In this scheme, the last state change happened at time 7. The first triangle after that was received at time 8. The state packets are written to the band buckets touched by the triangle at that time. And therefore that is the time assigned to the band bucket current pointer. Similar time stamp logic is used for sorting the triangles within a band into tiles. The Tile sort process starts to process the data in each band in order to bin it to tiles after the EndScene is received.

The data structures for the first band after the tile sort is completed are shown in Figure 25. The xmin and xmax computed for each triangle for the band are used to determine the tiles touched by the triangle. The Tile Render process reads the tile buckets and sends the appropriate state pointers to the corresponding units. While Tile_Render is processing the first band, the Tile Sort can start working on the second band in the image. This is accomplished by using double buffered tile pointers. One set of tile pointers is used by the Tile Sort process and the other set is used by the Tile Render process. The processing units downstream from the binning process may implement certain caching schemes to cache the state partitions. The memory pointers are used as tags for the cache entries. The number of cache entries used for each state partition is implementation and application dependent. The invention in this patent does not limit the number of cache entries or the number of state partitions used in an implementation. For example, the fragment operations stage may have a fully associative cache with four entries, each entry corresponding to one instantiation of the fragment ops state partition. When processing the first tile of the first band, the fragment operation stage gets the data corresponding to the triangles in the first object. The first object uses the Fragment-ops state block at b2_ptr. Therefore, the state_block at address b2_ptr will be loaded into cache entry 0. The second tile also gets a triangle from the first object. Since the state block B2 used for that object is already in cache, the state block does not need to be retrieved. The second triangle in the tile will cause state at bl_ptr to be loaded into cache entry 1. The geometry in third tile uses the state block B 1 , which is already loaded in cache entry 1. The first tile in the second band will require the Front-end state partition at a2_ptr to be loaded in Front-end cache, and so-on.

This aspect of the invention assumes that all the state partitions are already loaded in memory. Various processing units using the state partitions are agents on the memory controller that communicates to the memory containing the state partitions.

In several situations, the same rendering state may apply to several objects, for example in cases when several instances of the same object are used in the scene. In such a situation, the host can simply load the state partitions in the local memory and use the pointer to that state block in subsequent rendering. This can reduce the memory bandwidth and storage related to state management.

One aspect of this invention is that the content of the state partition is inteφreted by the processing unit that uses it. The state management is simply responsible for associating correct pointers of various state partitions with each object. Interleaved Full State Partitions

In this embodiment, the state partitions are not preloaded, but the data for each state partition is passed down to the binning process. In this embodiment, the binning process retains the full state partitions and not just the pointers in its on-chip memory (or on the heap in a software implementation). The host/driver assigns a state memory area to the binning process. When the state packet needs to be inserted into the band buckets, the binning process first saves the state partition to the state memory and then uses that pointer in the state packet. The rest of the processing is similar to the case above.

In the above two embodiments, the size of a state partition is fixed and depends only on the type of the partition.

In another embodiment, we combine the preloaded full state partitions with interleaved full state partitions. A flag is used to indicate if the current state in the state partitions needs to be saved to memory.

Preloaded Incremental State Blocks In this embodiment the state changes are incremental. This kind of situation arises when the state changes are implemented using display lists. The state member that is changed is typically identified by a key or an id, followed by a value. A state block in this case is a sequence of these key- value pairs. The number of key-value pairs in a block may be variable.

State management of the incremental state blocks is different from that outlined in the section for full state partitions. The reason is that the state at any one time is the accumulated effect of all state changes until that time. We use the example in Figure 19 to illustrate the state management for incremental state changes. We also introduce the notion of a RESET state block for each partition. The RESET state block sets all parameters in a state partition to a known state. Since all parameters are set, once a RESET block arrives, the history of state changes for that partition may be erased. The three RESET state blocks A0, BO, CO are loaded at a0_ptr, b0_ptr, and c0_ptr respectively for the three state partitions, namely Front- end state, Fragment-Ops state and the Post-processing state. Their sizes are aO size, bO size, and c0_size respectively. Furthermore, the sizes of Al, A2, BI, B2, and Cl are al size, a2_size, bl_size, b2_size, and cl_size respectively.

Since the state changes are incremental, the rendering of the first object is affected by the cumulative state changes due to (A0 + Al), (BO + B2), and (CO + Cl) blocks to the three state partitions. The order in which state changes are executed is important. For example, the state changes in state block A0 should be followed by the state changes in state block Al. The rendering of the second object is affected by the cumulative state changes due to (A0 + Al), (BO + B2 + BI), and (CO + Cl) blocks to the three state partitions. Similarly the rendering of the third object is affected by the cumulative state changes due to (A0 + Al + A2), (BO + B2 + BI), and (CO + Cl) blocks to the three state partitions. The state management scheme implements the cumulative state change logic.

The state management of preloaded incremental state blocks is carried out by keeping a array of (pointer, size) pairs for each state partition. Every time a state block for that partition is encountered by the binning process it enters that pointer and size into the array. The current write index for each array is maintained. (This index indicates the number of entries filled-in in the corresponding array.) The state management method keeps a set of "last read" index for each bin bucket as a part of its current bit bucket pointer. Instead of the Time Stamp used with preloaded full state block, the state management uses the three array indices to indicate the previous and current rendering context. We have used the symbol A to indicate the state blocks for Front-end state partition, B to indicate the state blocks for Fragment Ops state partition, and C to indicate the state blocks for post-processing state partition.

Figure 26 shows the various structures related to the management of preloaded incremental state blocks, at the time that the binning process receives vertex V₂. The A array, B array, and C array contain the incremental state block pointers and sizes encountered so far for the three state partitions. The rendering context has the number of these blocks for each partition, and the band bucket current pointers contain the current index of 0 for all three partitions. Figure 27 shows the structures after processing of Δ V₀VιV₂. The band bucket current pointers and the band buckets for the first and second band have been updated. The current pointers for bandO and bandl indicate that the first two state blocks for each of the three partitions have been loaded for bandO and bandl. The state packets in the bin buckets of bandO and bandl incoφorate the loading of these blocks by array indices. An array index of -1 indicates that the state partition does not need to be updated. The array indices for each array indicate the starting state index and the ending state index. Figure 28 shows the state of data structures after EndScene is received.

Interleaved Incremental State Blocks

Interleaved incremental state blocks comprise the most common programming practice in graphics today. OpenGl and Direct3D follow this model of programming. This invention includes three methods for state management for incremental state changes interleaved in the command stream.

In one embodiment, the state management of interleaved incremental changes is done as a modification to the management of interleaved full state partitions described above. The binning process inteφrets the entries in each state block and updates the corresponding state partition appropriately. A dirty flag is used to indicate that the state partition has been modified. When the state partition needs to be updated in the band bucket, it is first written out to memory and then updated.

In another embodiment, the state block is first written to the memory area. Its pointer is added to the state array, and the rest of the processing proceeds as for the preloaded interleaved state blocks.

In another embodiment, the interleaved incremental state blocks are loaded, as they are encountered, in a contiguous memory area. Instead of the time stamp, we use the memory block pointers to indicate the memory state that needs to be read in. Since the memory area is contiguous, the state is encountered serially. Each successive state-block will add the size of the state-block to the current pointer. The state memory between the previous state pointer and the current state pointer is the once that needs to be traversed for managing the state.

In another embodiment of the invention, the full and incremental state block change schemes are combined. This is done by associating a block type with each state block. The block ype can be "FULL" or "INCREMENTAL". If the block type is FULL then the previous state index is set to the (current state-block index - 1). Culling

View volume clipping removes portions of the scene that are not contained within the view volume. The pipeline may do view volume clipping before per-vertex lighting.

Some of the primitives may be rejected if backface culling is enabled. If the scene contains closed surfaces, then parts of the object will be facing towards the viewer and parts will be facing away from the viewer. The parts facing away from the viewer are called backfacing. They may be rejected in the transformation stage, if backface culling is enabled.

Each pixel assigned to the triangle is processed by the scissor and stipple test. The scissor test determines if a pixel (x, y) lies inside a window defined as the scissor rectangle (ScissorXmin, ScissorXmax, ScissorYmin, ScissorYmax). If (x, y) lies inside this rectangle it is painted; otherwise it is discarded. The pixels inside the scissor rectangle may be subjected to a stipple test (if stipple is enabled). The stipple test, if enabled, determines if the pixel is obscured by the stipple pattern and therefore need not be painted. If the pixel is visible, then it is sent to the opaque cull stage. The information about the scissor rectangle, stippleEnable flag, and stipple pattern is a part of the Front-end state partition.

Rendering APIs such as OpenGL and Direct3D send primitives to the dedicated hardware with the expectation that the hardware will process the primitives in the order received. This assumption is made for two reasons. Firstly, the state changes may be incremental. The second reason is that the appearance of the rendered primitive may be modified as a result of what was rendered before. For example, while rendering a combination of opaque and translucent objects, the data may be sorted into opaque and translucent objects and rendered in an order that yields the correct color in a pixel. The data may also be sorted in front to back or back to front manner. Stencils are also used for special effects. And a primitive rendered with StencilEn may affect the stencil buffer even if the stencil and/or depth tests fail.

Occlusion culling reduces computation by not rendering what is not seen. The rasterization stage includes fragment color computation and texturing. A depth test is carried out after the fragment color has been computed. The reason for performing the depth test after rasterization is that the fragment color computation may cause some fragments to be discarded. For example, the alpha value of a fragment is a combination of the texture alpha and the material alpha. This alpha value is obtained after lighting computations are done. If the alpha value is zero, then the fragment may be discarded even if the depth comparison places it in front of whatever is already rendered. If the alpha value is less than one and alpha blending is turned on, then the situation becomes more complex. Even if the alpha value is one, the stencil test may discard a pixel. If occlusion culling is depth-based and done prior to lighting, it will discard the primitives behind the fragments subject to discard based on the alpha, color, and/or stencil testing resulting is pictures that are visually disturbing and wrong.

An improved method of occlusion culling is based on ordered rendering. In this method a primitive is considered opaque if the rendering state applicable to this primitive is such that its visibility is entirely determined based on the depth test. If this primitive fails the depth test, then it does not modify any of the buffers, and if it passes the depth test then it cannot be discarded by other tests such as alpha test or stencil test. In other words, it is rendered with alpha test and alpha blending options disabled and with all stencil operations, e.g. StencilFailOp, StencilPassZPassOp, StencilPassZfailOp, being such that they do not change the values in the stencil buffer. For such opaque primitives, we perform the depth test. A pixel of the primitive that fails the depth test is discarded. A primitive that passes the depth test is sent downstream for further processing. The depth buffer is updated if and only if zWriteEnable is TRUE.

Some forms of occlusion culling have been implemented before. See, for example, "method and Apparatus for Simultaneous Parallel Query Graphics Rendering Z-Coordinate Buffer", Jerome F. Duluk, patent # 5596687 and "Heirarchical Z-Buffer Visibility", Ned Green, Michael Kass, Gavin Miller, Siggraph Proceedings 1993. In some schemes, multiple passes through the cull stream are required. In all of these schemes deep FIFOs are required to hold at least one tiles' worth of pixels due to the latency of the culling process.

The method proposed in this invention does not provide the maximum possible culling, but it is simpler to implement and improves performance over no culling as illustrated in Appendix A. It does not require deep FIFOs and does not cause latency bubbles in the pipeline. The method incoφorates a culling state before the stages that do per- fragment operations. An additional copy of the depth buffer for the tile is kept for use by the culling process. The Tile Render process sends either the cull stream data or a cull stream pointer to the cull process. The cull process rasterizes the spatial data by edge walking. For each scan-line in the tile that is intersected by the triangle, the cull process determines the line end-points. A hardware implementation may process several scan-lines simultaneously. The depth buffer in the culling process is cleared to some depth value - say that corresponding to the far plane of the view-volume. For each pixel in the scan-line, its depth value is computed, and the pixel subjected to depth test. If the depth test passes then the pixel is sent to the per-fragment operations stage, otherwise it is discarded. Clearly, in this scheme, the very first primitive incident on the pixels of the tile will always be painted. The fragments of the 2nd primitive covering the same pixel have a 50% chance of being painted, if we assume that there is an equal chance of their being in front of, or behind the first primitive already painted. The fragments of the 3rd primitive covering the same pixel have a 33.33% chance of being painted, since there is 1/3 chance of its being in front of the other two primitives already processed. Appendix A shows these computations for the two cases with depth complexity of 4 and 5. For a depth complexity of 4, this simple culling method discards about 50% of the incident fragments on average. For a depth complexity of 5, the method discards about 54% of the fragments. Clearly, this method of culling will still paint pixels that may be hidden.

In the scene graph representation, the branches of the scene represent different objects. Most of the objects are non-intersecting. An application can determine whether an object is in front of the other by bounding box checks, and traverse the scene in a "front to back" object order manner. The exact ordering by triangles is expensive. On the other hand, the object based ordering can be done quite easily in the transformation stage. This object ordering can further improve the efficacy of culling. The object ordered traversal and backface culling in conjunction with the opaque culling provide significantly improve efficacy of culling.

We have also moved the scissor and stipple test to occur before the per-fragment operations.

We introduce the notion of the depth sample grid, the color grid, the cull grid and the anti-alias grid. The effective hidden surface removal has the resolution of the depth sample grid. The color grid corresponds to the resolution at which the color computation is carried out. Thus the color grid may consist of an nxm block of depth samples. The size of the color grid is programmable. After all the processing has been done, the samples in the final image (after per-fragment operation) may be averaged over a rectangular grid. This is the anti-alias grid. The cull grid provides an additional optimization. Due to very fine resolution of the hardcopy images, there is considerable coherence in visibility of primitives. High resolution is needed for shaφness of silhouette edges. Most the objects in the scene do not contain intersecting surfaces. By using a cull grid that is coarser than the depth sample grid, the size of the depth buffer used in the culling process can be decreased. It can also reduce the amount of work done by the cull process, since fewer depth tests need to be performed.

However, for each element of the cull grid, the depth value that is computed needs to be the conservative depth value. We do this by examining the signs of depth gradients Zx and Zy and the depth test. The smallest or largest depth values within the cull grid will occur at one of the corners. (If the cull grid is not completely covered by the triangle, the computed values may lie outside the range of triangle values, which is alright for conservative culling.) The four corners of an element on the cull grid are represented by BL (bottom left corner), BR (bottom right corner), TL (top left corner) and TR (top right corner). We further assume that the cull grid is square. (This assumption does not limit the generality of the method. It is used here for the puφose of simplifying the illustration. A scale factor applied to Zx, Zy removes this restriction.) The following table lists the cull grid corner used for conservative depth computation. We use the symbols "<" for "less than", "=" for "equal to", and ">" for "greater than".

We assume that a color pixel covers an integer number of cull grid elements in width and height.

The Culling process retrieves the data from the cull stream. It determines the depth samples inside each triangle using the well-known edge and span walking technique. The top and middle vertex of the triangle, the slopes of its three edges, and the leftCorner flag used for edge and span-walking have already been determined in the binning stage. The scan conversion starts at the top vertex. The intersection of each scan line with the long edge of the triangle is found. Processing proceeds left to right if the LeftCorner flag is not set, otherwise it proceeds right to left, until the triangle edge on the other side is encountered. The pixels on the scan line between the left and right edges of the triangle are assigned to the triangle. The primitive is rasterized on the depth sample grid for as many rows at a time as are covered by the size of the color pixel grid. This is in order to compute the depth sample coverage mask used in the color pixel data structure sent to the per-fragment operations stage.

A color pixel may be only partially covered by the fragment of the triangle. It is also possible that the center of the color pixel may be outside the triangle. It is important that the location for color computation be inside the triangle. For each color pixel, the first and last row with coverage on that pixel is found. A row closest to the middle of the first and last row is found and the midpoint of the span on that row is the location at which the per- fragment color is computed. This is illustrated in Figure 29. In this drawing the tile is assumed to be 32x32 and the pixel grid is 8x8. The intersection of the triangle ABC with the Color Pixel at grid location 2,1 (2^nd color pixel in the third row of color pixels) covers depth sample rows 16 to 20. Therefore, the color sample location is chosen at the mid point of the pixel span at scan line 18. This location for color computation for each color pixel in the tile covered by this triangle is shown in Figure 29. A row of the color pixel grid covers one or more rows in the cull grid. For each row of cull grid, determine the conservative depth value for each cull grid element that is partially or fully covered by the triangle. The depth test is carried out against the value in the depth buffer used by the cull process. A "visible" bit is kept for each cull grid element.

For each pixel in the color pixel grid, the "visible" bits for the cull grid elements are examined to determine if they are hidden. If all cull grid elements within the color pixel are hidden, then the color pixel is discarded. Otherwise, the row information is used to determine the location within the color pixel most suitable for color computation. A color fragment data structure is then assembled and sent down to the per-fragment operations stage. The color fragment data structure contains the tile relative(x, y) location of the color pixel, a coverage mask, and a location within the color pixel for the computation of the per- fragment color. The coverage mask has one bit for each depth sample. All the samples in the coverage mask with value 1 are potentially visible and are assigned the computed color. Their depth values are computed individually in the per-fragment operations stage for exact hidden surface removal.

The culling process sends several types of data packets to the per-fragment operations stage. The tile data packet includes the information about the current tile being rendered. It has the tile coordinates. A tile packet will typically be followed by a triangle packet. A triangle packet has pointers to the fragment stream and the per-fragment and post-processing state blocks. A triangle packet may be followed by a sequence of color data structures, corresponding to each color pixel that needs to be processed. If none of the color pixels are visible, then the triangle packet will be followed by another triangle packet, and the old triangle packet can be discarded.

Per-fragment Operations

This stage takes the data passed down from the culling stage and performs the necessary processing to compute the color at each fragment. The per-fragment color computation may involve texture mapping and per-fragment lighting computations (such as specular highlights, bump mapping etc.).

This invention includes a mechanism for reducing the cost of color computation per- fragment by providing a color fragment data structure and a grid for computing the colors. A color fragment may consist of one or more depth fragments. Thus the depth computation may be done on a finer grid of samples than the color computation is done on. The rationale for this is that color variations within a small neighborhood of pixels within the same object is much smaller than the color change at the edges of the object. Having a finer depth fragment grid allows better hidden surface removal at the edges of the objects. The size of the color fragment grid is programmable. Some implementations of anti-aliasing compute the color per-pixel, but depth computation is done on a finer grid. These implementations average the color value over all samples to get average pixel color value. Such implementations limit the size of the anti-alias grid to be the same as the size of the color grid. In our method, anti-aliasing is implemented as a post-processing step by averaging the pixels in the final image. Effectively, the final image has the same resolution as the depth sample grid. The grid on which the color computation is done can be coarser. The antialiasing grid is used to implement anti-aliasing on the final image. The invention supports three modes for anti-aliasing: (1) It can be turned off, in which case the pixels of the final image correspond to the depth sample grid; (2) It can be implemented as convolution over the anti-alias grid, in this case also the pixels in the final image correspond to the depth sample grid; and (3) It can be implemented as average over the Anti-alias grid, in this case the pixels in the final image correspond to the size of the Anti-alias grid. These options are programmable. Note that there is significant difference in our approach and multi-sample implementations of anti-aliasing.

This approach has several advantages. For the display images, this approach can provide a fast preview mode. This is specially true for applications that use multiple textures on objects or use other sophisticated effects such as bump mapping and specular highlights. In the case of print applications, it is of even greater value. Print applications, due to ink being deposited on the page, use halftoning for creating the illusion of dynamic range in color space. A coarser color computation grid with a hidden surface elimination on a finer grid that corresponds to the final image resolution, provides a valuable method for speeding up the rendering process.

The computed color can be further modified by incoφorating the color conversion and color correction on the computed value. The method used for color conversion and color correction is described later. Such an embodiment of the invention will allow integration of images (i.e. textures) from a variety of input sources into the same display or printed image.

The computed color is assigned to each of the depth samples. The depth samples with the computed color and depth value are passed down to the alpha, color, and stencil tests. These tests are done on the fine depth sample grid. The depth test is in addition to the depth test carried out earlier in the opaque culling stage. A fragment may be discarded as a result of these tests. Note that the depth test in the culling stage is preliminary. It discards the samples on the cull grid that are definitely not going to be visible. The depth test after color computation is the exact test and refines the depth test done in the culling stage. It eliminates all fragments that are hidden. The depth test is followed by other 3D pipeline stages such as the alpha blending, dithering, and logical operations.

The color and depth values of the fragments that survive are written into the on-chip color buffer for the tile. When rendering to the tile is complete, the tile is subjected to postprocessing.

Post-Processing As shown in Figure 8, post-processing is a stage of processing that has been added in the modified pipeline. This stage takes the color buffer for the tile, rendered by the preceding 3D pipeline stages, and creates the color separations. These color separations are saved into frame buffer memory and used to drive the print engine. A color separation is a bitmap created for each color component. Thus for CMYK space, four color separations will be created, one for each of the cyan, magenta, yellow and black channels. For a 32x32 pixel tile, each color channel will require Ik bits or 128 bytes of storage. By creating the color separation for the tile on chip, we eliminate the need to save the color buffer in the memory. Instead, we only need to save the color separations. The memory for storing color separations can be further reduced by organizing the color separations by bands and printing the image incrementally by bands, or saving the bands of color separation in system memory or on disk.

Color Correction, Conversion, Black Generation and Undercolor Removal

The post-processing stage implements several processing steps. The Color correction step is responsible for mapping the palette of colors used by the image generator into the palette of colors of the printer. The color is then converted into the color space of the printer. If the color space of the printer is CMYK, then black generation and under color removal is done next, for two reasons. The first reason arises because equal values of c, m, and y should produce gray, and if all three of c, m, y are one, black. But the latter is actually a dirty gray. The black component from each of the c, m, and y color channels is processed by identifying the smallest of the three values, namely the values of the cyan, magenta, and yellow colors at the pixel, and assigning it to the K component. This is called black generation.

The K component is then subtracted from each of the c, m, and y channels. One of the channels will be left with a value of zero (one with a value assigned to k). The other two channels will also have their contributions diminished. This is called undercolor removal. This reduces the amount of ink that is placed on the paper. A lot of ink can soak and waφ the paper. While the actual mechanism for black generation and under color removal is simple, it has a large impact on the quality of printing.

This is followed by halftoning and creation of color separations. Halftoning converts the c, m, y, and k values into binary color separations.

In this invention, we describe two methods for the implemention of the color correction, color conversion, black generation, and undercolor removal stages. The invention also describes a method for implementing halftoning for producing high quality print images.

The information contained in the post-processing state block will change depending on the method used in particular embodiment.

Transform implementation

The first method implements the color correction process as three transformation stages. The RGB color is first converted to the CIE space using supplied RGB TO CIE transformations. The CIE color space of the renderer is then transformed to the printer's CIE space via a corrective transformation. The corrected transformed CIE values are then converted back to the RGB value using the inverse CIE TO RGB transformations. Each of these transformations is a 4x3 matrix. The fourth row allows the flexibility to incoφorate a bias in each of the color values. The values are clamped to the allowed range after each transformation. The format of the matrix elements depends on the implementation. The most general implementation treats the values as single precision floating point values. Each color component is converted to a floating point value before the transformations begin.

The output of the color correction stage is a tile of ARGB values corresponding to the final rendered image.

Next, the additive RGB primary values are converted to the subtractive CMY color space using a linear RGB to CMY transformation given by

C = 1.0 - R; M = 1.0 - G; Y = 1.0 - B; where R, G, B are the pixel color values after color correction. C, M, Y are the computed values of the color in the cyan, magenta and yellow space.

The black generation stage creates the black component by finding the smallest of the three c, m, y color components of the pixel.

K = MIN(C, M, Y);

The black color is then removed from the color values to finally yield the (c, m, y, k) value used for halftoning.

k = K; c = C - K; m = M - K; y = Y - K.

The post-processing state partition in this case consists of the three transformation matrices used in the color correction stage, and the ranges of the transformed values; in addition to the state parameters needed by the halftoning stage, as described in a later section.

We have focused here on CMYK color space. The color conversion can be implemented to other spaces as well, such as YUV.

Table Lookup Implementation

In this implementation, we use a multi-dimensional look-up table (LUT) to incoφorate the color correction, color conversion, black generation, and undercolor removal stages. In the case of RGB rendering, the look-up table is three dimensional. The R, G, and B components of the color value are used to index into the three-dimensional look-up table. Each entry in the look-up table is a four component color vector, with the components corresponding to the resulting cyan, magenta, yellow, and black color values. (The entries in the look-up table can have fewer or more components.)

Color Look-up with one-to-one correspondence

In one embodiment of this invention, there is one table entry for every possible value of the computed RGB (or any other input) color. We shall refer to the computed RGB color as the source color. Thus there is a 1 -to- 1 correspondence between the source colors and the LUT indices. Thus for 8 bit color channels, the table will contain 16 million entries (256x256x256). For CMYK target color, each entry has four components. If each of these components is also 8 bits, then 64 megabytes are required to hold the lookup table. The color entries are looked up as needed. This can cause a significant memory and memory bandwidth requirement.

Color Look-up with uniform many-to-one correspondence

In another embodiment of this method, we use a look up table with fewer entries. There is a uniform many-to-one mapping between the source color and the LUT entries, i.e. many source color values may index one entry in the table. The table entries are uniformly distributed. For example, the RGB color space may be may be divided into 32 sections along each of the red and blue channel and into 64 sections along the green channel. This will provide a lookup table with 64K entries. Indexing into the table is done by considering the 5 most significant bits of the red and blue channels and the 6 most significant bits of the green channel.

The resulting color may be obtained in one of two ways. It can be found by indexing the cell closest to the source color and using the value in that cell. The resulting color can also be obtained by linearly inteφolating 8 closest entries in the LUT around the source RGB value. The 3 least significant bits of the red and blue channel and the 2 least significant bits of the green channel are used as blending factors in inteφolation. This is schematically outlined in the following pseudo code:

DoLinearUniformColorLookup(unsigned byte r, g, b) { int r_idx, g^idx, b_idx; float r fac, g_fac, b_fac; CMY _color cOOO, cOOl, cOlO, cOl 1, clOO, clOl, cl 10, cl 1 1; CMYK_color temp 1 , temp2, temp3, temp4; CMYK color result;

r_idx = (r » 3); g_idx = (g » 2); b idx = (b» 3); // index into the look up table r_fac = (r & 0x7)/8.0; g_fac = (g & 0x3)/4.0; b_fac = (b & 0x7)/8.0; // blending factors

cOOO = tablefr idx, g_idx, b_idx]; cOOl = table[r_idx, g_idx, b_idx + 1]; cOlO = table[r_idx, g idx + 1, b_idx]; cOl 1 = table[r_idx, g_idx + 1, b_idx + 1]; c 100 = table[r_idx + 1 , g_idx, b_idx] ; c 101 = table[r_idx + 1 , g_idx, b_idx + 1 ] ; cl 10 = table[r_idx + 1, g idx + 1, b idx]; cl 11 = tablefr idx + 1, g_idx + 1, b idx + 1];

templ.cyan = cOOO.cyan * (1.0 - b_fac) + cOOl.cyan * b_fac; temp2.cyan = cO 10.cyan * ( 1.0 - b_fac) + cO 11.cyan * b fac; temp3.cyan = temp 1.cyan * ( 1.0 - g_fac) + temp2.cyan * g_fac;

temp 1.cyan = c 1 OO.cyan * ( 1.0 - b_fac) + c 101.cyan * b fac; temp2.cyan = cl lO.cyan * (1.0 - b_fac) + cl 1 l.cyan * b fac; temp4.cyan = temp 1.cyan * ( 1.0 - g_fac) + temp2.cyan * g_fac;

result.cyan = temp3.cyan * (1 - r fac) + temp4.cyan * r_fac;

// Similar interpolation for the magenta, yellow, and black components.}

In a subsequent section, we describe the caching schemes that help to minimize the i/o requirements for accessing the lookup table.

Color Look-up with non-uniform many-to-one correspondence

In another embodiment of this method, we use potentially non-uniformly distributed table entries. The lookup table is characterized by the number of entries along each of the red, green, and blue channels. In addition, there are six one-dimensional mapping tables. Three of these tables map the source red, green, and blue values into the r_ix, g ix, and b_ix used to index into the lookup table. The lengths of these tables correspond to the dynamic range of the source color. Thus for 8-bit source colors, we shall need 3 tables that have 256 entries each. The other three tables provide the sample color value at which the entry in the LUT is computed. The length of this second set of tables corresponds to the length of the LUT along each of the three dimensions. For a source color (r, g, b) the LUT indices and the corresponding sample color is determined as follows.

r_ix = RedMappingTable[r]; r_sample = RedSampleTable[r_ix]; g_ix = GreenMappingTable[g]; g_sample = GreenSampleTable[g]; b_ix = BlueMappingTablefb]; b_sample = BlueSampleTable[b];

If nearest sample lookup is performed, then r ix, g_ix, and b_ix are used to index into the LUT. This provides the mapping. The r_sample, g sample, b sample can be ignored in this case.

The sample color (r sample, g_sample, b sample) is used if linear inteφolation is performed on the LUT entries. We first compare the source color with the sample color for each color component. If the source color is less than the sample color, then inteφolation needs to be performed between the current and previous entry in the LUT. Otherwise the inteφlation is performed between the current and the next entry in the LUT. For example, if the "r sample" is greater than "r", then inteφolation is done on entries with index (rjx - 1) and r ix, otherwise the inteφolation is done on entries with index r_ix and (r_ix + 1). The difference between r and r sample is used to compute the inteφolation factor. This is schematically outlined in the pseudo-code below:

DoLinearNonUniformColorLookup(unsigned byte r, g, b) { int r_ix, g_ix, b ix, r_next, g_next, bjiext; float r fac, g_fac, b_fac; CMYK_color cOOO, cOOl, cOlO, cOl 1, clOO, clOl, cl 10, cl 1 1; CM Y _color temp 1 , temp2, temp3, temp4; CMYK_color result;

r_ix = RedMappingTablefr]; r_sample = RedSampleTable[r_ix]; g_ix = GreenMappingTable[g]; g_sample = GreenSampleTable[g]; b ix = BlueMappingTable[b]; b_sample = BlueSampleTablefb];

r_next = (r_sample > r) ? (r ix - 1) : (r_ix + 1); r_next = (r_next > r ax)? r_max: (r_next < 0)? 0: r next; r_sample_next = RedSampleTable[r_next]; r_fac = (r - r_sample)/(r_next - r sample);

// similar computations are done for green and blue

cOOO = table[r_ix, g_ix, b_ix]; cOOl = table[r_ix, g ix, b_next]; cOlO = table[r_ix, g_next, b ix]; cOl 1 = table[r_ix, g next, b_next]; c 100 = table[r_next, g_ix, b ix] ; c 101 = table[r_next, g_ix, b next]; cl 10 = table[r_next, g next, b idx]; cl 11 = table[r_next, g_next, b_next];

tempi. cyan = cOOO.cyan * (1.0 - b fac) + cOOl.cyan * b_fac; temp2.cyan = cOlO.cyan * (1.0 - b_fac) + cOl l.cyan * b_fac; temp3.cyan = tempi. cyan * (1.0 - g_fac) + temp2.cyan * g_fac;

templ .cyan = cl00.cyan * (1.0 - b_fac) + cl01.cyan * b_fac; temp2.cyan = cl lO.cyan * (1.0 - b fac) + cl 11.cyan * b_fac; temp4.cyan = templ.cyan * (1.0 - g_fac) + temp2.cyan * g_fac; result.cyan = temp3.cyan * (1 - r fac) + temp4.cyan * r_fac;

// Similar interpolation for the magenta, yellow, and black components. } The post-processing state in this embodiment consists of the three look-up tables mapping source colors into color index, three look-up tables to determine the color sample value corresponding to the samples of the color lookup table. The size of the three- dimensional table, the three-dimensional table itself (or a pointer to it), and the lookup mode (i.e., nearest sample or linear inteφolation). This is in addition to parameters needed by the halftoning stage.

The non-uniform lookup table provides an efficient method for many color related operations. The table can be used to incoφorate several processing operations into one lookup, by preparing the table appropriately. For example, color correction, gamma correction, histogram equalization and several image processing operations that map one color value to another can be incoφorated.

The invention allows for more sophisticated filtering schemes for inteφolation. We have used linear inteφolation for puφose of illustration.

Caching schemes for color look-up

In another embodiment of this invention, we use caching to improve the efficiency of the color look-up. The caching scheme allows efficient caching with a single clock cache look-up on a hit. The method divides the multi-dimensional array volume in the input color space into a set of abutting parallelepiped shaped blocks of array entries. Each block is further divided into abutting parallelepiped shaped sub-blocks of array entries. A cache line corresponds to the volume of the sub-block, in other words the cache line is large enough to hold the number of entries in the sub-block. The number of cache lines is equal to the number of sub-blocks each block is divided into. In one implementation, we use 16 cache lines. Each cache line contains 16 LUT entries. The three-dimension array representing the lookup table is carved up into blocks. Each block contains 8x8x4 (red * green * blue) LUT entries. Each block is further divided into sub-blocks, such that each sub-block corresponds to a cache line. Each sub-block contains 2x4x2 table entries. Thus each block of entries contains 16 sub-blocks, with each sub-block containing 16 table entries. Each cache line can hold one sub-blocks worth of data. There are 16 cache lines, and therefore the cache can hold one entire block. The caching is implemented as a direct mapped cache. The block index is used as a tag for the cache line. The caching procedure is as follows:

1. Generate the table index from the RGB color value. Call these r ix, g_ix, b ix.

2. Generate the block index by combining the 2 MSBs from the red, and 3 MSBs each from the green and blue channels, Thus r_blk - (r ix » 3); g_blk = (g_ix » 3); b blk = (b_ix » 2); block d = (r_blk « 6) | (g_blk « 3)| b_blk. 3. Generate the cache line index as follows. r_ln = (r_ix&0x7)»l ; g_ln =

(g_ix&0x7)»2; b_ln = (b_ix&0x2) » 1; cache_ix = (r_ln«2) | (g_ln«l) | b_ln;

4. Each cache line contains the tag and the data. The tag is the block_id. If the cache line is valid and the cache tag at the cache line index is the same as the desired block_id, then we have a cache hit, otherwise it is a miss and the cache line is read in from memory.

The host is responsible for formatting the table so that the lookup table is organized in a block by block manner, with sub-block by sub-block organization within each block.

Note that the above illustration of caching uses a three-dimensional table. The method is general and can be applied to an N-dimensional table. Furthermore, the elements of the table do not need to be 4 component vectors and can have any dimension. For example, the method can be used for mapping RGB colors to 6 component colors or HSB colors to any other color space.

Halftoning and color separation

The printing process uses ink or other media for printing the images. These media are binary in nature. The illusion of dynamic range in color is generated by incoφorating halftoning. In this invention, we incoφorate a method for implementing halftones in the tiled 3D rendering architecture.

Halftones are specified by an angle, frequency and a pattern. The angle specifies the orientation of the halftone patterns. The angle, frequency, and pattern are used to create the rectangular patterns that are page aligned. These halftone patterns can be viewed as bricks. Each successive row of bricks displaced horizontally by a certain amount with respect to the previous row. This is schematically shown in Figure 30. Halftone patterns are essentially threshold arrays, in that the pixel color value is compared against that in the corresponding location in the threshold array, and if the pixel color value is larger or equal to the threshold value, then the dot is painted, otherwise not. One threshold array is used for each color separation. Each element of the threshold arrays could be up to eight bits deep.

Each threshold array is characterized by a set of {width, height, displacement, x origin, y origin, x offset, y offset} parameters. Given the (x, y) pixel location, the threshold array element is determined as follows.

brick row = (int)((y + y_offset)/height); brick_y = y + y_offset - brick row * height; brick offset = mod((brick_row * displacement), width); brick x = mod((x + x offset - brick_offset), width);

If the current color value is larger than the corresponding threshold array value at (brick x, brick_y), the dot is painted on the page.

Figure 31 shows the correspondence between the brick parameters and the tiles. The tile relative (x,y) value is first converted to image relative (x,y) value before halftones are applied.

The post-processing state associated with halftoning contains brick width, height, displacement, x_offset, y-offset and the threshod array for each of the four color separations.

Implementation

The invention could be implemented in software, in hardware, or in a combination of the two. In some implementations, portions of the pipeline can be implemented in a VLSI chip. The color separations delivered from the modified pipeline of the invention can be downloaded to a host computer and integrated into a page via appropriate printer drivers. In other implementations, individual processing blocks could be implemented in DSP devices.

Other embodiments are also within the scope of the claims that follow the appendix. Appendix A: Efficacy of Culling

Assume a scene has a depth complexity of four. With ordered occlusion culling, only about 50% of the scene is rendered. The computation is outlined below. n_AVG = (SUMj(I * P_i))/(Sum_i(P_i)), where i is an integer from 1 to 4. P_n is the probability (prob) that n out of 4 primitives

(prims) are rendered.

P = ( prob that the 1st prim is rendered * prob that the 2nd prim is in front of the 1^st * prob that the 3rd prim is in front of the first two prims * prob that the fourth prim is in front of the first 3 prims ) ;

P₄ = (l * 1/2 * 1/3 * 1/4) = (1/24) ;

P₃ = ( prob that the 1 st prim is rendered * prob that the 2nd prim is in front of the 1^st * prob that the 3rd prim is in front of the first two prims * prob that the fourth prim is behind one of the first 3 prims ) + (prob that the first prim is rendered * prob that the second prim is behind of the first prim * prob that the third prim is in front of the first two prims * prob that the fourth prim is in front of the first three prim ) + (prob that the first prim is rendered * prob that the second prim is in front of the first prim * prob that the third prim is behind one of the first two prims * prob that the fourth prim is in front of the first three prim );

P₃ = (1 *1/2* 1/3*3/4) + (1 *1/2*1/3*1/4) + (1 * 1/2*2/3*1/4);

= (1/8) + (1/24) + (1/12) = (1/4); Similarly P2 = (1*1/2*2/3*3/4) + (1 *1/2*2/3*1/4) + (1 *1/2*1/3*3/4);

= (1/4) + (1/12) + (1/8) = (11/24); PI = (1 *1/2*2/3*3/4) = 1/4; n_Av_G = {(4 * 1/24) + 3/4+ 22/24 + l/4}/{ 1/4+11/24+1/4+1/24}; = (4+18 + 22 + 6)/24 = 25/12 = 2.1;

Thus, the average number of primitives rendered (Π_AVG) per-pixel is 2.1 or about half of the original data. As the depth complexity increases, the advantage of ordered culling increases on a percentage basis. For depth complexity of 5, the approach renders 46% of the scene, as shown below.

P5 = (1*1/2*1/3*1/4*1/5) = 1/120;

P4 = (1*1/2*1/3*1/4*4/5) + (1 *1/2*1/3*1/4*1/5) + (1 *1/2*2/3*1/4*1/5) +

(1 *1/2* 1/3*3/4*1/5) = 1/30 + 1/120 + 1/60 + 1/40 = (10/120); P3 = (1*1/2*1/3*3/4*4/5) + (1*1/2*2/3*3/4*1/5) + (1*1/2*2/3*1/4*1/5) + (1 *1/2*1/3*3/4*1/5) + (1 *1/2*1/3* 1/4*4/5) + (1*1/2*2/3*1/4*4/5); ^'

= (1/10) + (1/20) + (1/60) + (1/40) + (1/30) + (1/15) = 35/120; P2 = (1*1/2*2/3*3/4*4/5) + (1 *1/2*1/3*3/4*4/5) + (1 *1/2*2/3*1/4*4/5) + (1*1/2*2/3*3/4*1/5) = (1/5) + (1/10) + (1/15) + (1/20) = 50/120; PI = 1 *1/2*2/3*3/4*4/5 = 1/5 = 24/120; IΪ_AVG = (5*(1/120) + 4*(10/120) + 3 * (35/120) + 2* (50/120) + 24/120) = (5 + 40 + 105 + 100 + 24)/120 = 2.3;

Thus, discarding pixels that are definitely hidden can reduce per-pixel color computations and texture retrieval by about half. While it is possible to implement more complicated schemes for culling to get close to a depth complexity of 1, it may not be worth the hardware cost.

Claims

WHAT IS CLAIMED:

1. A method comprising in a first stage, sorting primitives of a scene to be rendered, among regions of the scene, in a second stage, sorting the primitives of each of the regions among sub-regions of the regions, and processing on each of the sub-regions to render the scene.

2. The method of claim 1 in which the sub-regions comprise tiles of the scene.

3. The method of claim 1 in which the primitives comprise 3D triangles. 4. The method of claim 2 in which the regions comprise bands of the tiles.

5. The method of claim 1 in which a primitive is sorted to a region if the primitive touches the region.

6. The method of claim 5 in which the regions touched by a primitive are determined from the extent of the primitive in a y direction. 7. The method of claim 1 in which a primitive is sorted to a sub-region based on minimum and maximum values of the primitive in an x direction within the region.

9. The method of claim 1 further comprising generating streams of information associated with the primitives in connection with the sorting.

10. The method of claim 9 in which the streams comprise a mode stream, a cull stream, and a fragment stream.

1 1. The method of claim 1 further comprising maintaining time stamp information concerning the relative timing of the sorting of the primitives to the regions.

12. The method of claim 1 further comprising maintaining rendering state information associated with each of the regions. 13. The method of claim 1 further comprising maintaining rendering state information associated with each of the sub-regions.

14. A method comprising maintaining state information in connection with processing primitives of portions of a scene to be rendered, and using the state information in rendering each of the portions of the scene to which the primitives belong, the state information being divided into state partitions.

15. The method of claim 14 in which the state information is contained in state blocks.

16. The method of claim 15 in which a state block contains a state description for an entire state partition.

17. The method of claim 15 in which a state block contains a state description only for parameters that have changed.

18. The method of claim 15 in which the state blocks are preloaded into memory and pointers to the state blocks are passed from the processing stage to the rendering stage. 19. The method of claim 15 in which incremental state blocks are preloaded into memory and pointers to the state blocks are passed from the processing stage to the rendering stage.

20. The method of claim 15 in which state blocks are interleaved with data about the geometry of primitives in a command stream passed from the processing stage to the rendering stage.

21. The method of claim 15 in which state block types are mixed.

22. A method comprising culling fragments of primitives belonging to a portion of a scene to be rendered by rasterizing the fragments in a sequence, for each fragment, discarding a pixel of the fragment if it fails a depth test, and performing rendering operations on the fragments of the primitives using pixels that have not been discarded.

23. The method of claim 22 in which the fragments are culled using a cull stream of data provided by a binning process that precedes the culling. 24. The method of claim 22 in which a depth sample grid is used for hidden surface removal, a cull grid is used for culling, and the cull grid may be coarser than the depth sample grid.

25. The method of claim 24 in which conservative depth values are estimated for the cull grid based on values in the depth sample grid. 26. The method of claim 22 further comprising processing objects of the scene in front to back order to improve culling efficiency.

27. The method of claim 22 further comprising prior to culling, performing a scissor and stipple test.

28. A method comprising pre-processing fragments of primitives of regions of a scene to be rendered, and after the pre-processing, determining the colors of pixels in the scene based on computations with respect to separate depth sample and color grids.

29. The method of claim 28 in which the depth sample grid is finer than the color grid.

30. The method of claim 28 further comprising performing an anti-aliasing operation after the colors of the pixels in the scene have been determined.

31. The method of claim 30 in which the anti-aliasing is performed by a convolution over an anti-aliasing grid.

32. The method of claim 30 in which the anti-aliasing is performed by an averaging over an anti-aliasing grid. 33. A method comprising receiving from an on-chip color buffer, color data for each one of a set of tiles of the scene, generating color separations of each of the tiles without storing the color data off- chip, and delivering the color separations for each of the tiles to a printer or to an off-chip storage device.

34. The method of claim 33 in which color separations for bands of the tiles are stored and delivered to the printer in bands.

35. A method comprising rendering tiles of a scene as a rasterized array of pixels, performing post-rendering steps on a tile-by-tile basis for the tiles of the scene.

36. The method of claim 35 in which the post rendering steps comprise at least one of color space conversion, color correction, black generation, under color removal, and halftoning. 37. The method of claim 35 in which the post rendering steps comprise color conversion or color correction transforms using matrices.

38. The method of claim 37 in which the transforms include scaling and biasing of color values.

39. The method of claim 35 in which the post-rendering steps include color table look-up.

40. The method of claim 39 in which the table incoφorates at least one of gamma correction, color histogram equalization, color conversion, and color correction.

41. The method of claim 39 in which there is a one to one correspondence of color table values and source color values.

42. The method of claim 39 in which portions of the color look-up table are cached. 43. The method of claim 39 in which entries in the color look-up table are uniformly distributed in the color space with many to one correspondence between the input and the color lookup table space.

44. The method of claim 43 further comprising nearest sample look-up in the uniformly distributed color look-up table. 45. The method of claim 43 in which source color values are mapped by inteφolation of nearby sample values in the uniformly distributed color look-up table.

46. The method of claim 43 in which entries in the color look-up table are distributed non-uniformly in the color space.

47. A method comprising color conversion and color correction of post-texture color values of an image using a transform or a color table look-up.