Description of drawings
Fig. 1 is the block diagram that shows according to the described graphic system of the embodiment of the invention, and graphic system system is as the environment of implementing packet processing systems (with method).
Fig. 2 A shows to divide the graphic system of Fig. 1 and the functional block diagram of packet processing systems according to the described selection portion of the embodiment of the invention.
Fig. 2 B is the functional block diagram that shows according to the packet processing systems of the described Fig. 2 A of the embodiment of the invention.
Fig. 3 shows the framework of the pre-service package content of being implemented by the packet processing systems of Fig. 2 B and the synoptic diagram of byte mask.
Fig. 4 shows that the pre-service package execution to Fig. 3 shields the aftertreatment package content that is produced with swap operation.
Fig. 5 is the process flow diagram that shows according to the described method for processing packet of the embodiment of the invention.
The primary clustering symbol description
10: graphic system
100,100a: packet processing systems
102: display device
The 103:PCIE wiring
104, DIU: display interfaces unit
106: local storage
110, MIU: storer boundary element
114, GPU: Graphics Processing Unit
118, PCIE BIU:PCIE Bus Interface Unit
122: chipset
124: system storage
126, CPU: CPU (central processing unit)
150: driver software
220, BCI: impact damper control initial cell
222, VS: vertex shader
224, TSU: triangle generation unit
226, STG: span and brick generation unit
The 228:ZL1 unit
The 230:ZL1 high-speed cache
The 232:ZL2 unit
The 234:Z high-speed cache
236,238:P unit
240, PS: pixel coloring device
242: texture cache
The 244:ZL3 unit
246: the purpose unit
The 248:D high-speed cache
260: separation vessel
262: the mask logic unit
264: the exchange logic unit
266: write logical block
268: receiver
300,400, A, B: package
302,304: byte mask
The 303:s-data
The 305:z-data
The 402:z-data group
The 404:s-data group
Embodiment
For above-mentioned and other purpose of the present invention, feature and advantage can be become apparent, cited below particularlyly go out preferred embodiment, and conjunction with figs., be described in detail below:
Embodiment:
The various embodiment of open packet processing systems of the present invention and method.Such packet processing systems and method are used in whole package content (or whole package) with byte mask, with enable to circuit unit carry out to select to write with and/or read operation.Can improve processing speed and usefulness (comparing) by byte mask being used in whole package content with legacy system.As mentioned above, conventional P CI system can be applied to byte mask the end or the front end of package, but not is used for the whole contents of package.Such legacy system must be decomposed into package the management fragment and header is attached to each fragment.Because additional header can make processing time and storage requirements increase, and therefore can reduce treatment efficiency.Packet processing systems disclosed in this invention and method do not need to cut apart the package content and header are attached to each fragment or do not need to carry out reading and write operation in the legacy system.Therefore, packet processing systems disclosed in this invention and method can be carried out continuous write operation to assembly.
Described according to the embodiment of the invention is that the packet processing systems and the method for background comprises Graphics Processing Unit with the graphics process environment, in order to produce the processing degree of depth (z) data and template (s) data of triangle (or other is primary) and PCIE bus.Yet bus communication protocol and standard that those skilled in the art all understand other can be contained in the scope of the present invention equally.Moreover those skilled in the art all understand, even the present invention only illustrates write operation, the disclosed principle of the embodiment of the invention also can be applicable in the read operation.In addition, even template and depth data only are described in instructions, same operation also can be used in the data of other kenel, for example the alpha value data is carried out with color (for example RGB) data and is separated or swap operation.
Fig. 1 is the block diagram that shows according to the described graphic system 10 of the embodiment of the invention, and graphic system 10 is as the environment of implementing packet processing systems 100 (with method).Graphic system 10 can be set at computer system in certain embodiments.Graphic system 10 can comprise by display interfaces unit (display interface unit, display device 102 that DIU) is driven and local storage 106 (comprising for example display buffer, texture buffer, instruction buffer and frame buffer (framebuffer) etc.).Local storage 106 can also be called frame buffer, storage element or storer at this.Local storage 106 is by storer boundary element (memory interface unit, MIU) 110 and be coupled to Graphics Processing Unit (graphics processing unit, GPU) 114.According to one embodiment of the invention, MIU 110, GPU 114 and DIU 104 are coupled to Bus Interface Unit (businterface unit, BIU) 118 with the PCIE compatibility.For example, according to the described PCIE BIU118 of the embodiment of the invention can by use graphics addresses replay firing table (graphics address remapping table, GART) or other memory mapped mechanism and realizing.BIU 118 and GPU 114 can be by PCIE wiring 103 coupled in communication, and by PCIE wiring 103 can provide data with and/or instruction.In an embodiment of the present invention BIU 118 and MIU 110 are set in order to (double data rate, DDR) the memory communication agreement transmits or receive data according to PCIE communication protocol and Double Data Rate respectively.
BIU 118 is coupled to chipset 122 (for example north bridge chipset) or switch.Chipset 122 comprises the interface electronics, in order to will strengthening from the signal of CPU (central processing unit) 126, and will be back and forth separates in the signal of system storage 124 and the signal in the input/output device (not shown) back and forth.Even in embodiments of the present invention by the PCIE bus communication protocol carry out between primary processor and the GPU 114 be connected with and/or communication, yet in other embodiments also can by other method (for example PCI and exclusive high-speed bus etc.) carry out between primary processor and the GPU 114 be connected with and/or communicate by letter.System storage 124 also comprises graphics application program (not shown) and driver software 150, and driver software 150 will instruct or order the buffer that is sent among GPU 114 and the DIU 104 by CPU126.Driver software 150 or the unit with identical function can be stored in the system storage 124, and (central processing unit CPU) 126 carries out by CPU (central processing unit).In an embodiment of the present invention, driver software 150 provides coding and decoding (for example colour coding (shader code)) to GPU 114, handles to carry out in GPU 114.
Employed in certain embodiments extra Graphics Processing Unit is coupled to assembly shown in Figure 1 through chipset 122 and by the PCIE bus communication protocol.Can comprise assemblies all among Fig. 1 according to the described graphic system 10 of the embodiment of the invention, or comprise than Fig. 1 still less with and/or different assemblies.In addition, can use extra assembly in certain embodiments, for example be coupled to the South Bridge chip of chipset 122.
Packet processing systems 100 can by hardware, software with and/or firmware realize.When packet processing systems 100 is realized by hardware (for example package (P) unit among Fig. 2 A), this hardware can be realized by following any one or combination of the prior art: the discrete logic with logic gate, in order to realize logic function according to data-signal, ASIC(Application Specific Integrated Circuit) (application specific integrated circuit with suitable combinational logic gate, ASIC), programmable gate array (programmable gate array, PGA) and field programmable gate array (fieldprogrammable gate array, FPGA) etc.
When packet processing systems 100 is realized by software or firmware (for example handling) by driver software 150 control hardwares, comprise that so the driver software 150 in order to the sequential list of the executable instruction that realizes logic function can be contained in the computer-readable recording medium of computing machine, to use by instruction execution system or device (for example based on the system of computing machine, comprise the system of processor or other can obtain from the instruction of instruction execution system or device and other system of execution command) or to be connected with it.In instructions, computer-readable medium can for anyly keep, the device of storage or transmission procedure, this program is used by instruction execution system or device or is connected with it.The computer-readable system can be for example electronics, magnetic, optics, electromagnetism, infrared ray or semiconductor system, device or transmission medium, however its unavailable scope with restriction the present invention.Computer-readable medium can also comprise electric connection (electronics), portable computer disk (magnetic), random access memory (the random access memory with at least one lead, RAM) (electronics), ROM (read-only memory) (read-only memory, ROM) (electronics), EPROM (Erasable Programmable Read Only Memory) (erasable programmable read-only memory, EPROM) (electronics), flash memory, (electronics), optical fiber (optics) and formula CD (optics) only.It should be noted that, computer-readable medium can be printed in last paper or other suitable medium for having program, and program is electrically caught by paper or other medium are carried out photoscanning, compiling, decipher or carry out and be stored in the computer memory with suitable method where necessary.
In addition, the scope of some embodiments of the invention comprises that function with preferred embodiment of the present invention is used in the logic in hardware or the software set medium.
Fig. 2 A is the partial function block diagram that shows according to the described GPU 114 of the embodiment of the invention, comprises the packet processing systems 100 that is denoted as 100a.GPU 114 can comprise initial (the buffercontrol initialization of impact damper control, BCI) unit 220, vertex shader (vertex shader, VS) 222, triangle generation unit (triangle setup unit, TSU) 224, span and brick generation unit (span and tile generation, STG) 226, ZL1 unit 228, ZL1 high-speed cache 230, ZL2 unit 232, Z high-speed cache 234, P unit 236 and 238, pixel coloring device (pixel shader, PS) 240, texture (T) high-speed cache 242, ZL3 unit 244, purpose (D) unit 246 and D high-speed cache 248.The function of at least one assembly can fixed-function unit realizes or realizes by the sign indicating number that use is implemented in processing unit able to programme.Data or instruction that BCI unit 220 receives from Bus Interface Unit (for example BIU among Fig. 1 118), and begin to handle vertex data.P unit 236 with 238 and ZL1 high-speed cache 230, D high-speed cache 248 be connected with storer boundary element (for example MIU 110 and BIU 118) respectively.It should be noted that in certain embodiments P unit 236 and 238 can be contained in respectively in Z high-speed cache 234 and the T high-speed cache 242.Although packet processing systems in certain embodiments 100 comprises less or more assembly, comprise packet processing systems 100a (shown in dotted line) according to the described P of one embodiment of the invention unit 236 and 238 (being called the package logical block herein respectively or jointly).For example, packet processing systems 100a can also comprise driver software 150, driver software 150 be set to be used for controlling P unit 236 and 238 with and/or the execution of core processor (for example engine), or can be contained in certain embodiments in whole Graphics Processing Unit 114 or the graphic system 10.
Fig. 2 B is the functional block diagram that shows according to the described packet processing systems 100 of the embodiment of the invention.As shown in the figure, packet processing systems 100 comprises separation vessel (segregator) 260, receiver 268, writes logical block 266 and driver software 150.Separation vessel 260 also comprises mask logic unit 262 and exchange logic unit 264.Separation vessel 260 is set in order to whole package is divided into two continuous groups, and first group comprises that the first kenel data and second group comprise the second kenel data.Receiver 268 is set to by PCIE wiring (for example from BIU 118) and receives data.Writing logical block 266 is set in order to data write cache (for example Z high-speed cache 234, T high-speed cache 242).Driver software 150 is set in order to adjust and the function of controlling receiver 268 with separation vessel 260.Those skilled in the art all understand for each packet element 236 and 238, and at least one logical block in the packet processing systems 100 disclosed according to the present invention (for example 260,268,266 etc.) can be replicated; Or in certain embodiments, at least one logical block in the packet processing systems 100 (for example 260,268,266 etc.) can be shared by packet element 236 and 238.
With reference to Fig. 2 A and Fig. 2 B, P unit 236 and 238 in an embodiment of the present invention comprises logic gate, comprise buffer, buffer is set in order to enable to carry out in (for example edge calculations) between other function the function of shielding (mask logic unit 262) and byte exchange (exchange logic unit 264).ZL2 unit 232 and ZL Unit 3 244 access Z high-speed caches 234.D unit 246 is coupled to PS 240 and ZL3 unit 244, in order to the execution colouring function, and access D high-speed cache 248.PS 240 access T high-speed caches 242, it is equivalent to the texture processing performed according to mechanisms known.It should be noted that in certain embodiments at least one assembly shown in Fig. 2 A and Fig. 2 B can be merged into single component, otherwise the function of single component can be distributed between two assemblies at least.
In operation, the instruction that BCI 220 receives from driver software 150 or other software is to draw triangle or other basic pattern (primitive).BCI 220 also receives vertex information according to the triangle that is about to draw.Vertex information is sent to VS 222, to carry out the summit conversion.VS 222 can be included in painted programming or the code of carrying out in the programmable unit (for example engine among core processor or the GPU 114).In certain embodiments, VS 222 can fixed-function unit realize.What pay special attention to is, object is converted to work space and screen space to form triangle by object space.Triangle is sent to TSU224, and TSU 224 collects primary, and carries out known work between other known function, for example produces and delimit frame, eliminates, produces edge function and refusal triangle form class.TSU 224 is sent to STG unit 226 with data, and STG unit 226 provides brick to produce (tile generation) function, so data object can be split into several floor tiles (for example 8*8 or 16*16 etc.) and be sent to ZL1 unit 228.ZL1 unit 228 is carried out the z value respectively as ZL2 unit 232 and ZL3 unit 244 and is handled, and for example the z value is carried out high-order and eliminates (high level rejection) (for example: compare with low order is superseded, high-order is eliminated and consumed less position).ZL unit 228,232 combines with ZL1 high-speed cache 230, Z high-speed cache 234 and Z high-speed cache 234 respectively with 244 operation.PS 240 can be included in the tinter of carrying out in the programmable unit (for example engine among core processor or the GPU114), and programmable unit receives texture and pipeline (pipeline) data, and provides and export D unit 246 and ZL3 unit 244 to.In certain embodiments, PS 240 can comprise fixed-function unit.D unit 246 is set in order to the value in Z high-speed cache 234 or high-speed cache 248 with ZL3 unit 244 and upgrades preceding alpha value test and the template test carried out.
P unit 236 and 238 is handled package (for example: carry out following separating and function of exchange) and is corresponded respectively to z-data and the s-data that are stored in Z high-speed cache 234 and the T high-speed cache 242.For example, primary application program can require to handle being carried out by the obtained surface of z-data (getting rid of the s-data).The requirement of primary application program communicates and realizes via BIU 118 and GPU 114 by driver software 150.Buffer among the driver software 150 programming GPU 114 and the core processor (for example engine) among the indication GPU 114 enable the form that this only has z.The instruction that core processor is passed on according to the driver software in the primary application program 150 produces shielding, and shielding is stored in can be by at least one buffer of P unit 236 and 238 accesses, so that P unit 236 with and/or 238 preceding via the necessary package form (form that z is promptly only arranged) of BIU 118 or MIU 110 outputs, carry out and separate or function of exchange.For example, P unit 238 receives the data of the pre-service package form (with reference to Fig. 3, this package is denoted as 300) from BIU 118 according to the reading requirement that is sent to BIU 118.The package address relevant with reading (or writing) operation can produce by the core processing unit (for example engine) among the GPU 114.With reference to Fig. 3, package 300 comprises two kinds of data of different types, comprises template (s) data 303 and the degree of depth or z-data 305.In this embodiment, three continuous z-data 305 bytes (z0 for example, z0, z0) single template (s) data 303 (for example s0) (in figure and Fig. 4 each comprise or the block of s data is represented a byte) of arranging in pairs or groups.Masking operation is carried out by using 302 pairs of whole package contents of byte mask in P unit 238, and swap data has the pixel package 400 of aftertreatment package form with formation, and pixel package 400 comprises two indivedual continuous groups 402 (z-data) and 404 (s-data) (as shown in Figure 4).P unit 238 a group group 402 or 404 at least writes T high-speed cache 242.It should be noted that P unit 238 can write in the T high-speed cache 242 connecing with the z data, but write operation only betides as shown in Figure 3 mixed format (for example package 300) in this embodiment.
About P unit 236, the data in the Z high-speed cache 234 are the pre-service package forms shown in the package 300 of Fig. 3.For example, P unit 236 is carried out masking operation according to the requirement that writes to BIU 118 to the package 300 that is stored in the high-speed cache 234, and the data of package exchange by P unit 236.After carrying out above-mentioned shielding and swap operation, the form of data is the form of the pixel package 400 that comprises aftertreatment package form as shown in Figure 4.Being denoted as the not at the same level of A and B in Fig. 2 will illustrate in Fig. 3 and Fig. 4.
With reference to the package 300 of Fig. 3, package 300 representatives are denoted as the package (pre-service package form) of A in Fig. 2 A.As mentioned above, the kenel that repeats in the package 300 comprises combination (three continuous z data 305 bytes (for example z0, z0, z0) collocation single template data 303 bytes (for example s0)) for example of at least two kinds of different types of data.In operation, when only needing (getting rid of s-data 303) to the 305 execution write operations of z-data, P unit 236 (or P unit 238, do explanation with P unit 236 in this embodiment) can carry out the byte enable operation to whole package content 300.Just, P unit 236 is used for whole package 300 with byte mask 302, and the bit-type attitude of byte mask 302 is s-data 303 are deenergized and to enable the z-data.Therefore, P unit 236 utilizes and has the byte mask 302 of data kenel for 11101110...1110.Just, P unit 236 adds 0 value in per 4 positions, and place value is the 0 representative function (even mask bit maintenance initial value) of deenergizing.Place value is 1 to represent ena-bung function, is equivalent to allow the position of conductively-closed to pass through.Those skilled in the art all understand disclosed in embodiments of the present invention shielding place value and function can have opposite function (i.e. 1 representative is deenergized, and 0 representative enables) in certain embodiments.
It should be noted that the data kenel can be anti-phase (shown in the byte mask 304) of bit-type attitude, just 00010001...0001 when expectation is carried out write operation to s-data 303 (getting rid of z-data 305).In addition, when expectation was passed through all positions, the mask bit kenel can all be 1 (not shown).Therefore, P unit 236 (with 238) optionally carries out write operation to the combination of package content 300 and continuous byte by byte mask 302.
Be denoted as the pixel package 400 of B among Fig. 4 displayed map 2A and show aftertreatment package form, aftertreatment package form comprises two continuous z-data group 402 and s-data group 404.Pixel package 400 can write in the local storage 106 (or can one of group 402 and 404 be write T high-speed cache 242 according to the requirement that writes to BIU 118 by P unit 238, perhaps s-data and z-data all can above-mentioned mixed format write T high-speed cache 242) via MIU 110 or BIU 118.As shown in the figure, z-data group 402 is separated from one another with s-data group 404, to enable to select to write the continuous position or the data of byte.In this embodiment, with the front end (for example preceding 48 bytes) of z-data-moving all in the group 402 (for example exchange) to package, and with the end (for example back 16 bytes) of s-data-movings all in the group 404 to package.For example, because the shielding place value is 0, so s-data group 404 employed 16 bytes can be retained.As described in Fig. 2 A, be that 48 byte enable of 1 are only carried out write operations (getting rid of s-data group 404) to z data group 402 corresponding to the shielding place value.When expectation was carried out write operations to s-data group 404, same can be with the shielding place value, and 1 part replaced with 0.
Fig. 5 shows according to the described method for processing packet 100b of the embodiment of the invention, method for processing packet 100b be subjected in conjunction with driver software 150 and P unit 236 with and/or 238 control.Comprise the package (502) that has at least the first kenel data and the second kenel data by PCIE wiring reception according to the described method of the embodiment of the invention, and whole package is separated into two continuous groups, and first group comprises that the first kenel data and second group comprise the second kenel data (504).
The explanation of any step or the square in the process flow diagram can be regarded as representing the program code of module, fragment or part among Fig. 5, program code comprises at least one executable instruction in order to specific logical function in the performing step, those skilled in the art all understand, and also can be contained in the scope of the present invention to be different from said sequence execution function (comprise and carry out simultaneously or carry out with opposite order) in other embodiments.
Though the present invention with preferred embodiment openly as above; yet it is not in order to limiting scope of the present invention, any those skilled in the art, without departing from the spirit and scope of the present invention; permission can be done some and change and retouching, so protection scope of the present invention is as the criterion when looking the qualification person of claim institute.