CN105183433A

CN105183433A - Instruction combination method and apparatus with multiple data channels

Info

Publication number: CN105183433A
Application number: CN201510521991.2A
Authority: CN
Inventors: 张淮声; 洪洲; 齐恒
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Glenfly Tech Co Ltd
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2015-12-23
Anticipated expiration: 2035-08-24
Also published as: EP3136228B1; EP3136228A1; TWI552081B; EP4258110A3; TW201709060A; US9904550B2; EP4258110A2; US20170060594A1; CN105183433B

Abstract

An embodiment of the invention provides an instruction combination method executed by a compiler. The method comprises: obtaining a plurality of first instructions, wherein the first instructions are used for performing one of calculation operation, comparison operation, logic operation, selection operation, condition branching operation, loading/storage operation, sampling operation and complicated mathematic operation; and combining the instructions according to data dependency among the first instructions, and transmitting a combined instruction to a stream processor.

Description

Instruction folding method and there is the device of multiple data channel

Technical field

The present invention is associated with a kind of Graphics Processing Unit technology, particularly a kind of instruction folding method and have the device of multiple data channel.

Background technology

The framework of Graphics Processing Unit has hundreds of basic tinter processing units (basicshaderprocessingunits) usually, is also called stream handle (streamprocessors).Each stream handle is in each period treatment single instruction multiple data (SIMD, SingleInstructionMultipleData) instruction of thread (thread), then in another single instruction multiple data thread of next cycle process.The usefulness of Graphics Processing Unit is subject to the impact of two key factors, and one is the number of stream handle, and two is the ability of stream handle.Therefore, the present invention proposes a kind of instruction folding method and has the device of multiple data channel, in order to promote the ability of stream handle.

Summary of the invention

Embodiments of the invention propose a kind of instruction folding method performed by compiler.Obtain multiple first instruction, wherein, the first instruction in order to carry out calculating operation, compare operation, logical operation, selection operation with, conditional branch operation, load/store operations, sampling operation and complex mathematical operation in one.Merge according to the data dependencies between the first instruction, and send the instruction of merging to stream handle.

Embodiments of the invention separately propose a kind of device with multiple data channel, comprise data extracting unit, bypass channel and main channel.Bypass channel is coupled to general-purpose register, constant buffer and data extracting unit.Main channel is coupled to data extracting unit and bypass channel, comprise the arithmetic element of sequence, compare/logical block and after unit processed.Arithmetic element, compare/couple logical block and rear unit sequence processed, and arithmetic element, compare/logical block and each afterwards in unit processed is coupled to bypass channel.

Accompanying drawing explanation

Fig. 1 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.

Fig. 2 is the instruction folding method flow diagram according to the embodiment of the present invention.

Fig. 3 A and 3B is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.

Fig. 4 is the instruction folding method flow diagram according to the embodiment of the present invention.

Fig. 5 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.

Embodiment

Below be illustrated as the better implementation of invention, its object is to describe essence spirit of the present invention, but not in order to limit the present invention.Actual summary of the invention must with reference to claims afterwards.

It must be appreciated, be used in the words such as " the comprising ", " comprising " in this instructions, in order to represent there is specific technical characteristic, numerical value, method step, operation process, assembly and/or assembly, but do not get rid of and can add more technical characteristic, numerical value, method step, operation process, assembly, assembly, or more combination in any.

Use as the word such as " first ", " second ", " the 3rd " is used for modifying the assembly in claim in claim, not be used for, between expression, there is priority order, precedence relation, or an assembly is prior to another assembly, or time order and function order during manner of execution step, be only used for distinguishing the assembly with same name.

Traditional stream handle performs simple operation, such as, calculate (Algorithm), compare (Compare), logic (Logic), select (Selection) and conditional branching (Branch) etc.But the data dependencies between the instruction of a tinter is high, make stream handle must read or write general-purpose register (CR, CommonRegisters) frequently.These data dependencies may consume a large amount of general-purpose register frequency ranges and cause bottleneck, and produce system bottleneck.In addition, read-after-write (RAW, the ReadAfterWrite) problem of general-purpose register may endanger the usefulness that instruction performs.Tinter uses usually to be moved (Move) instruction and carrys out initializing universal register, in order to require that stream handle transmits data or constant to another general-purpose register or constant buffer from a general-purpose register or constant buffer (CB, ConstantBuffer).It is a post-treatment procedures, such as carry out the load/store unit (LD/STLoad/StoreUnit) of data access, carry out the sampling unit (SMP of data texturing sampling, or special function unit (SFU SampleUnit), SpecialFunctionUnit) etc., stream handle be responsible for from general-purpose register or constant buffer read one or more come source value, then, corresponding post-processing unit is passed to.In these situations, data or constant do not need to perform any algorithm, cause stream handle to a certain degree not have efficiency.The embodiment of the present invention proposes a new framework, uses two data routings, in order to promote the usefulness of stream handle in a stream handle.First path can be described as main thoroughfare (Main-pipe), comprise calculation (Algorithm), compare (Compare), logic (Logic), select the operation such as (Selection) and conditional branching (Branch), and second path can be described as bypass channel (Bypass-pipe), in order to read data or constant from general-purpose register or constant buffer, and be passed to general-purpose register, constant buffer or post-processing unit.

Fig. 1 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.A stream handle uses two passages to perform instruction, and one is main thoroughfare, and another is bypass channel.Main thoroughfare can comprise the stage of three tool orders: calculation stage (ALG, AlgorithmStage); Rear logical stage (Post-LGC, Logic), comprises and compares and/or the operation such as logic; And post-processing stages (Post-PROC, Process) comprises the operations such as selection, conditional branching and/or result write-back.The result that stage produces can take the next stage to.Net result can be stored to general-purpose register, or exports post-processing unit to.Specifically, the instruction request 110 that instruction decoding unit 120 decoding is transmitted from compiler (compiler), and notice needs the general address 121 and/or the constant buffer address 123 that obtain data or constant.Instruction decoding unit 120 can get instruction the operation code (Opcode, OperationCode) in request 110.Data extracting unit 130 obtains the data 133 of general address 121 and/or the constant 135 of constant buffer address 123, if needed, notifies the general address 131 for writing back.The data 133 obtained and/or constant 135 can be described as again operand (Operand).Arithmetic element 140 carries out calculating operation to the data obtained.Calculating operation comprise add, subtract, multiplication and division, left dislocation, right displacement etc.The result that comparison/logical block 150 can produce according to arithmetic element 140 compares or logical operation.Compare operation comprise take large values, get the small value, numeric ratio comparatively etc., logical operation comprises and (AND) or (OR), non-(NOT) or non-(NOR), XOR (XOR) etc.Rear unit processed 160 can according to comparing/data after computing are written back to general-purpose register, or are passed to the one in load/store unit 171, sampling unit 173 and special function unit 175 by the operating result of logical block 150.Special function unit 175 implements complex mathematical computing, such as sinusoidal (SIN), cosine (COS), radical (SQRT) etc.Bypass channel can comprise by-pass unit 180, in order to transmit data or constant 181 from a general-purpose register or constant buffer to another general-purpose register or post-processing unit.

The instruction that compiler uses can be divided three classes: main channel instruction; Bypass channel instruction; And post-processing unit instruction.Main channel instruction comprises the instruction used in calculation stage, rear logical stage and post-processing stages.The calculation stage can comprise the following instruction (ALGinstructions) of use: FMAD, FADD, FMUL, IMUL24, IMUL16, IADD, SHL, SHR, FRC, FMOV, IMUL32I, IMUL24I, IADDI, IMAXI, IMINI, SHLI and SHRI etc.Rear logical stage can comprise the following instruction (CMP/LGCinstructions) of use: IMAX, IMIN, FCMP, ICMP, IMAXI, IMINI, NOR, AND, OR, XOR, ICMPI, NORI, ANDI, ORI and XORI etc.Post-processing stages can comprise the following instruction (SEL/Branchinstructions) of use: SEL, B, BL, IFANY and IFALL etc.Post-processing unit instruction comprises the instruction used in load/store unit 171, sampling unit 173 and special function unit 175.Load/store unit 171 can comprise the following instruction (LSinstructions) of use: LDU, STU, REDU, LDT, STT, REDUT, GLD, GST and GREDU etc.Sampling unit 173 can comprise the following instruction (SMPinstructions) of use: SAMPLE, SAMPLE_B, SAMPLE_L, SAMPLE_C, SAMPLE_D, SAMPLE_GTH and SAMPLE_FTH etc.Special function unit 175 can comprise the following instruction (SFUinstructions) of use: RCP, RSQ, LOG, EXP, SIN, COS and SQRT etc.Bypass channel instruction can comprise following instruction: MOV and MOVIMM etc.

Use framework as above, compiler can merge the instruction of multiple main channels according to data dependencies and post-processing unit instruction becomes an instruction, is called static merging (StaticCombine).Fig. 2 is the instruction folding method flow diagram according to the embodiment of the present invention.Compiler can obtain the instruction of multiple main channels and an aftertreatment instruction (if needs) (step S210), carry out merging (S230) according to the data dependencies between instruction, and the instruction after merging is sent to instruction decoding unit 120 (step S250).In step S230, compiler can carry out static state according to following rule and merge:

ALG+CMP+SEL；

ALG+CMP+SEL+SFU/LS/SMP；

ALG+CMP+Branch；

ALG+LGC+SEL；

ALG+LGC+SEL+SFU/LS/SMP; Or

ALG+LGC+Branch。

Wherein, ALG represents computations, and CMP represents comparison order, and LGC represents logical order, and SEL represents selection instruction, and Branch represents conditional branch instructions, and SFU represents mathematical operation instruction, and LS represents load/store instruction, and SMP representative sampling instruction.Below enumerate an example and static merging is described, the pseudo-code of tinter is as follows:

Above pseudo-code is carried out static state and is merged by compiler, and is translated into machine code as follows:

The value of general-purpose register R4 is added the value (that is being the value of variable a and b) of general-purpose register R8 by first machine code instruction, and result is delivered to the next stage, wherein, result (that is being the value of variable x) is delivered to the symbol of next stage by " SFWD " representative.The instruction of second machine code by the value (that is being the value of variable c) of register R5 with transmit the value (that is being the value of variable x) of getting off on last stage and compare, and comparative result is stored in flag P1, wherein, "+" represent this instruction folding to previous instruction and " SFWD " representative from transmitting the symbol got off on last stage.If the value of variable x is greater than the value of variable c, flag P1 is set to " 1 "; Otherwise, be set to " 0 ".Register R0 write is transmitted the value (that is being the value of variable c) of value (that is being the value of variable x) or the register R5 got off by the 3rd machine code instruction on last stage according to flag P1, wherein, "+" represents this instruction folding to previous instruction.If flag P1 is " 1 ", register R0 is write the value (that is being the value of variable c) of register R5; Otherwise write transmits the value (that is being the value of variable x) of getting off on last stage.

In order to allow Graphics Processing Unit can perform the instruction after merging, the computing unit (calculationunits) in framework can do some adjustment.Example is merged with reference to above-described static state.Fig. 3 A and 3B is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.Totalizer 310 is through front arithmetic logic unit (Pre-ALU, ArithmeticLogicUnit) obtain the value of register R4 and the value (that is being the value of variable a and b) of general-purpose register R8, and the result of computing is passed to comparer 330 through Standardisation Cell (Normalizer) and formatting unit (Formatter).Comparer 330 received the result that arithmetic element 140 (a upper stage) produces, and obtained the value (that is being the value of variable c) of register R5 through by-pass unit 180, then, carried out the comparison between the two.Selection unit 350 was written back to register R0 according to the one comparative result (output in a upper stage) 351 results totalizer 310 produced (that is being the value of variable x) 353 and the value 355 of general-purpose register R5 that obtains from by-pass unit 180.For performing the instruction after merging, computing unit in arithmetic element 140 is coupled to data extracting unit 130 and by-pass unit 180, in order to obtain operand, and be coupled to the computing unit of logical block 150 and rear unit processed 160, such as comparer 330 and selection unit 350, in order to output results to follow-up phase.Computing unit in logical block 150 is coupled to arithmetic element 140 and by-pass unit 180, and in order to obtain operand, and be coupled to the computing unit of rear unit processed 160, such as selection unit 350, in order to output results to follow-up phase.Computing unit in rear unit processed 160 is coupled to arithmetic element 140, logical block 150 and by-pass unit 180, in order to obtain operand, and be coupled to general-purpose register, load/store unit 171, sampling unit 173 and special function unit 175, such as selection unit 350, in order to write back data to general-purpose register or output results to post-processing unit.Except selection unit 350, rear unit 160 processed also can comprise conditional branching unit and whether write-back result to the judging unit of general-purpose register.

Use framework as above, the main channel instruction of bypass channel instruction folding and/or post-processing unit instruction also can be become an instruction by compiler, are called that bypass merges (BypassedCombine).Fig. 4 is the instruction folding method flow diagram according to the embodiment of the present invention.Compiler can obtain multiple instruction (step S410), at least one in a bypass channel instruction and main channel instruction and post-processing unit instruction is carried out merging (step S430), and the instruction after merging is sent to instruction decoding unit 120 (step S450).In step S430, the rule that the merge order of multiple main channels instruction needs the prerequisite of obedience to cross.In other words, step S430 can be considered and the amalgamation result of multiple main channels instruction is remerged a bypass channel instruction and/or an aftertreatment instruction.Below enumerate the example that bypass merges, machine code is as follows:

The value of general-purpose register R4 is stored to general-purpose register R0 by first machine code instruction.The value of general-purpose register R4 is added the value of general-purpose register R8 by second machine code instruction, and result is delivered to the next stage, and wherein, result is delivered to the symbol of next stage by " SFWD " representative.3rd machine code instruction asks for inverse (reciprocal) by transmitting the value of getting off on last stage, and result is write register R7, and wherein, " SFWD " representative is from transmitting the symbol got off on last stage.

Fig. 5 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.By-pass unit 180 receives the instruction of data extracting unit 130, reads the value of general-purpose register R4 and writes general-purpose register R0.Meanwhile, totalizer 510 obtains the value of general-purpose register R4 and the value of general-purpose register R8 through front arithmetic logic unit, and the result of computing is passed to special function unit 175.Special function unit 175 then obtains inverse, and writes general-purpose register R7.During due to bypass channel instruction process, main channel instruction and post-processing unit instruction can parallel processing, and the usefulness of stream handle can be promoted further.

With reference to figure 3 and Fig. 5.Generally speaking, be perform the instruction after merging, couple arithmetic element, compare/logical block and rear unit sequence processed, and arithmetic element, compare/logical block and afterwards unit processed be coupled to bypass channel.Specifically, the first computing unit in arithmetic element 140 (such as, totalizer, multiplier, divider etc.) be coupled to by-pass unit (that is being bypass channel) 180 and data extracting unit 130, in order to obtain operand from by-pass unit 180 and/or data extracting unit 130.The second computing unit in comparison/logical block 150 (such as, comparer, logic lock of all kinds etc.) be coupled to the output of by-pass unit 180 and the first computing unit, in order to obtain operand from the output of by-pass unit 180 and/or the first computing unit.The 3rd computing unit in rear unit processed 160 (such as, selection unit, conditional branching unit etc.) be coupled to the output of by-pass unit 180, first computing unit and the output of the second computing unit, in order to obtain operand from the above-mentioned output of by-pass unit 180, first computing unit and/or the output of the second computing unit.3rd computing unit is more coupled to load/store unit 171, sampling unit 173 and special function unit 175, in order to output operation result to these post-processing units.

Although contain assembly described above in Fig. 1,3,5, under being not precluded within the spirit not violating invention, use other add-on assembles more, to reach better technique effect.In addition, although Fig. 2,4 treatment step adopt specific order to perform, but when not violating invention spirit, those skilled in the art can under the prerequisite reaching same effect, revise the order between these steps, so the present invention is not limited to and only uses order as above.

Although the present invention uses above embodiment to be described, it should be noted that these describe and are not used to limit the present invention.On the contrary, this invention covers the apparent amendment of those skilled in the art and similar setting.So right must be explained in the broadest mode and comprise all apparent amendments and similar setting.

[symbol description]

110 instruction request; 120 instruction decoding units;

121 general addresses; 123 constant buffer addresses;

130 data extracting unit; 131 general addresses;

133 data; 135 constants;

140 arithmetic elements; 150 compare/logical block;

Unit processed after 160; 171 load/store units;

173 sampling units; 175 special function unit;

180 by-pass unit; 181 data or constant;

S210 ~ S250 method step; 310 totalizers;

330 comparers; 351 comparative results;

353 totalizers bear results; The value of 355 general-purpose registers;

S410 ~ S450 method step; 510 totalizers.

Claims

1. an instruction folding method, is performed by a compiler, comprises:

Obtain multiple first instruction, wherein, each above-mentioned first instruction is in order to carry out one of them in calculating operation, compare operation, logical operation, selection operation, conditional branch operation, load/store operations, sampling operation and complex mathematical operation;

Merge according to the data dependencies between above-mentioned first instruction; And

Send the instruction of above-mentioned merging to a stream handle.

2. instruction folding method as claimed in claim 1, wherein, above-mentioned first instruction merges according to following rule:

ALG+CMP+SEL；

ALG+CMP+SEL+SFU/LS/SMP；

ALG+CMP+Branch；

ALG+LGC+SEL；

ALG+LGC+SEL+SFU/LS/SMP; Or

ALG+LGC+Branch，

Wherein, ALG represents a computations, and CMP represents a comparison order, LGC represents a logical order, and SEL represents a selection instruction, and Branch represents a conditional branch instructions, SFU represents a mathematical operation instruction, and LS represents a load/store instruction, and SMP represents a sampling instruction.

3. instruction folding method as claimed in claim 1, also comprises:

Obtain one second instruction, wherein above-mentioned second instruction is in order to transmit data to another general-purpose register or a post-processing unit from a general-purpose register or a constant buffer; And

The amalgamation result of above-mentioned first instruction is remerged above-mentioned second instruction.

4. instruction folding method as claimed in claim 1, wherein, above-mentioned stream handle comprises:

One data extracting unit;

Bypass channel, is coupled to a general-purpose register, a constant buffer and above-mentioned data extracting unit; And

One main channel, is coupled to above-mentioned data extracting unit and above-mentioned bypass channel, comprising an arithmetic element, compares/logical block and after unit processed,

Wherein, couple, and each in above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit processed is coupled to above-mentioned bypass channel above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit sequence processed.

5. instruction folding method as claimed in claim 4, wherein, one first computing unit in above-mentioned arithmetic element is coupled to above-mentioned bypass channel and above-mentioned data extracting unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned data extracting unit; One second computing unit in above-mentioned comparison/logical block is coupled to one first output of above-mentioned bypass channel and above-mentioned first computing unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned first output; And one the 3rd computing unit in above-mentioned rear unit processed is coupled to above-mentioned bypass channel, above-mentioned first output of above-mentioned first computing unit and one second output of above-mentioned second computing unit, in order to obtain operand from above-mentioned bypass channel, above-mentioned first output and/or above-mentioned second output.

6. there is a device for multiple data channel, comprise:

One data extracting unit;

One main channel, is coupled to above-mentioned data extracting unit and above-mentioned bypass channel, the arithmetic element, comprising sequence compares/logical block and after unit processed,

7. there is the device of multiple data channel as claimed in claim 6, wherein, one first computing unit in above-mentioned arithmetic element is coupled to above-mentioned bypass channel and above-mentioned data extracting unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned data extracting unit; One second computing unit in above-mentioned comparison/logical block is coupled to one first output of above-mentioned bypass channel and above-mentioned first computing unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned first output; And one the 3rd computing unit in above-mentioned rear unit processed is coupled to above-mentioned bypass channel, above-mentioned first output of above-mentioned first computing unit and one second output of above-mentioned second computing unit, in order to obtain operand from above-mentioned bypass channel, above-mentioned first output and/or above-mentioned second output.

8. have the device of multiple data channel as claimed in claim 7, wherein, above-mentioned 3rd computing unit is coupled to a load/store unit, a sampling unit and a special function unit, in order to output operation result.

9. have the device of multiple data channel as claimed in claim 8, wherein, above-mentioned load/store unit performs a load/store instruction, and above-mentioned sampling unit performs a sampling instruction, and above-mentioned special function unit performs a mathematical operation instruction.

10. have the device of multiple data channel as claimed in claim 6, wherein, above-mentioned main channel performs a main channel instruction, and above-mentioned bypass channel performs bypass channel instruction.

11. devices as claimed in claim 10 with multiple data channel, wherein, the instruction of above-mentioned main channel and above-mentioned bypass channel instruction perform abreast.

12. devices as claimed in claim 10 with multiple data channel, wherein, the instruction of above-mentioned main channel comprises a computations, a comparison order, a logical order, a selection instruction and a conditional branch instructions, and above-mentioned bypass channel instruction comprises a move.