CN105183433A - Instruction combination method and apparatus with multiple data channels - Google Patents

Instruction combination method and apparatus with multiple data channels Download PDF

Info

Publication number
CN105183433A
CN105183433A CN201510521991.2A CN201510521991A CN105183433A CN 105183433 A CN105183433 A CN 105183433A CN 201510521991 A CN201510521991 A CN 201510521991A CN 105183433 A CN105183433 A CN 105183433A
Authority
CN
China
Prior art keywords
mentioned
instruction
unit
bypass channel
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510521991.2A
Other languages
Chinese (zh)
Other versions
CN105183433B (en
Inventor
张淮声
洪洲
齐恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Glenfly Tech Co Ltd
Original Assignee
Shanghai Zhaoxin Integrated Circuit Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhaoxin Integrated Circuit Co Ltd filed Critical Shanghai Zhaoxin Integrated Circuit Co Ltd
Priority to CN201510521991.2A priority Critical patent/CN105183433B/en
Priority to US14/855,580 priority patent/US9904550B2/en
Priority to TW104132348A priority patent/TWI552081B/en
Priority to EP23181534.1A priority patent/EP4258110A3/en
Priority to EP15189469.8A priority patent/EP3136228B1/en
Publication of CN105183433A publication Critical patent/CN105183433A/en
Application granted granted Critical
Publication of CN105183433B publication Critical patent/CN105183433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30058Conditional branch instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3853Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/456Parallelism detection

Abstract

An embodiment of the invention provides an instruction combination method executed by a compiler. The method comprises: obtaining a plurality of first instructions, wherein the first instructions are used for performing one of calculation operation, comparison operation, logic operation, selection operation, condition branching operation, loading/storage operation, sampling operation and complicated mathematic operation; and combining the instructions according to data dependency among the first instructions, and transmitting a combined instruction to a stream processor.

Description

Instruction folding method and there is the device of multiple data channel
Technical field
The present invention is associated with a kind of Graphics Processing Unit technology, particularly a kind of instruction folding method and have the device of multiple data channel.
Background technology
The framework of Graphics Processing Unit has hundreds of basic tinter processing units (basicshaderprocessingunits) usually, is also called stream handle (streamprocessors).Each stream handle is in each period treatment single instruction multiple data (SIMD, SingleInstructionMultipleData) instruction of thread (thread), then in another single instruction multiple data thread of next cycle process.The usefulness of Graphics Processing Unit is subject to the impact of two key factors, and one is the number of stream handle, and two is the ability of stream handle.Therefore, the present invention proposes a kind of instruction folding method and has the device of multiple data channel, in order to promote the ability of stream handle.
Summary of the invention
Embodiments of the invention propose a kind of instruction folding method performed by compiler.Obtain multiple first instruction, wherein, the first instruction in order to carry out calculating operation, compare operation, logical operation, selection operation with, conditional branch operation, load/store operations, sampling operation and complex mathematical operation in one.Merge according to the data dependencies between the first instruction, and send the instruction of merging to stream handle.
Embodiments of the invention separately propose a kind of device with multiple data channel, comprise data extracting unit, bypass channel and main channel.Bypass channel is coupled to general-purpose register, constant buffer and data extracting unit.Main channel is coupled to data extracting unit and bypass channel, comprise the arithmetic element of sequence, compare/logical block and after unit processed.Arithmetic element, compare/couple logical block and rear unit sequence processed, and arithmetic element, compare/logical block and each afterwards in unit processed is coupled to bypass channel.
Accompanying drawing explanation
Fig. 1 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.
Fig. 2 is the instruction folding method flow diagram according to the embodiment of the present invention.
Fig. 3 A and 3B is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.
Fig. 4 is the instruction folding method flow diagram according to the embodiment of the present invention.
Fig. 5 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.
Embodiment
Below be illustrated as the better implementation of invention, its object is to describe essence spirit of the present invention, but not in order to limit the present invention.Actual summary of the invention must with reference to claims afterwards.
It must be appreciated, be used in the words such as " the comprising ", " comprising " in this instructions, in order to represent there is specific technical characteristic, numerical value, method step, operation process, assembly and/or assembly, but do not get rid of and can add more technical characteristic, numerical value, method step, operation process, assembly, assembly, or more combination in any.
Use as the word such as " first ", " second ", " the 3rd " is used for modifying the assembly in claim in claim, not be used for, between expression, there is priority order, precedence relation, or an assembly is prior to another assembly, or time order and function order during manner of execution step, be only used for distinguishing the assembly with same name.
Traditional stream handle performs simple operation, such as, calculate (Algorithm), compare (Compare), logic (Logic), select (Selection) and conditional branching (Branch) etc.But the data dependencies between the instruction of a tinter is high, make stream handle must read or write general-purpose register (CR, CommonRegisters) frequently.These data dependencies may consume a large amount of general-purpose register frequency ranges and cause bottleneck, and produce system bottleneck.In addition, read-after-write (RAW, the ReadAfterWrite) problem of general-purpose register may endanger the usefulness that instruction performs.Tinter uses usually to be moved (Move) instruction and carrys out initializing universal register, in order to require that stream handle transmits data or constant to another general-purpose register or constant buffer from a general-purpose register or constant buffer (CB, ConstantBuffer).It is a post-treatment procedures, such as carry out the load/store unit (LD/STLoad/StoreUnit) of data access, carry out the sampling unit (SMP of data texturing sampling, or special function unit (SFU SampleUnit), SpecialFunctionUnit) etc., stream handle be responsible for from general-purpose register or constant buffer read one or more come source value, then, corresponding post-processing unit is passed to.In these situations, data or constant do not need to perform any algorithm, cause stream handle to a certain degree not have efficiency.The embodiment of the present invention proposes a new framework, uses two data routings, in order to promote the usefulness of stream handle in a stream handle.First path can be described as main thoroughfare (Main-pipe), comprise calculation (Algorithm), compare (Compare), logic (Logic), select the operation such as (Selection) and conditional branching (Branch), and second path can be described as bypass channel (Bypass-pipe), in order to read data or constant from general-purpose register or constant buffer, and be passed to general-purpose register, constant buffer or post-processing unit.
Fig. 1 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.A stream handle uses two passages to perform instruction, and one is main thoroughfare, and another is bypass channel.Main thoroughfare can comprise the stage of three tool orders: calculation stage (ALG, AlgorithmStage); Rear logical stage (Post-LGC, Logic), comprises and compares and/or the operation such as logic; And post-processing stages (Post-PROC, Process) comprises the operations such as selection, conditional branching and/or result write-back.The result that stage produces can take the next stage to.Net result can be stored to general-purpose register, or exports post-processing unit to.Specifically, the instruction request 110 that instruction decoding unit 120 decoding is transmitted from compiler (compiler), and notice needs the general address 121 and/or the constant buffer address 123 that obtain data or constant.Instruction decoding unit 120 can get instruction the operation code (Opcode, OperationCode) in request 110.Data extracting unit 130 obtains the data 133 of general address 121 and/or the constant 135 of constant buffer address 123, if needed, notifies the general address 131 for writing back.The data 133 obtained and/or constant 135 can be described as again operand (Operand).Arithmetic element 140 carries out calculating operation to the data obtained.Calculating operation comprise add, subtract, multiplication and division, left dislocation, right displacement etc.The result that comparison/logical block 150 can produce according to arithmetic element 140 compares or logical operation.Compare operation comprise take large values, get the small value, numeric ratio comparatively etc., logical operation comprises and (AND) or (OR), non-(NOT) or non-(NOR), XOR (XOR) etc.Rear unit processed 160 can according to comparing/data after computing are written back to general-purpose register, or are passed to the one in load/store unit 171, sampling unit 173 and special function unit 175 by the operating result of logical block 150.Special function unit 175 implements complex mathematical computing, such as sinusoidal (SIN), cosine (COS), radical (SQRT) etc.Bypass channel can comprise by-pass unit 180, in order to transmit data or constant 181 from a general-purpose register or constant buffer to another general-purpose register or post-processing unit.
The instruction that compiler uses can be divided three classes: main channel instruction; Bypass channel instruction; And post-processing unit instruction.Main channel instruction comprises the instruction used in calculation stage, rear logical stage and post-processing stages.The calculation stage can comprise the following instruction (ALGinstructions) of use: FMAD, FADD, FMUL, IMUL24, IMUL16, IADD, SHL, SHR, FRC, FMOV, IMUL32I, IMUL24I, IADDI, IMAXI, IMINI, SHLI and SHRI etc.Rear logical stage can comprise the following instruction (CMP/LGCinstructions) of use: IMAX, IMIN, FCMP, ICMP, IMAXI, IMINI, NOR, AND, OR, XOR, ICMPI, NORI, ANDI, ORI and XORI etc.Post-processing stages can comprise the following instruction (SEL/Branchinstructions) of use: SEL, B, BL, IFANY and IFALL etc.Post-processing unit instruction comprises the instruction used in load/store unit 171, sampling unit 173 and special function unit 175.Load/store unit 171 can comprise the following instruction (LSinstructions) of use: LDU, STU, REDU, LDT, STT, REDUT, GLD, GST and GREDU etc.Sampling unit 173 can comprise the following instruction (SMPinstructions) of use: SAMPLE, SAMPLE_B, SAMPLE_L, SAMPLE_C, SAMPLE_D, SAMPLE_GTH and SAMPLE_FTH etc.Special function unit 175 can comprise the following instruction (SFUinstructions) of use: RCP, RSQ, LOG, EXP, SIN, COS and SQRT etc.Bypass channel instruction can comprise following instruction: MOV and MOVIMM etc.
Use framework as above, compiler can merge the instruction of multiple main channels according to data dependencies and post-processing unit instruction becomes an instruction, is called static merging (StaticCombine).Fig. 2 is the instruction folding method flow diagram according to the embodiment of the present invention.Compiler can obtain the instruction of multiple main channels and an aftertreatment instruction (if needs) (step S210), carry out merging (S230) according to the data dependencies between instruction, and the instruction after merging is sent to instruction decoding unit 120 (step S250).In step S230, compiler can carry out static state according to following rule and merge:
ALG+CMP+SEL;
ALG+CMP+SEL+SFU/LS/SMP;
ALG+CMP+Branch;
ALG+LGC+SEL;
ALG+LGC+SEL+SFU/LS/SMP; Or
ALG+LGC+Branch。
Wherein, ALG represents computations, and CMP represents comparison order, and LGC represents logical order, and SEL represents selection instruction, and Branch represents conditional branch instructions, and SFU represents mathematical operation instruction, and LS represents load/store instruction, and SMP representative sampling instruction.Below enumerate an example and static merging is described, the pseudo-code of tinter is as follows:
Above pseudo-code is carried out static state and is merged by compiler, and is translated into machine code as follows:
The value of general-purpose register R4 is added the value (that is being the value of variable a and b) of general-purpose register R8 by first machine code instruction, and result is delivered to the next stage, wherein, result (that is being the value of variable x) is delivered to the symbol of next stage by " SFWD " representative.The instruction of second machine code by the value (that is being the value of variable c) of register R5 with transmit the value (that is being the value of variable x) of getting off on last stage and compare, and comparative result is stored in flag P1, wherein, "+" represent this instruction folding to previous instruction and " SFWD " representative from transmitting the symbol got off on last stage.If the value of variable x is greater than the value of variable c, flag P1 is set to " 1 "; Otherwise, be set to " 0 ".Register R0 write is transmitted the value (that is being the value of variable c) of value (that is being the value of variable x) or the register R5 got off by the 3rd machine code instruction on last stage according to flag P1, wherein, "+" represents this instruction folding to previous instruction.If flag P1 is " 1 ", register R0 is write the value (that is being the value of variable c) of register R5; Otherwise write transmits the value (that is being the value of variable x) of getting off on last stage.
In order to allow Graphics Processing Unit can perform the instruction after merging, the computing unit (calculationunits) in framework can do some adjustment.Example is merged with reference to above-described static state.Fig. 3 A and 3B is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.Totalizer 310 is through front arithmetic logic unit (Pre-ALU, ArithmeticLogicUnit) obtain the value of register R4 and the value (that is being the value of variable a and b) of general-purpose register R8, and the result of computing is passed to comparer 330 through Standardisation Cell (Normalizer) and formatting unit (Formatter).Comparer 330 received the result that arithmetic element 140 (a upper stage) produces, and obtained the value (that is being the value of variable c) of register R5 through by-pass unit 180, then, carried out the comparison between the two.Selection unit 350 was written back to register R0 according to the one comparative result (output in a upper stage) 351 results totalizer 310 produced (that is being the value of variable x) 353 and the value 355 of general-purpose register R5 that obtains from by-pass unit 180.For performing the instruction after merging, computing unit in arithmetic element 140 is coupled to data extracting unit 130 and by-pass unit 180, in order to obtain operand, and be coupled to the computing unit of logical block 150 and rear unit processed 160, such as comparer 330 and selection unit 350, in order to output results to follow-up phase.Computing unit in logical block 150 is coupled to arithmetic element 140 and by-pass unit 180, and in order to obtain operand, and be coupled to the computing unit of rear unit processed 160, such as selection unit 350, in order to output results to follow-up phase.Computing unit in rear unit processed 160 is coupled to arithmetic element 140, logical block 150 and by-pass unit 180, in order to obtain operand, and be coupled to general-purpose register, load/store unit 171, sampling unit 173 and special function unit 175, such as selection unit 350, in order to write back data to general-purpose register or output results to post-processing unit.Except selection unit 350, rear unit 160 processed also can comprise conditional branching unit and whether write-back result to the judging unit of general-purpose register.
Use framework as above, the main channel instruction of bypass channel instruction folding and/or post-processing unit instruction also can be become an instruction by compiler, are called that bypass merges (BypassedCombine).Fig. 4 is the instruction folding method flow diagram according to the embodiment of the present invention.Compiler can obtain multiple instruction (step S410), at least one in a bypass channel instruction and main channel instruction and post-processing unit instruction is carried out merging (step S430), and the instruction after merging is sent to instruction decoding unit 120 (step S450).In step S430, the rule that the merge order of multiple main channels instruction needs the prerequisite of obedience to cross.In other words, step S430 can be considered and the amalgamation result of multiple main channels instruction is remerged a bypass channel instruction and/or an aftertreatment instruction.Below enumerate the example that bypass merges, machine code is as follows:
The value of general-purpose register R4 is stored to general-purpose register R0 by first machine code instruction.The value of general-purpose register R4 is added the value of general-purpose register R8 by second machine code instruction, and result is delivered to the next stage, and wherein, result is delivered to the symbol of next stage by " SFWD " representative.3rd machine code instruction asks for inverse (reciprocal) by transmitting the value of getting off on last stage, and result is write register R7, and wherein, " SFWD " representative is from transmitting the symbol got off on last stage.
Fig. 5 is the hardware structure figure of the three-dimensional graph process device according to the embodiment of the present invention.By-pass unit 180 receives the instruction of data extracting unit 130, reads the value of general-purpose register R4 and writes general-purpose register R0.Meanwhile, totalizer 510 obtains the value of general-purpose register R4 and the value of general-purpose register R8 through front arithmetic logic unit, and the result of computing is passed to special function unit 175.Special function unit 175 then obtains inverse, and writes general-purpose register R7.During due to bypass channel instruction process, main channel instruction and post-processing unit instruction can parallel processing, and the usefulness of stream handle can be promoted further.
With reference to figure 3 and Fig. 5.Generally speaking, be perform the instruction after merging, couple arithmetic element, compare/logical block and rear unit sequence processed, and arithmetic element, compare/logical block and afterwards unit processed be coupled to bypass channel.Specifically, the first computing unit in arithmetic element 140 (such as, totalizer, multiplier, divider etc.) be coupled to by-pass unit (that is being bypass channel) 180 and data extracting unit 130, in order to obtain operand from by-pass unit 180 and/or data extracting unit 130.The second computing unit in comparison/logical block 150 (such as, comparer, logic lock of all kinds etc.) be coupled to the output of by-pass unit 180 and the first computing unit, in order to obtain operand from the output of by-pass unit 180 and/or the first computing unit.The 3rd computing unit in rear unit processed 160 (such as, selection unit, conditional branching unit etc.) be coupled to the output of by-pass unit 180, first computing unit and the output of the second computing unit, in order to obtain operand from the above-mentioned output of by-pass unit 180, first computing unit and/or the output of the second computing unit.3rd computing unit is more coupled to load/store unit 171, sampling unit 173 and special function unit 175, in order to output operation result to these post-processing units.
Although contain assembly described above in Fig. 1,3,5, under being not precluded within the spirit not violating invention, use other add-on assembles more, to reach better technique effect.In addition, although Fig. 2,4 treatment step adopt specific order to perform, but when not violating invention spirit, those skilled in the art can under the prerequisite reaching same effect, revise the order between these steps, so the present invention is not limited to and only uses order as above.
Although the present invention uses above embodiment to be described, it should be noted that these describe and are not used to limit the present invention.On the contrary, this invention covers the apparent amendment of those skilled in the art and similar setting.So right must be explained in the broadest mode and comprise all apparent amendments and similar setting.
[symbol description]
110 instruction request; 120 instruction decoding units;
121 general addresses; 123 constant buffer addresses;
130 data extracting unit; 131 general addresses;
133 data; 135 constants;
140 arithmetic elements; 150 compare/logical block;
Unit processed after 160; 171 load/store units;
173 sampling units; 175 special function unit;
180 by-pass unit; 181 data or constant;
S210 ~ S250 method step; 310 totalizers;
330 comparers; 351 comparative results;
353 totalizers bear results; The value of 355 general-purpose registers;
S410 ~ S450 method step; 510 totalizers.

Claims (12)

1. an instruction folding method, is performed by a compiler, comprises:
Obtain multiple first instruction, wherein, each above-mentioned first instruction is in order to carry out one of them in calculating operation, compare operation, logical operation, selection operation, conditional branch operation, load/store operations, sampling operation and complex mathematical operation;
Merge according to the data dependencies between above-mentioned first instruction; And
Send the instruction of above-mentioned merging to a stream handle.
2. instruction folding method as claimed in claim 1, wherein, above-mentioned first instruction merges according to following rule:
ALG+CMP+SEL;
ALG+CMP+SEL+SFU/LS/SMP;
ALG+CMP+Branch;
ALG+LGC+SEL;
ALG+LGC+SEL+SFU/LS/SMP; Or
ALG+LGC+Branch,
Wherein, ALG represents a computations, and CMP represents a comparison order, LGC represents a logical order, and SEL represents a selection instruction, and Branch represents a conditional branch instructions, SFU represents a mathematical operation instruction, and LS represents a load/store instruction, and SMP represents a sampling instruction.
3. instruction folding method as claimed in claim 1, also comprises:
Obtain one second instruction, wherein above-mentioned second instruction is in order to transmit data to another general-purpose register or a post-processing unit from a general-purpose register or a constant buffer; And
The amalgamation result of above-mentioned first instruction is remerged above-mentioned second instruction.
4. instruction folding method as claimed in claim 1, wherein, above-mentioned stream handle comprises:
One data extracting unit;
Bypass channel, is coupled to a general-purpose register, a constant buffer and above-mentioned data extracting unit; And
One main channel, is coupled to above-mentioned data extracting unit and above-mentioned bypass channel, comprising an arithmetic element, compares/logical block and after unit processed,
Wherein, couple, and each in above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit processed is coupled to above-mentioned bypass channel above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit sequence processed.
5. instruction folding method as claimed in claim 4, wherein, one first computing unit in above-mentioned arithmetic element is coupled to above-mentioned bypass channel and above-mentioned data extracting unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned data extracting unit; One second computing unit in above-mentioned comparison/logical block is coupled to one first output of above-mentioned bypass channel and above-mentioned first computing unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned first output; And one the 3rd computing unit in above-mentioned rear unit processed is coupled to above-mentioned bypass channel, above-mentioned first output of above-mentioned first computing unit and one second output of above-mentioned second computing unit, in order to obtain operand from above-mentioned bypass channel, above-mentioned first output and/or above-mentioned second output.
6. there is a device for multiple data channel, comprise:
One data extracting unit;
Bypass channel, is coupled to a general-purpose register, a constant buffer and above-mentioned data extracting unit; And
One main channel, is coupled to above-mentioned data extracting unit and above-mentioned bypass channel, the arithmetic element, comprising sequence compares/logical block and after unit processed,
Wherein, couple, and each in above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit processed is coupled to above-mentioned bypass channel above-mentioned arithmetic element, above-mentioned comparison/logical block and above-mentioned rear unit sequence processed.
7. there is the device of multiple data channel as claimed in claim 6, wherein, one first computing unit in above-mentioned arithmetic element is coupled to above-mentioned bypass channel and above-mentioned data extracting unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned data extracting unit; One second computing unit in above-mentioned comparison/logical block is coupled to one first output of above-mentioned bypass channel and above-mentioned first computing unit, in order to obtain operand from above-mentioned bypass channel and/or above-mentioned first output; And one the 3rd computing unit in above-mentioned rear unit processed is coupled to above-mentioned bypass channel, above-mentioned first output of above-mentioned first computing unit and one second output of above-mentioned second computing unit, in order to obtain operand from above-mentioned bypass channel, above-mentioned first output and/or above-mentioned second output.
8. have the device of multiple data channel as claimed in claim 7, wherein, above-mentioned 3rd computing unit is coupled to a load/store unit, a sampling unit and a special function unit, in order to output operation result.
9. have the device of multiple data channel as claimed in claim 8, wherein, above-mentioned load/store unit performs a load/store instruction, and above-mentioned sampling unit performs a sampling instruction, and above-mentioned special function unit performs a mathematical operation instruction.
10. have the device of multiple data channel as claimed in claim 6, wherein, above-mentioned main channel performs a main channel instruction, and above-mentioned bypass channel performs bypass channel instruction.
11. devices as claimed in claim 10 with multiple data channel, wherein, the instruction of above-mentioned main channel and above-mentioned bypass channel instruction perform abreast.
12. devices as claimed in claim 10 with multiple data channel, wherein, the instruction of above-mentioned main channel comprises a computations, a comparison order, a logical order, a selection instruction and a conditional branch instructions, and above-mentioned bypass channel instruction comprises a move.
CN201510521991.2A 2015-08-24 2015-08-24 Instruction folding method and the device with multiple data channel Active CN105183433B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN201510521991.2A CN105183433B (en) 2015-08-24 2015-08-24 Instruction folding method and the device with multiple data channel
US14/855,580 US9904550B2 (en) 2015-08-24 2015-09-16 Methods for combining instructions and apparatuses having multiple data pipes
TW104132348A TWI552081B (en) 2015-08-24 2015-10-01 Methods for instruction combine and apparatuses having multiple data pipes
EP23181534.1A EP4258110A3 (en) 2015-08-24 2015-10-13 Methods for combining instructions and apparatuses having multiple data pipes
EP15189469.8A EP3136228B1 (en) 2015-08-24 2015-10-13 Methods for combining instructions and apparatuses having multiple data pipes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510521991.2A CN105183433B (en) 2015-08-24 2015-08-24 Instruction folding method and the device with multiple data channel

Publications (2)

Publication Number Publication Date
CN105183433A true CN105183433A (en) 2015-12-23
CN105183433B CN105183433B (en) 2018-02-06

Family

ID=54324852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510521991.2A Active CN105183433B (en) 2015-08-24 2015-08-24 Instruction folding method and the device with multiple data channel

Country Status (4)

Country Link
US (1) US9904550B2 (en)
EP (2) EP4258110A3 (en)
CN (1) CN105183433B (en)
TW (1) TWI552081B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6237086B1 (en) * 1998-04-22 2001-05-22 Sun Microsystems, Inc. 1 Method to prevent pipeline stalls in superscalar stack based computing systems
US20060271738A1 (en) * 2005-05-24 2006-11-30 Texas Instruments Incorporated Configurable cache system depending on instruction type
CN101238454A (en) * 2005-08-11 2008-08-06 科莱索尼克公司 Programmable digital signal processor having a clustered SIMD microarchitecture including a complex short multiplier and an independent vector load unit
CN101377736A (en) * 2008-04-03 2009-03-04 威盛电子股份有限公司 Disorder performing microcomputer and macro instruction processing method
US8055883B2 (en) * 2009-07-01 2011-11-08 Arm Limited Pipe scheduling for pipelines based on destination register number
CN103646009A (en) * 2006-04-12 2014-03-19 索夫特机械公司 Apparatus and method for processing an instruction matrix specifying parallel and dependent operations

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5333284A (en) * 1990-09-10 1994-07-26 Honeywell, Inc. Repeated ALU in pipelined processor design
JP3634379B2 (en) * 1996-01-24 2005-03-30 サン・マイクロシステムズ・インコーポレイテッド Method and apparatus for stack caching
US7730284B2 (en) * 2003-03-19 2010-06-01 Koninklijke Philips Electronics N.V. Pipelined instruction processor with data bypassing and disabling circuit
US7020757B2 (en) 2003-03-27 2006-03-28 Hewlett-Packard Development Company, L.P. Providing an arrangement of memory devices to enable high-speed data access
TWI275994B (en) 2004-12-29 2007-03-11 Ind Tech Res Inst Encoding method for very long instruction word (VLIW) DSP processor and decoding method thereof
US20060195631A1 (en) 2005-01-31 2006-08-31 Ramasubramanian Rajamani Memory buffers for merging local data from memory modules
US8443030B1 (en) * 2007-03-09 2013-05-14 Marvell International Ltd. Processing of floating point multiply-accumulate instructions using multiple operand pathways
US9009686B2 (en) * 2011-11-07 2015-04-14 Nvidia Corporation Algorithm for 64-bit address mode optimization
US9569214B2 (en) * 2012-12-27 2017-02-14 Nvidia Corporation Execution pipeline data forwarding
US9003382B2 (en) * 2013-02-18 2015-04-07 Red Hat, Inc. Efficient just-in-time compilation
US9483266B2 (en) 2013-03-15 2016-11-01 Intel Corporation Fusible instructions and logic to provide OR-test and AND-test functionality using multiple test sources
US9342312B2 (en) * 2013-06-14 2016-05-17 Texas Instruments Incorporated Processor with inter-execution unit instruction issue
US9524178B2 (en) * 2013-12-30 2016-12-20 Unisys Corporation Defining an instruction path to be compiled by a just-in-time (JIT) compiler
US9213563B2 (en) * 2013-12-30 2015-12-15 Unisys Corporation Implementing a jump instruction in a dynamic translator that uses instruction code translation and just-in-time compilation
US20150186168A1 (en) * 2013-12-30 2015-07-02 Unisys Corporation Dedicating processing resources to just-in-time compilers and instruction processors in a dynamic translator
US9477477B2 (en) * 2014-01-22 2016-10-25 Nvidia Corporation System, method, and computer program product for executing casting-arithmetic instructions
US9442706B2 (en) * 2014-05-30 2016-09-13 Apple Inc. Combining compute tasks for a graphics processing unit
US9501269B2 (en) * 2014-09-30 2016-11-22 Advanced Micro Devices, Inc. Automatic source code generation for accelerated function calls
US9836283B2 (en) * 2014-11-14 2017-12-05 Cavium, Inc. Compiler architecture for programmable application specific integrated circuit based network devices
US10318292B2 (en) * 2014-11-17 2019-06-11 Intel Corporation Hardware instruction set to replace a plurality of atomic operations with a single atomic operation
US9588746B2 (en) * 2014-12-19 2017-03-07 International Business Machines Corporation Compiler method for generating instructions for vector operations on a multi-endian processor
JP6492943B2 (en) * 2015-05-07 2019-04-03 富士通株式会社 Computer, compiling method, compiling program, and pipeline processing program
US10564943B2 (en) * 2015-06-08 2020-02-18 Oracle International Corporation Special calling sequence for caller-sensitive methods
US10055208B2 (en) * 2015-08-09 2018-08-21 Oracle International Corporation Extending a virtual machine instruction set architecture
US10140119B2 (en) * 2016-03-17 2018-11-27 Oracle International Corporation Modular serialization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6237086B1 (en) * 1998-04-22 2001-05-22 Sun Microsystems, Inc. 1 Method to prevent pipeline stalls in superscalar stack based computing systems
US20060271738A1 (en) * 2005-05-24 2006-11-30 Texas Instruments Incorporated Configurable cache system depending on instruction type
CN101180611A (en) * 2005-05-24 2008-05-14 德克萨斯仪器股份有限公司 Configurable cache system depending on instruction type
CN101238454A (en) * 2005-08-11 2008-08-06 科莱索尼克公司 Programmable digital signal processor having a clustered SIMD microarchitecture including a complex short multiplier and an independent vector load unit
CN103646009A (en) * 2006-04-12 2014-03-19 索夫特机械公司 Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
CN101377736A (en) * 2008-04-03 2009-03-04 威盛电子股份有限公司 Disorder performing microcomputer and macro instruction processing method
US8055883B2 (en) * 2009-07-01 2011-11-08 Arm Limited Pipe scheduling for pipelines based on destination register number

Also Published As

Publication number Publication date
EP3136228B1 (en) 2023-08-09
EP3136228A1 (en) 2017-03-01
TWI552081B (en) 2016-10-01
EP4258110A3 (en) 2024-01-03
TW201709060A (en) 2017-03-01
US9904550B2 (en) 2018-02-27
EP4258110A2 (en) 2023-10-11
US20170060594A1 (en) 2017-03-02
CN105183433B (en) 2018-02-06

Similar Documents

Publication Publication Date Title
CN110036368B (en) Apparatus and method for performing arithmetic operations to accumulate floating point numbers
US9542154B2 (en) Fused multiply add operations using bit masks
Yu et al. Vector processing as a soft-core CPU accelerator
TWI514269B (en) Apparatus and method for vector instructions for large integer arithmetic
TWI733798B (en) An apparatus and method for managing address collisions when performing vector operations
KR20100075588A (en) Apparatus and method for performing magnitude detection for arithmetic operations
US11507531B2 (en) Apparatus and method to switch configurable logic units
US10474427B2 (en) Comparison of wide data types
CN114443559A (en) Reconfigurable operator unit, processor, calculation method, device, equipment and medium
US20100115232A1 (en) Large integer support in vector operations
CN105183433A (en) Instruction combination method and apparatus with multiple data channels
Dally Micro-optimization of floating-point operations
US7167889B2 (en) Decimal multiplication for superscaler processors
Yavits et al. Associative Processor
US11768664B2 (en) Processing unit with mixed precision operations
Jezequel et al. Parallelization of Discrete Stochastic Arithmetic on multicore architectures
Mishra et al. A novel signal processing coprocessor for n-dimensional geometric algebra applications
US20140372728A1 (en) Vector execution unit for digital signal processor
Odendahl et al. A next generation digital signal processor for European space missions
EP3655851A1 (en) Register-based complex number processing
CN113064841B (en) Data storage method, processing method, computing device and readable storage medium
Mazouz et al. An efficient real time implementation of a fast IMM for tracking a manoeuvring target
Khalil et al. Design and implementation of dual-core MIPS processor for LU decomposition based on FPGA
US20200089497A1 (en) Instruction set for minimizing control variance overhead in dataflow architectures
KR20230025897A (en) Processing unit with small footprint arithmetic logic unit

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211206

Address after: Room 201, No. 2557, Jinke Road, pilot Free Trade Zone, Pudong New Area, Shanghai 201203

Patentee after: Gryfield Intelligent Technology Co.,Ltd.

Address before: Room 301, 2537 Jinke Road, Zhangjiang hi tech park, Shanghai 201203

Patentee before: VIA ALLIANCE SEMICONDUCTOR Co.,Ltd.