WO1999066393A1 - Registers and method for accessing data therein for use in a single instruction multiple data system - Google Patents

Registers and method for accessing data therein for use in a single instruction multiple data system Download PDF

Info

Publication number
WO1999066393A1
WO1999066393A1 PCT/JP1999/003256 JP9903256W WO9966393A1 WO 1999066393 A1 WO1999066393 A1 WO 1999066393A1 JP 9903256 W JP9903256 W JP 9903256W WO 9966393 A1 WO9966393 A1 WO 9966393A1
Authority
WO
WIPO (PCT)
Prior art keywords
register
data
storage locations
registers
array
Prior art date
Application number
PCT/JP1999/003256
Other languages
French (fr)
Inventor
Sharif Mohammad Sazzad
Larry Pearlstein
Original Assignee
Hitachi, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi, Ltd. filed Critical Hitachi, Ltd.
Priority to JP2000555150A priority Critical patent/JP2002518730A/en
Publication of WO1999066393A1 publication Critical patent/WO1999066393A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30109Register structure having multiple operands in a single register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1006Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor

Definitions

  • the present invention relates to methods and apparatus, including, e.g., registers and register arrays, for implementing single instruction multiple data (SIMD) signal processing operations.
  • SIMD single instruction multiple data
  • Two-dimensional sets of data are frequently used to represent, e.g., images.
  • a two-dimensional operation as two, one-dimensional operations
  • the one-dimensional operations are applied sequentially in the horizontal and vertical directions of the data being processed. This is illustrated in Fig. 1 where the two-dimensional operation HV is implemented as two sequential processing operations H, V on the data set A 100 to produce the two-dimensional data set HV(A) 104.
  • the intermediate data set H(A) 102 is produced as the result of the application of the horizontal function H to the data set A 100.
  • data words each represented by a separate box, are arranged in a memory in "raster-scan" order as illustrated in Fig. 2.
  • data words beginning at the top left of a two-dimensional data array 200, following to the right and down to the bottom right data element are stored at sequential locations in memory, as illustrated by the row of blocks 202 representing sequential memory locations.
  • the arrangement of the samples in the one-dimensional structure is convenient because each data sample follows the next.
  • access to the data is not as straightforward because there is a jump between the consecutive samples as represented by the arrow 203.
  • One known method of solving the problem of accessing the vertical rows of data for performing the vertical processing operation is to store the results from the horizontal processing operation in transposed order. This is shown in Fig. 3 wherein the shaded blocks representing a vertical column of data are now arranged horizontally.
  • Another method of accessing data to perform sequential horizontal and vertical data processing operations involves addressing the data that is stored in memory using a pointer that jumps to the next desired data sample.
  • This method has the advantage, as compared to the transpose technique discussed above, that it does not require that the data undergo an additional transposition step in order to restore the natural data ordering for use in subsequent operations .
  • the computational unit may be, e.g., a programmable signal processing core or some fixed function hardware. As a result of the "closeness" of the data registers to the computational unit, the computational unit can operate directly on the registers.
  • SIMD Single-Instruction Multiple Data
  • SIMD architecture systems allow multiple data elements to be processed simultaneously in response to a single instruction.
  • the multiple data units may be stored in a single register.
  • Well designed SIMD architectures can provide considerable performance advantages over more traditional Single-Instruction Single Data (SISD) architecture systems because of the simultaneous processing of multiple pieces of data made possible by the SIMD architecture.
  • SIMD Single-Instruction Single Data
  • MMX technology from Intel Corporation currently in use in computer CPUs is one example of a SIMD architecture.
  • SIMD architecture that operates on two data samples at the same time.
  • the data samples have to be presented to the processing unit in the arrangement shown in the diagram of Fig. 4A.
  • one word 400 that is n-bits in length contains two sub-words 402, 404, each n/2-bits in length.
  • sub-words b402, a404
  • each of these halves is handled separately. This is one of the primary features of the ST-MD processing.
  • SIMD processing operation suppose that it is desired to add two sets of numbers, ⁇ a, b ⁇ and ⁇ c, d ⁇ to produce ⁇ a+c ⁇ and ⁇ b+d ⁇ .
  • the SIMD architecture it is possible to set up two data elements 406, 408 similar to the one shown in Fig. 4A.
  • One of these 406 would contain the set ⁇ a, b ⁇ and the other 408 would contain the set ⁇ c, d ⁇ .
  • the processing unit treats the two halves of the input data words as independent quantities during the computation. An important consequence of this is that if the addition for the lower half overflows, the overflow will not affect the upper half. It can be seen from this example that the SIMD architecture is extremely beneficial for processing multiple pieces of data in parallel.
  • the inventors of the present application have discovered that various problems are encountered when one attempts to implement two-dimensional signal processing algorithms on SIMD architecture using local registers to provide high-performance signal processing implementations.
  • the SIMD architecture poses the following problem when data is to be transposed. Suppose that it is desired to obtain the transpose of the matrix:
  • any new methods and apparatus be capable of being implemented without the need for buffering or other temporary storage of register contents which can cause performance delays .
  • new and improved methods and apparatus for manipulating the contents of registers be capable of supporting data processing operations , other than transpose operations , which may require the manipulation of data in data units which are smaller than the full size of a utilized data register .
  • New SIMD instructions capable of taking advantage of the processing capabilities of any new methods and apparatus are also desirable .
  • the present invention is directed to methods and apparatus for implementing single instruction multiple dada (SIMD) signal processing operations.
  • An object of the present invention is to provide an efficient register structure that allows the mathematical transposition of two-dimensional data to be performed with very low cost.
  • the present invention utilizes a two-dimensional SIMD register array that is used as t h e main work space in the high performance digital signal processing of two-dimensional signals.
  • the apparatus of the present invention includes new and useful registers and register arrays suitable for use when implementing a system based on a SIMD architecture.
  • Registers implemented in accordance with the present invention include circuitry that allows an entire n-bit word stored in a register to be accessed and output in word or sub-word units. During standard operation the registers are accessed on a word basis. However, during column data access operations, e.g., when performing a transpose operation, access is performed on a sub-word basis.
  • the ability to access the registers of the present invention on a word or sub-word level make implementing transpose and various other row/column data manipulation operations possible in a relatively straightforward manner without data buffering.
  • various aspects of the present invention are directed to new and novel SIMD instructions, e.g., SIMD move, add, and copy instructions, which support the specification of data to be processed as a row or column of a register array as
  • a transpose instruction which accepts a register array identifier as an operand is also supported.
  • the present invention is also directed to additional methods for accessing and using the novel
  • the register arrays of the present invention provide a new method of transposing two-dimensional data in a high performance signal processing system.
  • the register arrays of the present invention are able to transpose a variety of matrix shapes - not just square matrices. It is also possible for a single register array to perform the transpose of multiple matrices. It should be noted that the processing of signals with greater than two dimensions can also benefit from the present invention, by considering a two-dimensional subset of the data at a time.
  • the register arrays of the present invention are suitable for high speed storage during the processing of two-dimensional signals. They may also be used with a programmable computational core and/or with some fixed function computational unit.
  • the two-dimensional arrays of the present invention can be used, e.g., in digital image compression applications, in image filtering applications and in digital video processing operations.
  • Figure 1 illustrates the performing of a two-dimensional processing operation on a set of data as two sequential one-dimensional operations.
  • Figure 2 illustrates the storage of a two-dimensional array of data in a one-dimensional series in what is referred to in the art as "raster scan" order.
  • Figure 3 illustrates the storage of a two-dimensional array of data in a one-dimensional series in what is referred to in the art as "transposed" order.
  • Figure 4A illustrates a word comprising two sub-words .
  • Figure 4B shows an operation involving the additio of two words, each of which comprises two sub-words.
  • Figure 5 illustrates how a 2x2 array of data may be stored in the contents of two registers, each register storing a word comprising two sub-words.
  • Figure 6 illustrates the contents of two registers, illustrated in Fig. 5, in transposed order.
  • Figure 7 illustrates a known array of two registers.
  • Figures 8-10A illustrate register arrays implemented in accordance with the present invention.
  • Figure 10B is a table illustrating the values of control signals used to access data stored in the array of Fig. 10A.
  • Figure 11 is a diagram illustrating a 2x2 sub-word atomic register array unit, comprising 2 word registers, implemented in accordance with the present invention.
  • Figure 12 illustrates a 4x4 sub-word register array implemented using four of the atomic register arrays of the present invention illustrated in Fig. 11.
  • FIGS 13 and 14 illustrate the storage of non-square data in register arrays implemented in accordance with the present invention.
  • FIGS 15-17 illustrate various register arrays implemented in accordance with different embodiments of the present invention.
  • Figure 18 is a table illustrating the values of control signals when used to access data stored in the array of Fig. 17.
  • Figure 19 is a representation of a 4x4 sub-word register array implemented using 4 word registers in accordance with the present invention.
  • FIG. 20A is a diagram of a processing system implemented in accordance with the present invention.
  • Figure 20B - 20D illustrate the contents of registers RA1 and RA2 of Fig. 20A at different times.
  • the present invention is directed to methods and apparatus for implementing single instruction multiple data (SIMD) signal processing operations.
  • SIMD single instruction multiple data
  • Various embodiments are directed to new and useful registers and register arrays suitable for use when implementing a system according to a SIMD architecture.
  • the register and register arrays of the present invention allow the implementation of direct transpose and various other row/column data manipulation operations in an efficient manner without intermediate data buffering.
  • various aspects of the present invention are directed to new and novel SIMD instructions, e.g., a SIMD transpose instruction, and methods for using the novel registers and register arrays of the present invention.
  • a hardware approach is taken to solving the problem of manipulating row/column data, e.g., to perform a transpose operation on data included in a two-dimensional array.
  • One particular feature of the present invention is directed to circuitry that allows a general purpose register file in a SIMD architecture machine to read and/or write data into registers in a manner that allows two-dimensional data to be processed efficiently along either rows or columns.
  • a conventional register array 700 shown in Fig. 7, will first be discussed.
  • Fig. 7 illustrates a conventional register array 700 with two separate registers 702, 704, each n bits in length.
  • the individual first and second registers 702, 704 may be accessed using the control lines which are supplied with control signals cO and cl.
  • the n output data lines from each of the two registers 702, 704 are joined together via a system of pass gates 703, 705, which are sometimes referred to as pass gate arrays.
  • the term pass gate is used here to refer to a switching device. Pass gates may be implemented with, e.g., tri-state logic, and take the form of transmission gates, multiplexers, or other similar circuitry. Pass gates may be capable of bus control.
  • Pass gates of the type used in the present invention are commonly used to allow the multiplexing of data from a number of devices while avoiding electrical conflicts.
  • the control signals cl, cO are supplied to the system of pass gate arrays. The appropriate manipulation of the control signals ensures proper behavior of the register array 700.
  • the first and second registers 702, 704 may be part of a SIMD architecture system and that implicit within each register there are two, n/2-bit sub-words (d, c) and (b, a) , respectively.
  • the symbol Z is used to represent an n-bit bus.
  • the bus Z includes n data lines, zl, z2 ._ zn.
  • the control signals, cO and cl may be used to select the contents (d, c) , (b, a) of either register 702, 704 but it is not possible to obtain the sub-words a, b, c, or d separately.
  • control elements e.g., logic gates, which are not illustrated, are used to manage the generation of control signals used to read and write data from the illustrated register arrays.
  • the control elements that are not illustrated may be conventional control circuits and/or control circuits implemented in accordance with the teachings of the present invention included in this application.
  • Such control logic may be implemented using conventional components such as logic gates and/or multiplexers (MUXes) .
  • MUXes logic gates and/or multiplexers
  • the known register array illustrated in Fig. 7 does not allow for the sub-word elements stored therein tro be directly accessed making it difficult to use such a register array when trying to individually process sub-word data elements, e.g., to perform a transpose operation.
  • Figure 8 illustrates a register array 800 implemented in accordance with a first embodiment of the present invention which is designed to allow obtaining a transpose of the data stored in the register array 800 relatively easy.
  • the register array 800 comprises first and second register 802, 804.
  • Each of the first and second registers 802, 804 include an n-bit word (b, a), (d, c) , respectively. Note that each word is comprised of two n/2 bit sub-word as in the Fig. 7 example.
  • Fig. 8 the symbols ' Zl' and ' Z2' are used to represent lower and upper sets of n/2 bus lines, respectively.
  • the two sub-words of each register 802, 804 are separated from the bus lines by their own set of first and second pass gates (806, 807) and third and forth pass gates (808, 809), respectively.
  • Pass gates 806, and 808, of the first and second registers 802, 804 are controlled by the control signal c2 which may be supplied by a common control line.
  • Pass gates 807, 809, of the first and second registers 802, 804, are controlled by the control signal c3 which may be supplied to the pass gates 807, 809 via a common control line.
  • the n/2 lines corresponding to each of the two sub-words (b, a) are joined together following the first and second pass gates 806, 807 to form the lower n/2 bits of the full n-bit word.
  • the n/2 lines corresponding to each of the two sub-words are joined together following the pass gates 808, 809 to form the upper n/2 bits of the full n-bit word output via the combination of lines Zj., Z 2 .
  • the pass gates 808, 809 to form the upper n/2 bits of the full n-bit word output via the combination of lines Zj., Z 2 .
  • control signal c3 When the control signal c3 is enabled and c2 is disabled the n-bit bus Z is allowed access to sub-words ⁇ b, d ⁇ .
  • the control signal and pass gate arrangement illustrated in Fig. 8 allows the transpose of the register array contents to be easily obtained.
  • register arrays 700 and 800 are combined to form a register array 900 illustrated in Fig. 9.
  • the register array 900 includes first and second registers 902, 904.
  • the outputs of each one of the registers 902, 904 is controlled using a set of 3 pass gates.
  • an n line pass gate 903 and two n/2 line pass gates 906, 907 are used to control the output of the first register 902.
  • the n output lines of the pass gate 903, which is controlled by control signal cO, are coupled to the corresponding n lines of the n line bus Z.
  • the first and second n/2 line pass gates 906, 907 have their output lines coupled to the corresponding lower n/2 lines of the bus Z.
  • an n line pass gate 905 and two n/2 line pass gates 908, 909 are used to control the output of the second register 904.
  • the n output lines of the pass gate 905, which is controlled by control signal cl, are coupled to the corresponding n lines of the n line bus Z.
  • the third and fourth n/2 line pass gates 908, 909 have their output lines coupled to the corresponding upper n/2 lines of the bus Z.
  • the pass gate arrangements of the previously discussed register array circuits 700, 800 are combined so that the resulting register array 900 includes the functionality of both. That is, it is possible to access the register array 900 in the conventional manner described in regard to Fig. 7, using control signals cl and cO, and obtain the entire words stored in registers 902, 904, one word at a time. It is also possible to access registers 902, 904 in the manner discussed with regard to Fig. 8 using control signals c2, and c3 to access one sub-word from each of the two registers 902, 904 at a time.
  • control signals cO and cl are used to access the first and second registers 902 , 904 in the traditional manner while control signals c2 and c3 are used to access the register array in the above discussed manner which facilitates obtaining a " transpose " of the data sub-words store in registers 902 , 904 .
  • the register array 900 of the present invention is included in programmable system where the state of the control signals cO, cl, c2, c3 are a function of a coded operand of a processing instruction being executed. Such a case will be discussed in greater detail below with reference to Figs. 20A-20C.
  • control state of the control signals cO, cl, c2, c3 would depend on the output of a state machine implemented, e.g., using combinational and sequential logic.
  • Figure 10A illustrates another two register array 1000 implemented in accordance with the present invention.
  • three n/2 line pass gates 1006, 1008, 1009 are used with the first register 1002.
  • Another three n/2 line- pass gates 1116, 1118, 1119, are used with the second register 1004 of the present invention in the manner illustrated in Fig. 10A.
  • the Fig. 10A embodiment uses a separate control signal, cO, cl, c2, c3 , c4, c6, c5 to control each of the pass gates 1006, 1008, 1009, 1116, 1118, 1119, respectively.
  • the Fig. 10A embodiment uses the same number of pass gates as the Fig.
  • Fig. 10A note that the use of an n line pass gate is avoided in the Fig. 10A embodiment while two additional control signals are employed. Because of the elimination of the need n line pass gates, the Fig. 10A embodiment may offer certain hardware advantages over the Fig. 9 embodiment.
  • Fig. 10A The six control signals, cO, cl, c2, c3, c4, c5 illustrated in Fig. 10A are used to manage the way the registers 1002, 1004 are accessed.
  • Fig. 10B is a table showing the states to which the six control signals are set, e.g., by the control logic, to achieve the various data accesses operations set forth in the left side of the table. For example, in order to access the word (a, b ⁇ stored in the first register 1002, control signals cO and c2 would be set to 1 and the remaining control signals would be set to 0.
  • Figures 9 and 10A show two exemplary circuits of the present invention each of which operates as a basic two-dimensional register array suitable for use with a SIMD architecture that partitions a single word into two sub-words.
  • the register arrays 900 and 1000 may be treated as an "atomic" structure in- that it can serve as a building block that may be used to construct larger register arrays in accordance with the present invention.
  • FIG. 9 and 10A register arrays An important feature of the Fig. 9 and 10A register arrays is their ability to facilitate transposition of 2x2 data blocks.
  • the basic two-dimensional register array 900 or 1000 may be scaled to accommodate larger data blocks.
  • An atomic two-dimensional register array 1100 of the present invention capable of being implemented e.g., using either the register arrays illustrated in Fig. 9 or 10A, is illustrated in Fig. 11.
  • the register array 1100 comprises first and second n-bit registers 1101, 1102. Note how the dashed line 1103 alludes to the partitioned nature of the first and second SIMD registers 1101, 1102 in the array 1100, and the n/2 bit sub-word stored in each half of the SIMD register 1101, 1102.
  • the process of accessing a 2x2 sub-word matrix created by the register array 1100 may be visualized by considering that the data enters the register array 1100 using the word inputs inO and inl shown on the left.
  • Data outputs the register array 1100 in either the standard (non transposed) manner via word outputs osO osl, or in transposed form via word outputs otO, otl.
  • inO stands for input number
  • otO stands for transposed output number
  • osrO stands for standard output number 0.
  • the two transposed outputs tO and tl are shown at the top of the register array 1100.
  • the two standard outputs osO and osl are shown at the right side of register array 1100.
  • the two-dimensional array 1100 may be considered to be "atomic" because it is the smallest two-dimensional register array that may be constructed in accordance with the two-partition SIMD architecture of the present invention. Using the " atomic " structure illustrated in Fig . 11 larger register arrays may be created by combining multiple arrays 1100 .
  • Square MxM sub-word register arrays may be implemented by using M/2 x M/2 word registers of the present invention .
  • the 4x4 sub-word register array 1200 may be constructed as show in Fig . 12 .
  • four register arrays of the type illustrated in Fig . 11 are used to form the register array 1200 .
  • the register array 1200 also includes standard (non- transposed) outputs which are not illustrated .
  • the 4x4 sub-word register array 1200 may be used to form the transpose of matrices-that are up to 4x4 sub-words in size . Lower order matrices and non-square matrices may also be accommodated by the structure .
  • the data When entering data to be transposed into a register array implemented in accordance with the present invention the data should be entered in a manner that allows the transpose of the data to be obtained from the square register array 1200.
  • the array of sub-words For example, the array of sub-words :
  • each 2x2 sub-block within the array should be entered into the two-dimensional register array so that each 2x2 sub-block within the array is stored in a different one of the four atomic register units comprising the array 1200.
  • the array contents should be stored in such a manner that the content of each 2x2 sub-block will be aligned with a boundary of an atomic register unit.
  • Fig. 12 illustrates a possible way to store the array of sub-words illustrated above with proper register array alignment.
  • this array should be stored using the upper two register units of the array 1200 as illustrated in Fig. 13.
  • this array should be stored using the upper two register units of the array 1200 as illustrated in Fig. 13.
  • the register array 1200 consider the 3x3 array below.
  • the data When storing the above array in the register 1200, the data should be arranged in the manner shown in Fig. 14. Note that, due to the SIMD nature of the system, half of the word registers included in the array 1200 are left with at least a portion of the register contents undefined or with "don't care" data as represented by the Xs illustrated in Fig. 14.
  • an H by V array of n/2 bit sub-words can be stored in an X x Y array of n-bit registers, arranged as an array of the atomic register units of the present invention, where: X is : equal to H/2 if H is even; and equal to int(H/2) plus one if H is odd; and where :
  • Y is: equal to V/2 if V is even; and equal to int(V/2) plus one if V is odd.
  • each one of the V rows of n/2 bit sub-words to be stored is loaded into a different corresponding one of the Y rows of registers in an X x Y register array implemented in accordance with the present invention.
  • a register array 1500 comprising any desired even number, k, of atomic register units 1502, 1504, 1506 may be constructed as shown in Fig. 15.
  • the control signals in Fig. 15 are operated in such a way that only the control signals, for one atomic block 1502, 1504, or 1506, are active at a given time.
  • control signals are labeled as, e.g., clO, where the first number (1) identifies the atomic block, i.e., the first atomic block 1502, and the second number (0) identifies the pass gate within the block which is being controlled, i.e., the first gate in the case of the value 0.
  • the active atomic block e.g., atomic block 1502
  • the active atomic block may be specified as an operand of a software command.
  • the pattern of control signaling within the active atomic block, e.g., block 1502, to achieve a desired output, would be as shown in Fig. 10B.
  • Figs. 9-15 are based on a SIMD architecture system in which two sub-words are included in a long word. In accordance with the present invention other partitions of a long word are possible.
  • the number of partitions in a word that are to be supported will determine the size of the atomic register array of the present invention that supports such a partition arrangement and the ability to output the data in standard or transposed form.
  • the atomic two-dimensional register array will be of sub-word order 4x4.
  • a circuit 1600 for this atomic register array is shown in Fig. 16.
  • the atomic register array 1600 comprises four n-bit word registers 1602, 1604, 1606, 1608 the contents of which may be accessed on a word or sub-word basis where, in this embodiment, a sub-word is one fourth the size of an n-bit word.
  • a separate n-bit pass gate is used in the Fig. 16 embodiment to control the word output of each register 1602, 1604, 1606, 1608.
  • four n/2 line pass gates are used in conjunction with each of the four registers 1602, 1604, 1606, 1608 to control the sub-word outputs of these registers.
  • the architecture of the Fig. 16 register array 1600 is similar to that of the Fig. 9 register array where two sets of pass gates are used to support both traditional and transposed register access operations.
  • a 4x4 sub-word register array 1700 implemented in accordance with another embodiment of the present invention can be seen in Fig. 17.
  • the register array 1700 is implemented as four segments 1701, 1703, 1705, 1707 with each segment including an n-bit register 1702, 1704, 1706, 1708, respectively, and seven n/4 line pass gates coupled together as illustrated in Fig. 17. Note that in each of the segments 1702, 1704, 1706, 1708 the sub-word outputs of three of the four sub-words stored in the segment's n-bit register are coupled to two different n/4 line pass gates included in the segment while one of the four sub-words stored in the register is coupled to a single n/4 line pass gate.
  • the array 1700 has been simplified by combining the registers and the various pass gate arrays into the rectangular segments 1702, 1704, 1706, 1708. Note that the n/4 bit sub-word outputs of each of the register units 1701, 1703, 1705, 1707, provided via buses Z x , Z 2 , Z 3 , Z 4 , are combined via the bus Z to generate a full n-bit word.
  • the array 1700 is controlled by eight control signals cO - C7 which manage the pass gates and thus output behavior of the array 1700.
  • the control signals cO - C7 are operated as shown in the table illustrated in Fig. 18.
  • FIG. 19 A representation of the atomic two-dimensional register arrays 1600, 1700 is shown in Fig 19 as a 4x4 sub-word atomic register array 1900. It is similar to the 4x4 register array in Fig. 13 except that there are fewer input and output lines as a result of each word including four sub-words in the Fig. 19 embodiment, as opposed to two sub-words in the Fig. 11 embodiment. That is, in the Fig. 19 example, the SIMD architecture partitions an n-bit register into four pieces. This means that four data items are stored in one register reducing the required number of access signals as compared to the Fig. 13 embodiment. Like the 2x2 sub-word atomic register array 1100 in Fig.
  • the 4x4 sub-word atomic register array 1900 may also be used to form larger structures that are capable of handling larger matrices.
  • four 4x4 sub-word atomic arrays 1900 can be substituted for the 2x2 sub-word atomic arrays illustrated in Fig. 13 to produce an 8x8 two-dimensional register array.
  • This size is particularly useful because it can be used in the processing of compressed digital video information, e.g., MPEG compliant video data.
  • compressed digital video information e.g., MPEG compliant video data.
  • write* version of this invention can be realized by connecting the collection of pass gates, as arranged above, to the inputs of the registers, and by controlling a write strobe for each register so that a register partition will be enabled for writing if, and only if, one of the pass gates feeding its input is active.
  • the new and novel SIMD instructions of the present invention take advantage of the fact that the contents of the two-dimensional register arrays of the present invention can be accessed on a row or column sub-word basis.
  • MOV is a move instruction and R0 and RI are operands which specify the source and destination registers of the data to be copied.
  • R0 and RI are operands which specify the source and destination registers of the data to be copied.
  • data used in conventional SIMD instructions involves the entire contents of the register specified as an operand, e.g., RO.
  • a row and/or column of data to be used with a SIMD instruction can be specified as an operand.
  • Such an operand will normally identify both a row or column of register locations, and the particular two-dimensional register array where the specified row or column of register storage locations is located.
  • the present invention allows data to be specified in terms of rows or columns of a two-dimensional register array.
  • data may be copied from a row or column of a register array to another row or column within the register array
  • data maybe copied from a row of one register array to a row of another register array, from a column of one register array to a column of another register array, from a row of one register array to a column of another register array and/or from a column of one register array to a row of another register array.
  • Rows and columns as well as the register array to which they correspond may be specified, in accordance with the present invention, as command operands .
  • Figure 20A illustrates a system 2000 implemented in accordance with the present invention.
  • the system includes an integrated circuit 2001, an output device 2006, e.g., a display, and an input device 2008, e.g., a keyboard.
  • the integrated circuit 2001 includes a processor 2004, memory 2007 and two register arrays RAl and RA2 implemented in accordance with the present invention.
  • Register arrays RAl and RA2 are coupled by a data bus 2003 and control lines 2005 to an I/O and register control device 2004 included in the processor 2002.
  • the device includes combination logic for controlling register access under direction of the programmable processor 2002.
  • the memory 2007, output device 2006, and input device 2008 are also coupled to the I/O and register control device 2004.
  • instructions e.g., obtained from memory 2007, involving registers RAl and RA2 , are executed by the processor 2004 via control signals generated by the I/O and register control device 2004.
  • Fig. 20B illustrates the two two-dimensional 2x2 sub-word register arrays RAl and RA2 in greater detail.
  • the register arrays RAl, RA2 may be implemented using the circuitry of Fig. 10A.
  • a move instruction may be specified as follows:
  • C/R is an operand which identifies a particular column or row of a register array
  • RA is an operand which identifies a particular register array.
  • the first occurrence of the operands (C/R) (RA) specify the source of the data to be moved while the second occurrence of the operands (C/R) (RA) specify the destination of the data being moved.
  • SIMD commands such as copy, add, sub, etc.
  • operands which specify the row or column of a source register array and the row or column of a destination register array.
  • a transpose command is also supported by the processor and register array of the present invention illustrated in Fig. 20A.
  • a transpose command receives as operands a source array identifier and a destination array identifier.
  • the transpose command may be:

Abstract

Methods and apparatus for implementing single instruction multiple data (SIMD) signal processing operations are described. The apparatus of the present invention includes new registers and register arrays which allow data to be accessed at a word as well as sub-word or sub-register level. The registers and register arrays of the present invention may be used when implementing a system based on a SIMD architecture. Registers implemented in accordance with the present invention include a plurality of pass gates that allow an entire n-bit word stored in the register to be accessed and output as a single word or for a sub-word portion of a stored word to be accessed and output. During standard operation the registers are accessed on a word basis. However, during column access operations, e.g., when performing a transpose operation, access is performed on a sub-word basis. The ability to access the registers of the present invention on a word or sub-word level make implementing transpose and various other row/column data manipulation operations possible in a relatively straightforward manner without data buffering. In addition to the novel registers and register arrays of the present invention, various aspects of the present invention are directed to new and novel SIMD instructions, e.g., SIMD move, add, and move instructions, which support the specification of data to be processed as operands which identify rows or columns of register arrays as opposed to merely identifying registers as done with conventional commands. A transpose command is also supported.

Description

REGISTERS AND METHOD FOR ACCESSING DATA THEREIN FOR USE IN
A SINGLE INSTRUCTION MULTIPLE DATA SYSTEM
FIELD OF THE INVENTION
The present invention relates to methods and apparatus, including, e.g., registers and register arrays, for implementing single instruction multiple data (SIMD) signal processing operations.
BACKGROUND OF THE INVENTION
The processing of two-dimensional sets of data is growing in importance as the use of computers continues to grow. Two-dimensional sets of data are frequently used to represent, e.g., images.
In the digital processing of two-dimensional signals, e.g., data sets, it is possible, for example when performing some two-dimensional filtering such as a low pass filtering operation or some two-dimensional transformation such as an inverse discrete cosine transform (IDCT) operation, to treat a two-dimensional operation as a series of two, one-dimensional operations. This is possible due to a mathematical property called separability. This separability property allows a complex two-dimensional process to be implemented as a series of two, one-dimensional processes. Sequential one-dimensional processes tend to be far less complicated algorithms to implement, than a corresponding two-dimensional process. For this reason, the property of separability is frequently used to implement two-dimensional data processing operations. In implementing a two-dimensional operation as two, one-dimensional operations, the one-dimensional operations are applied sequentially in the horizontal and vertical directions of the data being processed. This is illustrated in Fig. 1 where the two-dimensional operation HV is implemented as two sequential processing operations H, V on the data set A 100 to produce the two-dimensional data set HV(A) 104. The intermediate data set H(A) 102 is produced as the result of the application of the horizontal function H to the data set A 100.
Suppose that data words, each represented by a separate box, are arranged in a memory in "raster-scan" order as illustrated in Fig. 2. In such an arrangement, data words beginning at the top left of a two-dimensional data array 200, following to the right and down to the bottom right data element are stored at sequential locations in memory, as illustrated by the row of blocks 202 representing sequential memory locations. In processing the two-dimensional data in the horizontal direction the arrangement of the samples in the one-dimensional structure is convenient because each data sample follows the next. In order to process the data in the vertical direction it is clear from the first two shaded squares in Fig. 2 that access to the data is not as straightforward because there is a jump between the consecutive samples as represented by the arrow 203.
One known method of solving the problem of accessing the vertical rows of data for performing the vertical processing operation is to store the results from the horizontal processing operation in transposed order. This is shown in Fig. 3 wherein the shaded blocks representing a vertical column of data are now arranged horizontally.
As a result of the mathematical transpose accessing the vertical information is simple. At the end of the processing for the vertical direction, the transpose of the resulting data must normally be performed to restore the arrangement to the natural order for use in subsequent operations, e.g., the generation of video images for display.
Another method of accessing data to perform sequential horizontal and vertical data processing operations involves addressing the data that is stored in memory using a pointer that jumps to the next desired data sample. This method has the advantage, as compared to the transpose technique discussed above, that it does not require that the data undergo an additional transposition step in order to restore the natural data ordering for use in subsequent operations . In high-performance implementations of digital signal processing algorithms, which may include various real time image processing applications, it is good practice to keep data that is being processed in hardware registers close to the main computational unit in order to minimize processing delays due to data transfer operations. The computational unit may be, e.g., a programmable signal processing core or some fixed function hardware. As a result of the "closeness" of the data registers to the computational unit, the computational unit can operate directly on the registers.
In cases where the data is not located in registers coupled closely to the computational unit, the data has to be fetched from cache or other memory and this results in reduced system performance. By keeping data which is frequently used in data registers which are directly accessible to a computational unit, a high level of computational speed can be. maintained throughout the lifetime of a computation without having the computational unit stall due to data being in lower speed storage such as a cache or main memory.
Single-Instruction Multiple Data (SIMD) architecture systems allow multiple data elements to be processed simultaneously in response to a single instruction. The multiple data units may be stored in a single register. Well designed SIMD architectures can provide considerable performance advantages over more traditional Single-Instruction Single Data (SISD) architecture systems because of the simultaneous processing of multiple pieces of data made possible by the SIMD architecture. MMX technology from Intel Corporation currently in use in computer CPUs is one example of a SIMD architecture.
Unfortunately the above described techniques of performing sequential horizontal and vertical processing operations are not straightforward when the data is stored in registers in a format that is used by SIMD architectures. In such a situation, the manipulations that are required to obtain the desired data arrangement are relatively difficult to implement.
Consider for example, a SIMD architecture that operates on two data samples at the same time. In such a SIMD architecture the data samples have to be presented to the processing unit in the arrangement shown in the diagram of Fig. 4A. Here, one word 400 that is n-bits in length, contains two sub-words 402, 404, each n/2-bits in length. Even though one n-bit word 400 is presented to the processor, there are actually two pieces of data, sub-words, b402, a404, that are embedded in that word 400. When presented to the SIMD processing unit, each of these halves is handled separately. This is one of the primary features of the ST-MD processing.
As an example of a SIMD processing operation, suppose that it is desired to add two sets of numbers, {a, b} and {c, d} to produce {a+c} and {b+d} . In the SIMD architecture, it is possible to set up two data elements 406, 408 similar to the one shown in Fig. 4A. One of these 406 would contain the set {a, b} and the other 408 would contain the set {c, d} . They may be presented to the SIMD processing unit for the desired addition. The processing unit treats the two halves of the input data words as independent quantities during the computation. An important consequence of this is that if the addition for the lower half overflows, the overflow will not affect the upper half. It can be seen from this example that the SIMD architecture is extremely beneficial for processing multiple pieces of data in parallel.
The inventors of the present application have discovered that various problems are encountered when one attempts to implement two-dimensional signal processing algorithms on SIMD architecture using local registers to provide high-performance signal processing implementations. For example, when processing two-dimensional signals, the SIMD architecture poses the following problem when data is to be transposed. Suppose that it is desired to obtain the transpose of the matrix:
a b c d
where the data is arranged in registers 0 and 1 as shown in Fig. 5. Note that the little-endian data scheme is used for the examples in this application, however this is simply for purposes of explanation of the invention and in no way limits the present invention to use only with little-endian data schemes. The transposed matrix will have the arrangement shown in Fig. 6.
Unfortunately, when two items of data, e.g., sub-words a and b, are packed into a conventional long register, the individual elements can not be accessed efficiently. That is, direct data access is limited to the full word (ba) and not one of the sub-words (b) or (a) . This register access limitation which exists in conventional registers makes it relatively difficult to transform the data arrangement of Fig. 5 into the transposed arrangement of Fig. 6. This is because it is not possible to access directly the individual data sub-words of a conventional register.
Various known approaches to transposing data stored in registers include the use of software or the use of special transposition hardware.- Software has the advantage of being flexible in that, minor modifications to the software of a program can allow the program to transpose arrays of different shapes and sizes. Unfortunately, software approaches have the major disadvantage of being relatively slow and time consuming because of the relatively large number of clock cycles required and the need to transfer and store the contents of the registers in, e.g., memory, while the register contents are being processed according to the software instructions. Known special transposition hardware also suffers several disadvantages. These include the need to use sequential logic, e.g., logic which includes buffers or delay elements, or logic which is limited in terms of the size and/or shape of an array which can be transposed. The use of sequential logic introduces undesirable time delays while constraints on the size and shape of arrays which can be transposed limit the utility of special transposition hardware to specific applications.
In view of the above discussion, it becomes apparent that there is a need for new and improved methods and apparatus for accessing and transposing two-dimensional sets of data stored in hardware registers. It is desirable that such improved methods and apparatus be compatible with SIMD architectures and the data access requirements of such architectures. In particular, it is desirable that any new methods or apparatus allow the contents of a register to be accessed as a single unit or as a plurality of sub-units.
From a performance perspective, it is also desirable that any new methods and apparatus be capable of being implemented without the need for buffering or other temporary storage of register contents which can cause performance delays .
In addition to supporting transpose operations it is desirable that new and improved methods and apparatus for manipulating the contents of registers be capable of supporting data processing operations , other than transpose operations , which may require the manipulation of data in data units which are smaller than the full size of a utilized data register .
New SIMD instructions capable of taking advantage of the processing capabilities of any new methods and apparatus are also desirable .
SUMMARY OF THE PRESENT INVENTION
The present invention is directed to methods and apparatus for implementing single instruction multiple dada (SIMD) signal processing operations. An object of the present invention is to provide an efficient register structure that allows the mathematical transposition of two-dimensional data to be performed with very low cost. The present invention utilizes a two-dimensional SIMD register array that is used as the main work space in the high performance digital signal processing of two-dimensional signals. The apparatus of the present invention includes new and useful registers and register arrays suitable for use when implementing a system based on a SIMD architecture.
Registers implemented in accordance with the present invention include circuitry that allows an entire n-bit word stored in a register to be accessed and output in word or sub-word units. During standard operation the registers are accessed on a word basis. However, during column data access operations, e.g., when performing a transpose operation, access is performed on a sub-word basis. The ability to access the registers of the present invention on a word or sub-word level make implementing transpose and various other row/column data manipulation operations possible in a relatively straightforward manner without data buffering.
In addition to the novel registers and register 5 arrays of the present invention, various aspects of the present invention are directed to new and novel SIMD instructions, e.g., SIMD move, add, and copy instructions, which support the specification of data to be processed as a row or column of a register array as
10 opposed to merely identifying registers as done with conventional commands . A transpose instruction which accepts a register array identifier as an operand is also supported. The present invention is also directed to additional methods for accessing and using the novel
15 registers and register arrays of the present invention.
As discussed above, various embodiments of the present invention are directed to efficient register and arrays of such registers, that allows the mathematical
20. transposition of two-dimensional data 'to be performed with relatively little hardware and at high speeds without the need to use delay elements or buffers. An array of the new and novel registers of the present invention will, on occasion, be referred to herein as a
25 two-dimensional SIMD register array. Such a register array may be used as the main work space in a SIMD processor used for high performance digital signal processi_ng of two-dimensional signals. The register arrays of the present invention provide a new method of transposing two-dimensional data in a high performance signal processing system. The register arrays of the present invention are able to transpose a variety of matrix shapes - not just square matrices. It is also possible for a single register array to perform the transpose of multiple matrices. It should be noted that the processing of signals with greater than two dimensions can also benefit from the present invention, by considering a two-dimensional subset of the data at a time.
The register arrays of the present invention are suitable for high speed storage during the processing of two-dimensional signals. They may also be used with a programmable computational core and/or with some fixed function computational unit.
The two-dimensional arrays of the present invention can be used, e.g., in digital image compression applications, in image filtering applications and in digital video processing operations.
Numerous additional features and embodiments of the present invention are discussed below in the detailed description which follows. Brief Description of the Drawings
Figure 1 illustrates the performing of a two-dimensional processing operation on a set of data as two sequential one-dimensional operations.
Figure 2 illustrates the storage of a two-dimensional array of data in a one-dimensional series in what is referred to in the art as "raster scan" order.
Figure 3 illustrates the storage of a two-dimensional array of data in a one-dimensional series in what is referred to in the art as "transposed" order.
Figure 4A illustrates a word comprising two sub-words .
Figure 4B shows an operation involving the additio of two words, each of which comprises two sub-words.
Figure 5 illustrates how a 2x2 array of data may be stored in the contents of two registers, each register storing a word comprising two sub-words.
Figure 6 illustrates the contents of two registers, illustrated in Fig. 5, in transposed order.
Figure 7 illustrates a known array of two registers. Figures 8-10A illustrate register arrays implemented in accordance with the present invention.
Figure 10B is a table illustrating the values of control signals used to access data stored in the array of Fig. 10A.
Figure 11 is a diagram illustrating a 2x2 sub-word atomic register array unit, comprising 2 word registers, implemented in accordance with the present invention.
Figure 12 illustrates a 4x4 sub-word register array implemented using four of the atomic register arrays of the present invention illustrated in Fig. 11.
Figures 13 and 14 illustrate the storage of non-square data in register arrays implemented in accordance with the present invention.
Figures 15-17 illustrate various register arrays implemented in accordance with different embodiments of the present invention.
Figure 18 is a table illustrating the values of control signals when used to access data stored in the array of Fig. 17. Figure 19 is a representation of a 4x4 sub-word register array implemented using 4 word registers in accordance with the present invention.
Figure 20A is a diagram of a processing system implemented in accordance with the present invention.
Figure 20B - 20D illustrate the contents of registers RA1 and RA2 of Fig. 20A at different times.
Detailed Description
As discussed above, the present invention is directed to methods and apparatus for implementing single instruction multiple data (SIMD) signal processing operations. Various embodiments are directed to new and useful registers and register arrays suitable for use when implementing a system according to a SIMD architecture. The register and register arrays of the present invention allow the implementation of direct transpose and various other row/column data manipulation operations in an efficient manner without intermediate data buffering. In addition to the novel registers and register arrays of the present invention, various aspects of the present invention are directed to new and novel SIMD instructions, e.g., a SIMD transpose instruction, and methods for using the novel registers and register arrays of the present invention. In accordance with the present invention, a hardware approach is taken to solving the problem of manipulating row/column data, e.g., to perform a transpose operation on data included in a two-dimensional array. One particular feature of the present invention is directed to circuitry that allows a general purpose register file in a SIMD architecture machine to read and/or write data into registers in a manner that allows two-dimensional data to be processed efficiently along either rows or columns. To facilitate an understanding of the SIMD register array of the present invention, a conventional register array 700, shown in Fig. 7, will first be discussed.
Fig. 7 illustrates a conventional register array 700 with two separate registers 702, 704, each n bits in length. The individual first and second registers 702, 704 may be accessed using the control lines which are supplied with control signals cO and cl. The n output data lines from each of the two registers 702, 704 are joined together via a system of pass gates 703, 705, which are sometimes referred to as pass gate arrays. The term pass gate is used here to refer to a switching device. Pass gates may be implemented with, e.g., tri-state logic, and take the form of transmission gates, multiplexers, or other similar circuitry. Pass gates may be capable of bus control. Pass gates of the type used in the present invention are commonly used to allow the multiplexing of data from a number of devices while avoiding electrical conflicts. The control signals cl, cO are supplied to the system of pass gate arrays. The appropriate manipulation of the control signals ensures proper behavior of the register array 700. Note that in the Fig. 7 example, the first and second registers 702, 704 may be part of a SIMD architecture system and that implicit within each register there are two, n/2-bit sub-words (d, c) and (b, a) , respectively.
In Fig. 7, the symbol Z is used to represent an n-bit bus. The bus Z includes n data lines, zl, z2 ._ zn. In the known register array 700, the control signals, cO and cl, may be used to select the contents (d, c) , (b, a) of either register 702, 704 but it is not possible to obtain the sub-words a, b, c, or d separately.
In the Fig. 7 example and in various other examples included in the present application, additional control elements, e.g., logic gates, which are not illustrated, are used to manage the generation of control signals used to read and write data from the illustrated register arrays. The control elements that are not illustrated may be conventional control circuits and/or control circuits implemented in accordance with the teachings of the present invention included in this application. Such control logic may be implemented using conventional components such as logic gates and/or multiplexers (MUXes) . In the known system illustrated in Fig. 7, when accessing the first register 702 the control signal cO is enabled while the control signal cl is maintained in a disabled state. This causes the pass gates at the first register 702 to be enabled and those at the second register 704 to remain disabled. It is then possible to access the entire contents of the first register 702 without affecting or being affected by the second register 704. When accessing the second register 704 the control signal cl is enabled, while the control signal cO is maintained in a disabled state. This causes the pass gates at the second register 704 to be enabled and those at the first register 702 to remain disabled. In such a case, it is possible to access the entire contents of the second register 704 without affecting or being affected by the contents of the first register 702.
Unfortunately, as discussed above, the known register array illustrated in Fig. 7 does not allow for the sub-word elements stored therein tro be directly accessed making it difficult to use such a register array when trying to individually process sub-word data elements, e.g., to perform a transpose operation.
Figure 8 illustrates a register array 800 implemented in accordance with a first embodiment of the present invention which is designed to allow obtaining a transpose of the data stored in the register array 800 relatively easy. As illustrated the register array 800 comprises first and second register 802, 804. Each of the first and second registers 802, 804 include an n-bit word (b, a), (d, c) , respectively. Note that each word is comprised of two n/2 bit sub-word as in the Fig. 7 example.
In Fig. 8 the symbols ' Zl' and ' Z2' are used to represent lower and upper sets of n/2 bus lines, respectively. In the register array 800, in accordance with the present invention, the two sub-words of each register 802, 804 are separated from the bus lines by their own set of first and second pass gates (806, 807) and third and forth pass gates (808, 809), respectively. Pass gates 806, and 808, of the first and second registers 802, 804 are controlled by the control signal c2 which may be supplied by a common control line. Pass gates 807, 809, of the first and second registers 802, 804, are controlled by the control signal c3 which may be supplied to the pass gates 807, 809 via a common control line.
At the first register 802, the n/2 lines corresponding to each of the two sub-words (b, a) are joined together following the first and second pass gates 806, 807 to form the lower n/2 bits of the full n-bit word. At the first register 802, the n/2 lines corresponding to each of the two sub-words are joined together following the pass gates 808, 809 to form the upper n/2 bits of the full n-bit word output via the combination of lines Zj., Z2. When c2 is enabled and c3 is disabled the n-bit bus Z formed by the combination of the lower Zx and upper Z2 bus lines is allowed access to sub-words {a, c} . When the control signal c3 is enabled and c2 is disabled the n-bit bus Z is allowed access to sub-words {b, d} . Thus, the control signal and pass gate arrangement illustrated in Fig. 8 allows the transpose of the register array contents to be easily obtained.
In accordance with another array register embodiment of the present invention, the pass gate features of register arrays 700 and 800 are combined to form a register array 900 illustrated in Fig. 9.
As illustrated in Fig. 9, the register array 900 includes first and second registers 902, 904. The outputs of each one of the registers 902, 904 is controlled using a set of 3 pass gates.
In the case of the first register 902, an n line pass gate 903 and two n/2 line pass gates 906, 907 are used to control the output of the first register 902. The n output lines of the pass gate 903, which is controlled by control signal cO, are coupled to the corresponding n lines of the n line bus Z. The first and second n/2 line pass gates 906, 907 have their output lines coupled to the corresponding lower n/2 lines of the bus Z.
In the case of the second register 904, an n line pass gate 905 and two n/2 line pass gates 908, 909 are used to control the output of the second register 904. The n output lines of the pass gate 905, which is controlled by control signal cl, are coupled to the corresponding n lines of the n line bus Z. The third and fourth n/2 line pass gates 908, 909 have their output lines coupled to the corresponding upper n/2 lines of the bus Z.
In the register array 900, the pass gate arrangements of the previously discussed register array circuits 700, 800, are combined so that the resulting register array 900 includes the functionality of both. That is, it is possible to access the register array 900 in the conventional manner described in regard to Fig. 7, using control signals cl and cO, and obtain the entire words stored in registers 902, 904, one word at a time. It is also possible to access registers 902, 904 in the manner discussed with regard to Fig. 8 using control signals c2, and c3 to access one sub-word from each of the two registers 902, 904 at a time.
Thus , in accordance with the present invention, when using the register array 900 illustrated in Fig . 9 , control signals cO and cl are used to access the first and second registers 902 , 904 in the traditional manner while control signals c2 and c3 are used to access the register array in the above discussed manner which facilitates obtaining a " transpose " of the data sub-words store in registers 902 , 904 . In one particular embodiment, the register array 900 of the present invention is included in programmable system where the state of the control signals cO, cl, c2, c3 are a function of a coded operand of a processing instruction being executed. Such a case will be discussed in greater detail below with reference to Figs. 20A-20C.
When embodied in a synchronous fixed function system as opposed to a programmable system, it is contemplated that the control state of the control signals cO, cl, c2, c3 would depend on the output of a state machine implemented, e.g., using combinational and sequential logic.
Figure 10A illustrates another two register array 1000 implemented in accordance with the present invention. In the Fig. 10A embodiment, three n/2 line pass gates 1006, 1008, 1009 are used with the first register 1002. Another three n/2 line- pass gates 1116, 1118, 1119, are used with the second register 1004 of the present invention in the manner illustrated in Fig. 10A. The Fig. 10A embodiment uses a separate control signal, cO, cl, c2, c3 , c4, c6, c5 to control each of the pass gates 1006, 1008, 1009, 1116, 1118, 1119, respectively. While the Fig. 10A embodiment uses the same number of pass gates as the Fig. 9 embodiment, note that the use of an n line pass gate is avoided in the Fig. 10A embodiment while two additional control signals are employed. Because of the elimination of the need n line pass gates, the Fig. 10A embodiment may offer certain hardware advantages over the Fig. 9 embodiment.
The six control signals, cO, cl, c2, c3, c4, c5 illustrated in Fig. 10A are used to manage the way the registers 1002, 1004 are accessed. Fig. 10B is a table showing the states to which the six control signals are set, e.g., by the control logic, to achieve the various data accesses operations set forth in the left side of the table. For example, in order to access the word (a, b} stored in the first register 1002, control signals cO and c2 would be set to 1 and the remaining control signals would be set to 0.
Figures 9 and 10A show two exemplary circuits of the present invention each of which operates as a basic two-dimensional register array suitable for use with a SIMD architecture that partitions a single word into two sub-words. The register arrays 900 and 1000 may be treated as an "atomic" structure in- that it can serve as a building block that may be used to construct larger register arrays in accordance with the present invention.
An important feature of the Fig. 9 and 10A register arrays is their ability to facilitate transposition of 2x2 data blocks. By arranging the atomic structure, e.g., the Fig. 9 or 10A register arrays 900, 1000 in groups, the basic two-dimensional register array 900 or 1000 may be scaled to accommodate larger data blocks. An atomic two-dimensional register array 1100 of the present invention, capable of being implemented e.g., using either the register arrays illustrated in Fig. 9 or 10A, is illustrated in Fig. 11. The register array 1100 comprises first and second n-bit registers 1101, 1102. Note how the dashed line 1103 alludes to the partitioned nature of the first and second SIMD registers 1101, 1102 in the array 1100, and the n/2 bit sub-word stored in each half of the SIMD register 1101, 1102.
The process of accessing a 2x2 sub-word matrix created by the register array 1100 may be visualized by considering that the data enters the register array 1100 using the word inputs inO and inl shown on the left. Data outputs the register array 1100 in either the standard (non transposed) manner via word outputs osO osl, or in transposed form via word outputs otO, otl. In Fig. 11, "inO" stands for input number 0, "otO" stands for transposed output number 0, and "osrO" stands for standard output number 0. The two transposed outputs tO and tl are shown at the top of the register array 1100. The two standard outputs osO and osl are shown at the right side of register array 1100. The two-dimensional array 1100 may be considered to be "atomic" because it is the smallest two-dimensional register array that may be constructed in accordance with the two-partition SIMD architecture of the present invention. Using the " atomic " structure illustrated in Fig . 11 larger register arrays may be created by combining multiple arrays 1100 .
Square MxM sub-word register arrays may be implemented by using M/2 x M/2 word registers of the present invention . For example , the 4x4 sub-word register array 1200 may be constructed as show in Fig . 12 . As illustrated, four register arrays of the type illustrated in Fig . 11 , are used to form the register array 1200 .
Note that in Fig . 12 , for illustration purposes , only the register inputs and transposed outputs are illustrated . The register array 1200 also includes standard (non- transposed) outputs which are not illustrated .
The 4x4 sub-word register array 1200 may be used to form the transpose of matrices-that are up to 4x4 sub-words in size . Lower order matrices and non-square matrices may also be accommodated by the structure .
When entering data to be transposed into a register array implemented in accordance with the present invention the data should be entered in a manner that allows the transpose of the data to be obtained from the square register array 1200. For example, the array of sub-words :
Figure imgf000027_0001
should be entered into the two-dimensional register array so that each 2x2 sub-block within the array is stored in a different one of the four atomic register units comprising the array 1200. In addition, the array contents should be stored in such a manner that the content of each 2x2 sub-block will be aligned with a boundary of an atomic register unit.
Fig. 12 illustrates a possible way to store the array of sub-words illustrated above with proper register array alignment.
As another example of array storage, consider the 2x4 rectangular sub-word array:
a b c d e f g h
In accordance with the present invention this array should be stored using the upper two register units of the array 1200 as illustrated in Fig. 13. As a final example of using the register array 1200, consider the 3x3 array below.
a b c d e f i h i
When storing the above array in the register 1200, the data should be arranged in the manner shown in Fig. 14. Note that, due to the SIMD nature of the system, half of the word registers included in the array 1200 are left with at least a portion of the register contents undefined or with "don't care" data as represented by the Xs illustrated in Fig. 14.
When the transpose outputs are taken in the Fig. 14 embodiment, it will be seen that the registers tlO, til and tl2 are defined only in the lower half because there is no valid data from the transposition operation to be placed in the upper halves.
Generally, an H by V array of n/2 bit sub-words, where H and V are positive integers, can be stored in an X x Y array of n-bit registers, arranged as an array of the atomic register units of the present invention, where: X is : equal to H/2 if H is even; and equal to int(H/2) plus one if H is odd; and where :
Y is: equal to V/2 if V is even; and equal to int(V/2) plus one if V is odd.
In such an implementation, for proper storage, each one of the V rows of n/2 bit sub-words to be stored is loaded into a different corresponding one of the Y rows of registers in an X x Y register array implemented in accordance with the present invention.
Because the register array of the present invention is scalable, a register array 1500 comprising any desired even number, k, of atomic register units 1502, 1504, 1506 may be constructed as shown in Fig. 15. The control signals in Fig. 15 are operated in such a way that only the control signals, for one atomic block 1502, 1504, or 1506, are active at a given time. In the Fig. 15 embodiment, control signals are labeled as, e.g., clO, where the first number (1) identifies the atomic block, i.e., the first atomic block 1502, and the second number (0) identifies the pass gate within the block which is being controlled, i.e., the first gate in the case of the value 0. In accordance with the present invention, the active atomic block, e.g., atomic block 1502, may be specified as an operand of a software command. The pattern of control signaling within the active atomic block, e.g., block 1502, to achieve a desired output, would be as shown in Fig. 10B.
The examples illustrated in Figs. 9-15 are based on a SIMD architecture system in which two sub-words are included in a long word. In accordance with the present invention other partitions of a long word are possible. When implementing register arrays in accordance with the present invention, the number of partitions in a word that are to be supported will determine the size of the atomic register array of the present invention that supports such a partition arrangement and the ability to output the data in standard or transposed form.
Consider, for example, a SIMD architecture that uses four partitions of a long word. In such a system, the atomic two-dimensional register array will be of sub-word order 4x4. A circuit 1600 for this atomic register array is shown in Fig. 16. Note that the atomic register array 1600 comprises four n-bit word registers 1602, 1604, 1606, 1608 the contents of which may be accessed on a word or sub-word basis where, in this embodiment, a sub-word is one fourth the size of an n-bit word. A separate n-bit pass gate is used in the Fig. 16 embodiment to control the word output of each register 1602, 1604, 1606, 1608. In addition, four n/2 line pass gates are used in conjunction with each of the four registers 1602, 1604, 1606, 1608 to control the sub-word outputs of these registers.
The architecture of the Fig. 16 register array 1600 is similar to that of the Fig. 9 register array where two sets of pass gates are used to support both traditional and transposed register access operations.
A 4x4 sub-word register array 1700 implemented in accordance with another embodiment of the present invention, can be seen in Fig. 17. The register array 1700 is implemented as four segments 1701, 1703, 1705, 1707 with each segment including an n-bit register 1702, 1704, 1706, 1708, respectively, and seven n/4 line pass gates coupled together as illustrated in Fig. 17. Note that in each of the segments 1702, 1704, 1706, 1708 the sub-word outputs of three of the four sub-words stored in the segment's n-bit register are coupled to two different n/4 line pass gates included in the segment while one of the four sub-words stored in the register is coupled to a single n/4 line pass gate. The array 1700 has been simplified by combining the registers and the various pass gate arrays into the rectangular segments 1702, 1704, 1706, 1708. Note that the n/4 bit sub-word outputs of each of the register units 1701, 1703, 1705, 1707, provided via buses Zx, Z2, Z3, Z4, are combined via the bus Z to generate a full n-bit word. The array 1700 is controlled by eight control signals cO - C7 which manage the pass gates and thus output behavior of the array 1700. The control signals cO - C7 are operated as shown in the table illustrated in Fig. 18.
A representation of the atomic two-dimensional register arrays 1600, 1700 is shown in Fig 19 as a 4x4 sub-word atomic register array 1900. It is similar to the 4x4 register array in Fig. 13 except that there are fewer input and output lines as a result of each word including four sub-words in the Fig. 19 embodiment, as opposed to two sub-words in the Fig. 11 embodiment. That is, in the Fig. 19 example, the SIMD architecture partitions an n-bit register into four pieces. This means that four data items are stored in one register reducing the required number of access signals as compared to the Fig. 13 embodiment. Like the 2x2 sub-word atomic register array 1100 in Fig. 11, the 4x4 sub-word atomic register array 1900 may also be used to form larger structures that are capable of handling larger matrices. For example, four 4x4 sub-word atomic arrays 1900 can be substituted for the 2x2 sub-word atomic arrays illustrated in Fig. 13 to produce an 8x8 two-dimensional register array. This size is particularly useful because it can be used in the processing of compressed digital video information, e.g., MPEG compliant video data. It should be noted, that, although the above description concerns providing the ability to read register data in either normal or transposed form, the same concepts can be applied to enable writing register contents in either normal or transposed form. The
"write* version of this invention can be realized by connecting the collection of pass gates, as arranged above, to the inputs of the registers, and by controlling a write strobe for each register so that a register partition will be enabled for writing if, and only if, one of the pass gates feeding its input is active.
New and novel processing instructions for use with the two-dimensional register arrays of the present invention will now be discussed. The new and novel SIMD instructions of the present invention take advantage of the fact that the contents of the two-dimensional register arrays of the present invention can be accessed on a row or column sub-word basis.
An example of a conventional SIMD command, also sometimes referred to as an instruction, is :
MOV R0 , RI
where MOV is a move instruction and R0 and RI are operands which specify the source and destination registers of the data to be copied. Note that data used in conventional SIMD instructions involves the entire contents of the register specified as an operand, e.g., RO.
In accordance with the new and novel instructions of the present invention, a row and/or column of data to be used with a SIMD instruction can be specified as an operand. Such an operand will normally identify both a row or column of register locations, and the particular two-dimensional register array where the specified row or column of register storage locations is located. In this manner, the present invention allows data to be specified in terms of rows or columns of a two-dimensional register array.
Because row/column register array access is supported at a sub-word level, a large number of column/row data manipulations are possible using the data from one or more arrays. For example, data may be copied from a row or column of a register array to another row or column within the register array, data maybe copied from a row of one register array to a row of another register array, from a column of one register array to a column of another register array, from a row of one register array to a column of another register array and/or from a column of one register array to a row of another register array. Rows and columns as well as the register array to which they correspond may be specified, in accordance with the present invention, as command operands . Figure 20A illustrates a system 2000 implemented in accordance with the present invention. The system includes an integrated circuit 2001, an output device 2006, e.g., a display, and an input device 2008, e.g., a keyboard. The integrated circuit 2001 includes a processor 2004, memory 2007 and two register arrays RAl and RA2 implemented in accordance with the present invention. Register arrays RAl and RA2 are coupled by a data bus 2003 and control lines 2005 to an I/O and register control device 2004 included in the processor 2002. The device includes combination logic for controlling register access under direction of the programmable processor 2002. The memory 2007, output device 2006, and input device 2008 are also coupled to the I/O and register control device 2004.
In accordance with the present invention, instructions, e.g., obtained from memory 2007, involving registers RAl and RA2 , are executed by the processor 2004 via control signals generated by the I/O and register control device 2004.
Fig. 20B illustrates the two two-dimensional 2x2 sub-word register arrays RAl and RA2 in greater detail. The register arrays RAl, RA2 may be implemented using the circuitry of Fig. 10A. In accordance with the present invention a move instruction may be specified as follows:
MOV (C/R) (RA) (C/R) (RA)
where MOV stands for the instruction move, (C/R) is an operand which identifies a particular column or row of a register array, and (RA) is an operand which identifies a particular register array. The first occurrence of the operands (C/R) (RA) specify the source of the data to be moved while the second occurrence of the operands (C/R) (RA) specify the destination of the data being moved.
For example, consider the instruction:
MOV (Cl) (RAl) (R2) (RA2)
This instruction, when implemented using the registers illustrated in Fig. 20B, results in the register contents being modified to that illustrated in Fig. 20C. Note how sub-words (a, c) found in column 1 of RAl have been copied to row 2 of RA2.
In addition to the new and novel move command of the present invention discussed above other SIMD commands such as copy, add, sub, etc., may be implemented in accordance with the present invention using operands which specify the row or column of a source register array and the row or column of a destination register array.
A transpose command is also supported by the processor and register array of the present invention illustrated in Fig. 20A.
In accordance with the present invention, a transpose command receives as operands a source array identifier and a destination array identifier.
For example, the transpose command may be:
TRNS (RAl) (RA2)
Execution of this command, assuming the register contents were as illustrated in Fig. 20B at the time of execution, would result in the register contents being modified to those illustrated in Fig. 20D.

Claims

CLA IMS
1. An apparatus , comprising: a first register assembly including : i . a first register having n storage locations , where n is an integer; ii . a first pass gate responsive to a first control signal coupled to a first set of said n storage locations ; and iii . a second pass gate responsive to a second control signal coupled to a second set of said n storage locations , at least one of the storage locations included in the second set being different from the storage locations included in the first set , the first register assembly outputting the data included in the first set of the n storage locations in response to activation of the first control signal and outputting the data- included in the second set of the n storage locations in response to activation of the second control signal .
2 . The apparatus of claim 1 , wherein the first register assembly further includes : a third pass gate, responsive to a third control signal , coupled to a third set of the n storage locations .
3. The apparatus of claim 1, wherein the second pass gate is an n-line pass gate having n inputs, each one of the n inputs corresponding to a different one of the n storage locations.
4. The apparatus of claim 2, wherein the first and third pass gates are n/2 line pass gates, the first and third pass gates being coupled to different sets of the n/2 storage locations.
5. The apparatus of claim 4, wherein the second pass gate is an n-line pass gate having n inputs, each one of the n inputs corresponding to a different one of the n storage locations.
6. The apparatus of claim 1, further comprising: a second register assembly including: i. a second register having n storage locations; ii. a fourth pass gate responsive to a fourth control signal coupled to a first set of the second register storage locations; and iii. a fifth pass gate, responsive to a fifth control signal coupled to a second set of second register storage locations of the second register, at least one of the storage locations included in the second set of second register storage locations being different from the storage locations included in the first set of second register storage locations.
7. The apparatus of claim 6, wherein the second register assembly further includes: a sixth pass gate, responsive to a sixth control signal, coupled to a third set of second register storage locations.
8. The apparatus of claim 6, wherein the fifth pass gate is an n-line pass gate having n inputs, each one of the n inputs corresponding to a different one of the n storage locations of the second register.
9. The apparatus of claim 6, further comprising: a plurality of said first and said second register assemblies arranged to form a two-dimensional data storage array.
10. The apparatus of claim 9, further comprising: control means for controlling the accessing of n units of data stored in one of the first and second register assemblies at a first time and for controlling the accessing of n/2 units of data stored in each of the first and the second register assemblies at a second time.
11. The apparatus of claim 1-0, wherein the first and second .registers included in the first and second register assemblies are n-bit registers suitable for storing an n-bit word including two n/2 bit sub-words.
12. The apparatus of claim 9, further comprising: a processor responsive to a programming instruction for controlling access to the first and second register arrays.
13. The apparatus of claim 10, wherein the first and second registers included in the first and second register assemblies are n-bit registers suitable for storing an n-bit word including four n/4 bit sub-words.
14. The apparatus of claim 5, wherein said first and said second register assemblies are arranged in an array to form an nx n-bit data storage unit.
15. The apparatus of claim 5, further comprising: additional first and second register arrays; said first and second register arrays and said additional first and second register arrays being combined to form a two-dimensional data storage array.
16. The apparatus of claim 15, further comprising: combinational logic used to control access to data stored in said two-dimensional data storage array.
17. The apparatus of claim 9, further comprising: a programmable processor coupled to the plurality of first and second register assemblies for generating said control signals used to control access to the first and second register assemblies.
18. The apparatus of claim 17, wherein the first and second register assemblies and programmable processor are implemented as a single integrated circuit.
19. A processing system, comprising: a processing unit implemented on a chip; a plurality of register arrays implemented on the chip, each register array including a plurality of n-bit registers; and at least three pass gates connected to each of the n-bit registers for controlling processor access to data stored in the n-bit registers.
20. The system of claim 19, further comprising: control logic for generating pass gate signals in response to programming instructions supplied to the processor which include a register array column as an operand.
21. A device, comprising: an integrated circuit including: i. a first register having n storage locations, where n is an integer; ii. a first switching device responsive to a first control signal coupled to a first set of said n storage locations; and iii . a second switching device responsive to a second control signal coupled to a second set of said n storage locations , at least one of the storage locations included in the second set being different from the storage locations included in the first set, the first register assembly outputting the data included in the first set of the n storage locations in response to activation of the first control signal and outputting the data included in the second set of the n storage locations in response to activation of the second control signal ; and iv. a third switching device, responsive to a third control signal, coupled to a third set of the n storage locations.
22 . A method of controlling access to data included in a first register array, including multiple n-bit registers , which permits data stored in the first register array to be accessed either on a row or column basis , each entry in a column of data corresponding to a portion of the contents of one of the registers included in the first register array, the method comprising : providing a processor for generating a plurality of register pass gate control signals ; supplying a first instruction to the processor which includes , as an operand, inforπvation identifying a column of the first register array; and operating the processor to generate a set of pass gate control signals enabling access to the identified column of data stored in the first register array.
23. The method of claim 22, wherein the first register array is one of a plurality of register arrays, the method further comprising: including, as an operand of the first instruction, information identifying the first register array.
24. The method of claim 23, further comprising the steps of: including, as an operand of the first instruction, information identifying a second register array and information identifying a row in the second register array; operating the processor to generate a set of pass gate control signals enabling access to the identified row of the second register array; and storing data in the identified row of the second register array.
25. The method of claim 24, further comprising the step of: including control logic in the processor for generating the pass gate signals in response to program instructions .
26. The method of claim 24, further comprising the step of: implementing the processor and plurality of register arrays on a single chip.
27. The method of claim 26, wherein the contents of registers included in the register arrays are accessed n bits at a time when a row access operation is performed and less than n bits at a time when a column access operation is performed.
PCT/JP1999/003256 1998-06-19 1999-06-18 Registers and method for accessing data therein for use in a single instruction multiple data system WO1999066393A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2000555150A JP2002518730A (en) 1998-06-19 1999-06-18 Register and method for accessing register used in single instruction multiple data system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/099,989 1998-06-19
US09/099,989 US6175892B1 (en) 1998-06-19 1998-06-19 Registers and methods for accessing registers for use in a single instruction multiple data system

Publications (1)

Publication Number Publication Date
WO1999066393A1 true WO1999066393A1 (en) 1999-12-23

Family

ID=22277559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP1999/003256 WO1999066393A1 (en) 1998-06-19 1999-06-18 Registers and method for accessing data therein for use in a single instruction multiple data system

Country Status (3)

Country Link
US (1) US6175892B1 (en)
JP (1) JP2002518730A (en)
WO (1) WO1999066393A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0945783A2 (en) * 1998-03-23 1999-09-29 Nec Corporation Variable length register device
WO2001008005A1 (en) * 1999-07-26 2001-02-01 Intel Corporation Registers for 2-d matrix processing
USRE46712E1 (en) 1998-03-18 2018-02-13 Koninklijke Philips N.V. Data processing device and method of computing the cosine transform of a matrix

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732253B1 (en) 2000-11-13 2004-05-04 Chipwrights Design, Inc. Loop handling for single instruction multiple datapath processor architectures
US6931518B1 (en) 2000-11-28 2005-08-16 Chipwrights Design, Inc. Branching around conditional processing if states of all single instruction multiple datapaths are disabled and the computer program is non-deterministic
US7653710B2 (en) 2002-06-25 2010-01-26 Qst Holdings, Llc. Hardware task manager
US7752419B1 (en) 2001-03-22 2010-07-06 Qst Holdings, Llc Method and system for managing hardware resources to implement system functions using an adaptive computing architecture
US7489779B2 (en) * 2001-03-22 2009-02-10 Qstholdings, Llc Hardware implementation of the secure hash standard
US7962716B2 (en) 2001-03-22 2011-06-14 Qst Holdings, Inc. Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements
US7400668B2 (en) * 2001-03-22 2008-07-15 Qst Holdings, Llc Method and system for implementing a system acquisition function for use with a communication device
US7249242B2 (en) 2002-10-28 2007-07-24 Nvidia Corporation Input pipeline registers for a node in an adaptive computing engine
US6836839B2 (en) 2001-03-22 2004-12-28 Quicksilver Technology, Inc. Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements
US6577678B2 (en) * 2001-05-08 2003-06-10 Quicksilver Technology Method and system for reconfigurable channel coding
US6823087B1 (en) * 2001-05-15 2004-11-23 Advanced Micro Devices, Inc. Parallel edge filters in video codec
US20020184291A1 (en) * 2001-05-31 2002-12-05 Hogenauer Eugene B. Method and system for scheduling in an adaptable computing engine
GB2382674B (en) * 2001-10-31 2005-11-16 Alphamosaic Ltd Data access in a processor
GB2382676B (en) * 2001-10-31 2005-09-07 Alphamosaic Ltd Data access in a processor
US7046635B2 (en) 2001-11-28 2006-05-16 Quicksilver Technology, Inc. System for authorizing functionality in adaptable hardware devices
JP3779602B2 (en) * 2001-11-28 2006-05-31 松下電器産業株式会社 SIMD operation method and SIMD operation device
US6986021B2 (en) 2001-11-30 2006-01-10 Quick Silver Technology, Inc. Apparatus, method, system and executable module for configuration and operation of adaptive integrated circuitry having fixed, application specific computational elements
US8412915B2 (en) 2001-11-30 2013-04-02 Altera Corporation Apparatus, system and method for configuration of adaptive integrated circuitry having heterogeneous computational elements
US7602740B2 (en) * 2001-12-10 2009-10-13 Qst Holdings, Inc. System for adapting device standards after manufacture
US7088825B2 (en) * 2001-12-12 2006-08-08 Quicksilver Technology, Inc. Low I/O bandwidth method and system for implementing detection and identification of scrambling codes
US7215701B2 (en) 2001-12-12 2007-05-08 Sharad Sambhwani Low I/O bandwidth method and system for implementing detection and identification of scrambling codes
US7403981B2 (en) 2002-01-04 2008-07-22 Quicksilver Technology, Inc. Apparatus and method for adaptive multimedia reception and transmission in communication environments
US7328414B1 (en) * 2003-05-13 2008-02-05 Qst Holdings, Llc Method and system for creating and programming an adaptive computing engine
US7660984B1 (en) 2003-05-13 2010-02-09 Quicksilver Technology Method and system for achieving individualized protected space in an operating system
US7493607B2 (en) 2002-07-09 2009-02-17 Bluerisc Inc. Statically speculative compilation and execution
US8108656B2 (en) 2002-08-29 2012-01-31 Qst Holdings, Llc Task definition for specifying resource requirements
US7937591B1 (en) 2002-10-25 2011-05-03 Qst Holdings, Llc Method and system for providing a device which can be adapted on an ongoing basis
US8276135B2 (en) 2002-11-07 2012-09-25 Qst Holdings Llc Profiling of software and circuit designs utilizing data operation analyses
US7225301B2 (en) 2002-11-22 2007-05-29 Quicksilver Technologies External memory controller node
US7275147B2 (en) 2003-03-31 2007-09-25 Hitachi, Ltd. Method and apparatus for data alignment and parsing in SIMD computer architecture
US7609297B2 (en) * 2003-06-25 2009-10-27 Qst Holdings, Inc. Configurable hardware based digital imaging apparatus
US7200837B2 (en) * 2003-08-21 2007-04-03 Qst Holdings, Llc System, method and software for static and dynamic programming and configuration of an adaptive computing architecture
US20050114850A1 (en) * 2003-10-29 2005-05-26 Saurabh Chheda Energy-focused re-compilation of executables and hardware mechanisms based on compiler-architecture interaction and compiler-inserted control
US7996671B2 (en) * 2003-11-17 2011-08-09 Bluerisc Inc. Security of program executables and microprocessors based on compiler-architecture interaction
US7386703B2 (en) * 2003-11-18 2008-06-10 International Business Machines Corporation Two dimensional addressing of a matrix-vector register array
US8607209B2 (en) 2004-02-04 2013-12-10 Bluerisc Inc. Energy-focused compiler-assisted branch prediction
US7257695B2 (en) * 2004-12-28 2007-08-14 Intel Corporation Register file regions for a processing system
US7506326B2 (en) * 2005-03-07 2009-03-17 International Business Machines Corporation Method and apparatus for choosing register classes and/or instruction categories
US20070294181A1 (en) * 2006-05-22 2007-12-20 Saurabh Chheda Flexible digital rights management with secure snippets
US20080082798A1 (en) * 2006-09-29 2008-04-03 3Dlabs Inc. Ltd., Flexible Microprocessor Register File
US20080126766A1 (en) 2006-11-03 2008-05-29 Saurabh Chheda Securing microprocessors against information leakage and physical tampering
GB2444744B (en) * 2006-12-12 2011-05-25 Advanced Risc Mach Ltd Apparatus and method for performing re-arrangement operations on data
US20080154379A1 (en) * 2006-12-22 2008-06-26 Musculoskeletal Transplant Foundation Interbody fusion hybrid graft
EP2526494B1 (en) 2010-01-21 2020-01-15 SVIRAL, Inc. A method and apparatus for a general-purpose, multiple-core system for implementing stream-based computations
US9606802B2 (en) 2011-03-25 2017-03-28 Nxp Usa, Inc. Processor system with predicate register, computer system, method for managing predicates and computer program product
US9009447B2 (en) 2011-07-18 2015-04-14 Oracle International Corporation Acceleration of string comparisons using vector instructions
US9280342B2 (en) 2011-07-20 2016-03-08 Oracle International Corporation Vector operations for compressing selected vector elements
US20170192789A1 (en) * 2015-12-30 2017-07-06 Rama Kishnan V. Malladi Systems, Methods, and Apparatuses for Improving Vector Throughput
US10990396B2 (en) * 2018-09-27 2021-04-27 Intel Corporation Systems for performing instructions to quickly convert and use tiles as 1D vectors
GB2597709A (en) * 2020-07-30 2022-02-09 Advanced Risc Mach Ltd Register addressing information for data transfer instruction
US20220100508A1 (en) * 2020-09-26 2022-03-31 Intel Corporation Large-scale matrix restructuring and matrix-scalar operations
US20220413854A1 (en) * 2021-06-25 2022-12-29 Intel Corporation 64-bit two-dimensional block load with transpose

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0381940A1 (en) * 1989-01-13 1990-08-16 Kabushiki Kaisha Toshiba Register bank circuit
US5708618A (en) * 1993-09-29 1998-01-13 Kabushiki Kaisha Toshiba Multiport field memory
GB2317466A (en) * 1996-09-23 1998-03-25 Advanced Risc Mach Ltd Data processing condition code flags

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6238075A (en) 1985-08-13 1987-02-19 Fuji Xerox Co Ltd Device for processing transposition of matrix data
FR2617621B1 (en) 1987-07-03 1989-12-01 Thomson Semiconducteurs TRANSPOSITION MEMORY FOR DATA PROCESSING CIRCUIT
FR2626693B1 (en) 1987-12-03 1990-08-10 France Etat BUFFER MEMORY DEVICE AND METHOD, PARTICULARLY FOR LINE-COLUMN MATRIX TRANSPOSITION OF DATA SEQUENCES
US5177704A (en) 1990-02-26 1993-01-05 Eastman Kodak Company Matrix transpose memory device
US5042007A (en) 1990-02-26 1991-08-20 Eastman Kodak Company Apparatus for transposing digital data
US5648776A (en) * 1993-04-30 1997-07-15 International Business Machines Corporation Serial-to-parallel converter using alternating latches and interleaving techniques
JP3676411B2 (en) * 1994-01-21 2005-07-27 サン・マイクロシステムズ・インコーポレイテッド Register file device and register file access method
US5481487A (en) 1994-01-28 1996-01-02 Industrial Technology Research Institute Transpose memory for DCT/IDCT circuit
US5570356A (en) * 1995-06-07 1996-10-29 International Business Machines Corporation High bandwidth communications system having multiple serial links
US5926120A (en) * 1996-03-28 1999-07-20 National Semiconductor Corporation Multi-channel parallel to serial and serial to parallel conversion using a RAM array

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0381940A1 (en) * 1989-01-13 1990-08-16 Kabushiki Kaisha Toshiba Register bank circuit
US5708618A (en) * 1993-09-29 1998-01-13 Kabushiki Kaisha Toshiba Multiport field memory
GB2317466A (en) * 1996-09-23 1998-03-25 Advanced Risc Mach Ltd Data processing condition code flags

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE R B: "SUBWORD PARALLELISM WITH MAX-2", IEEE MICRO, vol. 16, no. 4, 1 August 1996 (1996-08-01), pages 51 - 59, XP000596513, ISSN: 0272-1732 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USRE46712E1 (en) 1998-03-18 2018-02-13 Koninklijke Philips N.V. Data processing device and method of computing the cosine transform of a matrix
EP0945783A2 (en) * 1998-03-23 1999-09-29 Nec Corporation Variable length register device
EP0945783A3 (en) * 1998-03-23 2001-09-26 Nec Corporation Variable length register device
WO2001008005A1 (en) * 1999-07-26 2001-02-01 Intel Corporation Registers for 2-d matrix processing
US6625721B1 (en) 1999-07-26 2003-09-23 Intel Corporation Registers for 2-D matrix processing

Also Published As

Publication number Publication date
US6175892B1 (en) 2001-01-16
JP2002518730A (en) 2002-06-25

Similar Documents

Publication Publication Date Title
US6175892B1 (en) Registers and methods for accessing registers for use in a single instruction multiple data system
US5606520A (en) Address generator with controllable modulo power of two addressing capability
US5832290A (en) Apparatus, systems and method for improving memory bandwidth utilization in vector processing systems
US7979672B2 (en) Multi-core processors for 3D array transposition by logically retrieving in-place physically transposed sub-array data
US8375196B2 (en) Vector processor with vector register file configured as matrix of data cells each selecting input from generated vector data or data from other cell via predetermined rearrangement path
US4821224A (en) Method and apparatus for processing multi-dimensional data to obtain a Fourier transform
US9400652B1 (en) Methods and apparatus for address translation functions
EP0408810A1 (en) Multi processor computer system
US20190196831A1 (en) Memory apparatus and method for controlling the same
EP2027539B1 (en) Memory architecture
Lee Subword permutation instructions for two-dimensional multimedia processing in MicroSIMD architectures
JP2022543332A (en) Data processing
US7355917B2 (en) Two-dimensional data memory
US20060155953A1 (en) Method and apparatus for accessing multiple vector elements in parallel
US6085304A (en) Interface for processing element array
US20230176981A1 (en) Data processing method and acceleration unit
JPH04295953A (en) Parallel data processor with built-in two-dimensional array of element processor and sub-array unit of element processor
JPH1074141A (en) Signal processor
US20210142846A1 (en) Memory devices providing in situ computing using sequential transfer of row buffered data and related methods and circuits
US5928350A (en) Wide memory architecture vector processor using nxP bits wide memory bus for transferring P n-bit vector operands in one cycle
US20190114147A1 (en) Memory systems including support for transposition operations and related methods and circuits
WO2004013752A1 (en) Method and apparatus for accessing multiple vector elements in parallel
EP0775973B1 (en) Method and computer program product of transposing data
US7503046B2 (en) Method of obtaining interleave interval for two data values
US20240045922A1 (en) Zero padding for convolutional neural networks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase