US5247696A - Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory - Google Patents

Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory Download PDF

Info

Publication number
US5247696A
US5247696A US07/642,480 US64248091A US5247696A US 5247696 A US5247696 A US 5247696A US 64248091 A US64248091 A US 64248091A US 5247696 A US5247696 A US 5247696A
Authority
US
United States
Prior art keywords
vector
code
loop
memory
data points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US07/642,480
Inventor
Michael W. Booth
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RPX Corp
Morgan Stanley and Co LLC
Original Assignee
Cray Research LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Assigned to CRAY RESEARCH INC., 608 SECOND AVENUE SOUTH, MINNEAPOLIS, MN 55402, A CORP. OF DE reassignment CRAY RESEARCH INC., 608 SECOND AVENUE SOUTH, MINNEAPOLIS, MN 55402, A CORP. OF DE ASSIGNMENT OF ASSIGNORS INTEREST. Assignors: BOOTH, MICHAEL W.
Application filed by Cray Research LLC filed Critical Cray Research LLC
Priority to US07/642,480 priority Critical patent/US5247696A/en
Application granted granted Critical
Publication of US5247696A publication Critical patent/US5247696A/en
Assigned to SILICON GRAPHICS, INC. reassignment SILICON GRAPHICS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CRAY RESEARCH, L.L.C.
Assigned to FOOTHILL CAPITAL CORPORATION reassignment FOOTHILL CAPITAL CORPORATION SECURITY AGREEMENT Assignors: SILICON GRAPHICS, INC.
Assigned to U.S. BANK NATIONAL ASSOCIATION, AS TRUSTEE reassignment U.S. BANK NATIONAL ASSOCIATION, AS TRUSTEE SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SILICON GRAPHICS, INC.
Assigned to GENERAL ELECTRIC CAPITAL CORPORATION reassignment GENERAL ELECTRIC CAPITAL CORPORATION SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SILICON GRAPHICS, INC.
Assigned to MORGAN STANLEY & CO., INCORPORATED reassignment MORGAN STANLEY & CO., INCORPORATED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GENERAL ELECTRIC CAPITAL CORPORATION
Anticipated expiration legal-status Critical
Assigned to GRAPHICS PROPERTIES HOLDINGS, INC. reassignment GRAPHICS PROPERTIES HOLDINGS, INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: SILICON GRAPHICS, INC.
Assigned to RPX CORPORATION reassignment RPX CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GRAPHICS PROPERTIES HOLDINGS, INC.
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • G06F8/452Loops
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation

Definitions

  • the present invention pertains generally to the field of high speed digital processing systems, and more particularly to a method of vectorizing loops containing recursive equations in a supercomputer.
  • Supercomputers are high performance computing platforms that employ a pipelined vector processing approach to solving numerical problems.
  • Vectors are ordered sets of data. Problems that can be structured as a sequence of operations on vectors can experience one to two orders of magnitude increased throughput when executed on a vector machine (compared to execution on a scalar machine of the same cost). Pipelining further increases throughput by hiding memory latency through the prefetching of instructions and data.
  • a pipelined vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978, to Cray, the disclosure of which is hereby incorporated herein by reference.
  • Cray machine vectors are processed by loading them into operand vector registers, streaming them through a data processing pipeline having a functional unit, and receiving the output in a result vector register.
  • a vector machine according to U.S. Pat. No. 4,128,880 supports fully parallel operation by allowing multiple pipelines to execute concurrently on independent streams of data.
  • vector processing is faster and more efficient than scalar processing.
  • Overhead associated with maintenance of the loop-control variable is reduced.
  • central memory conflicts are reduced (fewer but bigger requests) and data processing units are used more efficiently (through data streaming).
  • Vector processing supercomputers are used for a variety of large-scale numerical problems. Applications typically are highly structured computations that model physical processes. They exhibit a heavy dependence on floating-point arithmetic due to the potentially large dynamic range of values within these computations. Problems requiring modeling of heat or fluid flow, or of the behavior of a plasma, are examples of such applications.
  • Vectorization typically breaks up a loop of the form: ##EQU1## into a nested loop of the form: ##EQU2## where VL is the length of the vector registers of the system. This process is known as "strip mining the loop".
  • strip mining the number of iterations in the internal loop is defined by the length of a vector register.
  • the number of iterations of the external loop is defined as an integer number of vector lengths.
  • the remaining iterations are performed as a separate loop placed before the nested loop.
  • Vector length arrays of data from the original data arrays are loaded into the vector registers for each iteration of the internal loop. Data from these vector registers can then be processed at the one element per clock period rate of a vector operation.
  • Compilers exist that will automatically apply strip mining techniques to scalar loops within program code to create vectorized loops. This capability greatly simplifies programming efficient vector processing. The programmer simply enters code of the form: ##EQU3## and it is vectorized.
  • Equations of the form: ##EQU4## are difficult to vectorize due to the possibility that the same data point in X may be needed for different Y(i) within a single vector operation.
  • Some compilers offer a compiler directive that can be used to force vectorization of loops that would not vectorize otherwise due to the appearance of recurring data points. Programmers using such a directive take on the responsibility of making sure that there are, in fact, no recurring data points within a vector operation inside the loop.
  • the result of each pass through the internal loop is accumulated in the work vector array at the location l defined by its place in the vector and the address k of the original data point. That result is used in future calculations on the same data point in the same location in the operand vector. Two operations on the same data point in a vector operation result in writes to two different locations in the work vector.
  • the last step in performing the original loop is to add elements 1 through VL for each data point in X.
  • the work vector method is simple. It can lead to impressive gains in performance over the scalar approach. But as the size of array X becomes large, the memory requirements for the work vector array become burdensome. Also, in cases of little overlap or of limited operations within M (smaller n/M), addition of elements 1 through VL for each data point in X leads to a large number of zero additions.
  • the Abe method assumes that the amount of overlap is fairly small. Therefore the cost of tracking overlap and performing calculations on those elements which were missed due to overlap is low. The frequency of correction must be kept small in order to get good performance from this algorithm. But in a large class of problems (e.g. Monte Carlo simulations) the number of iterations is small in comparison to data point space. For these types of problems, the Abe method will give a substantial boost in performance.
  • IAG Locations in IAG that are affected by more than one element of Y are written into more than once. After one pass through IAP, IAG contains pointers to no more than one element in Y for each data point in X. A new X is calculated using the values of Y pointed to by IAG and those elements of Y are removed from the pool.
  • Paruolo's method like Abe, requires additional memory. Memory must be set aside that is equivalent to the sum of the size of array Y and the size of array X. This additional memory can get large for large n or large M. Like Abe, Paruolo's method is difficult to implement in a practical compiler due to the difficulty in determining the size of X prior to algorithm execution time.
  • Abe and Paruolo are geared to solving different types of problems. Abe works best in a large data space with little overlapping of elements. Paruolo works best in a data space with a great deal of overlap, as in the case where the number of elements in Y is much greater than the number of data points in X.
  • the Paruolo method is optimized toward problems exhibiting a uniform distribution of elements of Y over the data points of X. However, it would perform no worse than the scalar calculation in the case of total overlap (i.e. all elements in Y affect one point in X).
  • Each of the three algorithms discussed offer ways of structuring loops containing recursive equations so as to take advantage of some level of vectorization. They share a requirement for large amounts of additional memory that limits the size of problems that can be executed and may force otherwise vectorizable calculations to be performed in scalar mode.
  • the present invention provides a vector update method for vectorizing loops containing recursive equations within a supercomputer.
  • Program code containing a loop is transformed into a nested loop in which the interior loop performs an integer number of iterations of the original loop equal to the vector length of the system.
  • Vector operations are executed on the arrays of data within the interior loop, a check is made for recurring data points and repairs are made to the results of the vector operations. All repairs are completed before exiting the interior loop.
  • the vector update method requires no additional memory for the vectorization of program code. Furthermore, it is designed to vectorize loops without recursive elements with negligible loss in performance.
  • a system implemented according to the present invention demonstrates a noticeable increase in the speed of program execution and can handle larger problems than systems implemented under previous methods.
  • a compiler constructed to include the vector update method of vectorizing code will automatically transform program code containing loops into vectorized code.
  • a compiler according to this invention generates code without allocating additional work space memory. This means that it can operate independent of the data used in the program to be compiled, eliminating the run time determination of memory sizes required by compilers implemented with other methods.
  • FIG. 1 is a block diagram representative of a typical vector processing computer system according to the present invention.
  • FIG. 2 is a flow diagram representative of the steps taken to vectorize program code containing a loop according to the present invention.
  • FIG. 3 is a FORTRAN representation of a program solving an algorithm containing recurring data points using the vector update method of the present invention.
  • FIGS. 4 and 5 are flow diagrams representative of the steps a vector processing computer system goes through in executing a vectorized loop after compiling with the vector update method of the present invention.
  • Computer system 10 which can be used for high speed data processing is illustrated generally in FIG. 1.
  • Computer system 10 is a typical pipelined vector processing system which includes a main memory 12, a plurality of vector registers 14, a plurality of scalar registers 16, a plurality of address registers 18, a plurality of arithmetic units 20, an instruction buffer 22, an instruction register 24, a program counter 26 and a compiler 30.
  • Computer system 10 also includes control hardware for operating in either scalar or vector mode (not shown).
  • Vector registers 14, scalar registers 16 and address registers 18 are connected to main memory 12 and serve as intermediate high-speed memory between main memory 12 and arithmetic units 20.
  • Vector registers 14.1 through 14.n are of uniform vector register length VL and are connected to main memory 12 and arithmetic units 20.
  • Compiler 30 is connected to main memory 12 and serves to convert program code into machine executable code in preparation for execution in computer 10.
  • Instruction buffer 22 is connected to main memory 12 and serves as intermediate high-speed memory between main memory 12 and instruction register 24.
  • vector registers 14 used as operand registers for a given vector process transmit individual elements to an arithmetic unit 20 at the rate of one element per time period. Once the start-up time, or arithmetic unit time, has passed, the arithmetic unit provides successive result elements on successive time periods, and these are transmitted as elements of a result vector to a vector register 14 acting as a result register for that particular vector process.
  • Vector transfers between vector registers 14 and main memory 12 may also be accomplished at one element per time period.
  • arithmetic units for example, floating point multiply, integer add, logical operations, etc.
  • vector registers 14 for example, eight
  • Program code to be run on computer system 10 must be vectorized to exploit the performance advantages inherent in vector processing.
  • Program code containing loops in which mathematical or logical operations are performed on arrays of data points can be vectorized using the vector update method of the present invention.
  • this code vectorization is performed by a compiler.
  • the method of this invention can be used by programmers to restructure program code for use on vector processing systems that do not have this compiler.
  • FIG. 2 A flow chart of the steps of the present invention is shown in FIG. 2.
  • program code containing a loop is converted into program code containing a nested loop of depth two through the process of strip mining.
  • the resulting interior loop is of length VL, the vector length of the system.
  • functions to be performed within the interior loop are set up as vector operations with the results to be saved to an intermediate vector register.
  • recurring data points are detected by including program code that writes the numbers 0 through VL-1 to the locations in memory of the data points used in the current interior loop. Recurring data points will be written to more than once, with the last number written pointing to the location in the intermediate vector register of the last valid operation performed on that data point.
  • program code is included that repairs those elements in the intermediate vector register associated with the recurring data points and saves the results to memory.
  • a system implemented according to the present invention has several advantages over a system implemented using the work method, Abe's method or Paruolo's method. There are no additional memory requirements. Systems implemented using this method, in effect, oversubscribe memory. Data point arrays become scratch pads for marking errors in vectorization due to recurring data points. This leads to greater efficiency in the use of memory, resulting in the capacity to handle larger problems in the same memory space. It also eases demand on memory bandwidth.
  • the vector update method tracks and repairs errors due to recurring data points within the internal loop of the vectorized code. Unlike Abe or Paruolo, this method operates on a set of elements from operand arrays acting on the data point arrays until all elements in the set are consumed. Abe must maintain in memory a list of elements that were missed by forced vectorization. He must then enter a second series of operations to repair the damage. Paruolo must maintain in memory two arrays, one for matching elements from a pool of elements to locations representative of the data point array X and the other to track the remaining elements in the pool.
  • a robust general compiler capable of vectorizing recursive equations can be developed using the vector update method.
  • the vector update method consumes elements of the operand arrays as they are handled.
  • Computer 10 remains inside the vector update loop until all recurring data points are corrected. More importantly, since the vector update method requires no additional memory for tracking recurring data points, there is no requirement for determining the size of data point arrays.
  • a compiler implemented under the Abe or Paruolo methods is required to perform a run time determination of the maximum size of the index into the data point array. This is no longer required. Program code can be compiled at compile time with no additional run time memory allocation steps.
  • FIG. 4 illustrates the result of strip mining the loop.
  • the routine in FIG. 4 is entered at 40 where the index variable k is initialized to zero.
  • an initial value of ii is calculated.
  • a check is made to see if ii is greater than or equal to VL (the vector register length of the computer used). If ii is greater than VL-1, computer 10 moves to 46 and performs the vector update method's vector operation and correction for recurring data points.
  • VL the vector register length of the computer used
  • computer 10 moves to 46 and performs the vector update method's vector operation and correction for recurring data points.
  • a value equivalent to the length of a vector register is added to k. Steps 42, 44, 46 and 48 are repeated until the number of elements in ii is less than VL.
  • the remaining calculations are performed in a scalar loop.
  • a check is made to see if all elements have been consumed. If so, control returns to the main program. If not, the remaining calculations are performed as a scalar loop of ii iterations. Then control returns to the main program.
  • Vectorized compiling with the vector update method results in code that executes a set of equations of the form: ##EQU6## where the internal loop calculation of the nested loop is performed as a vector operation.
  • FIG. 5 illustrates the preferred embodiment of the steps a vector processing computer goes through in executing a loop vectorized by a compiler 30 using the present invention.
  • This routine will work with all loops including those containing recurring data points.
  • the routine is entered at 60 where the contents of array X selected by j(i) in the current iteration of the loop are transferred from memory 12 to vector register 14.1.
  • vector register 14.2 is loaded with the values of 0 through VL-1 (where VL is the vector length of computer 10).
  • Vector register 14.2 will serve as a constant vector in the determination of recurring data points within the vector loop.
  • vector register 14.3 is loaded with the values of Y selected by i in the current iteration.
  • the vector operation(s) are performed as if there are no recurring data points.
  • the operation is the addition of X(j(i)) and Y(i).
  • the results are stored in vector register 14.4.
  • Steps 68 through 72 search for recurring data points.
  • the elements of vector register 14.2 are written to the locations in X involved in the current loop. Any recurring points will be written to more than once, with the last element written corresponding to the last vector operation performed on that data point.
  • the contents of array X selected by j(i) in the current iteration are transferred from memory 12 to vector register 14.5. If there are no recurring points, vector register 14.5 will be a replica of vector register 14.2. If there are recurring data points, the location of the last vector operation performed on that data point will occur more than once in vector register 14.5.
  • the check turns up one or more non-zero entries a scalar loop is entered.
  • the location of the first non-zero element in vector register 14.6 is saved to variable 1.
  • Variable 1 points at the element in vector register 14.3 whose operation on X was overwritten in step 66.
  • this operation is performed.
  • element 1 in vector register 14.6 is set to zero and control returns to 74. Steps 74, 76, 78, and 80 are repeated until all non-zero elements in vector register 14.6 have been set to zero.
  • control moves to 82, the updated values of X from vector register 14.4 are saved to memory 12 and control is returned to the exterior loop of FIG. 5.
  • step 74 In vector processing computer systems implemented with vector mask registers, an alternate embodiment of step 74 would be to use the vector mask register to mark the location of non-zero entries in vector register 14.6. Repair would then consist of a scalar loop that goes bit by bit through the vector mask register and performs the required operations on those locations corresponding to bits set to one in the vector mask register.
  • programmers can improve execution of code containing recursive equations by structuring their code according to the present invention. This would only be necessary in the case of program code that is intended to be executed on a vector processing computer that does not contain a compiler including the vector update method for vectorized compiling.
  • a FORTRAN subroutine for implementing the vector update method on a vector processing computer is shown in FIG. 3. It has been written to be executed on a vector processing computer containing vector registers of length 64 (e.g. a Cray supercomputer).
  • lines 3-5 are variable declarations.
  • the array tmp is the only additional memory required by routines implemented according to the present invention.
  • the size of tmp has been chosen such that the compiler will assign tmp to a vector register rather than external memory.
  • Lines 7-11 are comments depicting the recursive loop that is being solved according to the present invention.
  • Line 12 is the definition of a pointer to the array of data points X.
  • Lines 14 and 42 are the exterior loop created by the strip mining process.
  • Lines 16 through 20 perform the forced vectorization, saving the results into tmp.
  • Lines 22 through 26 write an index value into the array X at the locations read during the forced vectorization performed by lines 16 through 20. In the event of a recurring data point, data point array X will contain the final index value written to that location.
  • Lines 28 through 34 check to see if all index values can be read back and compensate for recurring data points.
  • a check is made to see if there was a recurring data point. In the event of a recurring data point the index value read back at that location will differ from the index value expected. The index value read will, however, point to the location in tmp containing the result of the last valid operation on that data point. This permits the execution at line 31 of an instruction to perform the needed additional operation on that data point.
  • Lines 36 through 40 write the results of the internal loop back to memory.
  • programmers should use their knowledge as to the nature of the data in their model to adjust the stride through the elements to decrease the likelihood of recurring data points. This would be appropriate in data structures that have recurring data points as a function of location in the array.

Abstract

A vector update method for vectorizing loops containing recursive equations within a supercomputer. Program code containing a loop is transformed into a nested loop in which the interior loop performs an integer number of iterations of the original loop equal to the vector length of the system. Vector operations are executed on the arrays of data within the interior loop, a check is made for recurring data points and repairs are made to the results of the vector operations. All repairs are completed before exiting the interior loop.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention pertains generally to the field of high speed digital processing systems, and more particularly to a method of vectorizing loops containing recursive equations in a supercomputer.
2. Background of the Invention
Supercomputers are high performance computing platforms that employ a pipelined vector processing approach to solving numerical problems. Vectors are ordered sets of data. Problems that can be structured as a sequence of operations on vectors can experience one to two orders of magnitude increased throughput when executed on a vector machine (compared to execution on a scalar machine of the same cost). Pipelining further increases throughput by hiding memory latency through the prefetching of instructions and data.
A pipelined vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978, to Cray, the disclosure of which is hereby incorporated herein by reference. In the Cray machine, vectors are processed by loading them into operand vector registers, streaming them through a data processing pipeline having a functional unit, and receiving the output in a result vector register. A vector machine according to U.S. Pat. No. 4,128,880 supports fully parallel operation by allowing multiple pipelines to execute concurrently on independent streams of data.
For vectorizable problems, vector processing is faster and more efficient than scalar processing. Overhead associated with maintenance of the loop-control variable (for example, incrementing and checking the count) is reduced. In addition, central memory conflicts are reduced (fewer but bigger requests) and data processing units are used more efficiently (through data streaming).
Vector processing supercomputers are used for a variety of large-scale numerical problems. Applications typically are highly structured computations that model physical processes. They exhibit a heavy dependence on floating-point arithmetic due to the potentially large dynamic range of values within these computations. Problems requiring modeling of heat or fluid flow, or of the behavior of a plasma, are examples of such applications.
Program code for execution on vector processing supercomputers must be vectorized to exploit the performance advantages of vector processing. Vectorization typically breaks up a loop of the form: ##EQU1## into a nested loop of the form: ##EQU2## where VL is the length of the vector registers of the system. This process is known as "strip mining the loop". In strip mining, the number of iterations in the internal loop is defined by the length of a vector register. The number of iterations of the external loop is defined as an integer number of vector lengths. The remaining iterations are performed as a separate loop placed before the nested loop. Vector length arrays of data from the original data arrays are loaded into the vector registers for each iteration of the internal loop. Data from these vector registers can then be processed at the one element per clock period rate of a vector operation.
Compilers exist that will automatically apply strip mining techniques to scalar loops within program code to create vectorized loops. This capability greatly simplifies programming efficient vector processing. The programmer simply enters code of the form: ##EQU3## and it is vectorized.
There are, however, certain types of problems that resist vectorization. Equations of the form: ##EQU4## are difficult to vectorize due to the possibility that the same data point in X may be needed for different Y(i) within a single vector operation.
Present compilers are designed to recognize the possibility of recurring points in a loop and inhibit vectorization so that the equation is solved by a scalar loop. The reasons for this are varied. The range and contents of j(i) may not be known until run time, making it difficult to predict recurring data points. Also, it is difficult to allocate memory for expressions like X(j(i)) in which array X may be of an arbitrary length to be determined at run time. For example, it is a common practice within advanced programming languages to pass an array X(1) implying X is infinite in length. A practical compiler may never know the size of the array and, as long as the data does not overlap, the program will execute properly.
There are programming steps that can be taken to regain some of the performance lost in executing recursive equations in scalar loops. Some compilers offer a compiler directive that can be used to force vectorization of loops that would not vectorize otherwise due to the appearance of recurring data points. Programmers using such a directive take on the responsibility of making sure that there are, in fact, no recurring data points within a vector operation inside the loop.
In addition, there exist a number of algorithms that can be used to structure program code to prevent recurring data or detect if operations have been affected by the vectorization process. One such algorithm creates a temporary area in memory and uses it to separate operations on the same data point. This method is called the work vector method. Since the source of error in vectorizing recursive loops lies in the occurrence of a data point more than once within a vector in the internal loop, a work vector is created to serve as intermediate storage of the result of the calculation. If array X is of size M and the vector length of the machine is VL, work vector WV[k,l] will be a two-dimensional array of size M * VL. The result of each pass through the internal loop is accumulated in the work vector array at the location l defined by its place in the vector and the address k of the original data point. That result is used in future calculations on the same data point in the same location in the operand vector. Two operations on the same data point in a vector operation result in writes to two different locations in the work vector. The last step in performing the original loop is to add elements 1 through VL for each data point in X.
The work vector method is simple. It can lead to impressive gains in performance over the scalar approach. But as the size of array X becomes large, the memory requirements for the work vector array become burdensome. Also, in cases of little overlap or of limited operations within M (smaller n/M), addition of elements 1 through VL for each data point in X leads to a large number of zero additions.
An alternate approach was suggested in a paper entitled "Present Status of Computer Simulation at IPP", by Dr. Yoshihiko Abe, in Supercomputing 88: volume II, Science and Applications in 1989. Abe proposed a method of vectorizing recursive equations which consisted of breaking one large loop into a number of smaller loops, performing vector operations on the smaller loops, flagging recurring data points within the vector operation and correcting the result. The Abe method creates two new vectors L and K of size M and 2*VL, respectively. Vector L is a scratch pad vector used to record the identity of the last Y(i) added to a data point. Vector K is a scratch pad vector used to record the location of elements of Y(i) whose contribution to X was missed due to the vector operation.
The Abe method assumes that the amount of overlap is fairly small. Therefore the cost of tracking overlap and performing calculations on those elements which were missed due to overlap is low. The frequency of correction must be kept small in order to get good performance from this algorithm. But in a large class of problems (e.g. Monte Carlo simulations) the number of iterations is small in comparison to data point space. For these types of problems, the Abe method will give a substantial boost in performance.
There are drawbacks to the Abe technique, however. Memory must be set aside that is equivalent to the size of the array of data points. This memory (vector L above) is needed to record the identity of the last element used to operate on that data point. In addition, memory must be set aside (vector K) as a stack for recording the location of elements missed due to overlap. This additional memory can get large for large M. The total memory needed required by the Abe method is, however, smaller than that required by the work vector method.
Most importantly, the Abe method is difficult to implement in a practical compiler. The size M of array X is rarely known at compile time. Although it is possible for a compiler to allocate vector K based on the value of M calculated at run time, the time required to scan j in order to determine the maximum size required for K would severely curtail any performance benefits gained from the execution of Abe's method.
A different approach was presented by Giuseppe Paruolo in an article entitled "A Vector-Efficient and Memory-Saving Interpolation Algorithm for PIC Codes on a Cray X-MP" published in the Journal of Computational Physics 89 in 1990. Paruolo suggests a method in which two temporary work arrays are used to detect and compensate for recurring data points. Two working arrays are created, IAG and IAP. IAG is the same size as X. IAP is the same size as Y. IAP serves as a pool of pointers linking elements in Y to their corresponding data points in X. The method consists of writing the location of each element in Y to the location in IAG corresponding to the data point affected in X. Locations in IAG that are affected by more than one element of Y are written into more than once. After one pass through IAP, IAG contains pointers to no more than one element in Y for each data point in X. A new X is calculated using the values of Y pointed to by IAG and those elements of Y are removed from the pool.
The above process is repeated until all elements of Y are consumed, the pool of elements in Y falls beneath a threshold or the number of elements consumed in a pass falls beneath a threshold. The remaining iterations are then solved with a scalar loop.
Paruolo's method, like Abe, requires additional memory. Memory must be set aside that is equivalent to the sum of the size of array Y and the size of array X. This additional memory can get large for large n or large M. Like Abe, Paruolo's method is difficult to implement in a practical compiler due to the difficulty in determining the size of X prior to algorithm execution time.
Abe and Paruolo are geared to solving different types of problems. Abe works best in a large data space with little overlapping of elements. Paruolo works best in a data space with a great deal of overlap, as in the case where the number of elements in Y is much greater than the number of data points in X. The Paruolo method is optimized toward problems exhibiting a uniform distribution of elements of Y over the data points of X. However, it would perform no worse than the scalar calculation in the case of total overlap (i.e. all elements in Y affect one point in X).
Each of the three algorithms discussed offer ways of structuring loops containing recursive equations so as to take advantage of some level of vectorization. They share a requirement for large amounts of additional memory that limits the size of problems that can be executed and may force otherwise vectorizable calculations to be performed in scalar mode.
In addition, use of these algorithms requires a conscious effort on the part of programmers to structure their programs. The programmer must convert a problem that involves recurring data points into a different approach that can take advantage of these methods. It is difficult to construct a practical compiler that will automatically vectorize loops containing ambiguous references to possibly recurring data points using these methods. The ambiguous nature of references to data point arrays such as X means that compilers implementing these algorithms would be required to allocate memory at run time based on the contents of one or more data arrays. This approach would be complex and the time required to determine the required memory size would probably outweigh any performance increase gained from compiled execution.
It is clear that there is a need for improved methods of vectorizing scalar loops. A system which can operate to vectorize scalar loops containing recursive equations with no additional memory requirements is desired. In addition, there is a need for a method of vectorizing scalar loops containing recursive equations that can be placed into a compiler and automatically vectorize such loops.
SUMMARY OF THE INVENTION
The present invention provides a vector update method for vectorizing loops containing recursive equations within a supercomputer. Program code containing a loop is transformed into a nested loop in which the interior loop performs an integer number of iterations of the original loop equal to the vector length of the system. Vector operations are executed on the arrays of data within the interior loop, a check is made for recurring data points and repairs are made to the results of the vector operations. All repairs are completed before exiting the interior loop.
The vector update method requires no additional memory for the vectorization of program code. Furthermore, it is designed to vectorize loops without recursive elements with negligible loss in performance. A system implemented according to the present invention demonstrates a noticeable increase in the speed of program execution and can handle larger problems than systems implemented under previous methods.
According to another aspect of this invention, a compiler constructed to include the vector update method of vectorizing code will automatically transform program code containing loops into vectorized code. A compiler according to this invention generates code without allocating additional work space memory. This means that it can operate independent of the data used in the program to be compiled, eliminating the run time determination of memory sizes required by compilers implemented with other methods.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram representative of a typical vector processing computer system according to the present invention.
FIG. 2 is a flow diagram representative of the steps taken to vectorize program code containing a loop according to the present invention.
FIG. 3 is a FORTRAN representation of a program solving an algorithm containing recurring data points using the vector update method of the present invention.
FIGS. 4 and 5 are flow diagrams representative of the steps a vector processing computer system goes through in executing a vectorized loop after compiling with the vector update method of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
In the following Detailed Description of the Preferred Embodiments, reference is made to the accompanying Drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
A computer system 10 which can be used for high speed data processing is illustrated generally in FIG. 1. Computer system 10 is a typical pipelined vector processing system which includes a main memory 12, a plurality of vector registers 14, a plurality of scalar registers 16, a plurality of address registers 18, a plurality of arithmetic units 20, an instruction buffer 22, an instruction register 24, a program counter 26 and a compiler 30. Computer system 10 also includes control hardware for operating in either scalar or vector mode (not shown). Vector registers 14, scalar registers 16 and address registers 18 are connected to main memory 12 and serve as intermediate high-speed memory between main memory 12 and arithmetic units 20. Vector registers 14.1 through 14.n are of uniform vector register length VL and are connected to main memory 12 and arithmetic units 20. Compiler 30 is connected to main memory 12 and serves to convert program code into machine executable code in preparation for execution in computer 10. Instruction buffer 22 is connected to main memory 12 and serves as intermediate high-speed memory between main memory 12 and instruction register 24.
To operate computer system 10, data is loaded from memory 12 into one or more of vector registers 14. Those vector registers 14 used as operand registers for a given vector process transmit individual elements to an arithmetic unit 20 at the rate of one element per time period. Once the start-up time, or arithmetic unit time, has passed, the arithmetic unit provides successive result elements on successive time periods, and these are transmitted as elements of a result vector to a vector register 14 acting as a result register for that particular vector process. Vector transfers between vector registers 14 and main memory 12 may also be accomplished at one element per time period.
By providing a number of arithmetic units (for example, floating point multiply, integer add, logical operations, etc.) and a number of vector registers 14 (for example, eight), any of which may be associated by program instruction control with any arithmetic unit 20 or memory 12, computer system 10 may have numerous vector processes proceeding simultaneously, thereby achieving extremely high data processing rates.
Program code to be run on computer system 10 must be vectorized to exploit the performance advantages inherent in vector processing. Program code containing loops in which mathematical or logical operations are performed on arrays of data points can be vectorized using the vector update method of the present invention. In the preferred embodiment of this invention this code vectorization is performed by a compiler. In an alternate embodiment, the method of this invention can be used by programmers to restructure program code for use on vector processing systems that do not have this compiler.
A flow chart of the steps of the present invention is shown in FIG. 2. At 100, program code containing a loop is converted into program code containing a nested loop of depth two through the process of strip mining. The resulting interior loop is of length VL, the vector length of the system. At 102, functions to be performed within the interior loop are set up as vector operations with the results to be saved to an intermediate vector register.
The steps of strip mining and vectorizing ignore the possibility of recurring data points. At 104, recurring data points are detected by including program code that writes the numbers 0 through VL-1 to the locations in memory of the data points used in the current interior loop. Recurring data points will be written to more than once, with the last number written pointing to the location in the intermediate vector register of the last valid operation performed on that data point.
At 106, program code is included that repairs those elements in the intermediate vector register associated with the recurring data points and saves the results to memory.
A system implemented according to the present invention has several advantages over a system implemented using the work method, Abe's method or Paruolo's method. There are no additional memory requirements. Systems implemented using this method, in effect, oversubscribe memory. Data point arrays become scratch pads for marking errors in vectorization due to recurring data points. This leads to greater efficiency in the use of memory, resulting in the capacity to handle larger problems in the same memory space. It also eases demand on memory bandwidth.
The vector update method tracks and repairs errors due to recurring data points within the internal loop of the vectorized code. Unlike Abe or Paruolo, this method operates on a set of elements from operand arrays acting on the data point arrays until all elements in the set are consumed. Abe must maintain in memory a list of elements that were missed by forced vectorization. He must then enter a second series of operations to repair the damage. Paruolo must maintain in memory two arrays, one for matching elements from a pool of elements to locations representative of the data point array X and the other to track the remaining elements in the pool.
A robust general compiler capable of vectorizing recursive equations can be developed using the vector update method. The vector update method consumes elements of the operand arrays as they are handled. Computer 10 remains inside the vector update loop until all recurring data points are corrected. More importantly, since the vector update method requires no additional memory for tracking recurring data points, there is no requirement for determining the size of data point arrays. A compiler implemented under the Abe or Paruolo methods is required to perform a run time determination of the maximum size of the index into the data point array. This is no longer required. Program code can be compiled at compile time with no additional run time memory allocation steps.
FIGS. 4 and 5 illustrate an embodiment of the steps computer system 10 will execute after compiler 30 has compiled program code of the form: ##EQU5## into vectorized code. It should be apparent to those skilled in the art that other equations including equations of the form X(j(i))=f{X(j(i)), Y(), . . . , Z()} could be compiled using the same technique.
FIG. 4 illustrates the result of strip mining the loop. The routine in FIG. 4 is entered at 40 where the index variable k is initialized to zero. At 42 an initial value of ii is calculated. At 44 a check is made to see if ii is greater than or equal to VL (the vector register length of the computer used). If ii is greater than VL-1, computer 10 moves to 46 and performs the vector update method's vector operation and correction for recurring data points. Upon completion, at 48, a value equivalent to the length of a vector register is added to k. Steps 42, 44, 46 and 48 are repeated until the number of elements in ii is less than VL.
When the number of elements remaining to be processed is less than the vector length of a vector register, the remaining calculations are performed in a scalar loop. At 50, a check is made to see if all elements have been consumed. If so, control returns to the main program. If not, the remaining calculations are performed as a scalar loop of ii iterations. Then control returns to the main program.
Vectorized compiling with the vector update method results in code that executes a set of equations of the form: ##EQU6## where the internal loop calculation of the nested loop is performed as a vector operation.
FIG. 5 illustrates the preferred embodiment of the steps a vector processing computer goes through in executing a loop vectorized by a compiler 30 using the present invention. This routine will work with all loops including those containing recurring data points. The routine is entered at 60 where the contents of array X selected by j(i) in the current iteration of the loop are transferred from memory 12 to vector register 14.1. At 62, vector register 14.2 is loaded with the values of 0 through VL-1 (where VL is the vector length of computer 10). Vector register 14.2 will serve as a constant vector in the determination of recurring data points within the vector loop. At 64, vector register 14.3 is loaded with the values of Y selected by i in the current iteration.
At 66, the vector operation(s) are performed as if there are no recurring data points. In this example, the operation is the addition of X(j(i)) and Y(i). The results are stored in vector register 14.4.
Steps 68 through 72 search for recurring data points. At 68, the elements of vector register 14.2 are written to the locations in X involved in the current loop. Any recurring points will be written to more than once, with the last element written corresponding to the last vector operation performed on that data point. At 70, the contents of array X selected by j(i) in the current iteration are transferred from memory 12 to vector register 14.5. If there are no recurring points, vector register 14.5 will be a replica of vector register 14.2. If there are recurring data points, the location of the last vector operation performed on that data point will occur more than once in vector register 14.5.
At 72, a test is made for recurring data points. Vector register 14.2 is subtracted from vector register 14.5 and the result stored in a vector register 14.6. At 74, a check is made for any non-zero entries in vector register 14.6. If not, there were no recurring data points and at 82 the updated values of X from vector register 14.4 are saved to memory 12. Control is returned to the strip mining loop of FIG. 5.
If, at 74, the check turns up one or more non-zero entries a scalar loop is entered. At 76 the location of the first non-zero element in vector register 14.6 is saved to variable 1. Variable 1 points at the element in vector register 14.3 whose operation on X was overwritten in step 66. At 78, this operation is performed. At 80, element 1 in vector register 14.6 is set to zero and control returns to 74. Steps 74, 76, 78, and 80 are repeated until all non-zero elements in vector register 14.6 have been set to zero.
Then control moves to 82, the updated values of X from vector register 14.4 are saved to memory 12 and control is returned to the exterior loop of FIG. 5.
In vector processing computer systems implemented with vector mask registers, an alternate embodiment of step 74 would be to use the vector mask register to mark the location of non-zero entries in vector register 14.6. Repair would then consist of a scalar loop that goes bit by bit through the vector mask register and performs the required operations on those locations corresponding to bits set to one in the vector mask register.
In an alternate embodiment, programmers can improve execution of code containing recursive equations by structuring their code according to the present invention. This would only be necessary in the case of program code that is intended to be executed on a vector processing computer that does not contain a compiler including the vector update method for vectorized compiling.
A FORTRAN subroutine for implementing the vector update method on a vector processing computer is shown in FIG. 3. It has been written to be executed on a vector processing computer containing vector registers of length 64 (e.g. a Cray supercomputer). The subroutine shown executes an operation of the form X(j(i))=X(j(i))+Y(i). The array j(i) is a pointer array used to match the elements of Y to the data points of X. It should be apparent to those skilled in the art that other equations including equations of the form X(j(i))=f{X(j(i)), Y(), . . . , Z()} could be implemented using the same technique.
In the subroutine of FIG. 3, lines 3-5 are variable declarations. The array tmp is the only additional memory required by routines implemented according to the present invention. The size of tmp has been chosen such that the compiler will assign tmp to a vector register rather than external memory. Lines 7-11 are comments depicting the recursive loop that is being solved according to the present invention. Line 12 is the definition of a pointer to the array of data points X.
Lines 14 and 42 are the exterior loop created by the strip mining process. Lines 16 through 20 perform the forced vectorization, saving the results into tmp. Lines 22 through 26 write an index value into the array X at the locations read during the forced vectorization performed by lines 16 through 20. In the event of a recurring data point, data point array X will contain the final index value written to that location.
Lines 28 through 34 check to see if all index values can be read back and compensate for recurring data points. At line 29 a check is made to see if there was a recurring data point. In the event of a recurring data point the index value read back at that location will differ from the index value expected. The index value read will, however, point to the location in tmp containing the result of the last valid operation on that data point. This permits the execution at line 31 of an instruction to perform the needed additional operation on that data point. Lines 36 through 40 write the results of the internal loop back to memory.
As an enhancement to programs structured according to the present invention, programmers should use their knowledge as to the nature of the data in their model to adjust the stride through the elements to decrease the likelihood of recurring data points. This would be appropriate in data structures that have recurring data points as a function of location in the array.
The replacement of a scalar loop representing a recursive equation with a vector update loop according to the present invention can result in code that will execute the loop approximately three times faster than the former scalar loop. For code with maximum overlap (all elements of Y operate on one element in X) execution will occur at approximately the same rate. For programs with sparse operations over large data spaces, computing efficiency can be expected to more than double.
Although the present invention has been described with reference to the preferred embodiments, those skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.

Claims (2)

What I claim is:
1. A computer implemented method of compiling program code including a vector operation on an array of data elements stored in memory of a vector processing computer system with vector registers of length VL, the method comprising the steps of:
a) scanning the program code for an equation which may have recurring data points;
b) replacing the equation with vectorized machine executable code, wherein the machine executable code comprises a nested loop and wherein the nested loop comprises:
an exterior loop which decomposes the equation into a plurality of vector length loops, including a current vector length loop;
a first interior loop for executing vector operations corresponding to the current vector length loop to form a vector register length result vector;
a second interior loop for determining occurrences, within the current vector length loop, of recurring data points, wherein the second interior loop comprises code for performing indirect writes of known values to said array, code for performing indirect reads of said array and code for comparing data read during the indirect reads of the array to the known values written to the array; and
a third interior loop for repairing elements of the result vector associated with the recurring data points and writing said repaired result vector to said memory; and
c) saving the vectorized code to said memory.
2. A computer implemented method of compiling program code including a mathematical operation on an array of data elements stored in memory of a vector processing computer system with vector registers of length VL, the method comprising the steps of:
a) scanning the program code for an equation which may have recurring data points;
b) replacing the equation with vectorized machine executable code, wherein the vectorized machine executable code comprises:
vectorization code for performing the mathematical operation as a vector operation, wherein the vectorization code comprises code for performing the mathematical operation on a vector length subset of the data elements in order to form a VL element result vector and for storing the result vector in a first vector register;
determining code for determining and correcting recurring data points in the result vector while the result vector is in the first vector register, said determining code comprising code for performing indirect writes of known values to said array of data in memory, for performing indirect reads of said array, for comparing data written during the indirect writes to data read during the indirect reads to find recurring data points and for correcting the result vector to compensate for the recurring data points; and
code for saving the corrected result vector to memory; and
c) saving the vectorized code to said memory.
US07/642,480 1991-01-17 1991-01-17 Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory Expired - Lifetime US5247696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US07/642,480 US5247696A (en) 1991-01-17 1991-01-17 Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US07/642,480 US5247696A (en) 1991-01-17 1991-01-17 Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory

Publications (1)

Publication Number Publication Date
US5247696A true US5247696A (en) 1993-09-21

Family

ID=24576742

Family Applications (1)

Application Number Title Priority Date Filing Date
US07/642,480 Expired - Lifetime US5247696A (en) 1991-01-17 1991-01-17 Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory

Country Status (1)

Country Link
US (1) US5247696A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5437035A (en) * 1991-06-14 1995-07-25 Hitachi, Ltd. Method and apparatus for compiling a program incending a do-statement
US5522074A (en) * 1992-12-14 1996-05-28 Nec Corporation Vectorization system for vectorizing loop containing condition induction variables
US5671431A (en) * 1992-09-22 1997-09-23 Siemens Aktiengesellschaft Method for processing user program on a parallel computer system by inserting a tag during compiling
US5778241A (en) * 1994-05-05 1998-07-07 Rockwell International Corporation Space vector data path
US5987248A (en) * 1996-03-27 1999-11-16 Fujitsu Limited Debugging information display device
US20030200426A1 (en) * 2002-04-22 2003-10-23 Lea Hwang Lee System for expanded instruction encoding and method thereof
US20040003381A1 (en) * 2002-06-28 2004-01-01 Fujitsu Limited Compiler program and compilation processing method
US6728694B1 (en) * 2000-04-17 2004-04-27 Ncr Corporation Set containment join operation in an object/relational database management system
US6813701B1 (en) * 1999-08-17 2004-11-02 Nec Electronics America, Inc. Method and apparatus for transferring vector data between memory and a register file
US7062633B1 (en) * 1998-12-16 2006-06-13 Matsushita Electric Industrial Co., Ltd. Conditional vector arithmetic method and conditional vector arithmetic unit
US20090113168A1 (en) * 2007-10-24 2009-04-30 Ayal Zaks Software Pipelining Using One or More Vector Registers
US20100122069A1 (en) * 2004-04-23 2010-05-13 Gonion Jeffry E Macroscalar Processor Architecture
US20100175056A1 (en) * 2002-07-03 2010-07-08 Hajime Ogawa Compiler apparatus with flexible optimization
US20100235612A1 (en) * 2004-04-23 2010-09-16 Gonion Jeffry E Macroscalar processor architecture
US20100318764A1 (en) * 2009-06-12 2010-12-16 Cray Inc. System and method for managing processor-in-memory (pim) operations
US20100318773A1 (en) * 2009-06-11 2010-12-16 Cray Inc. Inclusive "or" bit matrix compare resolution of vector update conflict masks
US20100318979A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Vector atomic memory operation vector update system and method
US20100318591A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Inclusive or bit matrix to compare multiple corresponding subfields
US20100318769A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Using vector atomic memory operation to handle data of different lengths
US8402451B1 (en) 2006-03-17 2013-03-19 Epic Games, Inc. Dual mode evaluation for programs containing recursive computations
US20160139897A1 (en) * 2012-09-28 2016-05-19 Intel Corporation Loop vectorization methods and apparatus
US9424039B2 (en) * 2014-07-09 2016-08-23 Intel Corporation Instruction for implementing vector loops of iterations having an iteration dependent condition
US10318261B2 (en) * 2014-11-24 2019-06-11 Mentor Graphics Corporation Execution of complex recursive algorithms
US10402177B2 (en) 2013-03-15 2019-09-03 Intel Corporation Methods and systems to vectorize scalar computer program loops having loop-carried dependences
US20210286601A1 (en) * 2019-12-09 2021-09-16 Horizon Quantum Computing Pte. Ltd. Systems and methods for unified computing on digital and quantum computers

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4710872A (en) * 1985-08-07 1987-12-01 International Business Machines Corporation Method for vectorizing and executing on an SIMD machine outer loops in the presence of recurrent inner loops
JPS63120338A (en) * 1986-11-10 1988-05-24 Matsushita Electric Ind Co Ltd Program converting device
US4821181A (en) * 1986-01-08 1989-04-11 Hitachi, Ltd. Method for converting a source program of high level language statement into an object program for a vector processor
US4833606A (en) * 1986-10-09 1989-05-23 Hitachi, Ltd. Compiling method for vectorizing multiple do-loops in source program
US4858115A (en) * 1985-07-31 1989-08-15 Unisys Corporation Loop control mechanism for scientific processor
US4967350A (en) * 1987-09-03 1990-10-30 Director General Of Agency Of Industrial Science And Technology Pipelined vector processor for executing recursive instructions
US5036454A (en) * 1987-05-01 1991-07-30 Hewlett-Packard Company Horizontal computer having register multiconnect for execution of a loop with overlapped code
US5083267A (en) * 1987-05-01 1992-01-21 Hewlett-Packard Company Horizontal computer having register multiconnect for execution of an instruction loop with recurrance
US5151991A (en) * 1987-10-21 1992-09-29 Hitachi, Ltd. Parallelization compile method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4858115A (en) * 1985-07-31 1989-08-15 Unisys Corporation Loop control mechanism for scientific processor
US4710872A (en) * 1985-08-07 1987-12-01 International Business Machines Corporation Method for vectorizing and executing on an SIMD machine outer loops in the presence of recurrent inner loops
US4821181A (en) * 1986-01-08 1989-04-11 Hitachi, Ltd. Method for converting a source program of high level language statement into an object program for a vector processor
US4833606A (en) * 1986-10-09 1989-05-23 Hitachi, Ltd. Compiling method for vectorizing multiple do-loops in source program
JPS63120338A (en) * 1986-11-10 1988-05-24 Matsushita Electric Ind Co Ltd Program converting device
US5036454A (en) * 1987-05-01 1991-07-30 Hewlett-Packard Company Horizontal computer having register multiconnect for execution of a loop with overlapped code
US5083267A (en) * 1987-05-01 1992-01-21 Hewlett-Packard Company Horizontal computer having register multiconnect for execution of an instruction loop with recurrance
US4967350A (en) * 1987-09-03 1990-10-30 Director General Of Agency Of Industrial Science And Technology Pipelined vector processor for executing recursive instructions
US5151991A (en) * 1987-10-21 1992-09-29 Hitachi, Ltd. Parallelization compile method and system

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5437035A (en) * 1991-06-14 1995-07-25 Hitachi, Ltd. Method and apparatus for compiling a program incending a do-statement
US5671431A (en) * 1992-09-22 1997-09-23 Siemens Aktiengesellschaft Method for processing user program on a parallel computer system by inserting a tag during compiling
US5522074A (en) * 1992-12-14 1996-05-28 Nec Corporation Vectorization system for vectorizing loop containing condition induction variables
US5778241A (en) * 1994-05-05 1998-07-07 Rockwell International Corporation Space vector data path
US5987248A (en) * 1996-03-27 1999-11-16 Fujitsu Limited Debugging information display device
US7062633B1 (en) * 1998-12-16 2006-06-13 Matsushita Electric Industrial Co., Ltd. Conditional vector arithmetic method and conditional vector arithmetic unit
US20050055536A1 (en) * 1999-08-17 2005-03-10 Ansari Ahmad R. Compiler instructions for vector transfer unit
US6813701B1 (en) * 1999-08-17 2004-11-02 Nec Electronics America, Inc. Method and apparatus for transferring vector data between memory and a register file
US6728694B1 (en) * 2000-04-17 2004-04-27 Ncr Corporation Set containment join operation in an object/relational database management system
US20030200426A1 (en) * 2002-04-22 2003-10-23 Lea Hwang Lee System for expanded instruction encoding and method thereof
US20040003381A1 (en) * 2002-06-28 2004-01-01 Fujitsu Limited Compiler program and compilation processing method
US20100175056A1 (en) * 2002-07-03 2010-07-08 Hajime Ogawa Compiler apparatus with flexible optimization
US8418157B2 (en) * 2002-07-03 2013-04-09 Panasonic Corporation Compiler apparatus with flexible optimization
US7975134B2 (en) 2004-04-23 2011-07-05 Apple Inc. Macroscalar processor architecture
US8412914B2 (en) * 2004-04-23 2013-04-02 Apple Inc. Macroscalar processor architecture
US8578358B2 (en) 2004-04-23 2013-11-05 Apple Inc. Macroscalar processor architecture
US20100235612A1 (en) * 2004-04-23 2010-09-16 Gonion Jeffry E Macroscalar processor architecture
US20100122069A1 (en) * 2004-04-23 2010-05-13 Gonion Jeffry E Macroscalar Processor Architecture
US8065502B2 (en) * 2004-04-23 2011-11-22 Apple Inc. Macroscalar processor architecture
US20120066482A1 (en) * 2004-04-23 2012-03-15 Gonion Jeffry E Macroscalar processor architecture
US8402451B1 (en) 2006-03-17 2013-03-19 Epic Games, Inc. Dual mode evaluation for programs containing recursive computations
US20090113168A1 (en) * 2007-10-24 2009-04-30 Ayal Zaks Software Pipelining Using One or More Vector Registers
US8136107B2 (en) * 2007-10-24 2012-03-13 International Business Machines Corporation Software pipelining using one or more vector registers
US20100318773A1 (en) * 2009-06-11 2010-12-16 Cray Inc. Inclusive "or" bit matrix compare resolution of vector update conflict masks
US8433883B2 (en) 2009-06-11 2013-04-30 Cray Inc. Inclusive “OR” bit matrix compare resolution of vector update conflict masks
US8458685B2 (en) 2009-06-12 2013-06-04 Cray Inc. Vector atomic memory operation vector update system and method
US20100318591A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Inclusive or bit matrix to compare multiple corresponding subfields
US20100318979A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Vector atomic memory operation vector update system and method
US20100318764A1 (en) * 2009-06-12 2010-12-16 Cray Inc. System and method for managing processor-in-memory (pim) operations
US8583898B2 (en) * 2009-06-12 2013-11-12 Cray Inc. System and method for managing processor-in-memory (PIM) operations
US8826252B2 (en) * 2009-06-12 2014-09-02 Cray Inc. Using vector atomic memory operation to handle data of different lengths
US8954484B2 (en) 2009-06-12 2015-02-10 Cray Inc. Inclusive or bit matrix to compare multiple corresponding subfields
US20100318769A1 (en) * 2009-06-12 2010-12-16 Cray Inc. Using vector atomic memory operation to handle data of different lengths
US9547474B2 (en) 2009-06-12 2017-01-17 Cray Inc. Inclusive or bit matrix to compare multiple corresponding subfields
US20160139897A1 (en) * 2012-09-28 2016-05-19 Intel Corporation Loop vectorization methods and apparatus
US9898266B2 (en) * 2012-09-28 2018-02-20 Intel Corporation Loop vectorization methods and apparatus
US10402177B2 (en) 2013-03-15 2019-09-03 Intel Corporation Methods and systems to vectorize scalar computer program loops having loop-carried dependences
US9424039B2 (en) * 2014-07-09 2016-08-23 Intel Corporation Instruction for implementing vector loops of iterations having an iteration dependent condition
US9921837B2 (en) 2014-07-09 2018-03-20 Intel Corporation Instruction for implementing iterations having an iteration dependent condition with a vector loop
US10318261B2 (en) * 2014-11-24 2019-06-11 Mentor Graphics Corporation Execution of complex recursive algorithms
US20210286601A1 (en) * 2019-12-09 2021-09-16 Horizon Quantum Computing Pte. Ltd. Systems and methods for unified computing on digital and quantum computers
US11842177B2 (en) * 2019-12-09 2023-12-12 Horizon Quantum Computing Pte. Ltd. Systems and methods for unified computing on digital and quantum computers

Similar Documents

Publication Publication Date Title
US5247696A (en) Method for compiling loops having recursive equations by detecting and correcting recurring data points before storing the result to memory
Frison et al. BLASFEO: Basic linear algebra subroutines for embedded optimization
US8583898B2 (en) System and method for managing processor-in-memory (PIM) operations
US5361354A (en) Optimization of alternate loop exits
Du et al. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
US6113650A (en) Compiler for optimization in generating instruction sequence and compiling method
US5537620A (en) Redundant load elimination on optimizing compilers
Shin et al. Compiler-controlled caching in superword register files for multimedia extension architectures
US8458685B2 (en) Vector atomic memory operation vector update system and method
KR102379894B1 (en) Apparatus and method for managing address conflicts when performing vector operations
TW201737075A (en) Complex multiply instruction
Lemaitre et al. Cholesky factorization on SIMD multi-core architectures
Falch et al. Register caching for stencil computations on GPUs
Bischof et al. ADIFOR: Automatic differentiation in a source translator environment
US20170269931A1 (en) Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
US8826252B2 (en) Using vector atomic memory operation to handle data of different lengths
DeVries et al. Parallel implementations of FGMRES for solving large, sparse non-symmetric linear systems
Lee On the Floating Point Performance of the i860™ Microprocessor
US20230095916A1 (en) Techniques for storing sub-alignment data when accelerating smith-waterman sequence alignments
Ando et al. Efficiency of the CYBER 205 for stochastic simulations of a simultaneous, nonlinear, dynamic econometric model
US20230101085A1 (en) Techniques for accelerating smith-waterman sequence alignments
US20230305844A1 (en) Implementing specialized instructions for accelerating dynamic programming algorithms
WO2022130883A1 (en) Compiling device, compiling method, and compiling program recording medium
Hu et al. Optimizing INTBIS on the CRAY Y-MP
Gao et al. Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Legal Events

Date Code Title Description
AS Assignment

Owner name: CRAY RESEARCH INC., 608 SECOND AVENUE SOUTH, MINNE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST.;ASSIGNOR:BOOTH, MICHAEL W.;REEL/FRAME:005613/0929

Effective date: 19910114

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: SILICON GRAPHICS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CRAY RESEARCH, L.L.C.;REEL/FRAME:010927/0853

Effective date: 20000524

FPAY Fee payment

Year of fee payment: 8

AS Assignment

Owner name: FOOTHILL CAPITAL CORPORATION, CALIFORNIA

Free format text: SECURITY AGREEMENT;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:012428/0236

Effective date: 20011109

AS Assignment

Owner name: U.S. BANK NATIONAL ASSOCIATION, AS TRUSTEE, CALIFO

Free format text: SECURITY INTEREST;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:014805/0855

Effective date: 20031223

FPAY Fee payment

Year of fee payment: 12

AS Assignment

Owner name: GENERAL ELECTRIC CAPITAL CORPORATION,CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:018545/0777

Effective date: 20061017

Owner name: GENERAL ELECTRIC CAPITAL CORPORATION, CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:018545/0777

Effective date: 20061017

AS Assignment

Owner name: MORGAN STANLEY & CO., INCORPORATED, NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GENERAL ELECTRIC CAPITAL CORPORATION;REEL/FRAME:019995/0895

Effective date: 20070926

Owner name: MORGAN STANLEY & CO., INCORPORATED,NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GENERAL ELECTRIC CAPITAL CORPORATION;REEL/FRAME:019995/0895

Effective date: 20070926

AS Assignment

Owner name: GRAPHICS PROPERTIES HOLDINGS, INC., NEW YORK

Free format text: CHANGE OF NAME;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:028066/0415

Effective date: 20090604

AS Assignment

Owner name: RPX CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GRAPHICS PROPERTIES HOLDINGS, INC.;REEL/FRAME:029564/0799

Effective date: 20121224