Recherche Images Maps Play YouTube Actualités Gmail Drive Plus »
Recherche avancée dans les brevets | Images de page | Historique Web | Connexion

Brevets

  
[merged small][merged small][merged small][merged small][table][merged small][merged small][merged small][table][graphic][graphic][graphic][merged small][merged small]

1 2

functions in an analogous manner to an analog band pass COMPUTING PROCESSOR WITH MEMORYLESS filter. The computations involve forming a sum of the FUNCTION UNITS EACH CONNECTED TO products of the digital signal multiplied by weighting

DIFFERENT PART OF A MULTIPORTED factors depending on the time at which each digital

MEMORY 5 signal value was measured. This may be accomplished

by constructing a linear array of N processors. At any This application is a continuation-in-part of appli- given time, the Kth processor in the linear array concant's copending application, Ser. No. 781,231 filed tains the value of the digital signal as it existed K clock Sept. 27, 1985, now abandoned, which was a continua- periods earlier. During each clock period, each procestion of applicant's prior application, Ser. No. 527,147 10 sor computes one term in the sum, i.e., the product of now abandoned. the digital signal value it currently is storing and a con

BACKGROUND OF THE INVENTION stant inputted to it on a separate signal line and stored

with the processing element. The result of this computaThe present invention relates generally to the field of tion in the Kth processor is then added to result from computer systems and more particularly to computer 15 the (£_ ^th processor and passed on the (K+ l)st proprocessing elements for use in concurrent computing cessor together with the digital signal value that was systems. stored in the Kth processor. A new digital signal value

Although general purpose computing systems of the ig j tted tQ the flrst processor after start of each clock Von Neumann type have made great advances in both ^ vaJue of ,he

sum outputted from the Nth

computing speed and cost per computation through 20 ^ h used t0 calculate the fl]tered si } value improvements in VLSI circuit design and fabncation corr di to the di^tal si j value.

hese systems are still too slow to perform many real ^ ^ ^ ^ *f iaJ hafd.

.me computational problems. Computer apphcations in wafe ^ ,ack f vaMd^ For *' £ £ of

the signal processing area often require more than a . .. . , J . ,

..... , . j n. r , processors described above cannot be easily recon

billion calculations per second. This is far above the 25 K. , . , . , .. . . _ J ... through-put of currently available Von Neumann com- R^P«*TM a calculation requiring two mult.

outers P ^ 931 a<""tion m eacn processor. Similarly, if the

The classical Von Neumann computer consists of a P^ular problem does not require all N stages, the memory connected to a central processing unit. Instruc- unused stagescannot be used to perform other muUiphtions and data are fetched from the memory by the 30 cation and addition steps. As a result, these special purcentral processing unit which is responsible for essen- Pose Pressors only have a high efficiency as measured tially all of the computational tasks. The typical central bv computations per second per square micron of inteprocessing unit is capable of executing hundreds of «rated Clrcult m for a sma11 class of problems, different instructions. However, it can not execute these There have been a number of attempts to construct simultaneously. Hence, at any given time, most of the 35 more general purpose processors which avoid the limicircuitry in the central processing unit is idle, since the tatlons of the classical Von Neumann computer. The central processing unit typically executes only one in- vector processing computer described by Cray (U.S. struction at a time. In addition to reducing the cost Pat No- +.128,880) is typical of such a computer. This effectiveness of the central processing unit, this idle computer is optimized for repetitive calculations incircuitry reduces the speed at which the central pro- 40 volving a small number of operations which are to be cessing unit can operate. The need to include this cir- performed successively on each element of one or more cuitry on the computer chip results in a larger chip with vectors. For example, the process may involve adding longer connecting paths between the various processing corresponding elements of two vectors and then storing elements. These longer signal paths have significant tne result in the corresponding element of a third "reparasitic capacitances which limit the speed at which 45 &ult" vector. The data making up the vectors is transthey can be driven. Hence, as the size of the central ferred to a set of vector registers in this special purpose processing unit is increased, the maximum clock rate at computer from a main memory which is usually part of which it can run is decreased. This further reduces the a lafge computing system in which this special purpose cost effectiveness of the central processing unit. system is incorporated. This architecture provides a

In addition, the memory from which the central pro- 50 substantial improvement over the classical Von Neucessing unit fetches instructions and data is typically mann architecture for a number of reasons. First, the located on a separate chip and hence is also limited in vector registers provide a high speed memory system speed by the capacitances of the signal paths. This limi- optimized for transferring successive elements of one or tation can be reduced somewhat by including a small more vectors to one of a plurality of function units fast cache memory on the central processing unit chip 55 which performs the desired calculation and then transfer holding instructions and/or data which would oth- ferring the results back to one of the vector registers, erwise be repeatedly transferred between the central This reduces the time needed to transfer data back and processing unit and the large system memory. How- forth between the slower system memory, since data ever, the size of the cache memory needed to relieve the that is repeatedly used is held in the vector registers problems introduced by the off chip system memory 60 until it is no longer needed. Hence, the need to transfer may be too large to be included on the central process- the same data back and forth between the slower system ing unit chip. memory and the central processing unit is significantly

These limitations of the classical Von Neumann com- reduced. Second, the instructions needed to carry out

puter design have been overcome to some degree in the the operations on the vectors need not be repeatedly

prior art by designing special purpose computing hard- 65 transferred between the system memory and the vector

ware which is optimized for a particular computational processor. Third, the function units may be optimized

task. For example, a common problem in signal process- for the specific calculation. This allows smaller chip

ing involves the construction of a digital filter which areas to be used and hence higher clock rates.

This type of vector processor may be reconfigured to a limited degree which makes it applicable to a broader class of problems than the signal processing computer described above. In the "chaining mode" described by Cray, the results from one vector operation which are 5 stored in a result vector register are immediately available as operands to a second function unit which may perform computations concurrently with other function units.

This type of vector processor, however, suffers from 10 three significant problems. First, it is a special purpose system which is only optimized for a specific limited class of computational problems, i.e., those involving applying a small computational program successively to each element in one or more vectors. It is inefficient at IS carrying out computations not in this class, and there is no way to reconfigure it when a problem for which it is not optimized is encountered. For example, if the vectors in question are too long to fit into the vector registers, there is no simple way of combining two registers 20 to form one long register. Similarly, if the code needed to carry out the computations does not fit in the internal memory allocated for code storage, there is no way to utilize free memory in the vector register area to provide additional code storage space. In these cases, the 25 calculation must be broken into sub-calculations which are run in tandem on the processor.

Second, it is difficult to configure such a system such that all the various function units operate concurrently. If the particular computational program does not utilize 30 all of the function units present in the processor, there is no practical method for applying the idle computational power to another part of the overall program running on the main computer system to which the vector processor has been connected. 35

Finally, this type of vector processor may not be efficiently combined with other such processors to form a processing array similar to that described above with regard to digital filtering. There are numerous situations in which the optimum processor configuration consists 40 of an array of processors in which each processor performs the same computation, but on different data. The digital filtering example is such a case. Because of the high costs inherent in designing and testing a new VLSI circuit, considerable economies of scale can be realized 45 if an array of processors is used rather than constructing one large special purpose processor having the equivalent number of function units. This is particularly true when the individual processors are of a sufficiently general nature that they may be applied to a wide vari- 50 ety of problems. In such a case the design and initial fabrication costs can be spread over a large number of parts thus allowing significant economies of scale to be obtained. To obtain the maximum economies of scale in this case, the replicated processor unit should contain as 55 little control circuitry as possible, since this control function can be applied at the array level by a single control processor which services all processors in the array, thus eliminating the need to replicate this control hardware in each processor. 60

The vector processor design described above contains considerable control circuitry which is designed to allow it to run independently of system control for significant periods of time. This includes memory for storing the code of the program to be executed and 65 instruction decoding circuitry which is different for different instructions. At most, an array of such vector processors requires one copy of this circuitry. The un

necessary replication of this circuitry requires larger computer chips which in turn leads to slower clock rates as well as higher design and construction costs. Furthermore, arrays of processors of this type would suffer from input/output bottlenecks, since one bus is used for transferring data and instructions to and from each processor.

Broadly, it is an object of the present invention to provide a reconfigurable computer processor.

It is a further object of the present invention to provide a reconfigurable computer processor which contains a minimum amount of control circuitry.

It is a still further object of the present invention to provide a computer processor that may be efficiently combined with other such processors to form a processing array which may be controlled by a single controller.

These and other objects of the present invention will become apparent from the following detailed description of the present invention and the accompanying drawings.

SUMMARY OF THE INVENTION

The present invention consists of a processing element which may be used either separately or in an array of similar processing elements for performing concurrent data processing calculations. The processing element includes a multiported memory unit for storing data to be processed by any of a plurality of function units which are connected to the multiported memory unit. The multiported memory unit includes a number of data storage slots for storing data words to be processed and the results of said processing. Each function unit performs a calculation having as its inputs one or more data words from the multiported memory unit. The result of this calculation is stored back in the multiported memory unit. The transfer of data to and from the function units is accomplished by use of the ports on said multiported memory unit. The multiported memory unit has a plurality of memory input ports used for receiving data to be stored therein and a plurality of memory output ports used for transmitting data which is stored in said multiported memory unit. Each function unit has one or more function unit input ports and one or more function unit output ports. Each functicn unit receives its inputs over its function unit input ports and transmits its results over its function unit output ports. Each function unit output port is connected to one memory input port and each function unit input port is connected to one memory output port. The total number of memory input ports is greater than or equal to the total number of function unit output ports. The excess memory input ports, if any, which are not connected to a function unit may be used for inputting data to be stored in the multiported memory unit from adjacent processing elements or from a host computer system in which the processing element is integrated. Similarly, the number of memory output ports is greater than or equal to the total number of function unit input ports. The excess memory output ports, if any, which are not connected to a function unit may be used for outputting data from the multiported memory unit to adjacent processing elements or a host computer. The data word stored in any data storage slot may be transmitted through any memory output port. Similarly, the data word received on any memory input port can be stored in any data storage slot in the multiported memory unit. The data manipulated by the processing element is controlled by specifying a correspondence between data storage slots, memory input ports and memory output ports. The processing element includes circuitry for receiving a list of data storage slots corresponding to each memory output port over which the 5 data word in said data storage slot is to be sent. The processing element also includes circuitry for receiving a list of data storage slots corresponding to each memory input port from which data is to be stored in said data storage slot. The processing element also includes 10 circuits for receiving a list of operation codes defining the functions to be carried out by each function unit.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a processing element 15 according to the present invention.

FIG. 2 is a block diagram of the preferred embodiment of multiported memory unit shown in FIG. 1.

FIG. 3 is a block diagram of a processing array using processing elements according to the present invention. 20

FIG. 4 is a block diagram of a portion of one of the processing elements shown in FIG. 3.

DETAILED DESCRIPTION OF THE

INVENTION 25

A block diagram of a processing element according to the present invention for use in a data processing system is shown at 10 in FIG. 1. The processing element includes a multiported memory 12 which is used to store data and control the flow of that data to and from 30 a plurality of function units 14 which perform various operations on the data. The multiported memory 12 also provides the means for communicating data between the processing element and other similar processing elements or an external data processing system. The 35 multiported memory 12 includes four basic elements, a random access memory 16, a plurality of memory output ports 18, a plurality of memory input ports 20, and a memory controller 22. The random access memory 16 is used to store data which is to be processed by the 40 function units 14 or which is to be passed on to other processing elements or an external data processing system. In the preferred embodiment, the random access memory 16 is organized into a plurality of memory slots of fixed length with one such memory slot being used to 45 store each word of data.

The memory output ports 18 are used to transmit data stored in the random access memory 16. The memory input ports 20 are used to receive data to stored in the random access memory 16. The input ports 19 and out- 50 put ports 17 which are not connected to a function unit may be used to transfer data between a processing element and an adjacent processing element or host data processing system. The memory output ports 18 which are connected to function units 14 may also be used to 55 transmit data to an adjacent processing element at the same time they are used to transmit data to the function unit if these output port lines are connected off of the chip on which the processing element of the present invention is constructed. Similarly, the memory input 60 ports 20 which are connected to function units 14 may also be used to receive data from an adjacent processing element when said memory input ports are not actually receiving data from a function unit.

These input and output operations are carried out 65 under the control of the memory controller 22 which is responsive to signals on a control bus 24 which define input and output lists of data words. During each major

6

memory cycle, as defined below, one word of data is copied from the random access memory 16 to each of the memory output ports 18, and one word of data is copied back into the random access memory 16 from each of the memory input ports 20. The location in the random access memory 16 from which each said word is to be copied is specified by an ordered list of addresses sent to the memory controller 22 on the control bus 24 by the data processing system in which said processing element is integrated. The order of the addresses in this list specify the memory output port 18 which is to receive the data stored at the address in question. Similarly, the location in the random access memory 16 at which each data word currently at a memory input port 20 is to be stored is specified by an ordered list of addresses sent to the memory controller 22 on the control bus 24. The order of each address specifies the memory input port 20 whose contents are to be copied to the memory location specified by said address. A "blank" address may be used to specify that no input or output is to be performed with a given memory input or output port. These input and output lists allow the memory controller 22 to function as both an input control means for entering data into the random access memory 16 and an output controller for transmitting data from said memory.

Each of the function units 14 performs a calculation using one or more data words stored in the multiported memory 12. The function units 14 may be simple adders, more general arithmatic logical units of the type used in the central processing unit of a typical prior art computing system, or special purpose function units, e.g., a function unit for evaluating an expression such as a*b+c. In the preferred embodiment, the result of any function unit calculation is independent of the previous calculation carried out by the function unit in question; i.e., the function units have no memory. This simplifies the control of the system, since the controller does not have to keep track of the previous history of a particular function unit. Each function unit 14 has one or more function unit input ports 26 which are used for transferring data into the function unit 14 in question. There is one such function unit input port 26 for each data word used as an input to the function unit in question. Each function unit input port 26 is connected to a different one of the memory output ports 18. Similarly, each function unit 14 has one or more function unit output ports 28 for transmitting the results of the operation performed by said function. Each of the function unit output ports 28 is connected to a different one of the memory input ports 20.

The exemplary function units 14 shown in FIG. 1 have two function unit input ports 26 and one function unit output ports 28; however, it will be apparent to those skilled in the art that different numbers of function unit input ports 26 and function unit output ports 28 may be used on each function unit 14. For example, a function unit 14 for computing the value of the expression a*b+c would have three function unit input ports 26 for receiving the values of a, b, and c. Similarly, a function unit 14 for computing a function of a complex number having a real and imaginary part would have two function unit output ports 28, one for transmitting the real part of the result to the multiported memory 12 and one for transmitting the imaginary part of the result to the multiported memory 12.

The total number of function unit input ports 26 must be less than or equal to the total number of memory

7 8

output ports 18. The memory output ports 18 which are which said lists are received. A more detailed implenot used for transmitting data to the function units 14 mentation of the present invention is described below, are used for transmitting data to other data processing In the preferred embodiment, the processing element elements of the data processing system in which the of the present invention is fabricated on a single VLSI processing element of the present invention is inte- 5 circuit chip. Fabrication on a single chip results in grated. Similarly, the total number of function unit lower cost and higher speed operation than may, in output ports 28 must be less than or equal to the total general, be achieved in multi-chip systems. As pointed number of memory input ports 20. The memory input out above the maximum clock speed at which the proports 20 which are not used for receiving data from one cessing element may be run is determined by the paraof the function units 14 are used for receiving data from 10 sitic capacitances of the various signal paths which other data processing elements of the data processing connect the functional elements of the processing elesystem in which the processing element of present in- ment. Off-chip conducting paths have significantly vention is integrated. These other processing elements higher parasitic capacitances and, hence, are to be may be processing elements according to the present avoided. The constraint of single chip fabrication places invention or data processing elements of a more tradi- 15 a limit on the total amount of circuitry which may be tional nature. included in the processing element. The present invenThe operation of the processing element 10 shown in tion requires considerably less circuitry than prior art FIG. 1 may be most easily understood by considering a processing elements which are capable of the same simple version of said processing element in which each computational throughput.

of the function units 14 is capable of performing pre- 20 One key to this reduction in circuitry is the uniform

cisely one operation. The operation in question may be methodology used for transferring data between the

different for different function units 14. All of the func- random access memory 16 and the adjacent processing

tion units 14 perform their respective operations at the elements and between the random access memory 16

same time. At the beginning of each major memory and the various function units 14. These data transfers

cycle, the list of addresses specifying the input data for 25 are carried out by the same control circuitry in the

each of the function units 14 is transmitted to the mem- present invention. In fact, the memory controller 22 has

ory controller 22 by the host data processing system no way of knowing whether the data it is transferring to

system over bus 24. The memory controller 22 causes and from the random access memory 16 is being com

the data stored at the first address to be outputted on the municated between the random access memory 16 and

first memory output port 18, the data stored at the sec- 30 its function units or between the random access memory

ond address to be outputted on the second memory 16 and processing elements which are external to the

output port 18, and so on. At the end of these output processing element containing said memory controller

operations, each function unit 14 will have been loaded 22. The prior art systems use circuitry to transfer data

with the correct input data for its particular operation. between local memories and function units which is

A second list of addresses is then transmitted to the 35 different from the circuitry used to move data between

memory controller 22 by the host data processing sys- the host data processing system and the local memories,

tem system over bus 24 which specifies the locations in This additional circuitry reduces the on-chip space

the random access memory 16 at which the results of available for other circuitry such as function units,

the operations specified by the first list are to be stored. In addition, the processing element of the present

After a time which is sufficient to allow the function 40 invention supports multiple data paths into and out of

units 14 to complete their operations has expired, the the processing element. This allows more data to be

memory controller 22 causes the data present on the transferred between the processing element and the

first memory input port 20 to be stored at the first ad- data processing system which is supplying the data to be

dress in this second list, the data present on the second processed. This is particularly important in processing

memory input port 20 to be stored at the second such 45 elements having multiple function units, since such sys

address, and so on. When this storage operation is com- terns can often produce results in less time than it takes

pleted, a new major memory cycle may commence. to transfer the results and obtain new input data. Such

Hence, during the major memory cycle described prior art systems often must wait for the relevant data to

above, one operation is completed by each function unit be transferred over a single bus which has a bandwidth

14 using data words inputted to it through the memory 50 which is significantly less than the bandwidth of the

output ports 18 which are connected to said function combined function units. The multiple data paths of the

unit. In addition, one output data word is transferred present invention together with the overlapping of in

from the processing element 10 to an "adjacent" pro- put/output operations and data processing results in a

cessing element on each of the memory output ports 18 substantial reduction in the on-chip memory needed to

which are not connected to a function unit input port 55 guarantee that all of the function units will operate at

26, and one input data word is stored in the random their optimum throughput.

access memory 16 from each memory input port 20. If The higher complexity input/output designs of the the memory input port 20 in question is connected to a prior art systems significantly reduce the throughput of function unit 14, this data word will be the result of the such systems. For example, consider the vector processfunction unit calculation. If the function unit 14 is not 60 ing system taught by Cray in the above cited U.S. paconnected to a function unit output port 28, these input tent. It too includes a local memory which is used to data words will have originated in one of the adjacent deliver data words to function units and to store the processing elements or in the host data processing sys- results from these function units. In fact, it contains tem. It should be noted that, since there is a fixed corre- several such memories which are used to store data spondence between memory input ports, memory out- 65 words of different types. These memories are loaded put ports, and function units, the input and output lists over a single bus from a large system memory which is define a set of "instructions" which are carried by the part of the data processing system in which it is inteprocessing element during major memory cycle in grated. The data transfer operations used to load and

« PrécédentContinuer »