7 8
output ports 18. The memory output ports 18 which are which said lists are received. A more detailed implenot used for transmitting data to the function units 14 mentation of the present invention is described below, are used for transmitting data to other data processing In the preferred embodiment, the processing element elements of the data processing system in which the of the present invention is fabricated on a single VLSI processing element of the present invention is inte- 5 circuit chip. Fabrication on a single chip results in grated. Similarly, the total number of function unit lower cost and higher speed operation than may, in output ports 28 must be less than or equal to the total general, be achieved in multi-chip systems. As pointed number of memory input ports 20. The memory input out above the maximum clock speed at which the proports 20 which are not used for receiving data from one cessing element may be run is determined by the paraof the function units 14 are used for receiving data from 10 sitic capacitances of the various signal paths which other data processing elements of the data processing connect the functional elements of the processing elesystem in which the processing element of present in- ment. Off-chip conducting paths have significantly vention is integrated. These other processing elements higher parasitic capacitances and, hence, are to be may be processing elements according to the present avoided. The constraint of single chip fabrication places invention or data processing elements of a more tradi- 15 a limit on the total amount of circuitry which may be tional nature. included in the processing element. The present invenThe operation of the processing element 10 shown in tion requires considerably less circuitry than prior art FIG. 1 may be most easily understood by considering a processing elements which are capable of the same simple version of said processing element in which each computational throughput.
of the function units 14 is capable of performing pre- 20 One key to this reduction in circuitry is the uniform
cisely one operation. The operation in question may be methodology used for transferring data between the
different for different function units 14. All of the func- random access memory 16 and the adjacent processing
tion units 14 perform their respective operations at the elements and between the random access memory 16
same time. At the beginning of each major memory and the various function units 14. These data transfers
cycle, the list of addresses specifying the input data for 25 are carried out by the same control circuitry in the
each of the function units 14 is transmitted to the mem- present invention. In fact, the memory controller 22 has
ory controller 22 by the host data processing system no way of knowing whether the data it is transferring to
system over bus 24. The memory controller 22 causes and from the random access memory 16 is being com
the data stored at the first address to be outputted on the municated between the random access memory 16 and
first memory output port 18, the data stored at the sec- 30 its function units or between the random access memory
ond address to be outputted on the second memory 16 and processing elements which are external to the
output port 18, and so on. At the end of these output processing element containing said memory controller
operations, each function unit 14 will have been loaded 22. The prior art systems use circuitry to transfer data
with the correct input data for its particular operation. between local memories and function units which is
A second list of addresses is then transmitted to the 35 different from the circuitry used to move data between
memory controller 22 by the host data processing sys- the host data processing system and the local memories,
tem system over bus 24 which specifies the locations in This additional circuitry reduces the on-chip space
the random access memory 16 at which the results of available for other circuitry such as function units,
the operations specified by the first list are to be stored. In addition, the processing element of the present
After a time which is sufficient to allow the function 40 invention supports multiple data paths into and out of
units 14 to complete their operations has expired, the the processing element. This allows more data to be
memory controller 22 causes the data present on the transferred between the processing element and the
first memory input port 20 to be stored at the first ad- data processing system which is supplying the data to be
dress in this second list, the data present on the second processed. This is particularly important in processing
memory input port 20 to be stored at the second such 45 elements having multiple function units, since such sys
address, and so on. When this storage operation is com- terns can often produce results in less time than it takes
pleted, a new major memory cycle may commence. to transfer the results and obtain new input data. Such
Hence, during the major memory cycle described prior art systems often must wait for the relevant data to
above, one operation is completed by each function unit be transferred over a single bus which has a bandwidth
14 using data words inputted to it through the memory 50 which is significantly less than the bandwidth of the
output ports 18 which are connected to said function combined function units. The multiple data paths of the
unit. In addition, one output data word is transferred present invention together with the overlapping of in
from the processing element 10 to an "adjacent" pro- put/output operations and data processing results in a
cessing element on each of the memory output ports 18 substantial reduction in the on-chip memory needed to
which are not connected to a function unit input port 55 guarantee that all of the function units will operate at
26, and one input data word is stored in the random their optimum throughput.
access memory 16 from each memory input port 20. If The higher complexity input/output designs of the the memory input port 20 in question is connected to a prior art systems significantly reduce the throughput of function unit 14, this data word will be the result of the such systems. For example, consider the vector processfunction unit calculation. If the function unit 14 is not 60 ing system taught by Cray in the above cited U.S. paconnected to a function unit output port 28, these input tent. It too includes a local memory which is used to data words will have originated in one of the adjacent deliver data words to function units and to store the processing elements or in the host data processing sys- results from these function units. In fact, it contains tem. It should be noted that, since there is a fixed corre- several such memories which are used to store data spondence between memory input ports, memory out- 65 words of different types. These memories are loaded put ports, and function units, the input and output lists over a single bus from a large system memory which is define a set of "instructions" which are carried by the part of the data processing system in which it is inteprocessing element during major memory cycle in grated. The data transfer operations used to load and