WO1994003860A1

WO1994003860A1 - Massively parallel computer including auxiliary vector processor

Info

Publication number: WO1994003860A1
Application number: PCT/US1993/007415
Authority: WO
Inventors: Jon P. Wade; Daniel R. Cassiday; Robert D. Lordi; Guy Lewis Steele, Jr.; Margaret A. St. Pierre; Monica C. Wong-Chan; Zahi S. Abuhamden; David C. Douglas; Mahesh N. Ganmukhi; Jeffrey V. Hill; W. Daniel Hillis; Scott J. Smith; Shaw-Wen Yang; Robert C. Zak, Jr.
Original assignee: Thinking Machines Corporation
Priority date: 1992-08-07
Filing date: 1993-08-06
Publication date: 1994-02-17
Also published as: AU4804493A; US5872987A; US6219775B1

Abstract

A computer system including a plurality of processing nodes (11) interconnected by a network (15). Each processing node comprises a network interface (22), a memory module (24), a vector processor (21) and a node processor. The vector processor (21) is connected to the memory module for performing vector data processing operations in connection with data in the memory module in response to vector instructions from the node processor. The node processor (20) is responsive to commands to (i) process data in the memory module, (ii) generate vector instructions for controlling the auxiliary processor, and (iii) control the generation of messages by the network interface. The network transfers messages generated by the network interfaces of the processing nodes among the processing nodes thereby to transfer information thereamong. A control arrangement (12, 14) generates commands to control the processing nodes in parallel.

Description

Massively Parallel Computer Including Auxiliary Vector Processor

Field Of The Invention The invention relates generally to the field of digital computer systems, and more particularly to massively parallel computer systems. Background Of The Invention Computer system have long been classified according to a taxonomy "SISD" (ioτ singje- instruction/single-data), "SIMD" (for single-instruction/multiple-data) and "MIMD" (for "multiple- instruction/multiple-data). In an SISD system, a single processor operates in response to a single instruction stream on a single data stream. However, if a program requires the same program segment to be used to operate on a number of diverse data items to produce a number of calculations, the program causes the processor to loop through that segment for each data item. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform such a calculation may not be unduly long. However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations required. Accordingly, SIMD processors have been developed which incorporate a large number of processing nodes all of which are controlled to operate concurrently on the same instruction stream, but with each processing node processing a separate data stream. On the other hand, if a program requires generally independent program segments to be used on diverse data items, the segments may be processed concurrently but using separate instruction streams. For such cases, MIMD processors have been developed which have a number of processing nodes each controlled separately in response to its own instruction stream. The flexibility of separate control in an MIMD system can be advantageous in some circumstances, but problems can arise when it is necessary to synchronize operations by the processing nodes, which may occur when, for example, transfers of data are required thereamong. Since all operations of an SIMD system are controlled by a global point of control, synchronization is provided by that global point of control. Recently, "SPMD" (for "single-program/multiple-data) systems have been developed which have many of the benefits of both SIMD and MIMD systems. An SPMD processor includes a number of processing nodes, each controlled separately in response to its own instruction stream, but which may be controlled generally concurrently in response to commands which generally control portions of instruction streams to be processed. An SPMD system thus has the possibility of having a global point of control and synchronization, namely, the source of the commands to be processed, which is present in an SIMD system, with the further possibility ol having local control of processing in response to each of the commands by each of the processing nodes, which is present in an MIMD system. Summary Of The Invention The invention provides a new and improved auxiliary processor for use in connection with a massively parallel computer system. In brief summary, a massively-parallel computer system includes a plurality of processing nodes (11) interconnected by a network (15). Each processing node comprises a network interface (22), a memory module (24), a vector processor (21) and a node processor. The vector processor (21) is connected to the memory module for performing vector data processing operations in connection with data in the memory module in response to vector instructions from the node processor. The node processor (20) is responsive to commands to (i) process data in the memory module (ii) generate vector instructions for controlling the auxiliary processor, and (iii) control the generation of messages by the network interface. The network transfers messages generated by the network interfaces of the processing nodes among the processing nodes thereby to transfer information thereamong. A control arrangement (12, 14) generates commands to control the processing nodes in parallel. Brief Description Of The Drawings This invention is pointed out with particularity in the appended claims. The above and further advantages of this invention may be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which: Fig. 1 is a general block diagram depicting a massively parallel computer incorporating an auxiliary processor constructed in accordance with the invention; Figs. 2A and 2B together comprise a general block diagram of the auxiliary processor depicted in Fig. 1; and Fig. 3 is a detailed block diagram of the context logic circuit in the auxiliary processor as shown in Fig.2B. Detailed Description of an Illustrative Embodiment Fig. 1 depicts a general block diagram of a massively parallel digital computer system 10 in which an auxiliary processor according to the invention may be used. With reference to Fig. 1, the computer system 10 includes a plurality of processing nodes 11(0) through 11(N) (generally identified by reference numeral 11) which operate under control of one or more partition managers 12(0) through 12(M) (generally identified by reference numeral 12). Selected ones of the processing nodes ll(x) through ll(y) ("x" and "y" are integers) are assigned to a particular partition manager 12(z) ("z" is an integer), which transmits data processing commands to processing nodes ll(x) through ll(y) defining a particular partition assigned thereto. The processing nodes ll(x) through ll(y) process the data processing commands, generally in parallel, and in response generate status and synchronization information which they transmit among themselves and to the controlling partition manager 12(z). The partition manager 12(z) may use the status and synchronization information in determining the progress of the processing nodes ll(x) through ll(y) in processing the data processing commands, and in determining the timing of transmission of data processing commands to the processing nodes, as well as the selection of particular data processing commands to transmit. One embodiment of processing nodes 11 and partition managers 12 useful in one embodiment of system 10 is described in detail in the aforementioned Douglas, et al., patent applications. The system further includes one or more input/output processors 13(i) through 13(k) (generally identified by reference numeral 13) which store data and programs which may be transmitted to the processing nodes 11 and partition managers 12 under control of input/output commands from the partition managers 12. In addition, the partition managers 12 may enable the processing nodes 11 in particular partitions assigned thereto to transmit processed data to the input/output processors 13 for storage therein. Input/output processors 13 useful in one embodiment of system 10 are described in detail in the aforementioned Wells, et al., patent application. The system 10 further includes a plurality of communications networks, including a control network 14 and a data router 15 which permit the processing nodes 11, partition managers 12 and input/output processors 13 to communicate to transmit data, commands and status and synchronization information thereamong. The control network 14 defines the processing nodes 11 and partition managers 12 assigned to each partition. In addition, the control network 14 is used by the partition managers 12 to transmit processing and input/output commands to the processing nodes 11 of the partition and by the processing nodes 11 of each partition to transmit status and synchronization information among each other and to the partition manager 12. The control network 14 may also be used to facilitate the down-loading of program instructions by or under control of a partition manager 12(z) to the processing nodes ll(x) through ll(y) of its partition, which the processing nodes execute in the processing of the commands. A control network 14 useful in one embodiment of system 10 is described in detail in the aforementioned Douglas, et al., patent applications. The data router 15 facilitates the transfer of data among the processing nodes 11, partition managers 12 and input/output processors 13. In one embodiment, described in the aforementioned Douglas, et al., patent applications, partitioning of the system is defined with respect to the control network 14, but the processing nodes 11, partition managers and input/output processors 13 can use the data router 15 to transmit data to others in any partition. In addition, in that embodiment the partition managers 12 use the data router 15 to transmit input/output commands to the input/output processors 13, and the input/output processors 13 use the data router 15 to carry input/output status information to the partition managers 12. A data router 15 useful in one embodiment of system 10 is described in detail in the aforementioned Douglas, et al., patent applications. One embodiment of system 10 also includes a diagnostic network 16, which facilitates diagnosis of failures, establishes initial operating conditions within the system 10 and conditions the control network 14 to facilitate the establishment of partitions. The diagnostic network 16 operates under control of a diagnostic processor (not shown) which may comprise, for example, one of the partition managers 16. One embodiment of diagnostic network 16 useful in system 10 is also described in connection with the aforementioned Douglas, et al., patent applications. The system 10 operates under control of a common system clock 17, which provides SYS CLK system clocking signals to the components of the system 10. The various components use the SYS CLK signal to synchronize their operations. The processing nodes 11 are similar, and so only one processing node, in particular processing node ll(j) is shown in detail. As shown in Fig. 1, the processing node ll(j) includes a node processor 20, one or more auxiliary processors 21(0) through 21(1) [generally identified by reference numeral 21(i)], and a network interface 22, all of which are interconnected by a processor bus 23. The node processor 20 may comprise a conventional microprocessor, and one embodiment of network interface 22 is described in detail in the aforementioned Douglas, et al., patent applications. Also connected to each auxiliary processor 21(i) are two memory banks 24(0)(A) through 24(I)(B) [generally identified by reference numeral 24(i)(j), where "i" corresponds to the index "i" of the auxiliary processor reference numeral 21(i) and index "j" corresponds to bank identifier "A" or "B"]. The memory banks 24(i)(j) contain data and instructions for use by the node processor 20 in a plurality of addressable storage locations (not shown). The addressable storage locations of the collection of memory banks 24(i)(j) of a processing node ll(j) form an address space defined by a plurality of address bits, the bits having a location identifier portion that is headed by an auxiliary processor identifier portion and memory bank identifier. The node processor 20 may initiate the retrieval of the contents of a particular storage location in a memory bank 24(i) j) by transmitting an address over the bus 23 whose auxiliary processor identifier identifies the particular auxiliary processor 21 (i) connected to the memory bank 24(i)(j) containing the location whose contents are to be retrieved, and location identifier identifies the particular memory bank 24(i)(j) and storage location whose contents are to be retrieved. In response, the auxiliary processor 21(i) connected to the memory bank 24(i)(j) which contains the storage location identified by the address signals retrieves the contents of the storage location and transmits them to the node processor 20 over the bus 23. Similarly, the node processor 20 may enable data or instructions (both generally referred to as "data") to be loaded into a particular storage location by transmitting an address and the data over the bus 23, and the auxiliary processor 21(i) that is connected to the memory bank 24(i)(j) containing the storage location identified by the address signals enables the memory bank 24(i)(j) that is identified by the address signals to store the data in the storage location identified by the address signals. In addition, the auxiliary processors 21(1) can process operands, comprising either data provided by the node processor 20 or the contents of storage locations it retrieves from the memory banks 24(i)(j) connected thereto, in response to auxiliary processing instructions transmitted thereto by the node processor 20. To enable processing by an auxiliary processor 21 (i), the node processor 20 can transmit an auxiliary processing instruction over processor bus 23, which includes the identification of one or more auxiliary processors 21 (i) to execute the instruction, as well as the identification of operands to be processed in response to the auxiliary processing instruction. In response to the auxiliary processing instructions, the identified auxiliary processors 21(i) retrieve operands from the identified locations, perform processing operation(s) and store the resulting operand(s), representing the result of the processing operation(s), in one or more storage location(s) in memory banks 24(i)(j). In one particular embodiment, the auxiliary processors 21(i) are in the form of a "RISC," or "reduced instruction set computer," in which retrievals of operands to be processed thereby from, or storage of operands processed thereby in, a memory bank 24(i)(j), are controlled only by explicit instructions, which are termed "load/store" instructions. Load/store instructions enable operands to be transferred between particular storage locations and registers (described below in connection with Figs. 2A and 2B) in the auxiliary processor 21(i). A "load" instruction enables operands to be transferred from one or more storage locations to the registers, and a "store" instruction enables operands to be transferred from the registers to one or more storage locations. It should be noted that the load/store instructions processed by the auxiliary processors 21 (i) control transfer of operands to be processed by the auxiliary processor 21(i) as well as operands representing the results of processing by the auxiliary processor 21(i). The node processor 20 and auxiliary processors 21(i) do not use the load/store instructions to control transfers directly between memory banks 24(i)(j) and the node processor 20. Other instructions, termed here "auxiliary data processing instructions," control processing in connection with the contents of registers and storage of the results of the processing in such registers. Each auxiliary processing instruction may include both a load/store instruction and an auxiliary data processing instruction. The node processor 20 transmits individual auxiliary processing instructions for processing by individual auxiliary processors 21(i), or by selected groups of auxiliary processors 21(i), or by all auxiliary processors 21(i) on the processing node, generally in parallel. As will be described below in connection with Fig. 2C in greater detail, each load/store auxiliary processing instruction is further accompanied by a value which represents an offset, from the base of the particular memory bank 24(i)(j), of a storage location in memory which is to be used in connection with the load/store operation. As noted above, each auxiliary data processing instruction identifies one or more registers in the auxiliary processor 21(i) whose operands are to be used in execution of the auxiliary data processing instruction. Accordingly, if, for example, operands represent matrix elements which are distributed among the auxiliary processors, the node processor 20 can, with a single auxiliary data processing instruction transmitted for execution by multiple auxiliary processors 21(i), enable the auxiliary processors 21 (i) to process the matrix elements generally in parallel, which may serve to speed up matrix processing. In addition, since such processing may be performed on all processing nodes 11 of a partition generally concurrently and in parallel, the auxiliary processors 21(i) enable operands comprising large matrices to be processed very rapidly. Each auxiliary processing instruction can enable an auxiliary processor 21 (i) to process a series of operands as a vector, performing the same operation in connection with each operand, or element, of the vector. If a operation initiated by a particular auxiliary processing instruction requires one ("monadic") operand, only one vector is required. However, if an operation requires two ("dyadic") or three ("triadic") operands, the auxiliary processor 21(i) processes corresponding elements from the required number of such vectors, performing the same operation in connection with each set of operands. If an auxiliary processing instruction enables an auxiliary processor 21 (i) to so process operands as vectors, the processing of particular sets of operands may be conditioned on the settings of particular flags of a vector mask. An auxiliary processing instruction which does not enable processing of series of operands as a vector is said to initiate a "scalar" operation, and the operands therefor are in the form of "scalar" operands. As will be further described in more detail below, each auxiliary processor 21 (i) may process data retrievals and stores for the node processor 20, as well as auxiliary processing instructions, in an overlapped manner. That is, node processor 20 may, for example, initiate a storage or retrieval operation with an auxiliary processor 21 (i) and transmit an auxiliary processing instruction to the auxiliary processor 21(i) before it has finished the storage or retrieval operation. In that example, the auxiliary processor 21 (i) may also begin processing the auxiliary processing instruction before it has finished the retrieval or storage operation. Similarly, the node processor 20 may transmit an auxiliary processing instruction to the auxiliary processor 21(i), and thereafter initiate one or more storage or retrieval operations. The auxiliary processor 21 (i) may, while executing the auxiliary processing instruction, also perform the storage or retrieval operations. With this background, the structure and operation of an auxiliary processor 21 (i) will be described in connection with Figs. 2A through 3. In one particular embodiment, the structure and operation of the auxiliary processors 21 are all similar. Figs. 2A and 2B depict a general block diagram of one embodiment of auxiliary processor 21(i). With reference to Figs. 2A and 2B, auxiliary processor 21(i) includes a control interface 30 (Fig. 2A), a memory interface 31 (Fig. 2A), and a data processor 32 (Fig. 2B), all interconnected by a bus system 33 (the bus system 33 is depicted on both Figs. 2A and 2B). The control interface 30 receives storage and retrieval requests (which will generally be termed "remote operations") over processor bus 23. For a retrieval operation, the control interface 30 enables the memory interface 31 to retrieve the contents of the storage location identified by an accompanying address for transfer to the processor 20. For a storage operation, the control interface 30 enables the memory interface 31 to store data accompanying the request in a storage location identified by an accompanying address. In addition, the control interface 30 receives auxiliary processing instructions (which will be generally termed "local operations"). If a auxiliary processing instruction received by the auxiliary processor 21(i) contains a load/store instruction, the control interface 30 enables the memory interface 31 and data processor 32 to cooperate to transfer data between one or more storage locations and registers in a register file 34 in the data processor 32. If the auxiliary processing instruction contains an auxiliary data processing instruction, the control interface 30 enables the data processor 32 to perform the data processing operations as required by the instruction in connection with operands in registers in the register file 34. If an auxiliary processing instruction includes both a load/store instruction and an auxiliary data processing instruction, it will enable both a load/stroe and a data processing operation to occur. As noted above, the memory interface 31 controls storage in and retrieval from the memory banks 24(i)(j) connected thereto during either a remote or local operation. In that function, the memory interface 31 receives from the control interface 30 address information, in particular a base address which identifies a storage location at which the storage or retrieval is to begin. In addition, the memory interface 31 receives from the control interface 30 other control information. For example, if the storage or retrieval operation is to be in connection with multiple storage locations, the control interface 30 controls the general timing of each successive storage or retrieval operation, in response to which the memory interface 31 generates control signals for enabling a memory bank 24(i)(j) to actually perform the storage or retrieval operation. In addition, if the storage or retrieval operation is to be in connection with a series of storage locations whose addresses are separated by a fixed "stride" value, the control interface 30 provides a stride value, which the memory interface 31 uses in connection with the base address to generate the series of addresses for transmission to a memory banks 24(i)(j). On the other hand, if the storage or retrieval operation is to be in connection with "indirect" addresses, in which the storage locations are at addresses which are diverse offsets from the base address, the memory interface 31 receives offset values, which are transmitted from registers in the register file 34 of the data processor 32 under control of the control interface 30, which it uses in connection with the base address to generate addresses for transmission to the memory banks 24(i)(j). As further noted above, the data processor 32 operates in connection with local operations, also under control of the control interface 30, to perform data processing operations in connection with operands stored in its register file 34. In that connection the control interface 30 provides register identification information identifying registers containing operands to be processed, as well as control information identifying the particular operation to be performed and the register into which the result is to be loaded. If the local operation is to be in connection with vectors, the control interface 30 also provides information from which the data processor 32 can identify the registers containing operands comprising the vectors, as well as the register in which each result operand is to be loaded. As in memory operations, operands comprising successive vector elements may be provided by registers having fixed strides from particular base registers and the control interface will provide the base identifications and stride values. In addition, at least some operands may come from registers selected using "indirect" register addressing, as described above in connection with the memory interface 31, and the control interface 30 identifies a base register and a register in the register file 34 which is the base of a table containing register offset values. From the base register identification and the register offset vlues in the table, data processor identifies the registers whose values are to be used as the successive operands. With reference to Figs. 2A and 2B, the bus system 33 provides data paths among the control interface 30, memory controller 31 and data processor 32. The bus system 33 includes two buses, identified as an A bus 35 and a B bus 36, as well as two gated drivers 37 and 38 which are controlled by A TO B and B TO A signals from the control interface 30. If both gated drivers 37 and 38 are disabled, which occurs if both A TO B and B TO A signals are negated, the A bus 35 and B bus 36 are isolated from each other. If, however, the control interface 30 asserts the A TO B signal, the gated driver 37 couples signals on the A bus 35 onto the B bus 36. Similarly, if the control interface asserts the B TO A signal, the gated driver 38 couples signals on the B bus 36 onto the A bus 35. With reference to Fig. 2A, the control interface 30 includes an address register 40, a data register 41 and a processor bus control circuit 42, all of which are connected to the processor bus 23. The processor bus control circuit 42 receives P CTRL processor bus control signals from the processor bus 23 controlling transfers over the processor bus 23 and when they indicate that an address is on the processor bus, initiating a transfer over the processor bus, enables the address register 40 to latch P ADRS processor address signals from the bus. The data register 41 is connected to receive P DATA processor data signals. If the control signals received by the processor bus control circuit 42 indicate that the processor bus transfer is accompanied by data, it enables the data register 41 to latch the P DATA signals, which comprise the data for the transfer. The processor bus control circuit 42 further notifies a scheduler and dispatcher circuit 43 that an address and data have been received and latched in the address and data registers 40 and 41, respectively. In response, the scheduler and dispatcher 43 examines the LAT ADRS latched address signals coupled by the address register 40 to determine whether the transfer is for the particular auxiliary processor 21(1), and if so, enables the processor bus control circuit 42 to transmit P CTRL processor bus control signals to acknowledge the bus transaction. If the scheduler and dispatcher circuit 43 determines that the LAT ADRS address signals indicate that the transfer is for this auxiliary processor 21(i), it further examines them to determine the nature of the transfer. In particular, the address signals may indicate a storage location in a memory bank 24(i)(j), and if so the bus transfer serves to indicate the initiation of a remote operation. Similarly, the address signals may indicate one of a plurality of registers, which will be described below in connection with Fig. 2C, which are located on the auxiliary processor 21(i) itself, and if so the address signals also serve to indicate the initiation of a remote operation. In addition, the P ADRS signals may indicate that the accompanying P DATA signals comprise an auxiliary processing instruction to be processed by the auxiliary processor 21(i). If the LAT ADRS latched address signals indicate a remote operation in connection with a storage location in a memory bank 24(i)(j), it also identifies a transaction length, that is, a number of storage locations to be involved in the operation. When the LAT ADRS latched address signals identify a register, the scheduler and dispatcher circuit 43 enables the contents of the data register 41 to be loaded into the indicated register during a write operation, or the contents of the indicated register to be transferred to the data register 41 for transmission over the processor bus 23 during a read operation. However, if the LAT ADRS latched address signals indicate that the accompanying P DATA processor data signals define an auxiliary processing instruction, the data in the data register 41 is an auxiliary processing instruction initiating a local operation. In response, the scheduler and dispatcher circuit 43 uses the contents of the data register 41 to initiate an operation for the data processor 32. In addition, if the local operation includes a load/store operation, the scheduler and dispatcher circuit 43 uses the low- order portion of the address defined by the LAT ADRS latched address signals to identify a storage location in a memory banks 24(i)(j) to be used in connection with the load/store operation. The control interface 30 further includes two token shift registers, identified as a remote strand 44 and a local strand 45, and a local strand control register set 46. The remote strand 44 comprises a shift register including a series of stages, identified by reference numeral 44(i), where "i" is an index from "0" to "I." The successive stages 44(i) of the remote strand 44 control successive ones of a series of specific operations performed by the auxiliary processor 21 (i) in performing a remote operation. Similarly, the local strand 45 comprises a shift register including a series of stages, - identified by reference numeral 45 (k), where "k" is an index from "0" to "K." The successive stages 45(k) of the local strand 45 control successive ones of a series of operations performed by the auxiliary processor 21(i) during a local operation. The local strand control register set 46 includes a plurality of registers 46(0) through 46(K), each associated with a stage 45 (k) of the local strand 45, and each storing operational information used in controlling a particular operation initiated in connection with the associated stage 45 (k) of the local strand 45. To initiate a remote operation involving a storage location in a memory bank 24(i)(j), the scheduler and dispatcher circuit 43 transmits REM TOKEN signals comprising a remote token to the remote strand 44, generally to the first stage 44(0). If the LAT ADRS latched address signals identify a transaction length greater than one word, referencing a transfer with a like number of storage locations, the scheduler and dispatcher circuit 43 will provide successive REM TOKEN remote token signals defining a series of remote tokens. As the remote strand 44 shifts each remote token through the successive stages 44(i), it generates MEM CTRL memory control signals that are transmitted to the memory interface 31, in particular, to an address/refresh and control signal generator circuit 50, which receives the low-order portion of the LAT ADRS latched address signals and the MEM CTRL memory control signals from the successive stages 44(i) of the remote strand 44 and in response generates address and control signals in an appropriate sequence for transmission to the memory banks 24(i)(j) to enable them to use the address signals and to control storage if the remote operation is a storage operation. In particular, the address/refresh and control signal generator circuit 50 generates "j" ADRS address signals ("j" being an index referencing "A" or "B"), which identify a storage location in the corresponding memory bank 24(i)(j), along with "j" RAS row address strobe, "j" CAS column address strobe and "j" WE write enable signals. Each memory bank 24(i)(j) also is connected to receive from a data interface circuit 51, and transmit to the data interface circuit, "i" DATA data signals representing, during the data to be stored in the respective memory bank 24(i)(j) during a write or store operation or the data to be retrieved during a read or load operation. As is conventional, the storage locations in each memory bank are organized as a logical array comprising a plurality of rows and columns, with each row and column being identified by a row identifier and a column identifier, respectively. Accordingly, each storage location will be uniquely identified by its row and column identifiers. In accessing a storage location in a memory bank 24(i)(j), the address/refresh and control signal generator 50 can transmit successive "j" ADRS address signals representing, successively, the row identifier and the column identifier for the storage location, along with successive assertions of the "j" RAS and "j" CAS signals. Each memory bank 24(i)(j) includes, in addition to the storage locations, a data in/out interface register 52(j), which receives and transmits the "j" DATA signals. During a retrieval from a memory bank 24(i)(j), in response to the "j" ADRS signals and the assertion of the "j" RAS signal, the memory bank 24(i)(j) loads the contents of the storage locations in the row identified by the "j" ADRS signals, into the data in/out interface register 52(j) and thereafter uses the "j" ADRS signals present when the "j" CAS signal is asserted to select data from the data in/out interfaceregister 52(j) to transmit as the "j" DATA signals. If subsequent retrievals from the memory bank 24(i)(j) are from storage locations in the same row, which is termed a "page," the address/reference and control signal generator 50 may operate in "fast page mode," enabling a retrieval directly from the data in/out interface register 52(j) by transmitting the column identifier as the "j" DATA signals and asserting the "j" CAS signal, enabling the memory bank 24(i)(j) to transmit the data from that column as the "j" DATA signals. Since the memory bank 24(i)(j) does not have to re-load the data into the data in/out interface register 52(i) while in the fast page mode, the amount of time required by the memory bank 24(i)(j) to provide the data from the requested storage location can be reduced. Otherwise stated, if, to respond to a retrieval, a memory bank 24(i)(j) has to load a row, or "page," into its data in/out interface register 52(j) because the row identifier of the retrieval differs from that of the previous retrieval (which is termed here a "miss page" condition), the retrieval will likely take longer than if the retrieval operation did not result in a miss page condition, because of the extra time required to load the data in/out interface register 52(i). The address/refresh and control signal generator circuit 50 also controls refreshing of the memory banks 24(i)(j). In one embodiment, the memory banks 24(i)(j) will initiate a refresh operation if they receive an asserted "j" CAS signal a selected time period before they receive an asserted "j" RAS signal, in so-called "CAS-before-RAS" refreshing. In that embodiment, the address/refresh and control signal generator 50 controls the "j" RAS and "j" CAS signals as necessary to enable the memory banks 24(i)(j) to perform refreshing. The address/refresh and control signal generator 50 further generates MEM STATUS memory status signals which indicate selected status information in connection with a memory operation. In connection with certain occurrences, such as a miss page condition as described above and others as will be described below, the timings of an operation enabled by a remote token at a particular stage 44(s) ("s" is an integer) of the remote strand 44 will be delayed, which will be indicated by the condition of the MEM STATUS signals. When that occurs, the remote token at that particular stage 44(s) and the upstream stages 44(0) through 44(s-l) are stalled in their respective stages, and will not be advanced until the stall condition is removed. The scheduler and dispatcher circuit 43 also receives the MEM STATUS memory status signals and will also be stalled in issuing additional remote tokens to the remote strand 44. To initiate a local operation, including a load/store operation, the scheduler and dispatcher circuit 43 transmits LOC TOKEN signals comprising a local token to the first stage 45(0) of the local strand 45. If the local operation is for a vector of operands, the scheduler and dispatcher circuit 43 will provide LOC TOKEN local token signals defining a series of local tokens. As the local strand 45 shifts the first local token through the successive stages 45 (k), the operational information, which is provided by the auxiliary processing instruction latched in the data register 41, is latched in the corresponding ones of the registers 46(k) of the local strand contro^ register set 46. The local token in each stage 45(0) of the local strand 45, along with operational information stored in each associated register 46(k), provide LOC CTRL local control signals. Some of the LOC CTRL signals are coupled to the address/refresh and control signal generator 50 and if the local operation includes a load/store operation they control the memory interface 31 in a manner similar to that as described above in connection with remote operation to effect a memory access for a load/store operation. In addition, the LOC CTRL signals wiil enable the data processor 32 to select a register in the register file 34 and enable it to participate in the load/store operation. If, on the other hand, the local operation includes an auxiliary data processing operation, the LOC CTRL local control signals will ■ enable the data processor 32 to select registers in the register file 34 to provide the operands, to perform the operation, and to store the results in a selected register. The MEM STATUS memory status signals from the address/refresh and control signal generator 50 also may stall selected stages 45 (j) of the local strand 45, in particular at least those stages which enable load/store operations and any stages upstream thereof, under the same conditions and for the same purposes as the remote strand 44. If the MEM STATUS signals enable such a stall, they also stall the scheduler and dispatcher circuit 43 from issuing additional local tokens. The memory interface 31, in addition to the address/refresh and control signal generator 51, includes a data interface circuit 51, which includes an error correction code check and generator circuit (not shown). During a store operation of a remote operation or during a load/store operation in which the data to be stored is for an entire storage location in a memory bank 24(i)(j), the data interface 51, under control of the address/refresh and control signal generator 50, receives DATA signals representing the data to be stored from the B bus 36, generates an error correction code in connection therewith, and couples both the data and error correction code as A DATA or B DATA signals, depending on the particular memory bank 24(i)(j) in which the data is to be stored. If the data to be stored is less than an entire storage location in a memory bank 24(i)(j), the data interface 51, under control of the address/refresh and control signal generator 50, receives the A DATA or B DATA signals from the particular storage location in the memory bank 24(i)(j) in which the data is to be stored, and uses the error correction code to check and, if necessary, correct the data. In addition, the data interface receives the DATA signals representing the data to be stored from the B bus 36, merges it into the retrieved data, thereafter generates an error correction code in connection therewith, and couples both the data and error correction code as A DATA or B DATA signals, depending on the particular memory bank 24(i)(j) in which the data is to be stored. In either case, if the store operation is a remote operation, the data is provided by the data register 41. In particular, the data register 41 couples the data onto A bus 35, and the control interface 30 asserted the A TO B signal enabling driver 37 to couple the data signals on A bus 35 onto B bus 36, from which the data interface 51 received them. On the other hand, if the store operation is a local operation, the data is provided by the data processor 32, in particular the register file 34,' which couples the data directly onto the B bus 36. During a retrieval operation of a remote operation or during a load operation of a local operation, the data interface receives the A DATA or B DATA signals, defining the retrieved data and error correction code, from the appropriate memory bank 24(i)(j) and uses the error correction code to verify the correctness of the data. If the data interface 51 determines that the data is correct, it transmits it onto B bus 36. If the operation is a remote operation, the control interface asserts the B TO A signal to enable the gated driver 38 to couple the data on B bus 36 onto A bus 35. The data on A bus 35 is then coupled to the data register 41, which latches it for transmission onto the processor bus 23 as P DATA processor data signals. On the other hand, if the operation is a local operation, the data is transferred from B bus 36 to the register file 34 for storage in an appropriate register. If the data interface 51 determines, during either a retrieval operation of a remote operation or a load operation of a local operation, that the data is incorrect, it uses the error correction code to correct the data before transmitting it onto B bus 36. In addition, if the data interface determines that the data is incorrect, it will also notify the address/refresh and control signal generator 50, which generates MEM STATUS memory status signals enabling a stall of the local and remote strands 45 and 44 and the scheduler and dispatcher circuit 43 while the data interface 51 is performing the error correction operation. With reference to Fig.2B, the data processor 32 includes the aforementioned register file 34, and further includes a set of register identifier generator circuits 61 through 65, an arithmetic and logic unit ("ALU") and multiplier circuit 66, a context logic circuit 67 and a multiplexer 70. The register file 34 includes a plurality of registers for storing data which may be used as operands for auxiliary processing instructions. Each register is identified by a register identifier comprising a plurality of bits encoded to define a register identifier space. The registers in register file 34 are divided into two register banks 34(A) and 34(B) [generally identified by reference numeral 34(j)], with the high-order bit of the register identifier comprising a register bank identifier that divides the registers into the two register banks. Each register bank 34(j) is associated with one memory bank 24(i)(j). The association between a memory bank 24(i)(j) and a register bank is such that the value of the memory bank identifier which identifies a memory bank 24(i)(j) in the address transmitted over the processor bus 23 corresponds to the value of the register bank identifier. In one embodiment, the auxiliary processor 21(i) effectively emulates two auxiliary processors separately processing operands stored in each memory bank 24(i)(j), separately in each register bank 34(j). If an auxiliary processing instruction enables a load/store operation with respect to both register banks, and processing of operands from the two register banks 34(j), the scheduler and dispatcher circuit 43 issues tokens to local strand 45 for alternating register banks 34(j) and the load/store operation and processing proceeds an interleaved fashion with respect to the alternating register banks 34(j). The register file 34 has six ports through which data is transferred to or from a register in response to REG FILE R/W CTRL register file read write control signals from the control interface 30 and the context logic 67. The ports are identified respectively as an L/S DATA load/store data port, an INDIR ADRS DATA indirect address data port, an SRC 1 DATA source (1) data port, a SRC 2 DATA source (2) data port, a SRC 3 DATA source (3) data port and a DEST DATA IN destination data input port. The register identifier circuits 61 through 65 generate register identifier signals for identifying registers whose contents are to be transferred through the respective ports for use as operands, in which processed data is to be stored, or which are to be used in connection with load/store operations or indirect addressing. In addition, the register identifier circuits 61 through 65 identify registers into which immediate operands, that is, operand values supplied in an auxiliary processing instruction, are to be loaded, and registers in register file 34 to be accessed during a remote operation. In particular, a load/store register identification generator circuit 61 generates L/S REG ID load/store register identification signals, which are used to identify registers in the register file 34 into which data received from the B bus 36 through the L/S DATA port is to be loaded during a load operation, or from which data is to be obtained for transfer to the B bus 36 through the L/S DATA port during a store operation. Several register identifier circuits 62 through 64 provide register identifications for use in connection with processing of operands. A source 1 register identifier generator circuit 62, a source 2 register identifier generator circuit 63, and a destination register identification generator circuit 64 generate, respectively, SRC 1 REG ID and SRC 2 REG ID source 1 and 2 register identification signals and DEST REG ID destination register identification signals. These signals are used to identify registers from which operands are transmitted, respectively, as SRC 1 DATA source 1 data signals through the SRC 1 DATA port, SRC 2 DATA source 2 data signals through the SRC 2 DATA port, and SRC 3 DATA source 3 data signals through the SRC 3 DATA port, all to the ALU and multiplier circuit 66. The ALU and multiplier circuit 66 generates result data in the form of ALU/MULT RESULT result signals, which are directed through the destination data input port DEST DATA IN. The destination data is stored in a destination register, which is identified by the DEST REG ID destination register identification signals from destination register identification generator circuit 64. During a load operation, if the load/store register identification generator circuit 61 identifies the same register in register file 34 as one of the source rgister identifier generator circuits 62 through 64, the register file 34, in addition to loading the data in the register identified by the load/store register identification generator circuit 61, will at the same time supply the data as SCR (i) DATA signals through the particular SRC (i) DATA port whose register identifier generator circuit 62, 63 or 64 identifies the register. Finally, an indirect address register identifier generator circuit 65 provides a register identification for use in identifying registers in register file 34 into which data from A bus 35 is to be loaded or from which data is to be coupled onto A bus 34. The data may be used in connection with indirect addressing for the memory banks 24(i)(j) as described above. In addition, the data may comprise immediate operands to be loaded into a register in register file 34 from an auxiliary processing instruction, or data to be loaded into the register or read from the register during a remote operation. In indirect addressing, the circuit 65 provides register identifications for a series of registers in the register file 34, with the series of registers containing the diverse offset values for the series of locations in a memory bank 24(i)(j). The indirect address register identifier generator circuit generates INDIR ADRS REG ID indirect address register identification signals which are coupled through the INDIR ADRS DATA indirect address data port. Each register identifier generator circuit 61 through 65 generates the respective register identification signals using register identification values which they receive from the A bus 35, and operates in response to respective XXX REG ID register identification signals ("xxx" refers to the particular register identification generator circuit). The XXX REG ID signals may enable the respective circuit 61 through 65 to iteratively generate one or a series of register identifications, depending on the particular operation to be performed. The ALU and multiplier circuit 66 receives the SRC 1 DATA source 1 data signals, the SRC 2 DATA source 2 data signals, and SRC 3 DATA source 3 data signals and performs an operation in connection therewith as determined by SEL FUNC selected function signals from the multiplexer 70. The multiplexer 70, in turn, selectively couples one of the ALU/MULT FUNC function signals, forming part of the LOC CTRL local control signals from the control interface 30, or ALU/MULT NOP no-operation signals as the SEL FUNC selected function signals. If the multiplexer 70 couples the ALU/MULT FUNC signals to the ALU and multiplier circuit 66, the circuit 66 performs an operation in connection with the received signals and generates resulting ALU/MULT RESULT signals, which are coupled to the destination data port on the register file, for storage in the register identified by the DEST REG ID destination register identification signals. In addition, the ALU and multiplier circuit 66 generates ALU/MULT STATUS signals which indicate selected status conditions, such as whether the operation resulted in an under- or overflow, a zero result, or a carry. The ALU/MULT STATUS signals are coupled to the context logic 67. On the other hand, if the multiplexer 70 couples ALU/MULT NOP no-operation signals to the ALU and multiplier circuit 66, it performs no operation and generates no ALU/MULT RESULT or ALU/MULT STATUS signals. The multiplexer 70 is controlled by the context logic 67. As noted above, and as will be described further below in connection with Fig. 6, when the auxiliary processor 21 (i) is processing operands as elements of vectors, it may be desirable to selectively disable both load/store and data processing operations with respect to selected vector elements. The context logic 67 determines the elements for which the operations are to be disabled, and controls a FUNC/NOP SEL function/no operation select signal in response. The context logic 65 further controls a DEST WRT COND destination write condition signal, which aids in controlling storage of ALU/MULT RESULT signals in the destination register, and, when it determines that operations for an element are to be disabled, it disables storage for that particular result. As noted above, auxiliary processor 21 may process data retrievals and stores for the node processor 20, as well as auxiliary processing instructions, in an overlapped manner. This is accomplished by the control interface 30, in particular by the scheduler and dispatcher circuit 43, in connection with dispatching tokens. The scheduler and dispatcher circuit 43 handles token dispatch scheduling both between operations, as well as within a local or remote operation (that is, between elemental operations within a local or remote operation. It will be appreciated that, for inter- operational scheduling, there are four general patterns, namely: (1) a local operation followed by a local operation; (2) a local operation followed by a remote operation; (3) a remote operation followed by a local operation; and (4) a remote operation followed by a remote operation. It will be appreciated that one purpose for scheduling is to facilitate overlapping of processing in connection with multiple operations, while at the same time limiting the complexity of the control circuitry required for the overlapping. The complexity of the control circuitry is limited by limiting the number of operations that can be overlapped in connection with the remote strand 44 or the local strand 45. In one particular embodiment, the scheduling limits the number of operations, that is, the number of local operations for which tokens can be in the local strand 45 or the number of remote operations for which tokens can be in the remote strand 44, to two. To accomplish that, the scheduler and dispatcher circuit 43 ensures that there be a predetermined minimum spacing between the first tokens for each of the two successive operations which it dispatches into a strand 44 or 45 corresponding to one-half the number of stages required for a local operation or a remote operation. Thus, for a local operation, the scheduler and dispatcher circuit 43 provides that there be a minimum spacing of eight from the first token of one local operation to the first token of the next local operation. Similarly, the scheduler and dispatcher circuit 43 provides that there be a minimum spacing of four from the first token of one remote operation to the first token of the next remote operation. A further purpose for scheduling is to ensure that no conflict will arise in connection with the use of specific circuits in the auxiliary processor 21 (i), after the dispatch of all of the tokens required for a first operation, from beginning the dispatch of tokens for a subsequent operation. Inter-token, intra-operation scheduling generally has a similar purpose. Conflicts may particularly arise in connection with use of the memory interface 31 in accessing of memory banks 24(i)(j) during a load, store, write or read operation, and also in connection with use of the bus system 33 in connection with transfer of information thereover at various points in a memory access. For example, for a store operation in which data for less than an entire storage location is stored, requiring first a read of the location, followed by a merge of the new data in the data from the location, followed by a write operation, it will be appreciated that certain components of the memory interface 50 will be used for both the read and write operations for each vector element, and so the intra-operation inter- to ken spacing will be such as to accommodate the use of the address generator for the write operation. In addition, for the ALU and multiplier circuit 66 (Fig. 2B) in one particular embodiment, the operations performed during the successive states are such that it will normally be able to begin a new operation for each token in the local strand 45 for tokens successively dispatched for each tick of the aforementioned global clocking signal. However, for some types of complex operations, the ALU and multiplier circuit 66 will require a spacing of several ticks, and the scheduler and dispatcher circuit 43 will schedule the dispatch of the successive tokens within the series required for local operation accordingly. It will be appreciated, therefore, that for local operations which do not include a load or a store operation, and for which the ALU and multiplier circuit 66 can initiate a new operation for tokens dispatched at each clock tick, the scheduler and dispatcher circuit 43 can generate successive tokens at successive ticks of the global clocking signal. In addition, the scheduler and dispatcher circuit 43, after it has finished generating all tokens for such a local operation, can begin generating tokens for a subsequent local operation, subject to the minimum spacing constraint between initial tokens for the operations as described above. On the other hand, if the successive local operations involve load or store operations, ignoring any spacing to accommodate the ALU and multiplier circuit 66, the required inter- operation spacing, will depend (1) on the sequence of load and store operations, and (2) if the first operation is a store operation, whether a store operation is of the entire storage location: (A) If the first local operation involves a store operation of less than an entire storage location, and the second involves either a load operation or a store operation, the second operation will be delayed to accommodate the use of generation of addresses for both the read and write portions of the initial store operation of the first local operation and (2) for the early states of either a load operation or a store operation for the second local operation. (B) If the first local operation involves a store operation of the entire storage location, and the second local operation involves either a load operation or a store operation of less than an entire storage location, it will be appreciated that the address will be generated only at the beginning of operations for each element of the first local operation, and so a small or zero delay thereafter will be required. (C) If a local operation involving a load operation is followed by a local operation involving a store operation, the required spacing will also depend on whether the store operation involves an entire storage location. If the store operation does involve an entire storage location, it should be noted that, while the memory addresses will be generated for the same stages for both the load operation and the store operation, the load/store register identifier generator 61 will be used late in the load operation, but relatively early in the store operation. Accordingly, the scheduler and dispatcher circuit 43 will provide a generally large spacing between the first local operation and the second local operation to ensure that the load/store register identifier generator 61 will not be used for the first vector element of the second local operation until the state after the generator 61 has been used for last vector element for the local operation's load operation. On the other hand, if the second local operation is a store involving data for less than an entire storage location, the load/store register identifier generator 61 will be used in connection with the store operation, which is closer to 1 the stage in which the generator is used in connection with the load operation, and so the spacing

2 provided by the scheduler and dispatcher circuit 43 will substantially less.

3 (D) Finally, if two successive local operations both involve load operations, since the

4 progression of operations through the successive stages will be the same for both local operations,

5 and the various circuits of the auxiliary processor 21(i) are not used in two diverse states, the first

6 token for the second local operation may be dispatched immediately following the last token for the

7 first local operation.

8 It will be appreciated that, if the computation operation required for the local operation is such that

9 the ALU and multiplier circuit 66 will not accept a new operation at each tick of the global clock 0 signal, the actual spacing will be the greater of the above-identified spacing to accommodate load 1 and store operations and the spacing to accommodate the ALU and multiplier circuit 66. 2 The particular spacing enabled for other combinations of local and remote operations are 3 determined in a generally similar manner and will not be described in detail. It will be appreciated, 4 however, that the auxiliary processor 21 (i) may initiate a remote operation, that is, scheduler and 5 dispatcher circuit 43 may begin generating tokens for the remote strand 44, before it has finished 6 generating tokens for a local operation so that the auxiliary processor 21 (i) will begin processing of 7 the remote operation before it begins processing in connection with some of the vector elements of

18 the prior local operation. This can occur, for example, if the local operation has no load or store

19 operation, in which case the memory interface 31 will not be used during processing of the local 0 operation. 1 Fig. 3 depicts the details of context logic 67. With reference to Fig. 3, the context logic 2 includes the vector mask register 104, vector mask mode register, vector mask buffer register 106, 3 and the vector mask direction register 107. In particular, the context logic 67 includes separate 4 vector mask registers 104(A) and 104(B) [generally identified by reference numeral 104(j), with index

25 "j" referencing "A" or "B"] each of which is associated with a separate vector mask buffer register

26 106(A) and 106(B) [generally identified by reference numeral 106(j)]. As described above, the

27 register file 34 is divided into two register banks, each of which loads data from a memory bank

28 24(i)(j), and from which data is stored to a memory bank 24(i)(j), having the same index "j." Each

29 vector register 104 ( j ) and each vector mask register 106(j) is used in connection with auxiliary

30 processing instructions involving operands from the correspondingly-indexed register bank 34(j).

31 Each vector mask register 104(j) is essentially a bi-directional shift register having a number

32 of stages corresponding to a predetermined maximum number "N" of vector elements, for each

33 register bank 34(j), that the auxiliary processor 21(i) can process in response to an auxiliary

34 processing instruction. Each vector mask register 104(j) stores a vector mask that determines, if the 35. auxiliary processing instruction calls for processing series of operands as vectors, whether, for each 36 successive vector element or corresponding ones of the vector elements, the operations to be performed will be performed for particular vector elements. The node processor 21(i), prior to providing an auxiliary processing instruction, enable a vector mask to be loaded into the vector mask register by initiating a remote operation identifying one or more of the vector mask registers 104(j) and providing the vector mask as P DATA processor data signals (Fig. 2A), or by enabling the contents of a register in register file 34 or the vector mask buffer register 106(j) to be copied into the vector mask register 104(j). The control interface 30 will latch the P DATA processor data signals in the data register 41, couple them onto A bus 35, and will assert a LD VM PAR -"j" load vector mask parallel bank "j" signal to enable the vector mask register 104(j) to latch the signals on the A bus 35 representing the vector mask. Each vector mask register 104(j) generates at its low-order stage a VM-j(O) signal and at its high-order stage a VM-j(N-l) signal (index "j" corresponding to "A" or "B"), one of which will be used to condition, for the corresponding vector element, the load/store operation if an L/S mode flag 105(B) is set, and processing by the ALU and multiplier circuit 66 of operands from the register file 34 if the ALU mode flag 105(A) is set. Each vector mask register 104(j) can shift its contents in a direction determined by a ROT DIR rotation direction signal corresponding to the condition of the vector mask direction flag controlled by an auxiliary processing instruction. Each vector mask register 104(j) shifts in response to a ROTATE EN rotate enable signal from the control interface 30, which asserts the signal as each successive vector element is processed so that the VM-A(0) or VM-A(N-l) signal is provided corresponding to the bit of the vector mask appropriate to the vector element being processed. The VM-A(0) and VM-A(N-l) signals are coupled to a multiplexer 320 which selectively couples one of them in response to the ROT DIR signal as a SEL VM-A selected vector mask (bank "A") signal. The SEL VM-A signal is coupled to one input terminal of an exclusive-OR gate 324, which under control of a VM COMP vector mask complement signal of an auxiliary processing instruction, generates a MASKED VE masked vector element signal. It will be appreciated that, if the VM COMP signal is negated, the MASKED VE signal will have the same asserted or negated condition as the SEL VM-A signal, but if the VM COMP signal is asserted the exclusive-OR gate 324 will generate the MASKED VE signal as the complement of the SEL VM-A signal. In either case, the MASKED VE signal will control the conditioning of the FUNC/NOP SEL function/no-operation select signal and the DEST WRT COND destination write condition signal by the context logic 67 (Fig. 2B), as well as the generation of the 'j' WE write enable signal by the memory control circuit 50 to control storage in memory banks 24(i)(j) in connection with the corresponding vector element. During processing of vector elements by the ALU and multiplier circuit 66, the circuit 66 generates conventional ALU/MULT STATUS status signals indicating selected information concerning the results of processing, such as whether an overflow or underflow occurred, whether the result was zero, whether a carry was generated, and the like. The context logic 67 uses such status information to generate a status bit that is stored in the vector mask register 104(j) so that, when the contents of the register 104Q) have been fully rotated, the bit will be in the stage corresponding to the vector element for which the status information was generated. That is, if the status bit was generated during processing of operands comprising a vector element "k," the context logic 67 will enable the status bit to be stored in a stage of the vector mask register 104(j) so that, after all of the vector elements have been processed, the status bit will be in stage "k" of the vector mask 104(j). Accordingly, the status bit can be used to control processing of the "k"-th elements of one or more vectors in response to a subsequent auxiliary processing instruction; this may be useful in, for example, processing of exceptions indicated by the generated status information. To generate the status bit for storage in the vector mask register 104(j), the context logic 67 includes an AND circuit 321 that receives the ALU/MULT STATUS status signals from the ALU and multiplier circuit 66 and STATUS MASK signals generated in response to an auxiliary processing instruction. The AND circuit 321 generates a plurality of MASKED STATUS signals, whose asserted or negated condition corresponds to the logical AND of one of the ALU/MULT STATUS signal and an associated one of the STATUS MASK signals. The MASKED STATUS signals are directed to an OR gate 322, which asserts a SEL STATUS selected status signal if any of the MASKED STATUS signals is asserted. The SEL STATUS signal is coupled to the vector mask register 104(j) and provides the status bit that is loaded into the appropriate stage of the vector mask register 104(j) as described above. The particular stage of the vector mask register 104(j) into which the bit is loaded is determined by a vector mask store position select circuit 323 (j) (index "j" corresponding to "A" or "B") which, under control of VECTOR LENGTH signals indicating the length of a vector, and the ROTATE EN rotate enable and ROT DIR rotate direction signals from the control interface 30, generates -"j" POS ID position identification signals to selectively direct the SEL STATUS signal for storage in a particular stage of the correspondingly-indexed vector mask register 104(j). The vector mask register 104(j) stores the bit in the stage identified by the -"j" POS ID position identification signals in response to the assertion of a LD VM SER -"j" load vector mask serial bank "j" signal by the control interface 30. The control interface 30 asserts the LD VM SER -"j" signal to enable the vector mask register 104(j) to store the status bit for each vector element when the SEL STATUS signal representing the status bit appropriate for the particular vector element has been generated. It will be appreciated that the vector mask store position select circuit will, for a particular vector length and rotation direction, enable the vector mask register 104(j) to latch the SEL STATUS selected status signal in the same stage. The particular stage that is selected will be determined only by the vector length and rotation direction, as indicated by the VECTOR LENGTH and ROT DIR signals, respectively. The vector mask buffer registers 106(A) and 106(B) are used to buffer the vector mask in the correspondingly-indexed vector mask register 104(A) and 104(B). For example, the node processor 20 may load a vector mask into a vector mask register 104(j) of an auxiliary processor 21(i), enable the auxiliary processor 21(i) to buffer the vector mask to the vector mask buffer 106(j), and thereafter issue an auxiliary processing instruction to initiate processing of operands in the form of vectors using the vector mask in the vector mask register 104(j). While executing the auxiliary processing instruction, the ALU and multiplier circuit 66 generates status information which is used to create a vector mask in vector mask register 104(i) as described above. The node processor may then enable the auxiliary processor to use the newly-created vector mask in connection with, for example, processing of exception conditions as indicated by the bits of that vector mask. Thereafter, the node processor 20 may enable the auxiliary processor to restore the original vector mask, currently in the vector mask buffer 106(j) to the vector mask 104(j) for subsequent processing. To accomplish this, each vector mask register 104(j) and the correspondingly-indexed vector mask buffer register 106(j) are interconnected so as to permit the contents of each to be loaded into the other. When enabled by the node processor 20 to buffer a vector mask in a vector mask register 104(j), the control interface 30 asserts a SAVE VMB-"j" vector mask buffer save signal (index "j" corresponding to "A" or "B") which enables the contents of the correspondingly-indexed vector mask register 104(j) to be saved in the vector mask buffer register 106(j). Similarly, when enabled by the node processor 20 to restore a vector mask from a vector mask buffer register 106(j), the control interface 30 asserts a RESTORE VMB-"j" vector mask restore signal (index "j" corresponding to "A" or "B") which enables the contents of the correspondingly-indexed vector mask buffer register 106(j) to be loaded into the vector mask register 104(j). The foregoing description has been limited to a specific embodiment of this invention. It will be apparent, however, that various variations and modifications may be made to the invention, with the attainment of some or all of the advantages of the invention. It is the object of the appended claims to cover these and such other variations and modifications as come within the true spirit and scope of the invention. What is claimed as new and desired to be secured by Letters Patent of the United States is:

Claims

Clainis 1. A massively-parallel computer comprising: A. a plurality of processing nodes (11), each processing node comprising: i. a network interface (22) for generating and receiving messages; ii. at least one memory module (24) for storing data; iii. a vector processor (21) connected to said memory module for performing vector data processing operations in connection with data in said memory module in response to vector instructions; and iv. a node processor (20) being responsive to commands to (i) process data in said memory module (ii) generate vector instructions for controlling said auxiliary processor, and (iii) control the generation of messages by said network interface; B. a network (15) for transferring messages generated by said network interfaces among said processing nodes thereby to transfer information among said processing nodes; and C. a control arrangement (12, 14) for generating commands to control said processing nodes in parallel. 2. A computer as defined in claim 1 in which said control arrangement comprises: A. a control node (12) for generating commands; and B. a network (14) for transferring commands generated by said control node to said processing nodes to control operations thereof in parallel. 3. A massively-parallel computer comprising a plurality of processing nodes (11) and at least one control node (12) interconnected by a network (14, 15) for facilitating the transfer of data among the processing nodes and of commands from the control node to the processing nodes, each processing node comprising: A. a network interface (22) for transmitting data over, and receiving data and commands from, said network; B. at least one memory module (24) for storing data; C. a node processor (20) for receiving commands received by the network interface and for processing data in response thereto, said node processor generating memory access requests for facilitating the retrieval of data from or storage of data in said memory module, said node processor further controlling the transfer of data over said network by said network interface; and D. an auxiliary processor (21) connected to said memory module for: (i) in response to memory access requests from said node processor, performing a memory access operation to store data received from said node processor in said memory module, or to retrieve data from said memory module for transfer to said node processor, and (ii) in response to auxiliary processing instructions from said node processor, performing data processing operations in connection with data in said memory module. 4. A computer as defined in claim 3 in which said auxiliary processor includes: A. a memory interface (31) connected to said memory module for performing memory access operations in connection with said memory module in response to memory access control signals; B. a data processor (32) for performing data processing operations in response to data processing control signals; and C. a control interface (30) for receiving memory access requests from said node processor and for generating memory access control signals in response thereto, and auxiliary processing instructions from said node processor and for generating data processing control signals in response thereto. 5. A computer as defined in claim 4 in which said control interface further selectively generates memory access control signals in response to receipt of auxiliary processing instructions to thereby enable said memory interface to perform a memory access operation to selectively retrieve data from said memory module for transfer to said data processor or to transfer data from said data processor to said memory module for storage. 6. A computer as defined in claim 5 in which: A. said memory module stores data in a plurality of storage locations each identified by an address; and B. said control interface, in connection with an auxiliary processing instruction, receives an address and a data processing operation identifier identifying one of a plurality of data processing operations, said control interface enabling said memory interface to perform a memory access operation to selectively transfer data between the storage location and the data processor, said control interface further enabling said data processor to perform a data processing operation as identified by said data processing operation identifier. 7. A computer as defined in claim 6 in which said control interface, in connection with an auxiliary processing instruction, further receives a load/store identifier identifying a load operation or a store operation, said control interface in response to a load/store identifier identifying a load operation enabling said memory module to retrieve data from a storage location identified by the received address for transfer to said data processor, and in response to a load/store identifier identifying a store operation enabling said memory module to store data received from said data processor in a storage location identified by the received address. 8. A computer as defined in claim 7 in which: A. said data processor includes a register file (34) including a plurality of registers each identified by a register identification and a data processing circuit (66), said load/store identifier further including a register identifier; and B. said control interface enabling said data processor to i. store data retrieved from said memory module in a register identified by said register identifier if said load/store identifier identifies a load operation, and ii. retrieve data from a register identified by said register identifier for transfer to said memory module if said load/store identifier identifies a store operation. 9. A computer as defined in claim 8 in which, in response to data processing control signals from said control circuit, said register file transfers input data representing contents of selected ones of said registers to said data processing circuit, said data processing circuit generating in response processed data representing a selected function as selected by said data processing control signals of the input data, said data processing circuit transferring the processed data to said register file for storage in a selected register. 10. A computer as defined in claim 9 in which, in response to an auxiliary processing instruction, said control circuit generates data processing control signals to enable, for each of a plurality of successive elemental operations, A. said register file to transfer input data items representing the contents of selected registers to said data processing circuit, and receive processed data items from said data processing circuit for storage in selected registers, the input data items provided for each elemental operation and processed data items received for each elemental operation representing vector elements of corresponding vectors; and B. said data processing circuit to, in response to said input data items from said register file, generate processed data items for transfer to the register file for storage. 11. A computer as defined in claim 10 in which said control circuit further includes a conditionalizing circuit (67) for selectively disabling storage of processed data items in said register file for selected elemental operations. 12. A computer as defined in claim 11 in which said conditionalizing circuit includes: A. a vector mask register (104) including a plurality of vector mask bits, each vector mask bit being associated with an elemental operation, and each bit having a selected condition; B. a mask bit selection circuit for selecting a vector mask bit of said vector mask register for an elemental operation; and C. a storage control circuit for controlling storage of processed data items by said register file for an elemental operation in response to the condition of the selected vector mask bit. 13. A computer as defined in claim 12 in which said conditionalizing circuit further includes a processor mode flag (105(A)) having a selected condition, the storage control circuit further operating in response to the condition of said processor mode flag, in response to the processor mode flag having one selected condition the storage control circuit controlling storage of processed data by said register file in response to the condition of the selected vector mask bit, and in response to the processor mode flag having a second selected condition the storage control circuit enabling storage of processed data items by said register file. 14. A computer as defined in claim 13 in which, in response to a load/store instruction, said control circuit generates memory access control signals to enable, for each of a plurality of successive elemental operations, said memory interface to perform a memory access operation and said register file to perform a register access operation to selectively facilitate the transfer of data between a selected storage location of said memory module and a selected register of said register file. 15. A computer as defined in claim 14 in which said conditionalizing circuit further selectively disables transfer of data by said register file and memory interface for selected elemental operations in response to the conditions of the vector mask bits, said conditionalizing circuit including a load/store mode flag (105(B) having selected conditions for selectively controlling use of said vector mask bits to disable such transfers. 16. A computer as defined in claim 10 in which operations for each elemental operation in response to an auxiliary processing instruction proceed through a sequence of processing stages, said control circuit in each stage generating processing stage control signals for enabling said register file and said data processing circuit to perform predetermined operations in said stage, said control circuit including: A. a token generator (43) for, in response to receipt of an auxiliary processing instruction, generating a series of data processing enabling tokens corresponding to the number of elemental operation to be performed; B. a data processing control signal generator (45 and 46) comprising a series of data processing token shift register stages corresponding to the number of processing stages, said data processing token shift register stages iteratively receiving data processing enabling tokens from said token generator and shifting them therethrough, at each stage generating processing stage control signals for enabling said register file and said data processing circuit to perform predetermined operations for the associated processing stage. 17. A computer as defined in claim 16 in which said token generator controls the initial generation of data processing enabling tokens in response to receipt by the auxiliary processor of an auxiliary processing instruction so as to have a selected spacing relationship in said data processing token shift register stages with data processing enabling tokens for a preceding auxiliary processing instruction. 18. A computer as defined in claim 17 in which: A. said token generator further .generates memory access tokens in response to the receipt of memory access requests; and B. said control circuit further comprises a memory access control signal generator (44) comprising a series of memory access token shift register stages each corresponding a stage in a memory access operation, said token shift register iteratively receiving data processing enabling tokens from said token generator and shifting them through said memory access token shift register stages, at each stage generating memory access stage control signals for controlling said memory interface to perform a memory access. 19. A computer as defined in claim 18 in which said token generator controls the initial generation of memory access enabling tokens in response to receipt by the auxiliary processor of a memory access request so as to have a selected spacing relationship in said memory access token shift register stages with memory access tokens for a preceding memory access request. 20. A computer as defined in claim 19 in which said token generator: A. further controls the initial generation of memory access enabling tokens in response to receipt by the auxiliary processor of a memory access request so as to have a selected spacing relationship with a corresponding data processing token shift register stage of said data processing control signal generator, and B. further controls the initial generation of data processing enabling tokens in response to receipt by the auxiliary processor of a data processing enabling token so as to have a selected spacing relationship with a corresponding memory access token shift register stage of said memory access control signal generator.