CA1154168A

CA1154168A - Processing element for parallel array processors

Info

Publication number: CA1154168A
Application number: CA000361873A
Authority: CA
Inventors: Kenneth E. Batcher
Original assignee: Goodyear Aerospace Corp
Current assignee: Goodyear Aerospace Corp
Priority date: 1979-12-31
Filing date: 1980-09-26
Publication date: 1983-09-20
Also published as: DE3049437C2; DE3049437A1; GB8324470D0; GB2140589A; GB2062915A; US4314349A; GB2140589B; IT8027013A0; IT1134924B; GB2062915B; FR2472784A1; JPS56101262A

Abstract

PROCESSING ELEMENT FOR
PARALLEL ARRAY PROCESSORS
ABSTRACT OF THE DISCLOSURE
A processing element constituting the basic building block of a massively-parallel processor.
Fundamentally, the processing element includes and arithmetic sub-unit comprising registers for operands, a sum-bit register, carry-bit register, a shift register of a selectively variable length and a full adder. A logic network is included with each proces-sing element for performing the basic Boolean logic functions between two bits of data. There is also include a multiplexer for intercommunicating with neighboring processing elements and a register for receiving data from and transferring data to neighbor-ing processing elements. Each such processing element includes its own random access memory which communicates with the arithmetic sub-unit and the logic network of the processing element.

Description

54~&~
` 1 PROCESSING E~ ENT FOR
P,~LEL ~Y P:ROCFSSO~S
BACKGROI~ O~ THE INVENTION
The instant invention resides in the art oE data processors and, more particularly, with large scale parallel processors capable of handling large volumes of data in a rapid and cost-effective manner.
Presently, the demands on data processors are such that large pluralities of data must be arithmetically and logically processed in short periods of time for purposes of constantly updating previously obtained results or~ alternatively, for monitoring large fields from which data may be acquired and in which correlations must be made. For example r this country is presently intending to orbit imaging sensors which can generate data at rates up to 1013 bits per day.
For such an imaging system, a variety of image processing tasks such as geometric correction, correl-ation, image registration, feature selection, multi-spectral classification, and area measurement are required to e~tract useful information from the mass of data obtaIned. Indeed, it is expected that the work load for a data processing system utilized in association with such orbiting image sensors would fall somew~ere bet~een 109 and 101 operations per second~
High speed processing systems and sophisti-cated parallel processors, capable of simwltaneously operating on a pluxality of data, have been known for a number of years. Indeed, applicantls prior U.S.
Patents 3,800,289; 3,812,467; and 3,936,806, all relate to a structure for vastly increasing the data processing capability of digital computers.
Similarly, U,S. Patent 3,863,233, assigned to Good-year Aerospace Corporation, the assignee of the !, ~t~

2~

inst~nt ~pplicati~n~,relates specl~ically to a d~ta pxocessln~ eIe~ent ~o:r an associa-tive or paral~
lel proce~sor ~.ic~ also increases data processing speed ~y including a plurali~ty of arithmetic units, S one for each word in t~e m~mory array. However, even the great ad~ancements of these prior art teachings do not possess t~e capa~ ity of cost effecti~ely handl~ny the large ~olume of data previously described.
A ~stem of the required nature includes thousands of .10 p~ocess~n~ elements, each includlng its own arithmetic and loyi~c ne*wor~ operati~ng in conjunction with its own m~mory~ ~h~le possessing the capa~ility of commu-nicat~ng wit~ other similar processing elements with-in the s~te~ Wi;th thousands of such processing ele~ents operating s~multaneously (massive-parallelism), the requisite speed ma~ be achieved. Further, the fact that typi~cal satellite Images ~nclude millions of picture e,lements or pi~eis that can generally be pro-cessed at the same tLme, such a structure lends itself 2Q well to the solution o~ the aforementioned problem.
~n a syste~ ca~able of processing a large ~olume of data in a ~assively~parallel manner, it is most des~ra~le that the system ~e capable of perform-ing ~t~serial mathemat~cs for cost e~fectiveness.
HoweYex~ in order to i~ncrease speed ~n the bit-serial cQ~putation~ it i~s most des,~rable that a ~ariahle len~t~ shi~ft register ~e included such that various ~ord lengths may be acco~modated. Further, it is desirable.that t~e mass~e array of processing ele-ments be capa~le of interc~.t~munication such that data may he moved ~etween and among at least neighbor-in~ proessing elements~ ~urther, it is desirable that each processing element be capable of performing ~11 o~ the Boolean operations possible ~etween two ~ts of d~ta, and that eac~'such processing element include i~ts ow~ random access memory~ ~et further, .

, .

- ` ~.15~6~3

3 p for such a S~stem -to be efficientl it should in-clude means for bypassiny inoperative or malfunc-tioning processing elemen-ts without diminish~ng system integrity~

OB3ECTS O_: THE INVENT~CON
In l~yht of the ~oregoing, it is an object of an aspect of the invention to provide a plurality of processing elements for a parallel array processor wherein each such element includes a variable length shift register for at least assisting in arithmetic computations, Yet another o~ject of an aspect of the invention is to provide a plurality of processing ele-ments for a parallel array processor wherein each such processing element is capable of intercommunicat-ing ~ith at least certain neighboring processing elements, Still another object of an aspect of the invention is to provide a plurality of processing elements for a parallel array processor wherein each such processing element is capable of perform-ing bit-serial mathematical computations.
An additional ob~ect of an asuect of the invention is to provide a plurality of processing elements for a parallel array processor wherein each such processing element is c~apable of performing all of the Boolean funations capable of being performed between two bits of binar~ data.
Yet a further object of an aspect of the - invention is to provide a pluxality oE processingelements for a parallel array processor wherein each such processing element includes its own memory and data bus.
Still a further object of an aspect of the invention is to provide a plurality of processing .

elements for a parallel ar-ray processor wherein cer-tain of said processing elemen-ts may be bypassed should they be found to be inoperative or malfunctioning, such bypassing not diminishing the system integrity.
Yet another object of an aspect of the inven-tion is to provide a pluraLity of processing elements for a parallel array processor which achieves cost-effective processing o~ a large plurality oE data in a time-efficient manner.
SUMMARY OF T~E INVENTION
Certain of the foregoing and other ob~jects of the invention are achieved by a matrix of a plurality of processing elements interconnected with each other and wherein each processing element comprises: a mem-ory; an adder; a selectably variable leng~h shift register operatively connected to said adder, said shift register comprising a plurality of individual shift registers having gates interposed therebetween, said gates selectively interconnecting said individual shift registers; and communication means connected `
to neighboring processing elements within said matrix and further connected to said adder and memory for transferring data between said memory, adder, and neighboring processing elements.

DESCRIPTION OF DRAWINGS
For a complete understanding of the objects, techniques, and structure of the invention, reference should be had to the following detailed description and accompanying drawings wherein:
Fig. 1 is a block diagram of a massively parallel processing system according to the invention, showing the interconnection of the array unit incorporating a plurality of processing elements;

'' ' ' 1 ' , ' ~ ~ ' ,'`,,`,`

. ~: . .
, .

4A.

Fig. 2 is a block diagram of a single processing element, comprising the basic building block oE the array unit of Fig. l;
Fig. 3, consisting of Figs. 3A-3C, constitutes a circuit schematic of the control signal generating circuitry of the processing elements main-. ~

.

~ ~ r~ J~

tained upo~ a chip ~nd ~nc~,udin~ the sum~or ~nd parity trees;
~ 4 ~s ~ detailed ci~cuit sch~matic of the fundamental circu3't~y of a processin~
ele~ent of the inYent~on;
Fig~ 5, compxisl'n~ ~gs~ 5~ an~ 5~, pxesents circuit schem~tics of t~e~$~tC~.ing circuit~y util.ized in re~o~n~ an inoperat~ve or m~lfunctioni~n~ pxocessin~
element fr~m the'arra~ unit~

... . .. . ... .. . . , ~ . . . . . . .
DET~XLED D~CRIPTIO~ OF PREFER~ED E~B~ ENT
RefePr~ng no~ to t~e dr~in~s and ~ore particularly ~ig~ t can ~e seen that a mas$ively~
parallel processor is desi~nated ~enerally b~ the .numeral 10~ ~ k.ey element of t~e processor 10 is th.e array unit 12 ~hich, ~n a preferred e~bodi~ent of the.inYent~on, includes a ~trix of 128 X.128 pxocessXng eIements, for ~ total of 16/384 processing elements, to ~e described in deta~l here~nafter.
The array un~t 12 i~nputs data on. i~ts left side and outputs data on its right s~de o~er 128 parallel lines, The max~um tran~fer rate o~ 128~it columns of data is 10 mhz for a ma,ximum ~and~idth of 1~28 billion bits per second~ 2nput, output, or ~oth~,can occur s~mul-taneousl~ ~th proce$s~ng~
Electron~c s~tches 24 select the in~ut of t~e arra~ un~t 12 from the.l28~t ~nter~ace of ~he processor.la, ox from the ~nput register 16~ Simi~
larly~ th.e arx~y 12 output may ~e steered to the 3a 128~t output i:nterface of the processor lC or to t~e output register 14 ~ia switches 26~ These switches 24,26 are controIled ~ the progr~m and dat~ managel' ~ent unLt 18 under suita~le program contxol~ Control signals to t~e arr~y unit 12 and status ~.its rom the arra~ un~tt may be connected to the external control interface of the processor 1 a ox to the arra~ control ;

. .

6 .3 uni-t 2~ ain, thi~ txans~e~ is ach.ieYed b~ electron~
ic s~itches 22, ~hIc~ axe under pro~ram co~trol of - the un~t 18 The arra~ control unit 2Q ~roadcasts control signals and me~or~ a~dresses to all pxoces~
sing elements of the arra~ un~t 12 and recei.~es status ~its therefrom Tt ~s desi~n~d to pexform bookkeeping operatLons such as address calculat~on, loop control~ ~ranching~ su~rout~ne call~n~, and the like It operates simultaneously ~ith the processing element control such that full px~cessing ~ower o~ the processIn~ elements of the array uni~t 12 can ~e applied to the data ~o ~e ~andled~ The contxol uni~t 20 ~n~
cludes three separate control un~ts; ths processin~
element control unit e~ecutes m~cro~coded ~ector processing routines and con.trols the process~n~ ele~
ments and their associatea ~e~ories; the ~nput~output control unit controls the s~t~n~ o~ data through the arra~ unit 12; and the ~a~n control unit executes t~e application prograMs~ per~orms the scaler processing internally, and m~kes calls to the processin~ element control unit for all yector ~rocessing~
The progr~m and data management unit 18 manages data flow ~et~e.en the.units of the processor 1~, loads programs into the contxol unit 20, executes s~stem tests and di~nosti~ routines, ~nd p~o~ides progr~m de~elopment facilities~ The details o~ such structure axe not important fox an undexst~ndiny of the instant inYention, ~ut it sh.ould ~e noted th~t the unit 18 may re~dily co~prise a m~ni~computer such.as the Digital ~quipmeAt Corporation ~PECl PDP~11~34 ~ith interface~ to the control unit 2a, arra~ unit 12 (regi$texs 14,16~, and the e~tern~l com~uter interface~
~s is ~ell kno~n ~n thé axt~ the unit 18 m~y also include peripheral equipment such as ~a~netic tape dri~e 28, disks 30, a line printer 32, and an alphanumeric i ~is~
7.

terminal .34~
~hile the structure of ~ig~ 1 IS of some significance ~or an appreciation of the overall system incorporating the invention, it is to be understood that the details thereof are not necessary for an appreciation of the scope and ~read-th of applicant's inventive concept~ Suffice it to say at this time that the array unit 12 comprises the inventive concept to be descri~ed in detail herein and that such array includes a large plurality of interconnected processing elements~ each of which has its own local memory, is capable of performing arith~
metic computations, i~ capa~le of performing a full complement of Boolean functions, and is further capable of communicating with at least the processing elements orthogonally neigh~oring it on each side, hereinafter re~erenced as north, south, east, and west~
With specific reference now to Fig 2, it can be seen that a single process~ng element is desig~
nated generally by the numeral 36. The processing element itself include~ a P register 38 which, together with its input lo~ic 40, per~orms all logic and routing functions for the processin~ el~ment 36 The A, B, and C registers 42~46, t~e vari~able length shift register 48 and the associated logic o the full adder 50 comprise the arithm~tic unit o~ the processing element 36~ The G register 5~ is pro~ided to control masking o~ ~oth arIthmetic and logic~l operations, while the S register 54 is used to shift data into and out of the.processing element 36 without disturbing operations thereof~ ~inally, the.
aforementioned elements of the processing element 36 are connected to a uniquely assoc~ated xandom access memory 56 by m~ans of a ~i~directivnal data ~us 58 As presently desi~ned, the processing element 36 is redu~ed by large scale integr~tion to ' ,;

8. 11~34~i8 such a size that a single chip may înclude eigh-t such processing elements along with a parity tree, a sum-or circuit, and associated control decode, In the pre-ferred embodiment of the invent;on, th.e eight pro-cessing elements on a chip are provided in a two row by four column arrangement. Since the size of random access memories presently available through large scale integration is rapidly changing, it is preferred that the memory 56, while comprising a portion of the processing element 36, be maintained separate from the integrated circuitry of the remaining structure of the processing elements such that, ~hen technology allows, larger memories may be incorporated with the process~
ing elements without altering the total system design.
The data bus 58 is the main data path Eor the processing element 36. During each machine cycle it can transfer one ~it of data from any one o~ si~
sources to one or more destinations~ The sources include a bit read from the addressed location in the random access memory 56, the state of the B, C~
Pr or S registers, or the state of the equivalence function generated by the element 60 and indicatin~
the state of equivalence existing he.tween the out-puts of the P and G registers. The equivalence function is used as a source during a ~asked~negate operation~
The destinations of a data ~it on the data - : bus 58 are the addressed location of the random access memory 56, the A, G, or S registe.rs, the logic asso ciated with the P register, the input to the sum~or tree, and the input to the p~rIty tree, Before considering the detai3.ed ci.rcuitr~
of the processing element 36, attention should ~e given to Fig. 3 wherein the circuitry 62 for generating the control signals for oper~ting the processing elements is sho~n. ~he circuitry oE ~g. 3 i~

~. ,i 9. ~ 3 included in a large scale integrated chip whic~
includes eight processing elements, and is responsible for controlling those associated elements~ Funda-mentally, the circuitry of Fig 3 includes decode logic receiving control signals on lines L0-LF
under program control and converts thvse s;gnals into the control signals Kl-K27 for application to the processing elements 36, sum~or tree~ and parity tree. Additionally, the circuitry of Fig, 3 gener;
ates from the main clock of the system all other clock pulses necessary for control of the processing element 36~
One skilled in the art may readily deduce from the circuitry of Fig. 3 the relationship between the programmed input function on the lines L~-LF and the control signals Kl-K2~. For example, the inverters 64,66 result in Kl=LC, Similarly, inverter 68-72 and NAND gate 74 result in K16=L0 Ll- By the same token, K18=L2-L.3.L4 L6.
Clock pulses for controlling the processing elements 36 are generated in substantially the same manner as the control signals. The same would e readily apparent to those skilled in the art from a review of the circuitry 62 of Fig. 3. For example, the clock S-CLK = S-C~K-ENABLE- MAIN CLK
by virtue of inverters 76,78 and NAND gate 80.
Similarly, clock G-CLK = L8 MAIN CLK by virtue of inverters 76,82 and NAND gate 84.
With further respect to the circuitry 62 3~ of Fig. 3, it can be seen that there is provided means for determining parity error and the sum-or of the data on the da-ta bus of all processing ele-ments. The data bit on the data bus may be presented to the sum-or tree, which is a tree of inclusive-or logic elements which forms the inclusive-or of all ~L~5~6~3 10 .

processing element data bu.s states and presents the results to the array control unit 20.
In order to detect the presence of process-ing elements in certain states, groups of eight pro-cessing elements are ORed together in an eight inputsum-or tree whose output is then fed to a 2048-input or-tree external to the chip to achieve a sum-or of all 16,384 processing elements.
Errors in the random access ~emory 56 may be determined in standard fashion by parity-generation and checking circuitry. With each group of eight pro-cessing elements 36 there is a parity-error flip-flop 86 which is set to a logic 1 whenever a parity error is detected in an associated random access memory 56. As shown in the circuitry 62, the sum-or tree comprises the three gates designated by the numeral 88 while the parity error tree consists of the seven exclusive-OR gates designated by the numeral 30.
During read operations, the parity output is latched in the flip-flop 86 at the end of the cycle by the M-clock. During write operations, parity is outputted to a parity memory through the parity-bit pin of the chip. The parity memory comprises a ninth random access memory similar to the elements 56. The parity state stored at the parity bit during write opera-tions is exclusive -ORed with the output of the parity tree 90 during read operations to affect the latch 86.
~ s shown, control signal K23 ~etermines whether a read or write operation is ~eing performed, while K24 is used for clearing the parity-error flip-flop 86. The sum-or tree 88 OR's all of the data bits D0-D7 on the associated data bus lines of the eight processing elements 36 of the chipo ~s can be seen, both the parity outputs and the sum-or out-puts are transferred via the same gating matrix ~2, '.';

,' which is controlled by K27 to determine whether par-ity or sum-or will be transferred from the chip to the array control unit 20. The outputs of the flip-flops 86 of each of the processing elements are connected to the 2048 input sum-or tree such ~hat the presence of any set flip-flop 86 mîght be sensed.
By using a flip-flop which latches upon an error, the array control unit 20 can se~uentially disable columns of processing elements until that column containîng the faulty element is found.
Finally, and as will be discussed further hereinafter, control signal K25 is used to disable the parity and sum-or outputs from the chip when the chip is disabled and no longer used in the system.
While the utilization of sum-or and parity functions are known in the art, their utilization in the instant invention is important to a.ssist in locating faulty processing elements such that those elements may be removed from the operative s~stem.
The trees 88,90, mutually exclusively gated via the network 92, provide th~ capability for columns of processing elements 36 to be checked for parity and further provides the sum-or net~ork to determine the presence o~ processing elements in particular logic states, such as to determine the responder to a search operation. The number of circuit elements necessary for this techni~ue have been kept to a minimum by utilizing a single output for the two trees, with that output being multiplexed under program control.
With final attention to Fig. 3, it can be seen that the disable signal, utilized for removing an entire column of processing element chips from the array unit 12, generates the signal K25,K26 for this purpose. As mentioned above, the control signal K25 disables the sum-or and parity outputs for asso-ciated processing elements. Further functions of the 12 .

signals K25,K26 with respect to remo~ing selected processing elements will be discussed with respect to Fig. 5 hereina~ter.
~ith reference now to Flg. 4, and correlat-ing the same to Fig. 2, it can be seen that the full adder of the invention comprises logic gates 94-100.
This full adder communicates with the B register comprising flip-flop 102 which receives the sum bit, the C register which comprises flip-flop 104 which receives the carry bit, and further communicates with the varia~le length shift register 48 which comprises 16, 8, and 4 ~it shift registers 106-110, flip;flops 112,114, and multiplexers 116-120.
The adder receives an input from the shift register, the output of the A register 122, and an input from the logic and routing sub-unit the output of the P register 124~ Whenever control line K21 is a logic 1 and BC-CLK is clocked, the adder adds the two input bits from registers A and P to the carry bit stored in the C register 104 to form a two-bit sum. The least significant bit of the sum is clocked into the B register 102 and the most significant bit of the sum is clocked into the C register 104 50 that it becomes the carry bit for the next machine cycle. If K21 is at a logic 0, a 0 is su~stituted for the P ~it.
As shown, control line K12 sets the C
register 104 to the logic 1 state while control line K13 resets the C register to the logic 0 state.
Control line K16 passes the state of the B register 102 onto the bi-directional data bus 58, while control line K22 transfers the output of the C regis-ter to the data bus.
In operation, the full adder of Fig. 4 incorporates a carry function expressed as follows:
C~AP v PC v AC.
.

r, .i 6~
13.

The new state of the carry register C/ Elip-flop 104, is equivalent to the states of the A and P registers ANDed together, or the stat:es of khe P and C regis~
ters ANDed together, or the states of the A and C
registers ANDed together. This carry function is achieved, not~ithstanding the fact that there is no feedback of C register outputs to C register inputs, because the JK flip-flop 104 follows the rule:
C~ JC v KC.
The new state of the C register is the complement of the present state of the C register ANDed with the J
input or the complement of the K input ANDed with the - present state of the C register. Accordingly r in the circuit of Fig. 4, the flip-flop 104 follows the rule:
C~--APC v ~AvP)C ~
The expression immediately above is equivalent to the carry function first given.
With respect to the sum expression, the B
register, flip-flop 102, receives a su m bit which is an exclusive OR function of the states of the A, P, and C registers according to the expression:
s~ A ~ P ~ C .
The gate 98 generates A ~ P from gates 94 and 96 which gates 100 exclusive OR's that result with C to achieve the sum expression.
The shift register of the arithmetic unit of the processing element 36 has 30 stages. These stages allow for the shift registers to have varying lengths so as to accommodate various word sizes, substantially reducing the time for arithmetic opera-tions in serial-by-hit calculations, such as occur in multiplication. Control lines Kl-K4 control multiplexers 116-120 so that certain parts of the shift register~may be bypassed, causing the length of the shift register to be selectively set at either 2, 6, 10, 14, 18, 22, 26, or 30 stages~ Data bits ..

' . . .

are entered into the shi~t register through the s register 102, these ~eing the sum bits ~rom the adder.
The data bits leave the sh;ift register through the A
register 122 and recirculate back through the adder.
The A and B registers add two stages of delay to the round-trip path. Accordingly, the round-trip length of an arithmet~c process is either 4, 8, 12, 16, 20, 24, 28, or 32 stages, depending upon the states of the control lines Rl-IC4 as they règulate the multiplexers 112-120.
T~e shift regiSter outputs data to the A
register 122 which has two other inputs selectable via control lines Kl,K2, and multiplexer 120. One input is a logic 0. This is used to clear the shift register to an all-zero state. The other input is the bi~directional data ~us 58. This may be used to enter data directly into the adderO
The A register 122 is clocked by A-CLK, and the other thirty stages of the shift register are clocked by SR-CLK. Since the last stage of the shift register has a separate clock, data from the bi-directional data bus 58 or logic 0 may be entered into the adder without disturbing data in the shift register.
As discussed a~ove, the P register 124 provides an input to the adder 50 with such input being supplied from one of the orthogonally contiguous processing elements 36, or from the data bus 58.
Data is received by the P register 124 from the P
register of neighboring processing elements 36 by means of the multiplexer 126 under control of control signals K5,K6. In transferring data to the P register i24 from the multiplexer 126, transfer is made via inverter 128 andND gates 130,132. The transfer is effectuated under control of the control signal K7 to apply the true and complement of the data to ~L~54~6~
15.

the J and K inputs respectively of the flip-flop 124. The data is latched under control of the clock P-CLK~ As noted, the true and complement outputs of the P flip-flop 124 are also adapted to be passed to the P flip-flops of neighboring pro-cessing elements 36. The complement is passed off of the chip containing th~. immediate processing element, but is inverted by a driver at the destina-tion to supply the true state of the P flip-flop.
The true state is not inverted and is applied to neighboring processing elements on the same chip.
The logic circuitry 40 is shown in more detail in Fig. 4 to be under control of control lines K8-Kll.
This logic receives data from the data bus 58 either in the true state or complementary through the inverter 130. The logic net~ork 40, under control of the control signals K8-Kll, is then capable of performing all sixteen Boolean logic functions ~hich may be performed between the data from the data bus and that maintained in the P
register 124. The result is then stored in the P register 124.
It will be understood that with R7=0, gates 130,132 are disabled. Control lines K8 and K9 then allow either 0, 1, D, or D to be gated to the J input of the P register, flip-flop 124. D is the state of the data bus 58. Independently, control lines K10 and Kll allow 0, 1, D, or D to he sent to the K input.
Following the rule of J-K flip-flop operation, the new state of the P register is defined as follows:
P ~-JP v KP .
As can be se~n, in selecting all four states of J
and all four states of K, all sixteen logic functions of P and D can be obtained.
3S As discussed above, the output of the P
register may be used in the a.rithmetic calculations ..

.
.
:. :
. .

. . .
.

16, of the processing elements 36, or may ~e passed -to the data ~us 58~ I~ K21 is at a lo~ic :1, the current state of the P re~ister .is ena~led to the adder logic 2 If ~ is a logic 0, the output of the P register is ena~led to the data ~US. ~f is at a lo~ic 0, the output of the P register is exclusively OR'ed with t~e complement of the G
register 132, and the result is enabled to the data bus, It will be noted that certain-trans~ers to khe data ~us are achieved via bi~dixectional transmission gates 1.34,136,.re6pectl~ely ena~led by control signals ~ and ~, These types of gates are well known to those skilled in the a.rt~
The ~ask register G, designated ~y the numeral 132, comprises a simple D-type ~lip-~lop.
The G register reads the state of the ~ directional data bus on the positive transition of G-CLK. Control line ~ controls the masking of the arithmetic sub-unit clocks ~A-CLK~ SR~CLK, and BC-CLK). ~hen X19 equals 1, these clocks ~ill only be sent to the arithmetic sub-units of those processing elements where G=l, The arithmetic sub-units of those processing elements where G=0 will not ~e clocked and no register and no su~-units ~ill change state~
When Kl9 = 0, the arithmetic sub-units of all processing elements will participate in the opera-tion.
Control line K20 controls the.masking of the logic and routing sub-unit, When ~ = 1, the clock P-CLK is only sent to the logic and routing su~-units of t~ose processing elements where G=l.
The logic and routing su~-units of those processiny elements where G=0 will not be clocked and t~eir P
registers will not change state.
Tra.nslation operations are masked when control lina K20 = 1~ In those processing elements ,; : ' 17c where G~l, the P re.gister is clocked b~ P-CL~ and rece~es the state of its ne~ghbor. In those where G=O, th.e P register i~s not cloc~ed and does not change state~ Regardless of whet~er G~0 or G=l, each process~ng element sends the state of its P
register to îtS nei~ors, Brief attent~on is no~ gi~en to the equivalence function pro~ided Eor by the inclusive O~ yate 138, which provides a lo~ic 1 output when the i.nputs t~ereof fro~ the P and G registers are of common lo~lc states~ In other words., the gate 138 pro~ides the output function of P ~ G . This r~sult is then supplied to t~e data bus~
The.S register comprises a D~type flip~flop 140 with the. input there,to under control o~ the multiplexer 142~ The output from the S register is transmitted to the data bus 58 by means of the ~directIonal transmission gate 144, The flip-flop -140 reads the state.of.it.s input on the transition '20 of -the clock pulse S~CLK~IN ~ When control line Kl~ is at a logic 0, the multiplexer 142 xeceives the state of the S register of the processing ele-ment.immediately to the ~est~ In such case, each :.
~ R~ pulse will shift the data in the S regis-ters one place to the east. To store the state of the S register 140 in local memory, control line KI8 is set to a logic 0 to enable the bi-directional transmission gate 144 to pass the complementary output of the S register.140 through the inverter 146 and to the data bus 58~ T~e S register 140 may be loade,d with a data bit from the local memory 56 by setting K17 to a logic 1, and thus enabling the data ~us 58 to the input of the flip-~lop 140.
As mentioned hereinabove, a particular attrihute of the massively-parallel processor 10 is that the array un'it 12 is capable of bypassing .. ; ' .

~l ~L r 9~3 ~

a set oE columns of processing elements 36 should an error or fault appear in that set. As discussed earlier herein, each chip has two processing elements 36 in eac~ o~ four columns of the array unit matrLx.
The instant in~ention disab:Les columns of chips and, accordingly, sets of columns of processing elements. ~undamentally, the columns are dropped out of operation by merely jumping the set of columns by interconnecting the inputs and outputs of the east~most and west-most processing elements on the chips establishing the set of columns. The method of inhibiting the outputs of the sum-or tree and the parity tree of the chips ha~e previously been described. However, it is also necessary to bypass the outputs of the P and S registers which intercom-municate between the east and west neighboring chips~
As shown în ~igO 5A, a chip includes eight processing elements, PE0-PE7, arranged as earlier described~ The S register o~ each processing element may receive data from the S register of the processing element i~mediately to the west and may transfer data to the S register of the processing element immediate~
ly to the east~ When enabled, the chip aIlows data to flow from S-INO, through the S registers o~ PEO-PE3 and then out S-OUT3 to the neighboring chip~ Similar data flow occurs from S-IN7 to S-OUT4. When it is desired to disable a column o~ chip~, the output gates of the column of chips which pass the S regis-ter data to the neighboring east chip are disabled~
That is, control signal K25 may inhibit output gates 148,150 while concurrently enabling the bypass gates 152,154. This interconnects S-IN0 with S-OUT3 and S-IN7 with S-OUT4, for all chips in the column.
In Fig. 5B it can be seen that communica-tions between the P registers o~ east-west neighbor-ing chips may also be bypassed~ P register data i5 19.

received from the chip to the west via inverters 156,158 and is transmitted thereto by gates 160,162.
Similarly, P register data is received from the chip to the east via inverters 164,156 and is trans-mitted thereto via gates 168,170. If the chip is enabled and P register data is to ~e routed to the west, then control line K6 is set to a logic 1 and K26 to a logic 0 so gates 160,162 are enabled and gates 158,1~0 are disabled. When rout;ng to the east, K6 îs set to zero and K26 to one. Tc disable the chip, K6 and K26 are both set to a logic 0 to disable all P register east-west outputs from the chip and I~25 is set to allow the bi-directional bypass gates 172,174 to interconnect WEST 0 with EAST-3 and WEST-7 with EAST~4, This connects the P registers of PE3 of the west chip with PE0 of the east chip and PE4 of the west chip with PE7 of the east chip.
By disablîng the parity and sum-or trees and by jumping the inputs and outputs of borderîng P and 2a S registers of the chips în a column, an entire column of chips may be removed from service if a fault is detected. It will be understood that while the pro-cessing elements of the disabled chîps do not cease functioning when dîsabled, the outputs thereof are simply r~moved from effecting the system as a whole~
~urther, it will be appreciated that, by removîng coIumns, no action need be taken with respect to întercommunication between north and south neighbors~
FinalIy~ by removîng en-tîre chips rather than columns of processing elements, the amount of bypass gating is great:Ly reduced.-In the preferred embodiment of the inven-tion, the array unit 12 has 128 rows and 132 columns of processing elements 36. In other words, there are 64 rows and 33 columns of chips. Accordingly, there is an extra column of chips ~eyond those necessary for .

~ 68 20.

achieving the desired square array. This allows for the maintenance of a square array even when a faulty chip i's found and a column of chips are to be removed rom service.
Th~s it can be seen that the o~jects of the invention have been sa*isfied by the structure presented hereinabove. A massively-parallel processor, having a unique array unit of a large plurality of interconnected and intercommunicating processing elements achieve rapid parallel processing. A
variable length shift register allows serial~by-bit arithmetic computations in a rapid fashion, while reducing system cost. Each processing element is capable of performing all requisite mathematical computations and logic functions and is further capable of intercommunicating not only with neighbor-ing processing elements, but also with its own uniquely associated random access memory. Provisions are made for removing an entire column of processing chips wherein at least one processing element has been ~ound to be faulty. All of this structure leads to a highly reliable data processor which is capable of handling large magnitudes of data in rapid fashion While in accordance with the patent stat-utes, only the best mode and preferred em~odiment of the invention has been presented and described in detail, it is to be understood that the invention is not limited thereto or thereby. Consequently, for an appreciation of the true scope and breadth of the invention, reference should be had to the following claims.

' .~

Claims

21.
The embodiments of the invention in which an exclusive property or privilege is claimed are de-fined as follows:

1. A matrix of a plurality of processing elements interconnected with each other and wherein each processing element comprises:
a memory;
an adder;
a selectably variable length shift register operatively connected to said adder, said shift regis-ter comprising a plurality of individual shift regis-ters having gates interposed therebetween, said gates selectively interconnecting said individual shift regis-ters; and communication means connected to neighboring processing elements within said matrix and further connected to said adder and memory for transferring data between said memory, adder, and neighboring pro-cessing elements.

2. The matrix according to claim 1 wherein each said processing element further includes a sum register and a carry register operatively connected to said adder.

3. The matrix according to claim 2 wherein said carry register comprises a J-K flip-flop.

4. The matrix according to claim 1 wherein each said processing element includes a logic network capable of performing the sixteen logic functions of two bits of data, said logic network including a sin-gle J-K flip-flop.

22,

5. The matrix according to claim 1 wherein said processing elements are interconnected in groups, the processing element of each group communicating with each other, each group being operatively con-nected to neighboring groups for communication there-with, and wherein each group includes means for remov-ing the processing elements thereof from communication with neighboring groups.

6. The matrix according to claim 5 wherein said means for removing comprises bi-directional swit-ches interconnecting inputs and outputs of said group.

7. The matrix according to claim 5 wherein each said group includes a sum-or tree of a plurality of OR gates receiving data bits from a data bus of each processing element within said group and a parity tree of a plurality of exclusive OR gates receiving data bits from each such data bus within said group, the outputs of said OR gates and said exclusive OR
gates being mutually exclusively connected to a single output.

8. The matrix according to claim 7 wherein said exclusive OR gates of said parity tree are con-nected to a flip-flop.

9. The matrix according to claim 7 wherein each said group further includes disable means con-nected to said sum-or tree and parity tree for selec-tively enabling and inhibiting outputs therefrom.