CA1268554A - Adaptive instruction processing by array processor having processor identification and data dependent status registers in each processing element - Google Patents
Adaptive instruction processing by array processor having processor identification and data dependent status registers in each processing elementInfo
- Publication number
- CA1268554A CA1268554A CA000529484A CA529484A CA1268554A CA 1268554 A CA1268554 A CA 1268554A CA 000529484 A CA000529484 A CA 000529484A CA 529484 A CA529484 A CA 529484A CA 1268554 A CA1268554 A CA 1268554A
- Authority
- CA
- Canada
- Prior art keywords
- instruction
- processing
- data
- adaptive
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000012545 processing Methods 0.000 title claims abstract description 131
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 55
- 230000001419 dependent effect Effects 0.000 title claims abstract description 31
- 239000002131 composite material Substances 0.000 claims abstract description 10
- 230000004048 modification Effects 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 2
- 238000009795 derivation Methods 0.000 abstract description 2
- 230000004044 response Effects 0.000 abstract description 2
- 230000009471 action Effects 0.000 description 15
- 230000006978 adaptation Effects 0.000 description 11
- 230000000295 complement effect Effects 0.000 description 8
- 238000013461 design Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000006854 communication Effects 0.000 description 5
- 150000002500 ions Chemical class 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 229910052729 chemical element Inorganic materials 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000000034 method Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000001934 delay Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241001269524 Dura Species 0.000 description 1
- KWYUFKZDYYNOTN-UHFFFAOYSA-M Potassium hydroxide Chemical compound [OH-].[K+] KWYUFKZDYYNOTN-UHFFFAOYSA-M 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 229940086255 perform Drugs 0.000 description 1
- 229940072033 potash Drugs 0.000 description 1
- 235000015320 potassium carbonate Nutrition 0.000 description 1
- BWHMMNNQKKPAPP-UHFFFAOYSA-L potassium carbonate Substances [K+].[K+].[O-]C([O-])=O BWHMMNNQKKPAPP-UHFFFAOYSA-L 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8023—Two dimensional arrays, e.g. mesh, torus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/448—Execution paradigms, e.g. implementations of programming paradigms
- G06F9/4494—Execution paradigms, e.g. implementations of programming paradigms data driven
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F02—COMBUSTION ENGINES; HOT-GAS OR COMBUSTION-PRODUCT ENGINE PLANTS
- F02B—INTERNAL-COMBUSTION PISTON ENGINES; COMBUSTION ENGINES IN GENERAL
- F02B75/00—Other engines
- F02B75/02—Engines characterised by their cycles, e.g. six-stroke
- F02B2075/022—Engines characterised by their cycles, e.g. six-stroke having less than six strokes per cycle
- F02B2075/027—Engines characterised by their cycles, e.g. six-stroke having less than six strokes per cycle four
Abstract
Abstract of the Disclosure ADAPTIVE INSTRUCTION PROCESSING BY ARRAY PROCESSOR
HAVING PROCESSOR IDENTIFICATION
AND DATA DEPENDENT STATUS REGISTERS
IN EACH PROCESSING ELEMENT
Equipping individual processing elements with instruction derivation means provides an array processor with adaptive spatial-dependent and data-dependent processing capability. The instruction becomes variable, at the processing element level, in response to spatial and data parameters of the data stream. An array processor can be optimized, for example, to carry out very different instructions on spatial-dependent data such as blank margin surrounding the black lines of a sketch. Similarly, the array processor can be optimized for data-dependent values, for example to execute different in-structions for positive data values than for negative data values. Providing each processing element with a processor identification register permits an easy setup by flowing the setup values to the individual processing elements, together with setup of condition control values. Each individual adaptive processing element responds to the composite values of original setup and of the data stream to derive the instruction for execution during the cycle. In the usual operation, each adaptive processing element is individually addressed to set up a base instruction; it also is conditionally set up to execute a derived instruction instead of the base instruction. An array processor made up of adaptive processing elements can adapt dynamically to changes in its input data stream, and thus can be dynamically optimized, resulting in greatly en-hanced performance at very low incremental cost.
HAVING PROCESSOR IDENTIFICATION
AND DATA DEPENDENT STATUS REGISTERS
IN EACH PROCESSING ELEMENT
Equipping individual processing elements with instruction derivation means provides an array processor with adaptive spatial-dependent and data-dependent processing capability. The instruction becomes variable, at the processing element level, in response to spatial and data parameters of the data stream. An array processor can be optimized, for example, to carry out very different instructions on spatial-dependent data such as blank margin surrounding the black lines of a sketch. Similarly, the array processor can be optimized for data-dependent values, for example to execute different in-structions for positive data values than for negative data values. Providing each processing element with a processor identification register permits an easy setup by flowing the setup values to the individual processing elements, together with setup of condition control values. Each individual adaptive processing element responds to the composite values of original setup and of the data stream to derive the instruction for execution during the cycle. In the usual operation, each adaptive processing element is individually addressed to set up a base instruction; it also is conditionally set up to execute a derived instruction instead of the base instruction. An array processor made up of adaptive processing elements can adapt dynamically to changes in its input data stream, and thus can be dynamically optimized, resulting in greatly en-hanced performance at very low incremental cost.
Description
-~Z61~
ADA.PTIVE INSTRUClION PROCESSING BY ARRAY PROCESSOR
HAV~NC; PROCESSOR IDENTIFICATION
AND DATA DEPENDENT STATUS REGISTERS
IN EACH PROCESSING ELEMENT
BACKGROUND OF THE INVENTION
1. Field of the Invention This invention relates to image processors which process data streams passing through arrays of processing elements, each processing elemellt exe-cuting an assigned instruction, and more particularly relates to an architecture for adaptively mallipulating the instruction assignment of each processing el-ement in response to spatial and data values in the data stream, using an in-struction adapter, individual to each processing element, to derive a new instruction as a composite function of processor identification and status and of the data stream.
~2~i;S~
ADA.PTIVE INSTRUClION PROCESSING BY ARRAY PROCESSOR
HAV~NC; PROCESSOR IDENTIFICATION
AND DATA DEPENDENT STATUS REGISTERS
IN EACH PROCESSING ELEMENT
BACKGROUND OF THE INVENTION
1. Field of the Invention This invention relates to image processors which process data streams passing through arrays of processing elements, each processing elemellt exe-cuting an assigned instruction, and more particularly relates to an architecture for adaptively mallipulating the instruction assignment of each processing el-ement in response to spatial and data values in the data stream, using an in-struction adapter, individual to each processing element, to derive a new instruction as a composite function of processor identification and status and of the data stream.
~2~i;S~
2.Description of the Prior Art The following publications are representative of the prior art:
United States Patent 3,287,702, Borck, Jr., et al, COMPVTER CON-TROL, Nov. 22, 1966, shows an array processor with computer con-trol of an array of conventional processing elements.
United States Patent 3,287,703, D.L. Slotnick, COMPUlER, Nov. 22, 1966 shows a similar array processor.
United States Patent 3,970,993, C.A. Firmila, COOPERATIVE-WORD LINEAR ARRAY PARALLEL
PROCESSOR, July 20, 1976 shows an array processor in which each processing element includes a nag register which can modify the oper-ations on the common control lines.
United States Patent 4,187,539, J.R. Eaton, PlPELlNED DATA
PROCESSIMG SYSTEM Wll~I CENTRALIZED MICROPRO-GRAM CONTROL, Feb. 5, 1980 shows a ph~ral dataflow pipeLined processor in which each dataflow includes a shift register which pro-vides a sequence of microinstructions, and a cornmon rnicroprogram control unit includes a flag register which helps keep the instruction S~ -size small by providing instruction information which does not cbange often.
United States Patent 4,287,566, GJ. Culler, ARRA' PROCESSOE~
WlTH PARALLEL OPERATIONS PER I~lSTRUCllON, Sept. 01, 1981, shows an array processor having subarrays used to calculate ~eclor addresses.
United States Patent 4,344,134, G.H. Barnes, PAR~IO~ABLE
PARALLEL PROCESSOR, August 10, 1982, shows a partitionable asray processor in which each processor in a node tree issues a ready signal when the dataflow has passed it, thus invoking the next instruc-tion.
IJnited States Paten~ 4,380,046, L-W. Fung, MASSIVELY PAR~L-LEL PROCESSOR COMPUTER, April 12, 1983, shows aa array processor with each processing element equipped with a mask bit reg-ister, identlfied as G-register, to disable the processing element and thus distinguish between e%ecuting the current instruction or no-operadon; that is, each processing element has a G-register with an OP/NOP flag ~85~
Unitad States Patent 4,467,40g, Potash et al, FLEXIBLE COMPUTER ARCHITECTURE USING ARRAYS OF
STANDARDIZED MICROPROCESSORS CUSTOMIZED FOR PIPELINE
AND PARALLEL OPERATIONS, August 21, 1984, shows a flexible architecture for a sequential processor, using standardized units with "soft functional structures" which customize a unit for a command.
The units thus can be manufactured as standard units and customized by means of a mask which sets contacts in the soft functional structure.
United States Patent 4,558,411, Farber et al, POLY~ORPHIC PROGRA~MABLE UNITS EMPLOYING PLURAL
LEVELS OF SUB-INSTRUCTION SETS, December 10, 1985, shows a multiplelevel programmable unit to provide a hierarchy of sub-instruction sets of microprogramming, to change, for example, from input output mode to processing mode or to execute programs written in differing languages.
United States Patent 4,739,474, Holsztynski, DATA
PROCESSING CELLS AMD PARALLEL DATA PROCESSORS
INCORPORATING SUCH CELI.S, shows an array processor in which each processing element includes a full adder and storage devices for N-S (north-south), E-W
8~
(east-west), and C (carry), so that tbe processing elernent can carry out both arithmetic and logic functions.
U.S.S.R. Author's Certificate Number 83-721416/30, ASSOCIA-- TlVE PROCESSORS MICROPROGRAM CONTROL APPARA-TUS, Tbilisi Elva Combine, September 15, 1982, shows first and second control instruction registers in instruction memory to aliow the same rnicroinstruction to be used for different instructiorls, reducing the overall volume of memory.
Davis et al, SYSTOLIC ARRAY CHlP MATCHES THE PACE OF
HICH-SPEED PROCESSING, Electronic Design, October 31, 1984, pp 207-218, shows a represer~tative array processor.
NCR GEO~ETRIC ARI~IMETIC PARALLEL PROCESSOR, product spccification NCR45CG72, NCR Corp., Dayton, OH, 1984, pp. 1-12, shows physical characteristics of a representative array processor.
Cloud et al, HIGHER EFFICIENCY FOR PARALLEL PROCESS-ORS, IEEE Southcon, reprint published by NCR Corporation Micro-electror,'cs Div., Fort Collins, CO, pp. 1-7, shows details of operation of NCR's geometric arithmetic parallel processor (GAPP).
The prior art shows a variety of array processors, with individual processing elements controllable externally in a variety of ma~ners, aod with the possi-bility of OP/NOP according to a nag in the individual processing element--but the prior art does not teach the use of instruction adaptation within each individual adaptive processing element to make an array processor dynam-ically optimizable to spatial and data dependencies through derived instruc-tion within the adaptive processing elernent.
Current computer systems are categorized, according to instruction stream and data stream, into four classes. They are:
o SISD (Single Instmction stream Single Data stream).
SIMD (Single Instruction stream Multiple Data stream).
o MISD (Multiple Instruction stream Single Data stream).
M~vlD (Multiple Instructioa stream Multiple Data stream).
E~cept for SISD, these architectures are paraUel processing systems. How-ever, none of them can perform parallel operations which are adaptive to the spatial condition of a processing element (spatial adaptation, e.g. data are at Y09~5057 - 6-~26~
the border of an image or the processing element is at tbe first column of an array). Neither can they perforrn parallel operations adaptive to the nature of tbe data (data adaptation, e.g. data positive / data negative; flag true / nag false).
Supercomputers are commercially available now and exemplified by the Cyber series from CDC, the CRAY series from CRAY Research and the ~EC AP series. All these machines are of MISD architecture and require a long setup time for setting up the instruction pipe to process a vector. The overhead is large if the frequency of the pipe setup is high or the vector is short; the performance is consequently low in such cases.
Data dependence in a loop degrades the perforrnance of these supercomput-ers. The machines are either prevented from presetting the pipe until the data dependence is resolved (e.g. status is known exactly) or will set up the pipe for one path (e.g., status is true) with higher probability. The former case delays the execution while the latter case involves the resetting of the pipe (i.e. increase the pipe setup frequency) if the "guess" is wrong. Both cases degrade the performance.
The lack of spatial and/or data adaptation leads to the following drawbacks:
6~3SS~
l. Data-dependent operations are processed sequentiaUy, which leads to a waste of the paraUel hardware, hence to lower performance;
2. Data with spatial significance are treated as exception, which prevent the parallel opportunity;
United States Patent 3,287,702, Borck, Jr., et al, COMPVTER CON-TROL, Nov. 22, 1966, shows an array processor with computer con-trol of an array of conventional processing elements.
United States Patent 3,287,703, D.L. Slotnick, COMPUlER, Nov. 22, 1966 shows a similar array processor.
United States Patent 3,970,993, C.A. Firmila, COOPERATIVE-WORD LINEAR ARRAY PARALLEL
PROCESSOR, July 20, 1976 shows an array processor in which each processing element includes a nag register which can modify the oper-ations on the common control lines.
United States Patent 4,187,539, J.R. Eaton, PlPELlNED DATA
PROCESSIMG SYSTEM Wll~I CENTRALIZED MICROPRO-GRAM CONTROL, Feb. 5, 1980 shows a ph~ral dataflow pipeLined processor in which each dataflow includes a shift register which pro-vides a sequence of microinstructions, and a cornmon rnicroprogram control unit includes a flag register which helps keep the instruction S~ -size small by providing instruction information which does not cbange often.
United States Patent 4,287,566, GJ. Culler, ARRA' PROCESSOE~
WlTH PARALLEL OPERATIONS PER I~lSTRUCllON, Sept. 01, 1981, shows an array processor having subarrays used to calculate ~eclor addresses.
United States Patent 4,344,134, G.H. Barnes, PAR~IO~ABLE
PARALLEL PROCESSOR, August 10, 1982, shows a partitionable asray processor in which each processor in a node tree issues a ready signal when the dataflow has passed it, thus invoking the next instruc-tion.
IJnited States Paten~ 4,380,046, L-W. Fung, MASSIVELY PAR~L-LEL PROCESSOR COMPUTER, April 12, 1983, shows aa array processor with each processing element equipped with a mask bit reg-ister, identlfied as G-register, to disable the processing element and thus distinguish between e%ecuting the current instruction or no-operadon; that is, each processing element has a G-register with an OP/NOP flag ~85~
Unitad States Patent 4,467,40g, Potash et al, FLEXIBLE COMPUTER ARCHITECTURE USING ARRAYS OF
STANDARDIZED MICROPROCESSORS CUSTOMIZED FOR PIPELINE
AND PARALLEL OPERATIONS, August 21, 1984, shows a flexible architecture for a sequential processor, using standardized units with "soft functional structures" which customize a unit for a command.
The units thus can be manufactured as standard units and customized by means of a mask which sets contacts in the soft functional structure.
United States Patent 4,558,411, Farber et al, POLY~ORPHIC PROGRA~MABLE UNITS EMPLOYING PLURAL
LEVELS OF SUB-INSTRUCTION SETS, December 10, 1985, shows a multiplelevel programmable unit to provide a hierarchy of sub-instruction sets of microprogramming, to change, for example, from input output mode to processing mode or to execute programs written in differing languages.
United States Patent 4,739,474, Holsztynski, DATA
PROCESSING CELLS AMD PARALLEL DATA PROCESSORS
INCORPORATING SUCH CELI.S, shows an array processor in which each processing element includes a full adder and storage devices for N-S (north-south), E-W
8~
(east-west), and C (carry), so that tbe processing elernent can carry out both arithmetic and logic functions.
U.S.S.R. Author's Certificate Number 83-721416/30, ASSOCIA-- TlVE PROCESSORS MICROPROGRAM CONTROL APPARA-TUS, Tbilisi Elva Combine, September 15, 1982, shows first and second control instruction registers in instruction memory to aliow the same rnicroinstruction to be used for different instructiorls, reducing the overall volume of memory.
Davis et al, SYSTOLIC ARRAY CHlP MATCHES THE PACE OF
HICH-SPEED PROCESSING, Electronic Design, October 31, 1984, pp 207-218, shows a represer~tative array processor.
NCR GEO~ETRIC ARI~IMETIC PARALLEL PROCESSOR, product spccification NCR45CG72, NCR Corp., Dayton, OH, 1984, pp. 1-12, shows physical characteristics of a representative array processor.
Cloud et al, HIGHER EFFICIENCY FOR PARALLEL PROCESS-ORS, IEEE Southcon, reprint published by NCR Corporation Micro-electror,'cs Div., Fort Collins, CO, pp. 1-7, shows details of operation of NCR's geometric arithmetic parallel processor (GAPP).
The prior art shows a variety of array processors, with individual processing elements controllable externally in a variety of ma~ners, aod with the possi-bility of OP/NOP according to a nag in the individual processing element--but the prior art does not teach the use of instruction adaptation within each individual adaptive processing element to make an array processor dynam-ically optimizable to spatial and data dependencies through derived instruc-tion within the adaptive processing elernent.
Current computer systems are categorized, according to instruction stream and data stream, into four classes. They are:
o SISD (Single Instmction stream Single Data stream).
SIMD (Single Instruction stream Multiple Data stream).
o MISD (Multiple Instruction stream Single Data stream).
M~vlD (Multiple Instructioa stream Multiple Data stream).
E~cept for SISD, these architectures are paraUel processing systems. How-ever, none of them can perform parallel operations which are adaptive to the spatial condition of a processing element (spatial adaptation, e.g. data are at Y09~5057 - 6-~26~
the border of an image or the processing element is at tbe first column of an array). Neither can they perforrn parallel operations adaptive to the nature of tbe data (data adaptation, e.g. data positive / data negative; flag true / nag false).
Supercomputers are commercially available now and exemplified by the Cyber series from CDC, the CRAY series from CRAY Research and the ~EC AP series. All these machines are of MISD architecture and require a long setup time for setting up the instruction pipe to process a vector. The overhead is large if the frequency of the pipe setup is high or the vector is short; the performance is consequently low in such cases.
Data dependence in a loop degrades the perforrnance of these supercomput-ers. The machines are either prevented from presetting the pipe until the data dependence is resolved (e.g. status is known exactly) or will set up the pipe for one path (e.g., status is true) with higher probability. The former case delays the execution while the latter case involves the resetting of the pipe (i.e. increase the pipe setup frequency) if the "guess" is wrong. Both cases degrade the performance.
The lack of spatial and/or data adaptation leads to the following drawbacks:
6~3SS~
l. Data-dependent operations are processed sequentiaUy, which leads to a waste of the paraUel hardware, hence to lower performance;
2. Data with spatial significance are treated as exception, which prevent the parallel opportunity;
3. Interconnections of parallel computers are fixed, which restricts the al-gorithrn versatility;
4. Complementary operations (e.g. SEND/RECEIVE pair) caused by data or spatial dependence are performed sequeotially, which implies longer execution time;
5. Commuoication bandwidth is accordingly wasted;
6. Different copies ot the program n~ust be generated for processing ele-ments (PEs) with different spatial conditions, wbich leads to larger software effort.
The prior art does not teach nor suggest the invenLion, which provides for instruction adaptation at ~he processing element level for spatial and data dependencies, by providing each of a finite number of processing elements with conditional ins[ruction modification means.
To facilitate a quick understanding of the invention, it is helpful to describe the situations where data-dependent parallel processing and spatial-85~
dependent paraUel processing are involved, and where improved solutions, such as by means of the invention, are most desirable.
With adaptive instruction processing, the above problem c~uld be handled in a paraUel fashion as foUows:
An instruction is defined as +/- (add or subtract) while, using the "status" as the "agreement bit," the derived iustruction is defined as + (add) if the "status" is true, or is defirled as - (subtract) if the "sta-tus" is false. The loop with data dependence can then be rewritten as for (i=0; i<300; i++) for (j=0; j~500; j++) c[i,j]= a[i,j]+/-b[i,j];
and paraUel processing can be applied efficiently.
This example demonstrates one instaace of how data dependence can be re-solved, and how the data depeadent loops that were processed sequentiaUy can now be paralleli-~ed. The opportunity of exploiting tbe parallelism that involves data dependence is noi limited to the above example and is much wider in appUcation.
8~
~1) Data-Dependent Parallel Processing Current parallel computers are efficient in processing a loop with a long run-r~ing index, but do not support efficiently the 1Oop witb data dependence.
Specifically, one type of data independent loop is shown as follows:
for (i=0; i<300; i++) for (j=0; j<S; i+~) c[i,j~=a[i,;]-~b[i,j];
The processing can be very well supported by most known SlhID, MISD and MIMD machines.
But when the data dependence is added to the program as sho~vn in the fol-lowing, no existing patallel macbines can handle it efficiently.
for (i,0; i<300; i++) for (j=0; j~500; j-~+) if (status) c[i,j]=~a[i,j]+b[i,j];
else c[i,j]= a[i,j]-b[i,j];
Yosssos7 - lo-~Z6~
(2) Spat~ dependent P(lral/el ~rocess~
In image processing and other applications where data are associated with spatial conditions, data are not handled homogeneously, but rather are sub-ject to their spatial depeodence. For example, the data on the boundary of an image are treated differently from the other non-boundary data; tbis is one type of spatial dependence reflected by data. Io this situation, there are two major drawbacks when applying parallel processing:
1. The degree of parallelism can only be extended to the homogeneous part of the data; consequently, the non-homogeneous data (e.g. boundary) are forced to be processed sequentially. This degrades the performance of a parallel system;
2. The program (or coding) for the non-homogeneous data differs from the program for ~heir homogeneous counterpart; therefore, more than one copy of the coding must be prepared. This leads to larger software ef-fort.
U the instmction could be adapted, processing with the spatial dependence could solve problems such as the above problem as follows:
for (i=0; i<300; i++) YO985057 - Il-a2~;5 îor (i=O; i<s; j++) if (spatial-condition) action l;
then action 2.
With adaptive instruction processing, both drawbacks could be removed. llle spatial condition for the non-homogeneity, for example the boundary, could be expressed as conditions such as ~<B or x>N-B or y<B or y>N-B, where (x,y) is the coordinate of a pixel of an NxN image and B is the width of the boundary.
SUMMAl?.Y OF THE INVENTION
, The object of the invention is to provide an image processor architecture which speeds the operation of the image processing system by eliminating delays related to spatial or data dependence.
A feature of the invention is the provision, in each of a finite number of processing elements, of an instruction adapter with addressable instmction derivation means, responsive to the composite of original instruction and bit values in the da~a stream.
An advantage of the invention is that the image processing system is self-adaptive to the parameters of the data stream, in tbat an individual processing element may derive different instructions adapted to the spatial or data pa-rameters of different items as the items arrive for processing at the individual processing element.
Another advantage of the invention is that it does not require processing de-lay time for data dependencies.
The foregoing and other objects, features and advantages of the invention will be apparent from the more particular description of the preferred em-bodiments of the invention, as illustrated in the accompanying drawings.
Y09~5057 ~3 55~
.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an array processor with one representative adaptive processing element presented in block diagram according to the in-vention.
FIG. 2 is a detailed block diagram of the instruction adapter L~ of FIG. 1.
FIG. 3 is a detailed block diagram of an instruction adapter according to a sirnplified embodiment.
FIG. 4 is a detailed block diagram of an instnuction adapter for a multidi-mensional network array processor.
FIG. 5 is a detailed block diagram of a further embodiment of the instmction adapter.
YO985057 - 1~-6E~
DESCRIPTION OF A PREFERRED EMBODlMENr OF THE IN-VENTION
The adaptive instruction processor assigns a processor identification (PID) to each adaptive processing element in the parallel computer. It collects a set of data-dependent-status ~DDS) bits from the arithmetic & logic unit (ALU) of the adaptive processing element. It then uses an Instnuction Adapter (IA) to derive the instruction subject to the spatial dependence and data depend-ence reflected by the PID, DDS and the original instruction.
FIG. I shows the array processor, which is similar in concept and operation to array processors known in the art, except insofar as handling of data de-pendencies and spatial dependencies is concerned. The image processor comprises adaptive processing element array 1 and array controller 2, both of which are shown schemntically. Array controller 2 provides a set of in-stmctions on instruction line 3, and provides a communication path 4 for in-put and output data. Processing element nrray 1 contains a great nmnber of individuai adaptive processing elements 5, each of which is equipped for ad-aptation accordin~ to this invention. Processing element array I might con-tain conventional processing elements 6, so long as there is awareness of which type of processing element is in each location, and housekeeping is S5~
done accordingly, but this is Dot preferred. It is preferred to make all the processing elements identical adaptive processing elements, and where con-ventional performance is desired, to adapt those elements, for example, those identified as elements 6, to perform conventionally Image processing proceeds in conventioaal image processor fashion so far as dataflow is concerned; data enter the processing element array and flow from processing element to processing element in accordance with initial data val-ues as modified by processing elements through which the data pass, without intervening access to system memory. The individual processing elements are set up prior to dataflow cornmencement. In a conventional image processor, there is uttle chance to alter the setup during execution because for all prac-tical purposes the exact position of data during execution is unknown. In es-sence, an image processor, once set up, is for the dura~ion of the execution a speciali~ed, fixed operation computer. This invention provides for dynamic changes of setup during execution, by equipping each of the multiplicity of adaptable adaptive processing elements with its own processor identification register, its own data dependent status register, and its own instruction ad-aptation mechanism which is responsive to the composite of processor iden-tification data, status data, and the applied instructioa to provide internal selection of operation for the adaptive processing element. On a system ba-~2~~SS~
sis, this provides convenient adaptation to spatial and data dependencies, which permits system optimization for the type of data being processed.
A representative one of the many adaptive processing elements 5, adaptive processing element 7, is shown in greater detail. Communication path 70, local memory 71, and arithmetic and logic unit (ALU) 7~ are conventional, similar in scope and effect to the analogous items in a conventional processing element. In general, these items can function to accept an assignraent (in-struction) and execute the assignment on each item of data as it is presented.
The cornputation cycle, shown simplified in time line 8, uses an original in-struction, showa simplified in original instruction view 9, to derive an in-struction for execution. In operation, the original inStruction, spatial dependent status and data dependent status values are available early in a cycle, as shown by values X,Y,Z on time line 8. The derived instruction then becomes available for control of computation C.
Adaptability is provided by instruction adapter (lA) 73, which accepts from original ~struction line 74 an instruction invoking adaptability, per~orrns ap-propriate tests on data as presented, detennines from test results whether to adapt by deriving a substitute instruction, and provides the substitute derived in~struction to ALU 7~ on derived instruction line 75. The ALU includes or Y09~5057 - 17 -~2~
serves as a data dependent status register to provide fresbly processed data depeDdent status bits back on DDS test data line 76 to L~ 73. Tbe adapt-ability invoking instruction, presented on adaptability control line 74, is made available to spatial dependent status bloc~ (SDS) 77, to dependence select and verify block (DSV) 78, and to instruction modify and e~tend block (~hIE) 79. Tbe instruction adapter (IA) block 73 accepts an original in-stnuction (of the type which invokes its own modification under specified circurnstances) and generates, as output, a derived instruction on line 75 to ALU block 72. This controls the operations of the adaptive processing ele-ment.
A typical instruction set, such as that shown in NCR45CG72, 1984, at page
The prior art does not teach nor suggest the invenLion, which provides for instruction adaptation at ~he processing element level for spatial and data dependencies, by providing each of a finite number of processing elements with conditional ins[ruction modification means.
To facilitate a quick understanding of the invention, it is helpful to describe the situations where data-dependent parallel processing and spatial-85~
dependent paraUel processing are involved, and where improved solutions, such as by means of the invention, are most desirable.
With adaptive instruction processing, the above problem c~uld be handled in a paraUel fashion as foUows:
An instruction is defined as +/- (add or subtract) while, using the "status" as the "agreement bit," the derived iustruction is defined as + (add) if the "status" is true, or is defirled as - (subtract) if the "sta-tus" is false. The loop with data dependence can then be rewritten as for (i=0; i<300; i++) for (j=0; j~500; j++) c[i,j]= a[i,j]+/-b[i,j];
and paraUel processing can be applied efficiently.
This example demonstrates one instaace of how data dependence can be re-solved, and how the data depeadent loops that were processed sequentiaUy can now be paralleli-~ed. The opportunity of exploiting tbe parallelism that involves data dependence is noi limited to the above example and is much wider in appUcation.
8~
~1) Data-Dependent Parallel Processing Current parallel computers are efficient in processing a loop with a long run-r~ing index, but do not support efficiently the 1Oop witb data dependence.
Specifically, one type of data independent loop is shown as follows:
for (i=0; i<300; i++) for (j=0; j<S; i+~) c[i,j~=a[i,;]-~b[i,j];
The processing can be very well supported by most known SlhID, MISD and MIMD machines.
But when the data dependence is added to the program as sho~vn in the fol-lowing, no existing patallel macbines can handle it efficiently.
for (i,0; i<300; i++) for (j=0; j~500; j-~+) if (status) c[i,j]=~a[i,j]+b[i,j];
else c[i,j]= a[i,j]-b[i,j];
Yosssos7 - lo-~Z6~
(2) Spat~ dependent P(lral/el ~rocess~
In image processing and other applications where data are associated with spatial conditions, data are not handled homogeneously, but rather are sub-ject to their spatial depeodence. For example, the data on the boundary of an image are treated differently from the other non-boundary data; tbis is one type of spatial dependence reflected by data. Io this situation, there are two major drawbacks when applying parallel processing:
1. The degree of parallelism can only be extended to the homogeneous part of the data; consequently, the non-homogeneous data (e.g. boundary) are forced to be processed sequentially. This degrades the performance of a parallel system;
2. The program (or coding) for the non-homogeneous data differs from the program for ~heir homogeneous counterpart; therefore, more than one copy of the coding must be prepared. This leads to larger software ef-fort.
U the instmction could be adapted, processing with the spatial dependence could solve problems such as the above problem as follows:
for (i=0; i<300; i++) YO985057 - Il-a2~;5 îor (i=O; i<s; j++) if (spatial-condition) action l;
then action 2.
With adaptive instruction processing, both drawbacks could be removed. llle spatial condition for the non-homogeneity, for example the boundary, could be expressed as conditions such as ~<B or x>N-B or y<B or y>N-B, where (x,y) is the coordinate of a pixel of an NxN image and B is the width of the boundary.
SUMMAl?.Y OF THE INVENTION
, The object of the invention is to provide an image processor architecture which speeds the operation of the image processing system by eliminating delays related to spatial or data dependence.
A feature of the invention is the provision, in each of a finite number of processing elements, of an instruction adapter with addressable instmction derivation means, responsive to the composite of original instruction and bit values in the da~a stream.
An advantage of the invention is that the image processing system is self-adaptive to the parameters of the data stream, in tbat an individual processing element may derive different instructions adapted to the spatial or data pa-rameters of different items as the items arrive for processing at the individual processing element.
Another advantage of the invention is that it does not require processing de-lay time for data dependencies.
The foregoing and other objects, features and advantages of the invention will be apparent from the more particular description of the preferred em-bodiments of the invention, as illustrated in the accompanying drawings.
Y09~5057 ~3 55~
.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an array processor with one representative adaptive processing element presented in block diagram according to the in-vention.
FIG. 2 is a detailed block diagram of the instruction adapter L~ of FIG. 1.
FIG. 3 is a detailed block diagram of an instruction adapter according to a sirnplified embodiment.
FIG. 4 is a detailed block diagram of an instnuction adapter for a multidi-mensional network array processor.
FIG. 5 is a detailed block diagram of a further embodiment of the instmction adapter.
YO985057 - 1~-6E~
DESCRIPTION OF A PREFERRED EMBODlMENr OF THE IN-VENTION
The adaptive instruction processor assigns a processor identification (PID) to each adaptive processing element in the parallel computer. It collects a set of data-dependent-status ~DDS) bits from the arithmetic & logic unit (ALU) of the adaptive processing element. It then uses an Instnuction Adapter (IA) to derive the instruction subject to the spatial dependence and data depend-ence reflected by the PID, DDS and the original instruction.
FIG. I shows the array processor, which is similar in concept and operation to array processors known in the art, except insofar as handling of data de-pendencies and spatial dependencies is concerned. The image processor comprises adaptive processing element array 1 and array controller 2, both of which are shown schemntically. Array controller 2 provides a set of in-stmctions on instruction line 3, and provides a communication path 4 for in-put and output data. Processing element nrray 1 contains a great nmnber of individuai adaptive processing elements 5, each of which is equipped for ad-aptation accordin~ to this invention. Processing element array I might con-tain conventional processing elements 6, so long as there is awareness of which type of processing element is in each location, and housekeeping is S5~
done accordingly, but this is Dot preferred. It is preferred to make all the processing elements identical adaptive processing elements, and where con-ventional performance is desired, to adapt those elements, for example, those identified as elements 6, to perform conventionally Image processing proceeds in conventioaal image processor fashion so far as dataflow is concerned; data enter the processing element array and flow from processing element to processing element in accordance with initial data val-ues as modified by processing elements through which the data pass, without intervening access to system memory. The individual processing elements are set up prior to dataflow cornmencement. In a conventional image processor, there is uttle chance to alter the setup during execution because for all prac-tical purposes the exact position of data during execution is unknown. In es-sence, an image processor, once set up, is for the dura~ion of the execution a speciali~ed, fixed operation computer. This invention provides for dynamic changes of setup during execution, by equipping each of the multiplicity of adaptable adaptive processing elements with its own processor identification register, its own data dependent status register, and its own instruction ad-aptation mechanism which is responsive to the composite of processor iden-tification data, status data, and the applied instructioa to provide internal selection of operation for the adaptive processing element. On a system ba-~2~~SS~
sis, this provides convenient adaptation to spatial and data dependencies, which permits system optimization for the type of data being processed.
A representative one of the many adaptive processing elements 5, adaptive processing element 7, is shown in greater detail. Communication path 70, local memory 71, and arithmetic and logic unit (ALU) 7~ are conventional, similar in scope and effect to the analogous items in a conventional processing element. In general, these items can function to accept an assignraent (in-struction) and execute the assignment on each item of data as it is presented.
The cornputation cycle, shown simplified in time line 8, uses an original in-struction, showa simplified in original instruction view 9, to derive an in-struction for execution. In operation, the original inStruction, spatial dependent status and data dependent status values are available early in a cycle, as shown by values X,Y,Z on time line 8. The derived instruction then becomes available for control of computation C.
Adaptability is provided by instruction adapter (lA) 73, which accepts from original ~struction line 74 an instruction invoking adaptability, per~orrns ap-propriate tests on data as presented, detennines from test results whether to adapt by deriving a substitute instruction, and provides the substitute derived in~struction to ALU 7~ on derived instruction line 75. The ALU includes or Y09~5057 - 17 -~2~
serves as a data dependent status register to provide fresbly processed data depeDdent status bits back on DDS test data line 76 to L~ 73. Tbe adapt-ability invoking instruction, presented on adaptability control line 74, is made available to spatial dependent status bloc~ (SDS) 77, to dependence select and verify block (DSV) 78, and to instruction modify and e~tend block (~hIE) 79. Tbe instruction adapter (IA) block 73 accepts an original in-stnuction (of the type which invokes its own modification under specified circurnstances) and generates, as output, a derived instruction on line 75 to ALU block 72. This controls the operations of the adaptive processing ele-ment.
A typical instruction set, such as that shown in NCR45CG72, 1984, at page
7, includes a micro-NOP instmction, several load/store instnuctions for inter-processing-element communication, and arithmetic/logic operations.
FIG. 1 iUustrates the instmction forn1at of the derived instnuction at inset derived instruction view 80. The derived iastmction has an "agreement bit"
inserted to the input instmction in one or more prescribed bit positions. The "agreement bit" is a function of tbe PlID, DDS and the input instruction while the "prescribed bit position" can be predetermined from the format of the input instnuction and the derived instnuction. Tbe "agreement bit" can also overwrite the bits of the input instmction at the prescribed positions.
:a~26~35~
The individual adaptive processing element remains quite simple. l'hree ma-jor building blocks are registers, shift registers and multiplexers, all of which are cornmon parts available from a number of suppliers and farniliar to those skilled in the art. Such parts are described in T~ l'rL DATA BOOK FOR
DESIGN ENGINEERS, Second Edition, Texas Instruments Corporation, LCC4112 74062-116-AI, pp. 7-471;7-316; and 7-181. The following ex-amples are typical of building blocks appropriate for selection:
Part Number Buildine Block 74374 8-bit regis~er 74194 4-bit bidirectionaJ shift register 74157 multiplexer FIG.2 shows the structure of the instruction adapter (IA) 73, whicb contains three functional blocks as follows:
1. Spatial dependent status block (SDS) 77;
2. Dependence select and verify block (DSV) 78; and 3. Instruction modify and extend block (B) 79. - -The Spatial Dependence Status (SDS) block 77 accepts part of the input in-struction as control and produces SDS bits as output to indicate the spatial dependence. This block contains a P~D register 81 whose content is the PID
of the PE. The PrD register can be preloaded by the input instruction. The ~85S~
content of the PID register 81 must be correlated to spatial location, for e~-ample x-y coordinates. It also contains a shift register 82 of the same size as PID register 81. The shift register 82 can perform a logic-sbift operation, one bit at a tirne, in either direction. This mechanism aUows for any bit group of the PID register to be available on line 88 as the input to the DSV block 78.
The second functional block is the "Dependence Select and Verify (DSV)"
block 78. The DSV block 78 contains a multiple~er 83 to select some of the SDS bits or DDS bits. A template register 84 is included in this block for matching, masking and comparing purpose. The selected dependent bits and the template are passed to the "agreement verifier" 85 to generate an "agreement bit." Template register 84 contains decision threshold informa-tion, preset at initialization. In a typical operation, all template registers are set to the same value to mask a certain subgroup of bits. On a particular cy-cle, aU shift registers are operated similarly, according to instruction, to ac-complish sampling of a group of bits in the related PID register and align those bits appropriately.
The template register imd shift register together function so as to select the interested bit~s from the PID.
~;26~5~
The usual operation is as a movable window, iacluding two consecutive bits to assign the bardware differently for different items of iaterest or different image subsets.
The agreement verifier can perform COMPARE, AND, OR and XOR oper-ations. In summary, the DSV block accepts SDS bits and DDS bits as input, and generates the "agreement bit" as output The DSV block accepts part oS
the input instruction as control.
The DDS bits indicate the nature of the data and are used for data adaptation.
Comu~on DDS bits iDclude positive/negative, zero/non-zero, positive/aon-positive, true/false, greater/equal/less, even/odd. Any other status that can be derived from tl~e Arithmetic Logic IJnit (ALU~ 72 of the adaptable proc-essing element 7 can be identified by status bits.
The third block is the "instruction modify and e~tend (IME)" block 79. The ~E block 79 accepts the agreement bit and the input instructioa as inputs, and generates the derived iastruction as Olltpllt. The l~fE block 79 has a "bit overwriter" to replace some bits of the input instruction by the agreement bit at the prescribed positions. The block also has a "bit inserter" which inserts the agreement bit into the input instruction at the prescribed positions. The ~L~6~
selection of overwriting or insertion or both, and the prescribing of the posi-tions, are controlled by part of the input instruction.
With these thtee functional blocks, the IA can perform the following "de-pendent operations" to facilitate the spatial and/or data adaptation:
( I) detecting any one bit of DDS being " I";
(2) detecting the i-th bit of the PII) register being "I"; and (3) detecting any M contiguous bits of the PID register matching to the template where M is smaller or equal to the total number of bits of the PID.
The above-described generic embodiment encompasses the invention. Spe-cific implementations can be a partial collection of each functiooal block.
Implementalion of Simplified A~lapti~le Processing E/ement A very efficient simplified implementatioD of the Instruction Adaptor (IA) 73 is illustrated in Figure 3. The content of the PID register 81 is copied to a logical shift register (LSR) 91 which can perform logic-shift operations hl both directions. The bits shifted out fronl them and the rightmost bit of the ~61~5~i~
logical shift-register 91 are the SDS bits. A multiple~er then selects one of the SDS or DDS bit as the agreement bit. Sucb an implementation can per-form one dimensional and two dimensional operations. This implementation is suitable for an adaptive processing element with less complexity, because of the eliminatioD of the template register 84 (FIG. 2) and the agreement verifier 85.
Implementation for Multidimensional Ne~work FIG. 4 shows the spatial dependent status block generaiized to detect spatial dependence of a multidimensional interconnection network of a parallel computer, such as aD array in 2D or a pyranlid in 3D. The PID register may be considered as comprising K sections l-K. The shift register 92, which is analogous to the sbift registers 82 and 91 in FlGs. 1-2, in SDS block 77, is partitioned into K seCtiODS, where K is the dimension of the network. Each section manipulates the spatial dependence for one dimension of the network.
The rightmost and the leftmost bits of each section are SDS bits which are passed to the multiplexer 83 in the DSV block 78 for spatial dependence de-tection.
APPLICATIONS AND BENEElI'S
~85S~
(IJ Data-Dependent and Spatial-Dependent Parallel Yrocessing With adaptive iastruction processing according to the invention, one can re-structure as follows:
for (i=0; i<300; i++) for (j=0; j~S00; j++) if (condition~
action l;
else action 2; (if action I and action 2 are complementary) into for(i=0; i<300; i++) for(j=~0; j<500; j++) action 3.
Complementary operations are possible. Aa example of the complementary operation is the ~/- (add or subtract) pair. The "agreement" bit is derived iS4 from the "condition," which can be data-dependent or spatial-dependent.
As the result of the restructuring, the sequential e~ecution of action 1/action 2 can be totally parallelLzed into one unified action 3.
The problem can also be restructured in the following way:
An "agreement" bit is derived froln the "condition" and is used to generate an "address offset" by either overwriting or inserting the ap-propriate fie]d of tbe input Lnstmction. The code for action 1 and action 2 are then structured D distaoce apart where D is equal to the "address offset." During the runtiune, each PE wiU "jump" to the right entry of the code according to the spatial condition of the data. Note that only one version of the coding is necessary. A non-homogeneous problem clue to the spatial dependence can be converted to a homoge-neous one by this invention, so that the degree of parallelLsm can be more extensive and software effort can be reduced.
(~) Universal l~/etwork ~:mulation This inveatioa resolves the spatial dependence due to the relative or absolute position (or coordLnate) of PEs in a parallel computer which has one single fLYed baseline network to connect the PEs in the system.
85S~
It is advantageous from an algorithnl development point of view to have more than one network (or intercoMection) embedded in a parallel processing system, because no single network optimally matches to various algorithms.
Io a parallel system with such network emulation capability, a Detwork can be emulated from the baseline intercormectioo via using the adaptive in-struction processor.
Eor example, the pyralrtid network can be emulated from the baseline array interconnection of size NxN by the adaptive instruction processor. The PID
register of each adaptive processing element in the array is loaded with the appropriate Cartesian coordinate value (x, y). At time t, 1, all PEs are active.
At t32, only adaptive processing elements with x or y equaLs to the multiple of 2 are active. At t,3, only a~laptive processing elements with x or y equaLs to the multiple of 4 are active. In summary, at t=i only adaptive processing elements with x or y equaLs to the multiple of 2x(i-1) are active. Tbe above-described procedure emulates a pyramid necwork of shrinkage 2 (i.e. for ev-ery flutber time step, only 1/4 of the total PEs are connected wilh distance one) from the baseline array network. The control of the emula~ion can be done by examining the conCent of the two-dimeosional PID register and ac-tivates the PE if the "agreement" bit is true.
~26E~S~
The same adaptive instruction processor can also be applied to the emula~ion of networks such as tree, ring and perfect shuffle etc. from a baseline inter-connection.
The network emulation is a very powerful mechanism in producing multiple parallel architectures from a single PE type. Along with the economic and logistic benefits, the algorithm can have a better matcb to the arcbitecture, so that the performance can be increased. Because only one element type is required, such emulation scheme is especially suitable for ~LSI implementa-tion.
(3) Complementary O,oera~ions A complementary operation is a pair of operations that can be distributed between two adaptive processing elements and executed simultaneously.
Examples of complementary operations include the SEND/RECEIVE pair for interprocessor communica~ion and the Butterfly computation of FFT
(Fast Fourier Transform). Another example is the +/- instruction described ~n application (1). When applied to the complementary operations, the adaptive instruction scheme can speed up the execution aod save communi-cation bandwidth.
~Z~8S5~
Consider the SEND/RECEIVE example for the SIMD array architecture.
One adaptive processing element (based on the knowledge of its position in the array, i.e., spatial knowledge) sends data through an interconnection to its neighbor PE while its neighbor PE (based on this spatial knowledge) re-eeives these data through the same interconnection. The operation can be accomplished in one cycle using one unit of conununication bandwidth for one datum. For the SIMD architecture without topology-dependent adapta-tion, the same operation needs two execution cycles and/or two units of communication bandwidth.
In the case of Butterfly computation, sequential execution in a conventional image processor may require 4 eycles. With the adaptive instruction processor, it can be reduced to 2 cycles, (with 2 adaptive processing ele-ments) or even to I cycle (with ~ adaptive processing elements). This is achieved by first sending the instruction ADD/SUB to all adaptive processing elements. Each adaptive processing element then maps the instmction into either ADD or SliB according to its spatial condition.
Application of eomplementary operations are not limited to the above exam-ples. In fact, many other applications can be fo~und in image processing, computer vision, digital signal processing, matbematical transformation and general scientific computation.
~6~S~
f4) Adaptive Supercomputer Adaptive instruction processiag can resolve the data dependent problem and increase the performance of a supercomputer. An instruction called OPA/OPB (operation A or operation B) can be defined and one agreement bit selected to adapt the instruction. The derived instructioa will e~ecute OPA
if the agreement bit is "I" or OPB if the agreement bit is "0." The pipe within the supercomputers can be set up jD advance for instruction OPA/OPB and the vector execution can be started once the agree~nent bit is available. The probabiUty of "guessing right" is always "I " because of the data adaptation.
IPI,E ILLUST~TING THE II~VENTION
This example illnstrates the adaptive instruction processor operating in a spatial adaptation case. With the aid of Figure 5, this example shows how an original instruction is modified as a function of the Processor IDentification (PID). As a result, the following computation for (i=0; i<300; i++) for (j=0; j<500; j+~) if (PID<0> =~ 1) c[i,j] = a[i,j] + b[i,j]
ss~
else C[i,j] 3 a[i,j] - b~i,j]
can be executed efficiently.
A simplified original instruction and derived instructions are depicted first as a background; then a design realizing the invention is detailed in Figure 5 with companion description.
Instruction Forrnat Anlong the M bits of the original instruction. the K-th bit controls the action of either SEND/REC or +/- as described below:
If bit~K>=l, then PE performs SEND when the PE's local status S is true;
PE performs REC when the PE's local status S is false;
If bit<K>= 0, then PE performs "+" when PE's local status S is true;
PE performs "-" when the PE's local status S is false. `~
In contrast to the original instruction, the derive(l instruction has two bits, the r-th and tbe r+ I -th bit, pertinent to this example. The action of the adaptive processing element is prescribed as foUows:
~21~;5,~
r-th r+ 1 -th action O O
O I +
SEND
For instance, the PE perfosms "+" when the r-th bit of the modified in-struction is "0" and the r+l-th bit is "1."
A Sample Design for the Invention Figure 5 shows another embodiment using multiple multiplexers. A portion of the original instmction commands the shift register 91, which cor~tairJs the processor identification PID, to SHIFT RIGHT one bit position; as a result, the least significant bit in the processor ideatifiction register (LSBPID) is placed in one of the inputs to the Dependence Selection and Verify (DSV) Block 78.
Figure 5 illustrates a sample design to translate the K-th bit (I<K>) of the original instruction into the r-th and the r-tl-th bits (IM<r> and IM<r+l>) ~68~
s)f the rnodified instruction accordiDg to the Least Significant Bit of P~D
(LSBPID, i.e. PID<0>~.
Anotber portion of the original instruction on adaptability control lines 74' then commands the DSV block 78 to SELECT the LSBPID as the output of the multiplexer in DSV Block. Consequently, LSBPII:) is placed on the line of "agreement bit."
As the inputs of the Instruction Modify and Extend (IME) Block 79, the "agreement bit" (now carrying the LSBPID) and the original instruction (carrying l<K> and otber bits) are routed into a set of multiplexers 84-86 to produce the derived instruction. J mllltiplexers are required for a J-bit de-rived instruction, one multiplexer to produce one bit of the derived instruc-tion. To produce [M<r>7 the original instmction command the "multiplexer r" to SELECT l<K> as the output. Sitmilarly, the original instruction com-mands "multiplexer r+ I" to SELECr the "agreement bit" as IM<r~ 1>.
A table below shows the relationship among the original ins~ruction (I<K>), the L,SBP~D and the modified instruction (IM~r> and IM<r+l>), wllicb demonstrates the reali~ation of the invention via the design illustrated in Figure 5.
;54 I<K> LSE~PID IM<r> IM~r+l>
" " o O O O
"+" O I 0 REC I O I
SEND
Thus, while the iovention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various cbanges in form and details may be made without departing from the scope of the invention.
Y09~5057 - 33 -
FIG. 1 iUustrates the instmction forn1at of the derived instnuction at inset derived instruction view 80. The derived iastmction has an "agreement bit"
inserted to the input instmction in one or more prescribed bit positions. The "agreement bit" is a function of tbe PlID, DDS and the input instruction while the "prescribed bit position" can be predetermined from the format of the input instnuction and the derived instnuction. Tbe "agreement bit" can also overwrite the bits of the input instmction at the prescribed positions.
:a~26~35~
The individual adaptive processing element remains quite simple. l'hree ma-jor building blocks are registers, shift registers and multiplexers, all of which are cornmon parts available from a number of suppliers and farniliar to those skilled in the art. Such parts are described in T~ l'rL DATA BOOK FOR
DESIGN ENGINEERS, Second Edition, Texas Instruments Corporation, LCC4112 74062-116-AI, pp. 7-471;7-316; and 7-181. The following ex-amples are typical of building blocks appropriate for selection:
Part Number Buildine Block 74374 8-bit regis~er 74194 4-bit bidirectionaJ shift register 74157 multiplexer FIG.2 shows the structure of the instruction adapter (IA) 73, whicb contains three functional blocks as follows:
1. Spatial dependent status block (SDS) 77;
2. Dependence select and verify block (DSV) 78; and 3. Instruction modify and extend block (B) 79. - -The Spatial Dependence Status (SDS) block 77 accepts part of the input in-struction as control and produces SDS bits as output to indicate the spatial dependence. This block contains a P~D register 81 whose content is the PID
of the PE. The PrD register can be preloaded by the input instruction. The ~85S~
content of the PID register 81 must be correlated to spatial location, for e~-ample x-y coordinates. It also contains a shift register 82 of the same size as PID register 81. The shift register 82 can perform a logic-sbift operation, one bit at a tirne, in either direction. This mechanism aUows for any bit group of the PID register to be available on line 88 as the input to the DSV block 78.
The second functional block is the "Dependence Select and Verify (DSV)"
block 78. The DSV block 78 contains a multiple~er 83 to select some of the SDS bits or DDS bits. A template register 84 is included in this block for matching, masking and comparing purpose. The selected dependent bits and the template are passed to the "agreement verifier" 85 to generate an "agreement bit." Template register 84 contains decision threshold informa-tion, preset at initialization. In a typical operation, all template registers are set to the same value to mask a certain subgroup of bits. On a particular cy-cle, aU shift registers are operated similarly, according to instruction, to ac-complish sampling of a group of bits in the related PID register and align those bits appropriately.
The template register imd shift register together function so as to select the interested bit~s from the PID.
~;26~5~
The usual operation is as a movable window, iacluding two consecutive bits to assign the bardware differently for different items of iaterest or different image subsets.
The agreement verifier can perform COMPARE, AND, OR and XOR oper-ations. In summary, the DSV block accepts SDS bits and DDS bits as input, and generates the "agreement bit" as output The DSV block accepts part oS
the input instruction as control.
The DDS bits indicate the nature of the data and are used for data adaptation.
Comu~on DDS bits iDclude positive/negative, zero/non-zero, positive/aon-positive, true/false, greater/equal/less, even/odd. Any other status that can be derived from tl~e Arithmetic Logic IJnit (ALU~ 72 of the adaptable proc-essing element 7 can be identified by status bits.
The third block is the "instruction modify and e~tend (IME)" block 79. The ~E block 79 accepts the agreement bit and the input instructioa as inputs, and generates the derived iastruction as Olltpllt. The l~fE block 79 has a "bit overwriter" to replace some bits of the input instruction by the agreement bit at the prescribed positions. The block also has a "bit inserter" which inserts the agreement bit into the input instruction at the prescribed positions. The ~L~6~
selection of overwriting or insertion or both, and the prescribing of the posi-tions, are controlled by part of the input instruction.
With these thtee functional blocks, the IA can perform the following "de-pendent operations" to facilitate the spatial and/or data adaptation:
( I) detecting any one bit of DDS being " I";
(2) detecting the i-th bit of the PII) register being "I"; and (3) detecting any M contiguous bits of the PID register matching to the template where M is smaller or equal to the total number of bits of the PID.
The above-described generic embodiment encompasses the invention. Spe-cific implementations can be a partial collection of each functiooal block.
Implementalion of Simplified A~lapti~le Processing E/ement A very efficient simplified implementatioD of the Instruction Adaptor (IA) 73 is illustrated in Figure 3. The content of the PID register 81 is copied to a logical shift register (LSR) 91 which can perform logic-shift operations hl both directions. The bits shifted out fronl them and the rightmost bit of the ~61~5~i~
logical shift-register 91 are the SDS bits. A multiple~er then selects one of the SDS or DDS bit as the agreement bit. Sucb an implementation can per-form one dimensional and two dimensional operations. This implementation is suitable for an adaptive processing element with less complexity, because of the eliminatioD of the template register 84 (FIG. 2) and the agreement verifier 85.
Implementation for Multidimensional Ne~work FIG. 4 shows the spatial dependent status block generaiized to detect spatial dependence of a multidimensional interconnection network of a parallel computer, such as aD array in 2D or a pyranlid in 3D. The PID register may be considered as comprising K sections l-K. The shift register 92, which is analogous to the sbift registers 82 and 91 in FlGs. 1-2, in SDS block 77, is partitioned into K seCtiODS, where K is the dimension of the network. Each section manipulates the spatial dependence for one dimension of the network.
The rightmost and the leftmost bits of each section are SDS bits which are passed to the multiplexer 83 in the DSV block 78 for spatial dependence de-tection.
APPLICATIONS AND BENEElI'S
~85S~
(IJ Data-Dependent and Spatial-Dependent Parallel Yrocessing With adaptive iastruction processing according to the invention, one can re-structure as follows:
for (i=0; i<300; i++) for (j=0; j~S00; j++) if (condition~
action l;
else action 2; (if action I and action 2 are complementary) into for(i=0; i<300; i++) for(j=~0; j<500; j++) action 3.
Complementary operations are possible. Aa example of the complementary operation is the ~/- (add or subtract) pair. The "agreement" bit is derived iS4 from the "condition," which can be data-dependent or spatial-dependent.
As the result of the restructuring, the sequential e~ecution of action 1/action 2 can be totally parallelLzed into one unified action 3.
The problem can also be restructured in the following way:
An "agreement" bit is derived froln the "condition" and is used to generate an "address offset" by either overwriting or inserting the ap-propriate fie]d of tbe input Lnstmction. The code for action 1 and action 2 are then structured D distaoce apart where D is equal to the "address offset." During the runtiune, each PE wiU "jump" to the right entry of the code according to the spatial condition of the data. Note that only one version of the coding is necessary. A non-homogeneous problem clue to the spatial dependence can be converted to a homoge-neous one by this invention, so that the degree of parallelLsm can be more extensive and software effort can be reduced.
(~) Universal l~/etwork ~:mulation This inveatioa resolves the spatial dependence due to the relative or absolute position (or coordLnate) of PEs in a parallel computer which has one single fLYed baseline network to connect the PEs in the system.
85S~
It is advantageous from an algorithnl development point of view to have more than one network (or intercoMection) embedded in a parallel processing system, because no single network optimally matches to various algorithms.
Io a parallel system with such network emulation capability, a Detwork can be emulated from the baseline intercormectioo via using the adaptive in-struction processor.
Eor example, the pyralrtid network can be emulated from the baseline array interconnection of size NxN by the adaptive instruction processor. The PID
register of each adaptive processing element in the array is loaded with the appropriate Cartesian coordinate value (x, y). At time t, 1, all PEs are active.
At t32, only adaptive processing elements with x or y equaLs to the multiple of 2 are active. At t,3, only a~laptive processing elements with x or y equaLs to the multiple of 4 are active. In summary, at t=i only adaptive processing elements with x or y equaLs to the multiple of 2x(i-1) are active. Tbe above-described procedure emulates a pyramid necwork of shrinkage 2 (i.e. for ev-ery flutber time step, only 1/4 of the total PEs are connected wilh distance one) from the baseline array network. The control of the emula~ion can be done by examining the conCent of the two-dimeosional PID register and ac-tivates the PE if the "agreement" bit is true.
~26E~S~
The same adaptive instruction processor can also be applied to the emula~ion of networks such as tree, ring and perfect shuffle etc. from a baseline inter-connection.
The network emulation is a very powerful mechanism in producing multiple parallel architectures from a single PE type. Along with the economic and logistic benefits, the algorithm can have a better matcb to the arcbitecture, so that the performance can be increased. Because only one element type is required, such emulation scheme is especially suitable for ~LSI implementa-tion.
(3) Complementary O,oera~ions A complementary operation is a pair of operations that can be distributed between two adaptive processing elements and executed simultaneously.
Examples of complementary operations include the SEND/RECEIVE pair for interprocessor communica~ion and the Butterfly computation of FFT
(Fast Fourier Transform). Another example is the +/- instruction described ~n application (1). When applied to the complementary operations, the adaptive instruction scheme can speed up the execution aod save communi-cation bandwidth.
~Z~8S5~
Consider the SEND/RECEIVE example for the SIMD array architecture.
One adaptive processing element (based on the knowledge of its position in the array, i.e., spatial knowledge) sends data through an interconnection to its neighbor PE while its neighbor PE (based on this spatial knowledge) re-eeives these data through the same interconnection. The operation can be accomplished in one cycle using one unit of conununication bandwidth for one datum. For the SIMD architecture without topology-dependent adapta-tion, the same operation needs two execution cycles and/or two units of communication bandwidth.
In the case of Butterfly computation, sequential execution in a conventional image processor may require 4 eycles. With the adaptive instruction processor, it can be reduced to 2 cycles, (with 2 adaptive processing ele-ments) or even to I cycle (with ~ adaptive processing elements). This is achieved by first sending the instruction ADD/SUB to all adaptive processing elements. Each adaptive processing element then maps the instmction into either ADD or SliB according to its spatial condition.
Application of eomplementary operations are not limited to the above exam-ples. In fact, many other applications can be fo~und in image processing, computer vision, digital signal processing, matbematical transformation and general scientific computation.
~6~S~
f4) Adaptive Supercomputer Adaptive instruction processiag can resolve the data dependent problem and increase the performance of a supercomputer. An instruction called OPA/OPB (operation A or operation B) can be defined and one agreement bit selected to adapt the instruction. The derived instructioa will e~ecute OPA
if the agreement bit is "I" or OPB if the agreement bit is "0." The pipe within the supercomputers can be set up jD advance for instruction OPA/OPB and the vector execution can be started once the agree~nent bit is available. The probabiUty of "guessing right" is always "I " because of the data adaptation.
IPI,E ILLUST~TING THE II~VENTION
This example illnstrates the adaptive instruction processor operating in a spatial adaptation case. With the aid of Figure 5, this example shows how an original instruction is modified as a function of the Processor IDentification (PID). As a result, the following computation for (i=0; i<300; i++) for (j=0; j<500; j+~) if (PID<0> =~ 1) c[i,j] = a[i,j] + b[i,j]
ss~
else C[i,j] 3 a[i,j] - b~i,j]
can be executed efficiently.
A simplified original instruction and derived instructions are depicted first as a background; then a design realizing the invention is detailed in Figure 5 with companion description.
Instruction Forrnat Anlong the M bits of the original instruction. the K-th bit controls the action of either SEND/REC or +/- as described below:
If bit~K>=l, then PE performs SEND when the PE's local status S is true;
PE performs REC when the PE's local status S is false;
If bit<K>= 0, then PE performs "+" when PE's local status S is true;
PE performs "-" when the PE's local status S is false. `~
In contrast to the original instruction, the derive(l instruction has two bits, the r-th and tbe r+ I -th bit, pertinent to this example. The action of the adaptive processing element is prescribed as foUows:
~21~;5,~
r-th r+ 1 -th action O O
O I +
SEND
For instance, the PE perfosms "+" when the r-th bit of the modified in-struction is "0" and the r+l-th bit is "1."
A Sample Design for the Invention Figure 5 shows another embodiment using multiple multiplexers. A portion of the original instmction commands the shift register 91, which cor~tairJs the processor identification PID, to SHIFT RIGHT one bit position; as a result, the least significant bit in the processor ideatifiction register (LSBPID) is placed in one of the inputs to the Dependence Selection and Verify (DSV) Block 78.
Figure 5 illustrates a sample design to translate the K-th bit (I<K>) of the original instruction into the r-th and the r-tl-th bits (IM<r> and IM<r+l>) ~68~
s)f the rnodified instruction accordiDg to the Least Significant Bit of P~D
(LSBPID, i.e. PID<0>~.
Anotber portion of the original instruction on adaptability control lines 74' then commands the DSV block 78 to SELECT the LSBPID as the output of the multiplexer in DSV Block. Consequently, LSBPII:) is placed on the line of "agreement bit."
As the inputs of the Instruction Modify and Extend (IME) Block 79, the "agreement bit" (now carrying the LSBPID) and the original instruction (carrying l<K> and otber bits) are routed into a set of multiplexers 84-86 to produce the derived instruction. J mllltiplexers are required for a J-bit de-rived instruction, one multiplexer to produce one bit of the derived instruc-tion. To produce [M<r>7 the original instmction command the "multiplexer r" to SELECT l<K> as the output. Sitmilarly, the original instruction com-mands "multiplexer r+ I" to SELECr the "agreement bit" as IM<r~ 1>.
A table below shows the relationship among the original ins~ruction (I<K>), the L,SBP~D and the modified instruction (IM~r> and IM<r+l>), wllicb demonstrates the reali~ation of the invention via the design illustrated in Figure 5.
;54 I<K> LSE~PID IM<r> IM~r+l>
" " o O O O
"+" O I 0 REC I O I
SEND
Thus, while the iovention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various cbanges in form and details may be made without departing from the scope of the invention.
Y09~5057 - 33 -
Claims (7)
1. An array processor having a control unit with instruction issuing capability and having an array of individual processing elements for processing data according to instructions applied simultaneously to processing elements by the control unit, each processing element having input means, processing means, and output means connected to other processing elements characterized in that each of the processing elements is adaptive, in that each adaptive processing element receives an original instruction, generates a derived instruction and executes the derived instruction, each adaptive processing element comprising:
means for generating a derived instruction as a composite function of the original instruction and dynamic operational parameters by replacing bits in the original instruction; and means to execute said derived instruction.
means for generating a derived instruction as a composite function of the original instruction and dynamic operational parameters by replacing bits in the original instruction; and means to execute said derived instruction.
2. An array processor having a control unit with instruction issuing capability and having an array of individual. processing elements for processing data according to instructions applied simultaneously to processing elements by the control unit, each processing element having input means, processing means, and output means connected to other processing elements characterized in that each of the processing elements is adaptive, in that each adaptive processing element receives an original instruction, generates a derived instruction and executes the derived instruction, each adaptive processing element comprising:
means for generating a derived instruction as a composite function of the original instruction and dynamic operational parameters by inserting a bit into the original instruction; and means to execute said derived instruction.
means for generating a derived instruction as a composite function of the original instruction and dynamic operational parameters by inserting a bit into the original instruction; and means to execute said derived instruction.
3. An array processor having a control unit with instruction issuing capability and having an array of individual processing elements for processing data according to instructions applied simultaneously to a significant number of processing elements by the control unit, each processing element having input means processing means, and output means connected to other processing elements characterized in that each of the processing elements is adaptive, in that each adaptive processing element receives an original instruction, generates a derived instruction and executes the derived instruction, each adaptive processing element comprising:
spatial dependent status means coupled for receiving said original instruction for providing a first signal indicative of a spatial dependent status operational parameter at a first node;
execution means, including the processing means, to execute the derived instruction and to provide a result to the output means and also to provide a second signal indicative of a data dependent status operational parameter at a second node;
dependence select and verify means, coupled for receiving said original instruction and coupled to said first node and said second node for receiving said first signal and said second signal for deriving at a third node an instruction as a composite function of said original instruction, said first signal and said second signal; and instruction modification and extension means, coupled for receiving said original instruction and coupled to said third node for receiving the composite instruction for providing said derived instruction to the processing means.
spatial dependent status means coupled for receiving said original instruction for providing a first signal indicative of a spatial dependent status operational parameter at a first node;
execution means, including the processing means, to execute the derived instruction and to provide a result to the output means and also to provide a second signal indicative of a data dependent status operational parameter at a second node;
dependence select and verify means, coupled for receiving said original instruction and coupled to said first node and said second node for receiving said first signal and said second signal for deriving at a third node an instruction as a composite function of said original instruction, said first signal and said second signal; and instruction modification and extension means, coupled for receiving said original instruction and coupled to said third node for receiving the composite instruction for providing said derived instruction to the processing means.
4. An array processor according to claim 3, in which said spatial dependent status means comprises a processor identification register;
in which said processing means comprises data dependent status register means; and in which said instruction modification and extension means is coupled to said processing means to provide processing activity which is a composite function of said original instruction and the data content of said processor identification register and the data content of said data dependent status register means.
in which said processing means comprises data dependent status register means; and in which said instruction modification and extension means is coupled to said processing means to provide processing activity which is a composite function of said original instruction and the data content of said processor identification register and the data content of said data dependent status register means.
5. An array processor according to claim 3, wherein said spatial dependent status means comprises:
a processor identification register and a shift register;
and wherein said dependence select and verify means comprises a template register and an agreement verifier, whereby said processor identification register, said shift register and said agreement verifier serve as a movable window for providing bits for controlling generation of the derived instruction by the adaptive processing element.
a processor identification register and a shift register;
and wherein said dependence select and verify means comprises a template register and an agreement verifier, whereby said processor identification register, said shift register and said agreement verifier serve as a movable window for providing bits for controlling generation of the derived instruction by the adaptive processing element.
6. An array processor according to claim 3, wherein said spatial dependent status means comprises:
a processor identification register divided into sections 1... K-1, K and a shift register means similarly divided into sections 1... K-1, K and interconnected section by section with said processor identification register.
a processor identification register divided into sections 1... K-1, K and a shift register means similarly divided into sections 1... K-1, K and interconnected section by section with said processor identification register.
7. An array processor according to claim 3, wherein said dependence select and verify means comprises a plurality of multiplexers arrayed to provide as an output agreement bit values as a composite function of said original instruction and data values to said instruction modification and extension means.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US06/839,311 US4783738A (en) | 1986-03-13 | 1986-03-13 | Adaptive instruction processing by array processor having processor identification and data dependent status registers in each processing element |
US06/839,311 | 1986-03-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1268554A true CA1268554A (en) | 1990-05-01 |
Family
ID=25279392
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000529484A Expired - Fee Related CA1268554A (en) | 1986-03-13 | 1987-02-11 | Adaptive instruction processing by array processor having processor identification and data dependent status registers in each processing element |
Country Status (5)
Country | Link |
---|---|
US (1) | US4783738A (en) |
EP (1) | EP0237013B1 (en) |
JP (1) | JPH0719244B2 (en) |
CA (1) | CA1268554A (en) |
DE (1) | DE3784082T2 (en) |
Families Citing this family (84)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3506749A1 (en) * | 1984-02-27 | 1985-09-26 | Nippon Telegraph & Telephone Public Corp., Tokio/Tokyo | Matrix processor and control method therefor |
GB2211638A (en) * | 1987-10-27 | 1989-07-05 | Ibm | Simd array processor |
US4839851A (en) * | 1987-07-13 | 1989-06-13 | Idaho Research Foundation, Inc. | Programmable data path device |
US4943912A (en) * | 1987-10-13 | 1990-07-24 | Hitachi, Ltd. | Parallel processor system having control processor and array control apparatus for selectively activating different processors |
US4901360A (en) * | 1987-10-23 | 1990-02-13 | Hughes Aircraft Company | Gated architecture for computer vision machine |
NL8800071A (en) * | 1988-01-13 | 1989-08-01 | Philips Nv | DATA PROCESSOR SYSTEM AND VIDEO PROCESSOR SYSTEM, PROVIDED WITH SUCH A DATA PROCESSOR SYSTEM. |
US5257395A (en) * | 1988-05-13 | 1993-10-26 | International Business Machines Corporation | Methods and circuit for implementing and arbitrary graph on a polymorphic mesh |
US5136717A (en) * | 1988-11-23 | 1992-08-04 | Flavors Technology Inc. | Realtime systolic, multiple-instruction, single-data parallel computer system |
US5280620A (en) * | 1988-12-16 | 1994-01-18 | U.S. Philips Corporation | Coupling network for a data processor, including a series connection of a cross-bar switch and an array of silos |
US5067069A (en) * | 1989-02-03 | 1991-11-19 | Digital Equipment Corporation | Control of multiple functional units with parallel operation in a microcoded execution unit |
CA2012938A1 (en) * | 1989-04-19 | 1990-10-19 | Patrick F. Castelaz | Clustering and association processor |
EP0424618A3 (en) * | 1989-10-24 | 1992-11-19 | International Business Machines Corporation | Input/output system |
US5471593A (en) * | 1989-12-11 | 1995-11-28 | Branigin; Michael H. | Computer processor with an efficient means of executing many instructions simultaneously |
CA2073185A1 (en) * | 1990-01-05 | 1991-07-06 | Won S. Kim | Parallel processor memory system |
US5617577A (en) * | 1990-11-13 | 1997-04-01 | International Business Machines Corporation | Advanced parallel array processor I/O connection |
US5828894A (en) * | 1990-11-13 | 1998-10-27 | International Business Machines Corporation | Array processor having grouping of SIMD pickets |
US5590345A (en) * | 1990-11-13 | 1996-12-31 | International Business Machines Corporation | Advanced parallel array processor(APAP) |
US5765015A (en) * | 1990-11-13 | 1998-06-09 | International Business Machines Corporation | Slide network for an array processor |
US5630162A (en) * | 1990-11-13 | 1997-05-13 | International Business Machines Corporation | Array processor dotted communication network based on H-DOTs |
US5815723A (en) * | 1990-11-13 | 1998-09-29 | International Business Machines Corporation | Picket autonomy on a SIMD machine |
US5752067A (en) * | 1990-11-13 | 1998-05-12 | International Business Machines Corporation | Fully scalable parallel processing system having asynchronous SIMD processing |
US5765012A (en) * | 1990-11-13 | 1998-06-09 | International Business Machines Corporation | Controller for a SIMD/MIMD array having an instruction sequencer utilizing a canned routine library |
US5966528A (en) * | 1990-11-13 | 1999-10-12 | International Business Machines Corporation | SIMD/MIMD array processor with vector processing |
US5588152A (en) * | 1990-11-13 | 1996-12-24 | International Business Machines Corporation | Advanced parallel processor including advanced support hardware |
US5734921A (en) * | 1990-11-13 | 1998-03-31 | International Business Machines Corporation | Advanced parallel array processor computer package |
US5625836A (en) * | 1990-11-13 | 1997-04-29 | International Business Machines Corporation | SIMD/MIMD processing memory element (PME) |
DE69131272T2 (en) * | 1990-11-13 | 1999-12-09 | Ibm | Parallel associative processor system |
US5809292A (en) * | 1990-11-13 | 1998-09-15 | International Business Machines Corporation | Floating point for simid array machine |
US5963745A (en) * | 1990-11-13 | 1999-10-05 | International Business Machines Corporation | APAP I/O programmable router |
US5765011A (en) * | 1990-11-13 | 1998-06-09 | International Business Machines Corporation | Parallel processing system having a synchronous SIMD processing with processing elements emulating SIMD operation using individual instruction streams |
US5963746A (en) * | 1990-11-13 | 1999-10-05 | International Business Machines Corporation | Fully distributed processing memory element |
US5794059A (en) * | 1990-11-13 | 1998-08-11 | International Business Machines Corporation | N-dimensional modified hypercube |
US5594918A (en) * | 1991-05-13 | 1997-01-14 | International Business Machines Corporation | Parallel computer system providing multi-ported intelligent memory |
US5237626A (en) * | 1991-09-12 | 1993-08-17 | International Business Machines Corporation | Universal image processing module |
JP2571655B2 (en) * | 1991-11-27 | 1997-01-16 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Protocol conversion mechanism, switching network and computer system |
JPH06131312A (en) * | 1992-01-23 | 1994-05-13 | Hitachi Ltd | Method and system for parallel processing |
JP2642039B2 (en) * | 1992-05-22 | 1997-08-20 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Array processor |
US5581778A (en) * | 1992-08-05 | 1996-12-03 | David Sarnoff Researach Center | Advanced massively parallel computer using a field of the instruction to selectively enable the profiling counter to increase its value in response to the system clock |
US5579527A (en) * | 1992-08-05 | 1996-11-26 | David Sarnoff Research Center | Apparatus for alternately activating a multiplier and a match unit |
US6298162B1 (en) | 1992-12-23 | 2001-10-02 | Lockheed Martin Corporation | Image compression/expansion using parallel decomposition/recomposition |
US5493651A (en) * | 1993-02-16 | 1996-02-20 | International Business Machines Corporation | Method and system for dequeuing connection requests in a simplex switch |
JP3287901B2 (en) * | 1993-03-12 | 2002-06-04 | シャープ株式会社 | Identification Data Confirmation Method in Data Driven Information Processing System |
US5765014A (en) * | 1993-10-12 | 1998-06-09 | Seki; Hajime | Electronic computer system and processor element for processing in a data driven manner using reverse polish notation |
US5535291A (en) * | 1994-02-18 | 1996-07-09 | Martin Marietta Corporation | Superresolution image enhancement for a SIMD array processor |
US5659780A (en) * | 1994-02-24 | 1997-08-19 | Wu; Chen-Mie | Pipelined SIMD-systolic array processor and methods thereof |
US5748950A (en) * | 1994-09-20 | 1998-05-05 | Intel Corporation | Method and apparatus for providing an optimized compare-and-branch instruction |
US5758176A (en) * | 1994-09-28 | 1998-05-26 | International Business Machines Corporation | Method and system for providing a single-instruction, multiple-data execution unit for performing single-instruction, multiple-data operations within a superscalar data processing system |
US5682491A (en) * | 1994-12-29 | 1997-10-28 | International Business Machines Corporation | Selective processing and routing of results among processors controlled by decoding instructions using mask value derived from instruction tag and processor identifier |
US6128720A (en) * | 1994-12-29 | 2000-10-03 | International Business Machines Corporation | Distributed processing array with component processors performing customized interpretation of instructions |
US5680597A (en) * | 1995-01-26 | 1997-10-21 | International Business Machines Corporation | System with flexible local control for modifying same instruction partially in different processor of a SIMD computer system to execute dissimilar sequences of instructions |
US5898850A (en) * | 1997-03-31 | 1999-04-27 | International Business Machines Corporation | Method and system for executing a non-native mode-sensitive instruction within a computer system |
US6076156A (en) * | 1997-07-17 | 2000-06-13 | Advanced Micro Devices, Inc. | Instruction redefinition using model specific registers |
US6366999B1 (en) * | 1998-01-28 | 2002-04-02 | Bops, Inc. | Methods and apparatus to support conditional execution in a VLIW-based array processor with subword execution |
US6219776B1 (en) * | 1998-03-10 | 2001-04-17 | Billions Of Operations Per Second | Merged array controller and processing element |
US7225436B1 (en) | 1998-12-08 | 2007-05-29 | Nazomi Communications Inc. | Java hardware accelerator using microcode engine |
US20050149694A1 (en) * | 1998-12-08 | 2005-07-07 | Mukesh Patel | Java hardware accelerator using microcode engine |
US6332215B1 (en) | 1998-12-08 | 2001-12-18 | Nazomi Communications, Inc. | Java virtual machine hardware for RISC and CISC processors |
US6826749B2 (en) | 1998-12-08 | 2004-11-30 | Nazomi Communications, Inc. | Java hardware accelerator using thread manager |
US7191310B2 (en) * | 2000-01-19 | 2007-03-13 | Ricoh Company, Ltd. | Parallel processor and image processing apparatus adapted for nonlinear processing through selection via processor element numbers |
EP1197847A3 (en) * | 2000-10-10 | 2003-05-21 | Nazomi Communications Inc. | Java hardware accelerator using microcode engine |
US7346217B1 (en) * | 2001-04-25 | 2008-03-18 | Lockheed Martin Corporation | Digital image enhancement using successive zoom images |
US7127593B2 (en) * | 2001-06-11 | 2006-10-24 | Broadcom Corporation | Conditional execution with multiple destination stores |
US7383421B2 (en) * | 2002-12-05 | 2008-06-03 | Brightscale, Inc. | Cellular engine for a data processing system |
US8769508B2 (en) | 2001-08-24 | 2014-07-01 | Nazomi Communications Inc. | Virtual machine hardware for RISC and CISC processors |
US7251594B2 (en) * | 2001-12-21 | 2007-07-31 | Hitachi, Ltd. | Execution time modification of instruction emulation parameters |
US7613900B2 (en) * | 2003-03-31 | 2009-11-03 | Stretch, Inc. | Systems and methods for selecting input/output configuration in an integrated circuit |
US8001266B1 (en) | 2003-03-31 | 2011-08-16 | Stretch, Inc. | Configuring a multi-processor system |
US7581081B2 (en) | 2003-03-31 | 2009-08-25 | Stretch, Inc. | Systems and methods for software extensible multi-processing |
US7590829B2 (en) * | 2003-03-31 | 2009-09-15 | Stretch, Inc. | Extension adapter |
US7609297B2 (en) * | 2003-06-25 | 2009-10-27 | Qst Holdings, Inc. | Configurable hardware based digital imaging apparatus |
US7418575B2 (en) * | 2003-07-29 | 2008-08-26 | Stretch, Inc. | Long instruction word processing with instruction extensions |
US7373642B2 (en) * | 2003-07-29 | 2008-05-13 | Stretch, Inc. | Defining instruction extensions in a standard programming language |
FR2865290A1 (en) * | 2004-01-21 | 2005-07-22 | Thomson Licensing Sa | METHOD FOR MANAGING DATA IN A MATRIX PROCESSOR AND MATRIX PROCESSOR EMPLOYING THE METHOD |
US7725691B2 (en) * | 2005-01-28 | 2010-05-25 | Analog Devices, Inc. | Method and apparatus for accelerating processing of a non-sequential instruction stream on a processor with multiple compute units |
US7516301B1 (en) * | 2005-12-16 | 2009-04-07 | Nvidia Corporation | Multiprocessor computing systems with heterogeneous processors |
US7451293B2 (en) * | 2005-10-21 | 2008-11-11 | Brightscale Inc. | Array of Boolean logic controlled processing elements with concurrent I/O processing and instruction sequencing |
JP2009523292A (en) * | 2006-01-10 | 2009-06-18 | ブライトスケール インコーポレイテッド | Method and apparatus for scheduling multimedia data processing in parallel processing systems |
WO2008027567A2 (en) * | 2006-09-01 | 2008-03-06 | Brightscale, Inc. | Integral parallel machine |
US20080244238A1 (en) * | 2006-09-01 | 2008-10-02 | Bogdan Mitu | Stream processing accelerator |
US20080059467A1 (en) * | 2006-09-05 | 2008-03-06 | Lazar Bivolarski | Near full motion search algorithm |
US20080212895A1 (en) * | 2007-01-09 | 2008-09-04 | Lockheed Martin Corporation | Image data processing techniques for highly undersampled images |
US7920935B2 (en) * | 2008-08-19 | 2011-04-05 | International Business Machines Corporation | Activity based real-time production instruction adaptation |
US8755515B1 (en) | 2008-09-29 | 2014-06-17 | Wai Wu | Parallel signal processing system and method |
US11940945B2 (en) * | 2021-12-31 | 2024-03-26 | Ceremorphic, Inc. | Reconfigurable SIMD engine |
Family Cites Families (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3287702A (en) * | 1962-12-04 | 1966-11-22 | Westinghouse Electric Corp | Computer control |
US3287703A (en) * | 1962-12-04 | 1966-11-22 | Westinghouse Electric Corp | Computer |
US3544973A (en) * | 1968-03-13 | 1970-12-01 | Westinghouse Electric Corp | Variable structure computer |
US4558411A (en) * | 1969-05-19 | 1985-12-10 | Burroughs Corp. | Polymorphic programmable units employing plural levels of sub-instruction sets |
US3970993A (en) * | 1974-01-02 | 1976-07-20 | Hughes Aircraft Company | Cooperative-word linear array parallel processor |
GB1527289A (en) * | 1976-08-17 | 1978-10-04 | Int Computers Ltd | Data processing systems |
US4380046A (en) * | 1979-05-21 | 1983-04-12 | Nasa | Massively parallel processor computer |
US4301443A (en) * | 1979-09-10 | 1981-11-17 | Environmental Research Institute Of Michigan | Bit enable circuitry for an image analyzer system |
US4287566A (en) * | 1979-09-28 | 1981-09-01 | Culler-Harrison Inc. | Array processor with parallel operations per instruction |
JPS6042516B2 (en) * | 1980-03-04 | 1985-09-24 | 日本電信電話株式会社 | data processing equipment |
US4435758A (en) * | 1980-03-10 | 1984-03-06 | International Business Machines Corporation | Method for conditional branch execution in SIMD vector processors |
US4344134A (en) * | 1980-06-30 | 1982-08-10 | Burroughs Corporation | Partitionable parallel processor |
US4467409A (en) * | 1980-08-05 | 1984-08-21 | Burroughs Corporation | Flexible computer architecture using arrays of standardized microprocessors customized for pipeline and parallel operations |
US4484346A (en) * | 1980-08-15 | 1984-11-20 | Sternberg Stanley R | Neighborhood transformation logic circuitry for an image analyzer system |
US4398176A (en) * | 1980-08-15 | 1983-08-09 | Environmental Research Institute Of Michigan | Image analyzer with common data/instruction bus |
US4574394A (en) * | 1981-06-01 | 1986-03-04 | Environmental Research Institute Of Mi | Pipeline processor |
US4464689A (en) * | 1981-06-04 | 1984-08-07 | Education & Informations Systems, Inc. | Random access read/write unit |
NZ207326A (en) * | 1983-03-08 | 1988-03-30 | Stc Plc | Associative data processing array |
US4739474A (en) * | 1983-03-10 | 1988-04-19 | Martin Marietta Corporation | Geometric-arithmetic parallel processor |
US4541116A (en) * | 1984-02-27 | 1985-09-10 | Environmental Research Institute Of Mi | Neighborhood image processing stage for implementing filtering operations |
JPS61264470A (en) * | 1985-05-03 | 1986-11-22 | アドバンスト・マイクロ・デイバイシズ・インコ−ポレ−テツド | Monolithic integrated circuit device |
DE3579924D1 (en) * | 1985-05-20 | 1990-10-31 | Howard D Shekels | SUPER COMPUTER SYSTEM ARCHITECTURE. |
-
1986
- 1986-03-13 US US06/839,311 patent/US4783738A/en not_active Expired - Fee Related
-
1987
- 1987-01-09 JP JP62002006A patent/JPH0719244B2/en not_active Expired - Lifetime
- 1987-02-11 CA CA000529484A patent/CA1268554A/en not_active Expired - Fee Related
- 1987-03-10 DE DE8787103374T patent/DE3784082T2/en not_active Expired - Fee Related
- 1987-03-10 EP EP87103374A patent/EP0237013B1/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
EP0237013A3 (en) | 1989-02-08 |
EP0237013B1 (en) | 1993-02-10 |
DE3784082D1 (en) | 1993-03-25 |
JPS62221063A (en) | 1987-09-29 |
DE3784082T2 (en) | 1993-08-12 |
EP0237013A2 (en) | 1987-09-16 |
JPH0719244B2 (en) | 1995-03-06 |
US4783738A (en) | 1988-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA1268554A (en) | Adaptive instruction processing by array processor having processor identification and data dependent status registers in each processing element | |
US5828894A (en) | Array processor having grouping of SIMD pickets | |
US5815723A (en) | Picket autonomy on a SIMD machine | |
Annaratone et al. | Warp architecture and implementation | |
US5805915A (en) | SIMIMD array processing system | |
US5809292A (en) | Floating point for simid array machine | |
EP0485690B1 (en) | Parallel associative processor system | |
Hartenstein et al. | A new FPGA architecture for word-oriented datapaths | |
US5148547A (en) | Method and apparatus for interfacing bit-serial parallel processors to a coprocessor | |
CA1119731A (en) | Multibus processor for increasing execution speed using a pipeline effect | |
US5761077A (en) | Graph partitioning engine based on programmable gate arrays | |
Bernhard | Computers: Computing at the speed limit: Computers 1000 times faster than today's supercomputers would benefit vital scientific applications | |
Treleaven et al. | A recursive computer architecture for VLSI | |
EP0570952A2 (en) | Slide network for an array processor | |
Ligon et al. | An empirical methodology for exploring reconfigurable architectures | |
Sérot et al. | A functional data-ow architecture dedicated to real-time image processing | |
US5473774A (en) | Method for conflict detection in parallel processing system | |
Vlontzos et al. | A wavefront array processor using dataflow processing elements | |
Lee et al. | The implementation of a PC-based list processor for symbolic computation | |
Schaefer | The characterization and representation of massively parallel computing structures | |
Hawver et al. | Processor autonomy and its effect on parallel program execution | |
Yoshinaga et al. | Node processor for a parallel object‐oriented total architecture A‐NET | |
CA1293063C (en) | Binary tree multiprocessor | |
Vasquez | Concurrent use of two programming tools for heterogeneous supercomputers | |
Lee | Data structures and processor allocation for dataflow systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKLA | Lapsed |