CA1212477A

CA1212477A - Data processing apparatus and method employing instruction pipelining

Info

Publication number: CA1212477A
Application number: CA000457773A
Authority: CA
Inventors: Paul R. Jones, Jr.; Walter A. Jones; Joseph L. Ardini, Jr.
Original assignee: Prime Computer Inc
Current assignee: Prime Computer Inc
Priority date: 1983-07-11
Filing date: 1984-06-28
Publication date: 1986-10-07
Also published as: US4750112A; JPS6074035A; ATE64664T1; DE3484720D1; EP0150177A1; EP0134620A2; WO1985000453A1; EP0134620B1; EP0134620A3; US4777594A; CA1212476A; US4760519A

Abstract

ABSTRACT OF THE DISCLOSURE

data processing system for processing a sequence of program instructions has two independent pipelines, an instruction pipeline and an execution pipeline. Each pipeline has a plurality of serially operating stages. The instruction stages read instruc-tions from storage and form therefrom address data to be employed by the execution pipeline. The execution pipeline receives the address data and uses it for referencing stored data to be employed for execution of the program instructions. Both pipelines operate synchronously under the control of a pipeline control unit which initiates operation of at least one stage of the execution pipeline prior to completion of the instruction pipeline for a particular instruction.
Thereby operation of at least one instruction stage and one execution stage of the respective pipelines overlap for each program instruction. The instruction and exe-cution pipelines share high speed memory. The pipeline control unit can independently control the flow of instructions through the two pipelines. This is impor-tant for operation in conjunction with a microcode storage element which allows conditional branching and subroutine operation. Circuitry also detects pipeline collisions and exception conditions and delays or inhibits operation of one or more of the pipeline stages in response thereto. Under control of the pipe-line control unit, one of the independent pipelines can operate while the other is halted.

Description

DATA PROCESSING APPARATUS AND METHOD EMPLOYING
INSTRUCTION PIPE LINING

BACKGROUND OF THE INVENTION

The present invention relates to the field of digital computers and, in particular, to apparatus and methods for processing instructions in high speed data processing systems;

Data processing systems generally include a central processor, an associated storage system (or main memory), and peripheral devices and associated interfaces. Typically, the main memory consists ox relatively low cost, high-capacity digital storage devices. The peripheral devices may be, for example, non-volatile semi-permanent storage media, such as magnetic disks and magnetic tape drives. In order to carry out tasks, the central processor of such systems executes a succession of instructions which operate on data. The succession of instructions and the data those instructions reference are referred to as a program.

In operation of such systems, programs are initially brought to an intermediate storage area, usually in the main memory The central processor may then interface directly to the main memory to execute the stored program. However, this procedure places limitations on performance due principally to the rota lively long times required in accessing that main, memory. To overcome these limitations a high speed (i.e. relatively fast access storage system, in some cases called a cache, is used for holding currently used portions of programs within the central processor itself. The cache interfaces with main memory through memory control hardware which handles program transfers between the central processor, main memory and the peripheral device interfaces.

One form of computer, typically a mainframe computer has been developed in the prior art to con-currently hardware process a succession of instructions in a so-called poplin" processor. In such pipeline processors each instruction is executed in part at each I of a succession of stages After the instruction has been processed at each of the stages, the execution is complete. With this configuration, as an instruction is passed from one stage to the next, that instruction is replaced by the next instruction in the program.
Thus, the stages together form a "pipeline' which, at any given time, is executing, in part, a succession of instructions. Such instruction pipelines for pro-cussing a plurality of instructions in parallel are found in several mainframe computers. These processors consist of single pipelines of varying length and employ hard wired logic ton all data manipulation. The large quantity of control logic in such machines makes them extremely fast, buy also very expensive Another rum of computer system, typically a "minicomputer incorporates microcode control of instruction execution. Generally, under microcode control, each instruction is fully executed before eye-caution of the next instruction begins. Microcode-controlled execution does not provide as high perform mange (principally in terms of speed) as hardwiredcontrol, but the microcode control does permit signify-I

cant cost advantages compared to hard wired systems. As result microcode control of instruction execution has been employed in zany cost-sensitive machines.
Microcode reduces the total quantity of hardware in the processor and also allows much more flexibility in terms of adapting to changes which may be required during system operation. Unfortunately, the convent tonal pipeline techniques for instruction execution are not compatible with the multiple steps which must be performed Jo execute some instructions in a microcode-controlled environment.

Accordingly, it is an object of the present invention to provide an improved computer system.

Another object is to provide performance characteristics heretofore associated only with mainframes while maintaining a cost profile consistent with the minicomputers.

It is yet another object to provide a come putter system incorporating pipeline instruction pro-sousing and microcode-controlled instruction execution SUMMARY OF THE INVENTION
-The invention relates to a data processing system and pipeline control method for processing a sequence of program instructions in a computer. The data processing system has an instruction pipeline having a plurality of serially operating instruction stages for reading instructions from storage and for forming therefrom plural address data to be employed during execution of the program instructions. The data processing system further has an execution pipeline having a plurality of serially operating execution stages for receiving the address data and for employing that data, formed by the instruction pipeline, for referencing stored data to be employed for executing the program instructions.

The data processing system features a pipe-line control unit for synchronously operating the S instruction pipeline and the execution pipeline. The pipeline control unit has circuitry for initiating operation of at least one stage of the execution pipe-line using at least on of the address data formed by the instruction pipeline for program instruction prior to the complexion of address data formation by eye instruction pipeline for that program instruction.
Thereby, operation of at least one instruction stage and one execution stage of the respective pipelines overlaps for each program instruction.

The data processing system further feature sharing a memory between the instruction pipeline and the execution pipeline. A pipeline master clock for timing the pipeline stages has at least two clocked periods allotted for each stage of the pipeline to complete its operation. During one of these two clocked periods the instruction pipeline has access to the high speed memory and during another one of the clocked periods the execution pipeline has access Jo the high speed memory.

The pipeline control unit further has air-quoter responsive to exception conditions on the execu-lion and instruction pipelines for independently controlling, for each pipeline, the wow of instruction operations through the execution and instruction pipe lines. Flow control of the instructions can include halting one or the other, or both, of the execution and instruction pipelines; running the execution pipe-line using artificial atop" (no operation) instructions while a previously empty instruction pipeline is being filled; extending the time for all pipeline stages to complete an operation for allowing one of the stages to complete its operations; providing extended time for a plurality of microinstruction to be used in the execu-lion stages of the pipeline; maintaining the instruct lion pipeline in a halted state; and similarly relatedly type operations.

In another aspect of the invention, a pipe-line control method for use with an instruction and an execution pipeline having a plurality of serially operating instruction stages, features the steps of synchronously operating the instruction and execution pipelines and initiating operation of at least one stage of the execution pipeline using address data formed by the instruction pipeline at a time prior to completing, by the instruction pipeline, generation of all of the address data formed for a particular program instruction.

h. 7 BRIEF Description OF THY: DRAWINGS

The foregoing rid other objects of this invention, the various features thereof, as well us the invention itself, may be more fully understood from the 5 following description, when read together with the accompanying drawings in which:

Fig. 1 shows, in block diagram form, an exemplary computer system embodying the present invent Zion.

Fig. I depicts, in bloc diagram for, the instruction processor, including the two three-stage pipelines, showing overlap and flow between stages, and the pipeline control unit, of the central processor of the system of Fig. l;

Fig. 2 depicts the five hardware units that for the instruction processor ox Fig. 2, showing major data paths for the processing of instructions;

Fig. 3 shows, in block diagram form, the pipeline control unit ox Figure 2;

Fig. PA shows ! in block diagram form, the decode logic for the pipeline control unit of Fig. 4;

Fly. 4 shows, in detailed block diagram form, the pipelines of Fig. l;

Fig. 5 depicts the flow of instructions through the two pipelines, with examples of alteration to normal processing flow;

Fig 6 illustrates the clock generation of the It stage of the IT pipelines of Fig. lay Fig. 7 depicts a block diagram of the Share Program Cache of Fig. lay Fig. depicts a block diagram of the Instruction Preprocessor of r it. lay Fig. depicts a block diagram Ox the ~licro-Control Store of Fig. lay and Fig. 10 depicts a combined block diagram of the two Execution units of Fig. lay Fig. 11 shows, in block diagram form, the branch cache of the system of Fig. 4; and Fig. 12 shows, in block diagram for, the register bypass network of the Instruction Preprocessor of Fig. I

D RUSSIAN Ox THE PREFERRED EMBODIMENT

Fig. 1 shows a computer system embodying the present invention The system includes a central pro-censor, main memory, peripheral interface and exemplary peripheral devices.

This system ox Fig. 1 processes computer data instruct ions in the central processor which includes instruction preprocessing hardware, local progr~
storage, micro-control store, and execution hardware.
The central processor includes two independent pipe-lines; the Instruction Pipeline ZIP) and the Execution Pipeline (EN). In the preferred form, each pull is three stages in length (where the processing time also-elated Jith each sty is nominally the save)/ with the last stage of the IT being overlapped with the first stage ox the EN. With this configuration, an instruct lion requires a minimum of five stage times for couple-lion. All control or advancing instructions through all wrier stages originates from a Pipeline Connately Unit ~PCU) in the genteel processor. The PCU controls the stages to be clocker dynamically, based on pipeline status information gathered from all stages This form of the invention processes instruct lions defined in the System Architecture Reference Guide, Ed Ed. (PRICK 182) Revision 1392, published by Prime Computer, Inc., Natick, Massachusetts, and sup ports the machine architecture, which includes d plural lily of addressing modes, defined in the Reference Guide. In keeping with this architecture, words are 16 bits in length, and double words are I bits in length.
This o'er of the invention is optimized Jo perform address form~ti3ns including OR X + D, BY GROW D
and RIP X D, where BY (Base Register) is a byway starting address pointer X index) is a iota rejoicer GROW (high side of General Register) it a 16-7'7 bit unwept, D (the displacement) is contained expel-city in the instruction and may be either 9 or 16 bits, and RIP is the current value of the program counter.

PRINCIPLES OF PIPELINE OPERATION

pipeline Stage Fig. lo shows in functional block diagram form, two three-stage pipelines 9 an Instruction Pipeline (IT) and an Execution Pipeline (EN), together with the pipeline control unit (PCU) in the central processor. The Instruction Pipeline includes an Instruction Fetch (IF) stage 2, an Instruction Decode (ID) stage 3, and an Address Generation (A) stage I.
The Execution Pipeline (UP) includes a Control Formation OF stage 5, an Operand Execute (OWE stage 6, and an Execute Store (EN) stage 7. The PCU 1 is depicted in detailed block diagram form in Figs. 3 and PA and the IF, ID, AGO OF, Of and EN stamps are depicted in detailed block diagram for in Fig. I.

Fig. 2 shows an embodiment of the IT, EN and PCU of Fig lo in terms of five hardware units:
Instruction Preprocessor (IMP) 3, Shared Program Cache spook) 9, Execution-l board (Eel) I Execution board (EX2) 11~ and ~ir~o-Con~rol Swore tics) 12. The hard-ware units of Fig. 2 are representative of groupings of the various elements of the IT and EN of Fig 4. The respective hardware units are shown in detailed for in Figs. 7-10. In alternative embodiments, other groupings of the various elevens of the IT and EN Jay be used.

briefly, in the illustrated grouping of Fig

2, the Shared Program Cache 9 contains local storage and provides instructions by way of bus 13 to the d 7 7 Instruction Preprocessor 8, and provides memory operands by way of bus 14 to the Execution-l board 10.
The IMP supplies memory operant addresses by way of bus 15 to the SPY 9, register operands and immediate 5 data by Jay of bus 17 Jo Al 10, and control decode addresses by way of bus 19 to the ~icro-Control Swore 12. Eel 10 operates on memory operands received by way of bus 14 from the SPY 9 and register Nile operands received by way of bus 16 from the Execution-2 boar 11, and transfers partial results by way of bus I to EX2 if for postprocessing and storage. EX2 11 also performs multiplication operations The MCCOY 12 prove-dyes microprogrammed algorithmic control for the four blocks lo while the PCU 1 provides pipeline stave annihilation for all blocs 3-12.

The pipeline stage operations are completed Jithin the various hardware units 8-12 as follows:

IF (Instruction Fetch): A Look ahead rigor counter on 5PC 9 is loaded into a local 2C general address register; instruction are accessed from a high steed local memory (cache.

ID (rnseruction Decode): Instruction data is transferred prom SPY to IMP 8; IMP 8 decodes instructions, forming ~icro-control store entry point inferno for TICS 12, an accessing registers for address generation in IMP 8.

A (Address ~enera~i.on): IMP 8 forts instruction operand add so an transfers value Jo SPY 9 address register.

OF Conrail Formation): MCCOY 12 accesses local control store word and distributes control information eon all boards Of (Operand execute SPY accesses memory data operands in cache: Eel lo receives memory data operands from SPY 9, register operands from IMP 8, and begins arithmetic operations.

EN equity Store): Eel lo and EX2 if complete arithmetic operation and store results.

The Address Generation and Control Formation stages are lug overlapped in time within eke data system. The IT and EN operate synchronously under the supervision of the pipeline control unit (PCU) l, which interfaces to each stage with two enable lines (E~Cxxl and ~NCxx~) what provide two distinct clock phases Within each stage, as lo indicated in Fig. lay The notation "xx" refers to a respective one of the reference designations IF, ID, AGO OF, Of and EN, The six E~Cxx2 lines denote the respective stage operations are complete and the data (or control) processed in those stages are ready for passing to the next stage.

Clocking of Pi elite Sty en . P

Timing and clocking in the dual pipelines ZIP
and EN) are synchronized by two signals - the Easter clock MILK and the enable-end-of-phase signal ENEOP.
ENEOP is produced by the Pipeline Control Unit l and notifies all boards of the proper time to examine the slave clock enable signal lines (ENCxxl and E~Cxx2) in order Jo produce phase l and phase 2 stage clocks Roy the master clock CLUCK. see Fig. 6). Pipeline stages always consist of two phases. Phase l lasts for exactly two CLUCK pulses while phase 2 can last for an arbitrary number of MILK pulses, as described below, depending on the conditions present in both the IT and the EN.

of An example of how MILK and ENEOP and the stags clock enables interact on each board to form the clocks which define the stage boundaries is show-n in Fig. 6 for the Instruction Decode stage 2. Register 22 generates clock signals when enabled by ENEOP. when Ensoul is present the clock Swaddle is generated; when ENSUED is present, the clock SWEDE is generated.

PIPELINE CONTROL UNIT

The Pipeline Control Unit 1 shown in Figs. 3 and PA controls the flow of insurrections through the dual pipelines ZIP and EN] by generating the enable signals for all clocks Jhich define stage boundaries and relative overlap of the IT and EN. The PCU 1 includes stage clock enable decode logic 23 and the Pipeline State Register PER 24~ PCU 1 receives as inputs:

lo Instruction in oration and exception and register conditions from the IMP

2. Exception and cache conditions frill the

3. Microcode specified timing conditions related Jo the length of stage Of and the overlap of stage Of and OF from the US 12

4. Exception conditions from Eel 10 and EX2 11.

The PCU 1 has complete control ox all stage boundaries IJith that control:

lo The PCU 1 can hold the IT isle cycling multi~miceoco~e through the EN.

2 The PCU 1 can alter the slow ox instruct lions based on control information provided by microcode.

3. The KIWI 1 can extend all stages if extra time is required for a particular stage to finish its operation.

4. The PCU 1 can alter the relative overlap of stages Of and OF of the EN in order to allot different types of microcode sequencing (as described below in conjunction with Eye.

JO The PCU 1 can flush out instructions in the IT and recycle the IT to load new instructions upon detecting incorrect Lowe (such as an incorrect Lowe prediction pry-voided by Branch Cache 34).

6. The PCU 1 can idle the EN with no-operation (NO) cycles, Chile cycling the IT, for example, when IRK 27,33-in the SPY 9 is reloaded after an incorrect program flow sequence .

7. The PCU 1 can suspend all pipeline opera-lions during non-overlappable operations such as cache miss" access Jo main Myra 8. The PCU l can introduce separation between sequential instructions Sun the IT
under certain conditions, such as "collisions" between instructions 9 The PCU 1 can keep an instruction held in eke IF stage upon detecting an instruction-related exception, and then allow the other instructiorls currently in the pipeline to complete processing so what the exception can be processed in the correct order.

The Pipeline Control Unit ~PCU) 1 which controls the clocking of the stages in the IT and EN is shown in detail in Fig. PA. Condition signals received from the IMP 8, SPY 9, SKYE 12, Eel 10, and EX2 11 hard-ware units are utilized Jo produce enable signals or clocks in the IF 2, ID 3, A 4, OF I Of 6, and EN 7 stages ox the dual pipelines ZIP and EN). There are two major elements in PCU 1 which produce the clock enable signals ENCxx1,2. the pipeline state resister PER 24 (including state registers 130,1~2,184,186,188,190) and the stage clock enable decode logic 23 (including combinatorial logic blocks (guy The state rejoicers 180,182,184,186,188,190 indicate that the respective pipeline stages are ready to be enabled it there are no conditions received by the PCU 1 which should inn bit the Sue from proceeding. When the stages art in operation, the state registers 180,132~184,186,188,1g0 provide a timing reference to distinguish between cone two phases of each stage. The combinatorial logic blocks I 3,18$,187,139,191 decode the conditions I received from the various hardware units 8-11 to deter wine whether or not the stage operation shekel proceed.

The values of the state registers are unrolled by the various ENCxxl and ENCxx2 signals as follows-The IF state register IFS 180 is set ready by EONS which indicates that an instruction fetch is Capella and another can begin.
ENCIFl sets state register IFS 180 Jo India gate aye phase 1 of the I F slave has been performed I

t77 The ID state register IODIZE 182 is set ready by ENCIF2 which indicates that the IT prefetched an instruction which is ready to be decoded. ENCIDl sets state register IDSR 1~0 to indicate that phase 1 ox the ID
stage has been per~or~ed.

The A state register Agree 184 is set ready by ESSAYED which indicates that the IT has decoded an instruction which now requires an operand address generation. E~C~Gl sets state register AGSR 134 to indicate that phase 1 of the A stage has been purifier.

The OF state register CUR 186 is set ready my ENCCF2 which indicates that the EN has collated foreign of the control word also-elated with the l~icroinstr~ction ready Jo enter the Of stage. ENCCr 1 sets state register CFSR 136 Jo indicate that phase 1 it the OF stage is complete.

The Of state register OOZIER lay is sex ready my ENCCF2 which indicates thaw control and addressing information is ready to be passed Jo tile Of stage. ENCOEl sets state register OOZIER 138 to indicate that phase 1 of the Of stage is complete.

The En state register USSR 190 is set ready by EKE which indicates that operands are ready to enter the final execution stage an be stored. ENCESl sets state resister USSR
Len to indicate that phase l of the US slave is complete.

Combinatorial logic networks Elf 181~ END
183~ SNAG 185, ENCF 187J ENNUI 189, and EYES l91 monitor condition signals received from the hardware units 8-Ill, and when whose conditions indicate, block the E~Cxxl and ENCxx2 enables for the respective stages.
In Fig. PA, each signal entering the combinatorial logic blocks Jay inhibit the respective enables for that stage. The condition signals applied co the PCU 1 are described below.

The IMP 8 provides ewe conditions signals to the PCU 1: COLORED and COLDEST. COLORED (collision predicted) indicates that separation may have to be introduced between two instructions in the IT eon allow determination of whether or not a register collision exists CALIPER holds the IF, ID, and A stages ox the IT to permit determination of whether or not a register collision exists between the instruction in the ID
stage and the instruction that has just entered the EN.
Logic EDDY 183 generates FORCENOP (force a no ODer~tion instruction in the OF stage), when no new instruction is available to enter the EN This signal disables the LEA signal on bus 91 by setting LEA register 84 to zero. COLDEST indicates what a collision does exist.
In response, the generation of the clock enable signal for stages IF, ID, AGO OF, and Of is delayed until the updated register it available from the complexion of the EN stage. This process is illustrated in Fig. 5 during time periods T24, T25 9 and T26.

SPY 9 provides three condition signals to PCU
1: CACHE S, XMEtiEXCPTN, OPME:~EXCPTN. COUCHES
indicates what a cache miss has occurred in the SPY 9.
In response Jo the Cachets signal, the generation of the clock enable signals for the stages IF, ION AGO OF, and Tao is delayed until the memory subsystem has updated the cache, The signal It~E;~EXCPTN from the SPY
g indicates thaw an exception (such as an access viola-lion, 5TL~ Miss) has occurred during an instruction fish The ItlEMEXCPTN signal similarly of fictively ~2~7~

holds the IF stage from further prefetching and pro-vents the instruction in the IF stage from proceeding to the ID stage. All other stages are allowed to pro-cuss, 50 that the pylon Jay be emptied of all instructions before proceeding to handle the exception condition, The OPMEMEXCPTN signal indicates that an exception has occurred during the operand fetch in stage Of. This OPEt~EMEXCPT~ signal blocks stages IF, ID, A of the IT and provides sufficient delay for the OF stage as to allow the EN to branch to a microcode routine capable of handling the exception condition.
Stage Of, in which the exception occurred, is effect lively canceled.

The MCCOY 12 provides information decoded from microcode related to the number ox microcode-driven execution cycles required to complete an instruction and the timing required for completing data monopoly-lion and formation of micro-control store addresses within such cycles. Three signals within this category are produced. ~XCMPL is only asserted on final micro steps of insertions During all other micro steps of instructions, the PCU 1 holds the IT con-sitting of stages IF, ID and A until the multi-microcode has completed. XTNDEX indicates that additional time is required in the Of stage, while XT~DCTRL controls the relative overlap of stages Of and OF, allowing microcode jump condo t ions to be used in the present micro step to select following micro step.
The MCCOY 12 also produces FLUS in cases where incorrect instruction flow has occurred, such as when wrong branch cache predictions are made. In response to the FLUSH signal, all IT stages are cleared and a new IF
stage is started.

The EX1,2 pair 10,11 produces the signals EXECEXCP~, which is generated under certain execution-relayed conditions, and CEXCMPL, which indicates 7t7 whether or not a microinstruction is a final one based on testing bits within Exile 10~11. In response Jo EXECEXCPN, the PCU 1 functions in a similar manner as in response to OPMEt~EXCPTN, differing only in the S microcode routine which is executed. The CEXCMPL
causes the same result as EXAMPLE, differing only in that the generation of CEXCMPL is conditioned on con-lain jest bits within EX1,2 10,11 INSTRUCTION FLOW IN PIPELINES

lo Fig 5 shows the flow on instructions through the six stages of the dual pipeline ZIP and EN), and shows the clocking associated with those stages. In Fig. 5, To - T27 art time reference markers If - I25 represent machine instructions; clue - I represent Audi-tonal microcode execution cycles required to complete the execution of a machine instruction and N represents a NO or "no-operation") instruction cycling through eke Execution Pipeline Time periods To and To show the dual pipe-lines concurrently processing five machine instruct lions. Instruction 4 requires an additional microcode cycle My during time period To, the PCU 1 idles the IF ID, and A stages of the Instruction Pipeline.
During To, the IT again begins to advance instructions.
It also requires an extra execution cycle (rl21, so that during time periods To and To, the PCU l again idles the three stages of the IT. The second microcode step or It (i.e. I is conditional, based on the results of the execution of It; the PCU 1 therefore stretches the OF stage for My relative to the end of the Of stage for It Both pipelines are operative again during time periods To and T89 It is an example of a machine instruction requiring four extra microcode execution cycles ( My, I I " and My ) O The PCU 1 begins and con-tinges to idle stages IF, ID" and A beginning in time I

period To. microcode execution cycle My requires add-tonal time in the Of stage, so the PCU 1 extends both the OF and Of stage from T10 to Toll In the exemplary sequence of Fig. 5, It is a conditional insurrection During the multiple cycles of execution associated with It (i.e. My - My), the system determines that the IT has prefetched incorrectly. The EN then flushes the pipeline by notifying the PCU and reloading the Lockwood program counter used for prefetching. The IF, ID, and A
stages of the Instruction Pipeline are shown refilling during time periods T14, T15, and To While the IT is refilling, the EN completes the last microcode step associated with It 3 During time periods T14 and To;, NO steps are forced into the Execution Pipeline, as no machine instruction is yea available for execution.

I18 is an example of a machine instruction recoloring extra time in the Of stage. The PCU also delays the IF ID, AGO and OF stages of the instruct lions behind Ill (i.e. Ill, I20, and I21) keeping all stages in synchrony .

Time periods T23, T24, T25, and T26 show an example where the IT requests special action in the PCU
prior to advancing I22 from the ID stage to the A
stage. In particular, the IT has determined thee I21 will modify a register required by I22 to generate the operand address associated with I22. In response, the PCU 1 suspends the IT during time period T24, and delays the IF, ID, and A stages in the IT and the OF
stag in the EN during time periods T25 and T26, so what the results stored for I21 in the EN stage can be used by the A stage for I22. because no machine instruction is available at time period ',24, a NO
cycle is introduced into the OF stage of the EN.

The phased stage clocks (Cxxl,Cxx2) described in the Pipeline Control Unit section are shown beneath the insurrection flow diagram in Fig. 5.

PUP ELM _ As described above, Fig. 4 shows the print supply hardware elements contained in each of the six stages of the instruction and execution pipelines. In the embodiment ox jig. 2, several of the stages include elements which are time-mul~iplexed resources within the pipelines. These elements are shown with identical references designations in the various stages of the Fig. 4 configuration.

For a single machine instruction passing wreck the pipeline stages, the processing occurring within the IF stage is confined eon hardware on the SPY
9. During the first phase ox the lo stage, the con-tents of the Luke rigor counter 27,33 are gaged through the Spokes address selector 28/39 and loaded into the address registers 44,40 with clock pulse Souffle.
During the second phase, 32 bits of instruction date are retrieved from cache 41 and loaded into the cache data register 42 with lock pulse SOPHIE, which ton-minutes the IF stage. The STEELE 45 is also accessed during the second phase, loading a mapped physical memory address into register BPtlA 46 for possible use in the event data is not contained in cache 41. The branch cache 34 is also checked during the IF stage.
As described below in conjunction with Fig. 11, based on the information contained, register IRK 27,33 is either loaded with a new target address or incremented.

During the first phase of the ID stage, the instruction data held in the cache data register 42 is passed through selectors 47,43 on the SPY 9 ensuring that the opaqued for the instruction at the current ~20 program counter value is presented on bus 63. The thirty two bits of instruction data are passed on buses 62 63 to the opaqued latches and selectors 80 81 on the IMP 8; this data it retained on the IMP 8 by clock pulse Swahili During the later phase of the ID stage, opaqued information is used to access the microcode entry point for the instruction from the decode net 82 which is loaded into register LEA 84 with clock pulse SWEDE.
Also during the second phase, registers required for memory address generation are accessed prom register file AGRF 72 and stored in register BX2 73 with clock pulse SWEDE. Finally, the displacement required for address generation is transferred from the instruction latches and selectors 80,81,207 and loaded into the pipeline displacement register DISK 83 through selector 209 with clock pulse SWEDE. Summarizing at the end of the ID stage, information for the OF stage and A stage has been stored in pipeline registers; the machine instruction processing then simultaneously moves into the last (A) stage of the Instruction Pipeline and the first (OF) stage of the Execution Pipeline During the Go stage, the IMP 8 computes Lowe effective address of the memory operand (assuring the instruction being processed requires a memory reference) and loads that address into the address resisters on the SPY 9. The operation commences with a selector 74 choosing either the output of register I
73, which contains the contents of the appropriate registers accessed during the ID stage, or DRY 71 which contains an updated value of a register (as described in detail below with respect Jo register bypassing in the IMP section. The first ALUM 75 then adds the base register and index register as specified by the instruction and feeds the value into the second ALUM 76 us where it is combined with displacement offset from register DISK 83. The resulting operand address is passed through selectors 86,78 and sent to the SPY 9 on buses 4g,57. Selectors 28,39 on the SPY 9 gate the address to the cache 41 and Steele 45 through address registers 44,40 which are loaded with clock pulse CAGE.
A copy of this address is also stored in the IMP 8 in registers ENS 85,77 or later use if the particular machine instruction requires multiple microcode execu~
lion cycles.

The OF stage performs the access and duster-button of the micro-control store word used for algorithmic control Jo all hardware units. In the case of a machine level instruction, the entry point from the ID stage is chosen by the selector 103 and pro-sensed to the ~icro-store 104. The output of the microspore is driven to all required hardware units through buffer 105 and loaded into a plurality of control word registers 215,~5,216,145 with clock pulse CCF2, which marks the end of the OF stage. Also at the end of the stage, the current microspore address is loaded into the holding register Rich 106 with clock pulse CCF2.

At the end of the A and OF stage operations, which have occurred in parallel for a machine instruct lion about to begin execution, all addressing and control information has been stored in registers US clocked by CCF2 and CAGE. The Of stage 6 operation which follows the A and OF stage operations, has two well marked phases, During the first phase, cache 41 and STUB 45 on the SPY 9 are accessed for the operand fetch. (Note that the system cache 41 is accessed by the Of stage 6 during the first phase of operation and, as noted above, by the IF stage 2 during the second phase of operation This sharing of system cache is a significant advantage.) Thirty-two bits of operand data are loaded into the cache data register 42~ which is clocked with Cowl. The STEELE 45 is also accessed during the first clock phase, and loads a mapped physical memory address into register BRA 46 with the occurrence of clock pulse Cowl. 'rho memory address stored in BPMA 46 it for possible use in the event data is nut contained in S cache 41. Still during thy first phase, the resister file 130, if the micro-control store word so specifies, is also accessed. The register Nile operand output is loaded into register RI 129, also clocked at Cole During the second phase ox operation in the Of stage, memory data from cache is passed through selectors 47, 43 on the SPY 9, to Eel 10 over buses 62, 63, passed through selector 117, and finally is grated to the B leg of the 48 bit ALUM 118. This data is latched with clock pulse COED to maintain the pipe-lining in registers OX 116, 123. Also during the second phase, register file data from RI 129 is grated through selector 125 and presented to the A leg of the ALUM 118.

The ALUM 118 operation completes during the first phase of the EN stage; ALUM data is passed through selectors 119,121 for post processing including shifting, and loaded into resisters ROD 122 and US 126 with clock pulse Cell. Finally during the last phase of the pipeline, results of the calculation stored in register US 126 are written into register file 130 it so specified by the micro-store control word and into register BAR 71 clocked at SWISS. Register BAR 71 makes an updated location available to hardware in the ID
stage for updating register file AGRF 72 and for bypassing AGRF 72 in calculating an operand address in the A stage through selector 74.

In certain cases a particular machine instruction will require more than one cycle in the EN.
In such a case, the PCU 1 will stop providing clock enables to the IT, buy continue to cycle the three stages in the EN. The micro-store 104 permits any try general purpose algorithm to execute within the EN.
Results computed in the Of and EN stages and loaded into registers I 122 and US 126 with clock pulse Cal can be fed back into the Alto 118 via the ALUM selectors 117~125, thus enabling data manipulation in successive execution cycles to also be pipeline. In the event thaw an execution cycle references a register written in the previous cycle, the value in resister US 126, which will be written into the register file 130 during 10 the last phase of the EN stage can bypass register RI
129 normally used to read register file data and be presented directly to selector 125 and presented to the ALUM 11~.

15 Shared Program Cache The Shared Program Cache 9 in Fig. 7 includes the high speed cache memory 41 for instructions and operands, the segment table look-aside buffer (STUB) 45 20 for retrieving recently used zapped physical memory addresses, and the branch cache 34 used to predict the flow of conditional machine instructions as they are fetched from cache. Also shown are pipeline address and data registers used in conjunction with the storage 25 elements.

In operation, the SPY 9 operates under the general control of enables from PCU 1J and, during the Of stage also under the general control of microcode stored in tics 12, which has been transferred by Jay of I RCC bus I Jo RUM register 65. Selectors 28,39 deter-mine the source for main SPY address busses 53~59 which load address registers 40,44 which in turn directly address the cache 41 and SLOB 45. Also loaded from the main address buses 53,59 are backup address registers ERMAH, ERMAL 30,37 for operand addresses and PROWL 36 , I
!, r7 for the low side of the program counter. Backup address registers 30,37 provide backup storage of the cache and STUB addresses for use when the convents of the registers 40,44 (which directly access each 41 and STUB 45) are overwritten with new addresses prior to detection of a cache miss or memory exception.

There are four sources of addresses for accessing the cache and STEELE storage elements: (i) registers IRPH 27 and IRPL 33 which contain the look-ahead program counter used for prefetching instruct lions, (ii) buses BOHEMIA 49 and BOYLE 57 which transfer effective addresses generated in the IMP 8, (iii) buses BDH 50 and BDL 54 through buffers 26,31 which transfer addresses from EX2 11 during multiple micro-code sequences, and (iv) buses 51 and 56 which are used to restore addresses from the program counter backup registers 27,36 or operand address backup registers 30,37 previously used in the event of cache misses or Emory exception conditions. Thirty two bits of information from cache 41 are stored in a data register 42 and grated on bus I to selectors 43,47, from which data is driven to Eel 10 and instructions are sent to the IMP 8 over buses B8H and BLUE 63,62.

In the event of cache misses or explicit main err requests, virtually mapped physical addresses from the STUB 45 or absolute addresses from the backup registers 27,30 and 36~37 are grated to selector 46 and stoned in the PI register 48. The physical memory address is then fed through selector 47 and grated on to BBH, BLUE 63,62 and transferred to the Cain memory sub-system. The backup registers 27,36 and 30,37 are also selectively transferred to Eel 10 over buses BBH, BLUE
63,62 for fault processing through the appropriate selectors 29,38,47,43, The branch cache 34 permits non-sequential instruction prefetching based on past occurrences of I
i, 'I

branching. Briefly, the branch cache 34 is addressed by the low-side of the look-ahead program counter IRPL
33; the output from that operation consists of control information indicating whether or not to reload IRPL
33 with a new target address on bus 55 through selector I As described in detail below, the information in the branch cache 34 is maintained by the execution hardware and is updated along with IRPL 33 by way of bus BDL 54 whenever it is determined (in IMP 8) that incorrect prefetching ha occurred. In the event the branch cache 34 does not indicate that the prefetch flow should be altered, program counter IRPL 33 is then incremented When the branch cache 34 does alter program flow, the new contents of IRPL 33 are grated onto bus BEMAL 57 by way of buffer 35 and sent Jo the IMP 8 for variable branch target validation.

Instruction Preprocessor - The Instruction Pro Processor [IMP) 3 shown in Fig 8 includes instruction alignment logic, decoding hardware, arithmetic units for address genera-lion, and registers for preserving addresses triune-furred to the SPY 9. The input logic of the IMP 8 is adapted to process one- and two-word instruction for mats and to accommodate the instruction fetching in the SPY 9 which is always aligned on an even two-word bound defy. In either instruction format, the first word always contains the opaqued and addressing information;
for one-word instructions the displacement for address offset is also contained in the same word; for two-word instructions the displacement is contained in the second word.

In instruction prefetching operation the IMP
8 operates under the control of the enables received from PCU l; during processing of multiple execution cycles registers asp updated and manipulated under the d f I. 7 general control of microcode stored in MCCOY 12, which has been transferred by okay of RCC bus 64 to RUM
register 215. The SPY 9 transfers two words of instruction information to the IMP 8 over buses BY 63

5 and BLUE 62. The two words of instruction data pro-sensed to the IMP 8 can be various combinations, such as two one-word instructions, an aligned (even boundary) Ward instruction, or the second word of a two-word instruction and the next one-word instruction.
The SPY 9 gates the opaqued of the instruction also-elated with the current value of the program counter IRPL 33 onto BBH I where it passes through the OPAL 80 selector latch for immediate processing.

The contents of BLUE 62 are stored in register IRE 81; depending on whether or not this second word contains an opaqued or a displacement, the contents of IRE 81 are grated by way of bus 94 to the OPAL 80 latch, or to the selector 209. The output ox eke OPAL
80 latch is transferred by way of bus 93 to the decode net I the opaqued register OPCR 207, the address inputs ox register file AGRF 72 and register bypass blocks (including collision prediction logic 208 and collision dejection logic ill The decode new 82 pro-vises control information for continuing the pro-processing of the instruction and also provides a micro-control store nary point which is stored in the LEA register I and subsequently driven to the US 12 over the bus LEA 91. The register bypass blocks are described in detail below.

Information decoded from the instruction governs if and how the operand address should be formed. Depending on whether an instruction continue one or two words, the selector 209 chooses either OPCR
207 on bus 203 or the IRE 81 on bus 94~ If the instruction in stage IF is two words and unaligned, its displacement does not arrive from the SPY 9 until it I f~t7 has proceeded to stage It In this case, the DISK
selector latch 83 selects a displacement value directly from bus Bay 62. O~herwlsef latch a selects a displacement value from selector 20'~. The displacement value from latch 83 is coupled by way of hut 92 to the B-leg of ALUM 76.

The IMP includes the register file AGRF 72 which contains copies of all registers used in address calculation. The AGRF 72 can simultaneously access 32 bit base or general registers and 16 bit index registers transferring them into base and index pipe-line register 73~ The true contents of these registers are maintained by the EX2 11 board in the execution unit and any changes to the registers do not occur until the EN stage of the execution pipeline. At the completion of stage EN, updated register contents are sent over BDH 50 and BDL 54 and through buffer 210 and are loaded into the bus D register BAR 71. The output bus 87 from BAR 71 distributes the contents of that register to the AGO.- 72 for updating register copies) and Jo the selector '4 (for register bypassing, as described in detail elm in eon unction with Fig.
12).
.

The collision detection logic 211 compares the AGRF 72 address (as decoyed from the instruction in stage ID) to the address used by EX2 11 (as receive by the IMP over bus BIT 204) Jo write its register file, It the collision detection logic 211 Determines that EX2 11 has updated a base, index or general register which matches the one just loaded from AGRF 72 into I
73, logic 211 selects the new register value held in ; BAR 71 in place of the output of BAR 73 by controlling selector 74~

Collision prediction logic 208 predicts possible collisions between instructions which ore one I
, .

'7 stage apart in the IT by comparing the address being read from the AGRF 72 with a "guess" of a written address derived from bus 203. If a possible collision is discovered, the PCU 1 is notified to separate the two instructions by one additional stage time so that the collision dejection logic 211 can determine whether a problem actually exists. This technique of register bypassing is described more fully below.

As described fully below, selector 74 select lively gates the high word of the bate or general register (as fetches from the AGKF 72~ over bus 89 eon selectors 212 and 86. The low word of the base or general register on bus go and the index register value on bus 96 are added together in the indexing ALUM 75 if this operation is specified by the instruction. The displacement ALUM 76 adds the result from the indexing ALUM 75 to the displacement transferred from DISK 83 on bus 92. The result from ALUM 75 is transferred to bus go Jo selectors 78 and 213 and to the branch cache validation logic 214.

The branch cache validation logic 214 come pares the computed branch address on bus 90 to the pro-dialed address from the branch cache 34 sent from the SPY 9 over bus BEMAL 57.

The effective address source registers trash 85 and EASY 773 and effective address destination registers (EACH 205 and EARL 206) function as two 32-bit memory address pointers, the low word of which (iced EASY 77 and ERDL 206) are counters. EACH 205 and EARL 206 are loaded from bus 200. HASH 85 and EASY 77 are loaded from selector 212 over bus Z01 and selector 213 over bus 202 respectively. Busses BY 63 arid BLUE
62 are coupled to the outputs of selector 86 and 78 respectively, and provide general register and mime-dilate operands to Eel 10. Busses BOHEMIA 49 and BOYLE 57 I

I

are similarly coupled to the output of selectors 86 and 78, respectively and provide memory addresses to the SPY 9 for referencing cache 41 and STEELE OWE aye on busses go and 90 are transferred over musses EYE ~9,57 during stage A of the IT by selectors 86 and 78.
During microcode controlled memory accesses, either HAS
85,77 or HAD 205,206 can be selected. Either HAS 85,77 or HAD 20S,206 can also be selected onto busses 63,62 by selectors 86 and 78.

Micro-Control Store The ~icro-control store unit 12 ox Fig. 9 includes microcode storage 104, the next microcode address selector 103, the RAP register 10~, the pro-sent micro-address register RICH 106, the microcode stack 107, and the buffers 105 for driving new control bits ~RCC's) by way of bus 64 Jo all boards, The microspore 104 can be selectively loaded Jo contain OK 80 bit microcode words as provided over bus 108 from ho BDH bus So by way of buffer loll Of the I bits in each microcode word, Betsy are directed to parity Chicano network 66, and the remaining 72 bits are transferred to the IMP 8, SPY g, Eel 10 and EX2 11 for al~oeithmic control during execution cycles.
The microspore 104 and RICH 106 are addressed by Jay ox bus 109. Gus 109 is driven by selector 103 which selects among the various sources or generating next addresses. These sources include the RBPA register 102 (which is used during microcode loads), the LEA bus 91 (which provides decode addresses from eke IMP 8), the jump address signals from PA bus 111 (which provide conditional sequencing information from Eel 10) r the output bus 112 from RICH 106 which contains the present micro-address), and bus 113 from the output of the microcode stack 107. This stack 107 holds addresses which are used to return from a microcode subroutine or from a microcode fault or exception sequence. The stack 107 can contain up to 16 addresses at once in order to handle cases such as subroutine calls within subroutines. The 72-bit control output bus lL0 of the microspore 104 is driven by way of buffers 105 over the RCC bus 64 to units 8-11 to provide microcode control of those units.

Execution 1 and Execution 2 The execution unit of the present embodiment performs the data manipulation and write-storage port lions of all instructions which proceed through the dual pipeline SIR Andes Among the data types sup-ported by this execution unit are.

1. 16 and 32-bit fixed point binary 2. 24-bit fraction/8-bit exponent awaiting point (single precision) 3. 48-bit fraction/16-bit exponent floating point trouble precision) 4. 96-bit fraction/16-bit exponent floating point (quad precision) 5. Varying length Betty character strings

6, Varying length 4 or Betty decimal digit strings In the present embodiment the execution unit is located on two hoards: Al 10 and EX2 11. The eye-caution unit operates under the control of microcode stored on the MCCOY 12. The microcode control bits are loaded into the RUM register 145 from bus 64~ The eye-caution portion of a machine instruction Jay require one or many micro-ins~ructions to complete. A new micro-instruction is retched prom the MCCOY 12 for each new data manipulation performed by Eel 10 and EX2 11.

The execution unit includes the general our-pose Betty ALUM 118 with an A-leg input and a lug input, selectors 117,1~5 for choosing among a plurality of operands for input to either the A- or B-leg, a selector 121 for supporting operations on various data types, decimal and character string processing support networks 119,120,131, registers US 126 and I 122 for temporary data storage, a register file 130 and multiply hardware 133,146,147~

In the present embodiment the ALP 118 is adapted to operate on data types up to I byes wide and provides a plurality of arithmetic and logical modes.
Arithmetic odes include both binary and binary coded decimal types. The ALUM 118 operates in concert with shift rotate network 119 and decimal network 120 to adaptively reconfigure in a manner permitting processing the various data types which must be or-cussed.

The resister file 130 supports separate read (source) and write (destination) addresses for the instruction. Thy file 130 is 256 locations deep and generally operates as a 32-bit wide isle In floating point arithmetic, field address register manipulation and certain other special cases, it supports a full 48-bit data path An RF source decode 303 generates addresses for reading the register file 130 during the first phase of the Of stage while the RF destination decode 304 generates addresses for writing to the file 130 during the second phase ox the EN stage. The RF
destination decode 304 also transfers register update information Jo the collision dejection logic 211 on the IMP 8 via bus BIT 204. Selector 307 chooses between , ~32-t77 read and write addresses and sends those addresses to the register file 130.

The multiply hardware 133 consists of a 48-bit combination carry propagate/carry save adder. This adder 133 is combined with the sum register 146 and the carry register 147 to perform multiplications up to I
byway bits by a shift and add technique. Each it era-lion of the multiply hardware 133 processes two bits of operand and generates two bits of sum and one bit of carry. The carry bit between the two sum bits is allowed to propagate.

musses BBH 63 and BLUE 62 supply to the execu-lion unit either a memory operand from the SPY 9 or register or immediate operand from the IMP 8. This operand is latched in OPT 116 and OWL 123 which in turn feed the Allah selector 117 by way ox busses 134 and 144 respectively. When the operand supplied over BY 63 and BLUE 62 is an unpacked byway decimal digit data type, the decimal support logic 131 converts to the corresponding packed (4-bit) decimal data ye The selector 117 selects from the destination register ROD 122, OPT 116 and OWL 123 to drive the bus 135 which in turn feeds the B-leg of the main ALUM 118. The A-leg selector 125 selects from among the input register RI
1~4 (which contains operands read from the register Nile 1~0), the shif~er-register US 126, the sum buts bus lflO and carry bits bus 141 output from the multiply hardware 133~, the bus 132 (from the low word of the program counter RIP 128~, and the timer aye out-put Jo drive the Betty A-leg ALUM bus 143. The timer aye has two general purpose counting resisters used for operating system and performance evaluation sup pout.

Program counter RIP 12~ is a 16-bit counter which can increment either by one or two depending on wow-'7 the length of the instruction currently in the execu-lion pipeline. If a jump or branch type of instruction is being processed, RIP 128 may be loaded. This load occurs conditionally depending on whether the program is actually switching to a new non-sequential address and whether this change of flow was successfully pro-diced by the branch cache 34 in the SPY g. As described below status about the branch cache's pro-diction associated with the instruction currently in the execution unit is passed to Eel I my the IMP 8.
In operation, the ALUM 118 processes the data on busses 135 and 143 and the result is placed on bus 136. Bus 136 is coupled to the jump condition generation logic 300 which supplies microcode branching bits or loading into the JO RUG 301r The contents of the JO RUG 301 can effect the formation of the next microcode address either in the ~iCro-i~structiGn which loads it or in the one which immediately follows it The control is effected by microcode control of the overlap of the Of I stave of one instruction with the OF stage of the next one. selector 302 chooses among a plurality of jump conditions Jo produce jump address signals which are transferred by way of JAY bus 315 to the US 12.

Character byte rotation and floating point shifting are performed by the shift/rot~te hardware of shift rotate network 119. Additional decimal digit processing, including unpack (convert 4-bit to 8-bit) and nibble rotate, is performed by network 1~0. The selector 121 chooses among its various sources depending on the data manipulation being performed.
Selector 121 drives bus 137 which in turn loads ROD 122, US 12~ and RIP 128. This bus can also be coupled to busses BDH MU and BDL 54 by the selector 127. The out-put bus 138 of US 126 is selected onto BDL bus 50 and BDL bus 54 by the selector 1?7 in order to provide update information to the IMP 8 when an instruction completes execution which has modified a register which has a copy in the IMP 8. The output of US 126 is also used Jo provide write data for the register file 130, to provide one of the operands to the multiply hardware 133 and as an input to the selector 1?5.

As described fully below, the use of US 126 as an input to selector 125 it primarily for register bypassing. The register bypass logic 305 compares the register file source address (from source decode 303) for the instruction in stage Of to the register file destination address (from destination decode 304) for the instruction in stage EN of the execution pipeline.
If a match is detected, the contents of US 126 on bus 13~, which contains the data to be Jrit~en into the register file 130 are selected by 125 (in place of the data read into RI 124 from the register file 130.) BRANCH CACTI_ The branch cache network it shown in Fig. 11.
In the present embodiment, as shown in Fig. if, port lions ox this network are located units 8-11. The branch cache network is adapted to permit predictions of non-sequestial program flow following a given instruction prior to a determination that the instruct lion is capable of modifying instruction flow.
Moreover, the branch cache network dots not require computation of the branch address before the instruct lion prefetching can continue. Generally the branch cache network makes predictions based solely on the previous instruction locations, thereby avoiding the wait for decode of the current instruction before pro-ceding with prefetch of the next instruction. Thus the branch address need not be calculated before pro-fetching can proceed, since target addresses are stored along with predictions In particular, the design of the flow predict lion hardware accommodates alterations to the flow of ~35 instructions (i.e. branches without requiring any more time than the simple sequential slow ox instructions (ire incrementation of the look-ahead program counter), Thus, extra cycles are not required when a discontinuity is encountered in the flow of instruct lions. This continuation of normal operation results because the branch prediction logic bases its decisions solely on the current look-ahead program counter value ~IRPL 33). The logic does no wait for the instruction to be decoded by the ID and A stages This structure permits decisions to be made in one pipeline cycle and thus effect changes to the instruction flow very rapidly. Thus the flow redirecting instruction need not be decoded as a branch before instructions are I fetched from the branch target Referring Jo Figure 11, the look-ahead program counter IRPL 33 holds eke low order 16 bits of the virtual address of the next instruction to be read from the system cache 41. At the same time as this instruction is being transferred over BLUE and OH 62, 63 to be decoded by the instruction decode ID stage/
the branch cache 34 predicts whether the instruction flow should he diverted If there is no predicted diversion, IRPL 33 simply increments by two. If a diversion is predicted, the output of the branch cache is loaded into IRPL via the selector 32. It is key that the branch prediction is jade by the IF stage only, and without any knowledge of the nature of the instruction just fetched (e.g. whether it is a jump or conditional branch instruction This is especially valuable in a complex instructions architecture where instruction decode is a complex task. The branch decision is jade at the same time that the transfer ox the instruction to the ID stage completes, end before the ID stage has even begun to decode the insertion.
The look ahead program counter IRPL loads the redirected value at the same time as it Gould have done the next increment. This shows that the redirection (JEEP) takes no longer than a simple increinent. The IF
stage need not wait for feedback from the ID stage, informing it that a branch or jump has been retched and that it should begin to act. (This is too late to avoid extra delays in the IF stage while it reloads the look-ahead program counter, and refills the pipeline with instructions, overwriting the erroneously fetched instructions which sequentially followed the branch.) Detailed Explanation f Branch Cache O earn o P _ _ In operation, the network shown in Fig 3 begins on the SPY 9 with IRPL 33 accessing the branch cache 34 with save value that is being used to access thwart bits of inslruc~lon data in the program cache hardware 40,41,42,43. The output of the branch cache 34 includes a prediction bit (TAKER) (associated with the last word of a particular branch instruction - and which asserts that a branch should be waken), on index (which ensures the entry belongs to the current vilely of IRPL 33, a 16-bit target address which will be loaded into ~RPL 33 if the control indicates aye non-sequential program flow should be followed), and a control line (ODD SIDE) (which indicates which of the two fjords of instruction data being fetched Roy the cache I a branch directive is associated with. The signal ODD SIDE identifies each entry in the branch cache as being associated with either an add or even word aligned instruction. In cases where a prediction is Audi for a two word branch instruction, the predict Jo lion entry is always associated with the second Ford of the instruction if, order Jo ensure that the second word (which is required for calculating the address spew gifted by the branch instruction) is properly fetched into the pipeline. This is described in greater detail below.

7'7 Associating the prediction entry with the second word of the instruction ensures that all words ox an instruction have been fetched by the IF stage and have been sent to the ED stage before a branch predict Zion is made. Thus by associating the flow predict lions tlith the final word of the instruction, the IF
stage does not redirect itself before the ID stage has obtained alp of the information necessary for correct execution of the instruction.

Referring to Figure 7, "unaligned" two word branch instructions, rather than being completely con-twined in one entry, are split across two successive thirty-two bit entries in the system cache 41 Such instructions are sent to the ID stage as portions of two successive transfers over Balm BBH 62, 63 on two successive pipeline cycles. The ID stage employs its bypass paths to bring the Tao words together and apply the both Jo the single instruction they represent.
Two successive branch cache 34 locations are referenced in the process of obtaining the ewe words of this type of instruction. If the redirection were associated with the first word of the two word instruction, the flow of words from the system cache 41 to the ID stage would never include the second word of the instruction, since IRPL would be redirected around it as soon as the branch cache hit was detected on eke first word. This would result in incorrect operation since it is necessary to obtain the second word to compute the address of the target of the branch. In the case of an "aligned" two word branch (completely contained in one system cache entry), it does not matter which of the two words has associated with it the redirection command, since they both correspond to a single ranch cache location and Lowe actions which need to be taken are identical. The association of the redirection with the second Ford is therefore tailored to the more dip faculty "unaligned" case.

I

Other embodiments of the invention which account for unaligned instructions can be implemented Thus, if the branch cache were to improperly predict a branch on the first word of an unaligned two word instructiorl, due to self modifying code or a variety of other possible special considerations, the situation can be detected my the IF stage, with the help of the special bit used to determine the ODD SIDE signal.
Erroneous operation could then be avoided through the use of the erroneous branching avoidance mechanist descried below. However, this mechanism is expensive in terms of pipeline cycles, and avoiding the need for it on unaligned branches is advantageous and efficient liken a branch is predicted, the index and upper bits 1-7 are checked for equality in a comparator 21~. If these values match and the signal TUBER India gates that the branch should be taken, the signal CHIT
is generated, causing the 16 bit target address (3TARGl-16) Jo be loaded into IRPL 33 via selector 32 I rather than the normal operation of incrementing ILL
33. The SPY 9 always sends the contents of the low side of the look-ahead program counter to the IMP
through buffer 35 where it is saved in register 217 for later use in validating the prediction. Many con ditional instructions in the Prime Instruction Set have branch addresses thaw are capable of being variable.
For example, a conditional instruction could specify a branch to RIP + X, where RIP = the contents of the program counter and X = the value of the index wrester Between the time the branch cache was loaned with a target for a branch instruction and the time the instruction is actually executed, the value of the X
register could change. In view of this possibility, the IMP 8 compares branch targets used for prefetching in the SUP g against the actual calculation of the location that the instruction will branch to if the f~7'7 specified conditions are satisfied. The calculation of the address to which a branch instruction will vector is performed in the same manner as the generation ox an address for a data operand. Therefore the calculation 5 performed in the A stage ox the IT produces the address to which the branch instruction should vector if the specified conditions are jet. This address is eventually passed Jo the EN for use in loading eke program counter RIP on EX2 lo and for use in reloading IRK 33 on the IMP 8 if prefetching has not occurred properly, i.e. the branch cache makes an incorrect pro-diction. The calculated argue it available on bus 90 from the last ALUM 76 used in the A stage. The cowlick-fated target is compared to the value of the program counter (saved in RUG 217), which contains the target prediction from the branch cache that was used to retch the instruction following the branch instruction.
Comparator 219 performs the equality check and India gates Whether or not eke computed target address ox the next instruction matches the target retrieved fry the branch cache OWE If the equality is jet, the signal GOODBRTARG is generated. Control logic 220 receives instruction classification infor~atiorl from decode net 82 and the CHIT signal from the SPY 9 and determines whether or not a branch has occurred on a non-branch instruction. If such a branch has occurred, logic 220 generates the signal BREXCPTN~ Otherwise logic 220 synchronizes the CHIT signal from the SPY 9, passing it along with its associated instruction as BRETT.

The signals GOODBRTARG9 BRAKE BREXCPTN are transferred to the branch processing hardware 22L in Eel 10 as eke branch instruction enters the Of stage.
As the branch instruction is executed, a determination of whether or not the branch should occur is loaded into register JAR 30l~ The output of register JAR 301 together with ~OODBRTARG and ~RTA~EN are used to venerate ~LDRP which is used to force a load of RIP 128 - O -'7 in EN 2 11 in the event the branch cache mechanism correctly predicted that a branch should be taken.

If the instruction flow has been correctly predicted, regardless of the outcome of the branch instruction, the signal CEXCMPL, indicating what no further execution cycles are required in the EN, is available to the PCU 1, which allows the IT to proceed.

As noted above, a branch instruction can be associated with either the first or second word of a stored thirty-two bit instruction The IF stage and its associated flow prediction hardware deal with heartache bit double words exclusively while the ID
stave deals it instructions which may be 1, 2 or 3 words in length. The interaction of these stages and their varying requirements affects branch cache opera-lion.

Referring to the ODD SIDE signal generation noted above, discontinuities in instruction flow are associated with specific jump or branch instructions and not directly with a specific thirty-t:Jo bit double word location in eke branch cache. These insertions can be one or two words in length and may start a either word within a double Ford cache cell. The control bit stored in the random access memory is used to record which word in a double word cache pair should be considered to be the rink instruction. The IL
stave uses this information to assist it in the deter-munition of whether a valid change in instruction flow has occurred, to control the IF stage, and to appropriately redirect its own instruction buffering end alignment functions as follows.

Referring to Figure 7, the IF stage obtains thirty two bit values from the system cache 41 and delivers them to the ID stage over 38H 63 and BLUE I

., I

AL; 7 7 The IF stage has no knowledge of the nature of the instructions being supplied; it simply sequences through thwart bit values (toe double words), either sequentially or as directed by the branch prey diction hardware.

Referring to Figure 8, the ID stage receives thirty-two bit data from the IF stage and implements buffering, alignment, and bypassing to handle the various cases of one and two word instructions starving at even and odd word boundaries These functions are performed using the opaqued selector/latch 80, instruct lion storage register 81~ and displacement selector latch 83.

The ID stage buffering function operates, if redirection by the branch prediction logic does no occur, as follows. If a one word instruction arrives on BBH 63 and passes through the opaqued selector/la~ch 80 to be operated on, the word on BLUE 62 is stored in IRE 81 twill the first instruction is passing through the ID stage. The IF stage is directed to stow fetching double words for one cycle, since it has fetched Gore instructions than are presently being con-summed by the ED and subsequent stages.

Now suppose the word on BY is a branch or jump instruction with an associated branch prediction.
In this case, the ID stage should not perform buffering at IRE 81 and the associated IF stage holdup lung-lions, but should process the branch instruction and then immediately accept the next pair of words placed on BBH and BLUE by the IF siege. Further, the word in IRE is discarded, since it represents an instruction which has been bypassed by the program flow redirect Shea.

Another possibility is that the word on Bull represents a one word non-branch instruction and the ' h p word on BLUE is a one word branch instruction. At the time these words are supplied to the ID stage, this case looks exactly like the case discussed in the second preceding paragraph. In this case, the buff firing at IRE 81 and holdup functions should be per formed Jo allow the instruction preceding the branch to finish, and then the branch stored in IRE 80 should be processed.

In the event thee the brunch cache mechanist 10 has not correctly predicted program flow, further eye-caution cycles in the EN are necessary. Bus JAY 315 transfers the address of the next micro step (from JO
301) whereby specifying which type of branch cache modification is to be performed.

modifications ray be one of two categories for branch-type instructions, depending on the probe-lily of correct prediction of branches. For both pro-dictable an non-predictable instructions, if the instruction is incorrectly predicted to branch, the branch cache 34 is updated by removing the prediction while permitting the "bad" target address to retain.

- It a branch occurs which has not been pro-dialed on an instruction type which is classified as "predictable" (such as a Jump or ranch instruction), the ranch cache 34 is updated during the ensuing eye-caution cycles by inserting a prediction and associated target address. The newly inserted target address, which is the calculated address of the branch instruct lion, is transferred from selector 127 by wry of BDL
bus 54 to branch cache 34.

Referring now to Figure 11, the operation of adding an instruction redirection to the branch cache works as follows when the address of the non predicted ranch instruction is loaded into IRPL 33 by -43~

the microcode, after detection of a non-predicted branch, bits 8-15 are used to address the appropriate branch cache 34 location, the "TAKE BRANCH" bit is set, the target of the branch is stored, and the index is set to the value of bits 1-7 ox the IRPL. In add-lion, bit 16 of the IRYL is stored in the branch cache (the ODD SIDE signal to indicate with which ox the two possible words the ranch prediction is associated.
This bit is provided to the ID stage on subsequent transfers of the normally read double word length data corresponding to this branch cache location) from the IF stage and serves to differentiate between the two cases described above. In this manner the It stage can decide between the two possible courses of action.

When the branch it correctly predicted, but the target address does not watch the calculated target address, the prediction remains in the branch cane 34 but a nudge target address corresponding to the cowlick-lace address) is inserted.

If a branch occurs which has not been pro-dialed or instruction Tao which are not classified as "predictable" (such as Skip), no updating is made in the branch cache 34.

When a branch is incorrectly predicted or an instruction which is not a branch-type instruction, the signal BREXCPTN forces execution of a microcode routine no associated with any particular instruction which removes the incorrect prediction category. In all cases of an incorrect prediction, the look-ahead program counter IRPL 33 is reloaded and the PCU 1 is notified to slush the pipeline.

; An incorrect branch can occur because the branch prediction device supplies only a prediction and does not wait for instruction decode to make its deter-I

, 3~'tc,;~7'~
Minoans. A redirection cannot be detected as incorrect until such time as the instruction has been completely decoded by the ID and A stages, and has actually commenced execution in the Of stage. At this point, the pipeline control hardware traps the ~icrocrode to a special routine which locates and rem-Yes the erroneous entry as described above, and no-initializes the pipeline so that the undesired redirection is elirninatedD

In particular, referring again to Fig. 11 and Fig. lay the IF stage 2 makes its branch decisions autonomously. The IF stage then informs the ID and A
stages 3,4 of its determination simultaneous to the delivery of instructions from the IF stage to the ID
stage. The ID stage decodes the instruction and also records the branch determination. During the time that the A stage prepares the effective address, the A
stage decides whether it is acceptable to alloy the instruction to proceed through microcode execution in the Of and EN stages 6~7. The microcode for non-branch instructions is not prepared to handle the possibility of an instruction redirection. If the A stage deter-mines that this situation has occurred, it prevents the instruction prom proceeding to the EN stage, and instead directs the OF stage 5 to transfer control to a special microcode routine which corrects the problem.
This operation is carried out as follows.

The microcode obtains the true program counter (maintained by the EN stage) and transfers it over BDL 54 through buffer 31 and selector 32 to IRPL
33. (The current value of IRPL is useless, because it reflects the redirection erroneously taken). The con tents of the appropriate location in the branch cache 34 addressed by the IRPL, now reflecting the original count when the erroneous decision was made, is invalid dated (by the microcode Writing a zero into the "TAKE

to BRANCH" bit stored with the data.) This ensures that the branch cache will no longer make the erroneous pro-deacon The microcode when directs the pipeline control unit to refill the pipeline with correctly fetched instructions.

REGISTER BYPASS

The register bypass Norway is shown in detailed form in Fig. 12. In the present embodiment, the register bypass Norway is located principally on IMP 8. In the present pipelines system, simultaneous access to certain registers is often required by two or more different stages of the pipelines For example, many instructions require prefe~ching of certain registers early in the pipeline sequence so that they I may be used in the generation of data (operand) addresses owe accessing the program storage. Other instructions require prefetching of a register value which is used directly as an operand. Register valves used for generating addresses, or directly as operands are typically modified by execution stages placed late in the pipeline.

ilk this type of processor, instruction "collisions" may occur when two instructions, one pro-fetching a register and owe writing it, are too close to each other in the instruction flow. In this situation, the write which happens in a late stage may no actually be done until later in time than the prey fetch read, even though the writing instruction comes before the reading one in the program.

The register bypass network accommodates hardware which handles collisions between an instruct lion reading a register in an operand prefetch stage of a pipeline and another instruction modifying the same register in an execution wage which may be employed to "

., modify zany registers during one instruction through repeated execution cycles. The register bypass network further accommodates different types of collision using variations of bypassing techniques. If a collision occurs on instructions which are well separated, a bypass selector and associated storage for saving the bypass value are sufficient, together with address come prison hardware. As the cow instructions move closer together and the prefetched register is being used to form an operand address, the pipeline control unit PCU
1 forces separation of the instructions; however, this separation only occurs if a collision is either detected or at toast predicted The register bypass further provides routing bypass data back to die rent stages of the pipeline depending on eke relative separation in cases where register prefetching is no occurring on behalf of register operands rather than register-rela~e~ operand address formation. In the register bypass network of Fig. 12, a pair of registers are fetched for each memory referencing instruction.
These registers are termed "base register" and "index registry and are shown as AGO 72 in Fig 8. The base and index register are added together by ALP 75 in the A stage of the instruction fetch pipeline, thence added to a displacement resulting in an operand address.

Another instruction for requires thaw the value of a "general register" be supplied directly as an operand. this operand it fetched from the save register file as is used for the base registers described above, and is transported without modifica-Zion through the A stage and supplied to the Of stage.

Current values for base, index, and general registers are supplied by the EN stage as it executes microcode instructions which modify them The EN stage can modify all 32 bits of a register, or either ox its 16 bit halves. Since the EN stage completes its opera lions three stage times later than completion of the corresponding ID stage, there are three different collisions possible 1) Modification and use separated by three or More cycle times. In this case, an instruction has completed the ID phase and waits for completion of the terminal microcode step of thy preceding instruct lion before continuing through the A, phase. An index and base register have been fetched from the A Register file 72, transferred through pipeline register BY 73 and stored in selector/latch 74. The register file destination address specified by each microcode step and supplied by BIT 204 is continuously compared by comparators 226 and 22~) with the base register and index register addresses used on Boyle of the instruction awaiting in the A
stage and stored in latch 225. the out-puts of these comparators, together with write enables supplied by BIT, are passed through bypass control logic 228 for determination of the needed action.

If a match occurs, the data in selector/
latch 74 it stale, and correct data must be substitu~edO The appropriate port eons of selec~or/latch I are no-clocked, selecting the updated value coming from the EN stage via ED 50, 54 buffers 210 and pipeline register BAR
71. Sufficient time exists in this case for the updated values Jo no-traverse the A stage, so no additional I

delay is necessary. This same mechanism is employed for equivalent cases involving general registers used as operands.

2) Modification and use separated by two cycle times In this case the A phase is attempting to proceed (the final microcode step of the preceding instruction it beginning) lo and the previous microcode step modified an index or base register used by the instruction active in the A phase. The same monitoring hardware used for 1 retains effective due to latch 225, which holds the index and base register addresses tong enough for this final determination. In the event ox collie soon detection, the proper bypass is again selected at selector/latch 74, buy in this case extra time must be added for the A phase to properly employ the new value. The Collision Detect signal, produced by control logic 228/ directs the PCU to allow the EN stage to complete while stopping all other pipe-line slaves. In this fashion the Nat value is obtained and a one cycle time delay provided for the A phase Jo make use of it.

It is undesirable to incur this time delay where registers are used directly as operands. Since this type of operand need not be manipulated by Allis 75 and 76, it is possible to ski over these pipeline stages and send the data I

directly where it ' s needed This is accomplished via selectors 212 and 213, which select the modified portion of the value presently on busses BY 5û, I for insertion into the data stream in place of the stale value being produced on busses 39 and 90. In this manner, no extra time is required.

3) Modification and use separated by one cycle time.

When two successive machine instructions result in this situation, the method used in 13 and I) is no effective, because the instruction with the stale data must exit the A stage before the register file destination address of the modifying instruction is available. The destination predictor logic, consisting or a portion ox the decode net 82, con-lain saved opaqued bits 207 and Conrail logic 229, is used to determine which register, if any, might be modified in the final microcode step ox an instruct shunned This requires some care in the selection of microcode algorithms, but the flexibility resulting from storage ox control bits in the decode net makes this task straightforward.

The output of the destination predictor logic is compared with the index and base register addresses used by the next instruction by colnparaeors 230 and 231.
Thy outputs of the comparators revel trough control logic 233~ which genera I toes the Collision Predict signal. When ~50--asserted, this signal instructs the PCU
to allow the instruction doing the mod-ligation to proceed, while holding the next instruction's A stage (and all subsequent instructions). This spear-ales the two instructions by two cycles instead of one cycle, and the hardware of case I above can then take over. This logic may or may not insert its one cycle delay, depending on whether the collision actually occurs.

The need for a resister bypass, however, cannot be determined directly from the EN stage in the case of im~ediacely adjacent instructions. It is possible to make a reasonably accurate deter-mine ion of what register (if any) Jill be modified by an instruction by exam-inning the opaqued bits and the destine-lion register tag bit of the instruction. Ready access to microcode allegro related information can be obtained by storing opaqued related in oration in the instruction decode net. Once the microcode for an assembly language instruction has been written, a determination is made of the register most likely to be modified by a per-final microcode step. This information is then stored in a storage element which wakes up part of the decode new, - and all paths through the microcode are checked to ensure that they place a copy of this register in ROD 122 foe bypassing (should bypassing be needed for the next instruction) The A stage then checks the next pro-virus instruction presently in the ID
stage) to see if a collision condition exists. In the event ox a collision on S an index or base register, the IF and ID
stages of the pipeline are held up one cycle; allowing time or the normal collision detection and resolution hard-ware of case 2) to take over. (Lo the collusion involves a general register, then the pipeline is not held up and the automation Of stage bypassing is invoked as descried below. ) In particular, referring to Figure 12, instructions are transported through the opaqued latch 80 and are decoded by decode ne~worX 82. Instruction specific information is passed to the destination register prediction control logic 229 which either produces a prediction o the likely destination register or states chat no register will be ~odi~ied. the prediction is compared by comparators 230 and 231 with eye addresses of the index and base wrists fetched on behalf ox the next instruction. This result passes through additional logic 233 which determines whether a collision has actually occurred (the pipeline Jay be refilling or the next instruction May jot actually use the index register fetched for it).
Referring to Figure AYE when the IMP 8 (Roy logic 233) produces the collie soon predict signal COLORED as described above the pipeline control unit ~PCU) receives the signal, stops the IF ID, So to and A stages, and allows the OF, Of, and EN stages to cycle. The PCU also supplies the signal FORCENOP which operates on the LEA venerator I (Fig.

8) and modifies the microcode address on the LEA bus 91 to the address of a spew coal "stall" step, which acts as a place holder for the OF stage while the necessary one cycle separation between the two instructions is being inserted.
This one cycle separation, as noted above, is sutficien~ to allow the balance of the locJic illustrated in Fig.
12 to take over and perform bypassing, if needed or supply any additional delay(s) that may be required The "prediction" aspect is based solely on the use ox instruction opcodes. In a complex instruction set architecture, 2C there are many instructions which can write more than one register, or which might not modify the predicted destine-lion register in all cases. (Divide by zero is an example.) By stipulating I one "likely" register in a microcode algorithm, (and then not modifying any different register in the final microinstruction of the algorithm), and then recording this "likely" destination in the decode network, the IPU is able to make a determination which will result in the necessary delay in all cases where it is definitely necessary, never adds delays which are known to be unnecessary for an instruction, and adds a minimum delay in certain unlively cases. Once the hardware performs its I

function, any necessary separation will have been introduced to allow the micro-code specified register destinations to Abe monitored by the logic in Fib 12 as described in case (2) above.

It is again undesirable to apply time penalties when registers are used as operands. When a match its detected by comparator 230 and a general register is being fetched, this condition is rehem-bored in register 232. This it in turn pipeline in Register ~34 and sent over to the Of stage hardware as the signal USER, where it acts as a form of extended control over the operand source select microcode field when such a collision occurs, this extended control forces selection of the needed operand from an alternative source in the instruction execution pipeline. This extra copy is kept valid by microcode convention, and again no time penalty is required.

As noted above, for certain classes of instructions, a register it used directly as an operand, instead of as an input towards thy generation of the effective address of an operand. In this case, the lregis~er) operand does not need to be manipulated by the A
siege but rather is supplied unmodified to the Of stage For maximum effi~
eons, it is important to make instruct lions ox this type as fast a possible, I When a register modification occurs in the microinstruction which immediately -54~

precedes the initiate step for the next assembly-language-level instruction, it is not possible for the A stage to pro-vise the operand without an undesirable extra pipeline delay.

The mechanism by which the Ox stage can transparently provide its own operand, however, through a hardware override of its data path control logic and using a microcode convention which ensures that the required data is available within the Of stage, in a form that can be substituted directly for what Gould have teen provided by the A stage, is as hollows.

Any microcode algorithm which modifies a general register on what could be the last microcode step prior Jo commence-mint of the next assembly language instruction must ensure that a thirty-two bit copy of the resultant data is placed n the microcode scratch register ROD 122 fugue 10) during or before the final step This data can then ye substituted for the (stale) data pro voided by the A stage, should the next instruction reference the save register In operation, thy ID stage ordinarily fetches the desired register operand from the A register file 72 and stores it through the BAR pipeline rouge ton 73.
Tune A stage transports it through the selector/latch 74 through Anus 75 and 76, selectors ~12 and 213, and stores it in registers NASH 85 and EASY 77. The to 7 Of stage can then obtain the operand by using the microcode field to direct what HAS be transported through selectors 86 and 73 and placed on BBH 63 and AL 62.

If the immediately succeeding instruct lion modifies the desired register and operand, neither ox the selectors can obtain eke data in time to effect the needed bypass. The value in EASY 85 and EARL 77 it "stale" and does not reflect the update. Reeker then. waiting for the new value to arrive, (and thus undesirably holding up the Of siege, the A stage detects this condition and records its occurrence along with storing the "stale" data. This lung-lion is performed by logic depicted in Fig 12) specifically the decode net 32, opaqued register 207, comparator 230, pipeline register 232, and collision record register 234 as noted above. The signal "USER" is sent, to the Of stage to inform it of this situation.

Referring now to Figure 10 r the USER
signal cots AS a control input Noah shown) to the selector 117, and forces it to substitute the contents of register ROD 122 for the stale data pro-sent on BBH 63 and AL 62~ The con tents of ROD 122 are guaranteed to be an appropriate substitute by the microcode restriction stated above.

the invention may be embodied in other specie fig forms without departing from the spirit or Essex-trial characteristics thereof. The described embodiment ~56-f~7~7 is therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by ye foregoing description, and all changes which come within the meaning and range of equivalency of eke claims are wherefore intended to be embraced therein.

Claims

what is claimed is:

1. A data processing system for processing a sequence of program instructions comprising an instruction pipeline having a plura-lity of serially operating instruction stages for reading instructions from storage and for forming therefrom data to be employed during execution of said instructions, an execution pipeline having a plurality of serially operating execution stages for receiving said data and for employing said data formed by said instruction pipeline for executing said instructions, a pipeline control unit for synchronously operating said instruction pipeline and said execution pipeline, said pipeline control unit including means for initiating operation of at least one stage of said execution pipeline using data formed by said instruction pipeline for a program instruction prior to the completion of said data for-mation by said instruction pipeline for said program instruction, whereby operation of at least one instruction stage and one execution stage of said respective pipelines overlaps for each program instruc-tion.

2. The data processing system of claim 1 further comprising a high speed random access memory, a pipeline master clock for timing said pipeline stages, said pipeline control unit providing at least two clocked periods for each said pipeline stage to complete its operation, and for each said at least two clocked periods, said instruction pipeline having access to said high speed memory during one of said clocked periods and said execution pipeline having access to said high speed memory during another one of said clocked periods.

3. The data processing system of claim 1 wherein said pipeline control unit further comprises means responsive to exception conditions on said execution and said instruction pipelines for independently controlling, for each pipeline, the flow of instruction operations through said execution pipe-line and said instruction pipeline.

4. The data processing system of claim 3 wherein said flow control means includes means for halting operation of one only of said execution and instruction pipelines.

5. The data processing system of claim 1 wherein said instruction pipeline comprises an instruction fetch stage for accessing from memory program instructions to be performed, an instruction decode stage for generating, from said accessed instructions, (a) starting addresses in a microcode storage element and (b) operand address data and an address generation stage for generating operand addresses from said operand address data.

6. The data processing system of claim 5 further comprising a register file and wherein said execution pipeline comprises a control formation stage for accessing microinstructions from said microcode storage element, using said starting addresses, and for buffering said microinstructions, an operand execute stage for accessing, using said operand addresses and register file controls, operand data to be operated upon and, using said operands and a said microinstruction, initiating execution of said instruction, and an execution and store stage for completing said execution of said microinstruction and making results of said execution available to said system.

7. The data processing system of claim 6 wherein said operation initiating means begins opera-tion of said control formation stage for a program instruction at a time prior to completion of operation of said address generation stage for said program instruction.

8. The data processing system of claim 6 wherein the pipeline control unit further comprises means responsive to said microinstruc-tions for altering the flow of instructions in at least one of said instruction pipeline and said execution pipeline.

9. The data processing system of claim 8 wherein said altering means is responsive to a said microinstruction for extending the operating time dura-tion of all stages, except the execute and store stage, for allowing the operand execute stage to complete a process operation.

10. The data processing system of claim 8 wherein said altering means is responsive to a said microinstruction for inhibiting operation of said instruction pipeline for allowing said execution pipe-line to cycle through a plurality of microinstructions.

11. The data processing system of claim wherein said pipeline control unit further comprises means for inserting no-operation cycles \

into the execution pipeline in the event that at least one of (a) being no instruction from the instruction decode stage of the instruction pipeline for the control formation stage of the execution pipeline and (b) there being no instruction from the address genera-tion stage of the instruction pipeline for the operand execution stage of the execution pipeline, occurs.

12. The data processing system of claim 8 wherein said altering means is responsive to a con-ditional branch microinstruction entering said operand execute stage and said altering means further comprises means for operating said operand execute and execution and store stages, and means for inhibiting operation of said instruction pipeline and said control formation stage, until data required by said conditional branch micro-instruction is available from said operand execute stage.

13. The data processing system of claim 6 wherein said instruction pipeline comprises a look-ahead program counter, and further wherein said pipeline control unit comprises means responsive to a said microinstruc-tion for redirecting instruction flow in said instruc-tion pipeline by effecting reloading of said instruction pipeline look-ahead program counter.

14. The data processing system of claim 13 wherein said execution pipeline includes a microcode storage element, and further wherein said pipeline control unit, in response to a request by the execution pipe-line causes all current instructions in the instruction pipeline to be discarded.

15. The data processing system of claim 1 wherein said execution pipeline includes a microcode storage element and further wherein said pipeline control unit, in response to a request by the execution pipe-line will discard all current instructions in the instruction pipeline.

16. The data processing system of claim 15 wherein said pipeline control unit further comprises means for inserting no-operation cycles into the execution pipeline during the time duration that the instruction pipeline is refilling, and for continuing operation of said execution pipeline while said instruction pipeline is refilling.

17. The data processing system of claim 1 comprising means for detecting pipeline collisions in said instruction and execution pipelines, and further wherein the pipeline control unit comprises means responsive to said collision detecting means for delaying operation of at least one of said stages for introducing a separation between said colliding instructions.

18. The data processing system of claim 5 comprising means for detecting an exception con-dition during operation of said instruction fetch stage and wherein said pipeline control unit comprises means for holding an instruction in said instruction fetch stage until all other stages of said instruction and execution pipelines have completed processing instructions therein.

19. The data processing system of claim 5 further comprising a high speed instruction storage element, means for reading from said element two instruction words at a time, said two words being aligned with an even word boundary of said memory, and access means for reading from said ele-ment a two word instruction aligned with an odd word boundary said access means comprising means for reading a first word of said two word instruction during the instruction fetch stage of said instruction, said first word including all instruction decode data, and means for reading a second word of said instruction during the instruction fetch stage of a next following instruction.

20. The data processing system of claim 1 further comprising a microcode storage element for storing microinstructions, and said execution pipeline effects data manipulation in response to selected ones of the microinstructions.

21. In a data processing system for pro-cessing a sequence of program instructions comprising an instruction pipeline having a plura-lity of serially operating instruction stages for reading instructions from storage and for forming therefrom address data to be employed during execution of said instructions, an execution pipeline having a plurality of serially operating execution stages for receiving said address data and for employing said address data formed by said instruction pipeline for referencing stored data to be employed for executing said instruc-tions, the pipeline control method comprising steps of synchronously operating said instruc-tion pipeline and said execution pipeline, and initiating operation of at least one stage of said execution pipeline using at least one said address data formed by said instruction pipeline for a program instruction prior to the completion of said address data formation by said instruction pipe-line for said instruction.

22. The pipeline control method of claim 21 further comprising the steps of providing at least two clocked periods for each pipeline stage to complete its operation, and sharing a high speed memory between said instruction pipeline and said execution pipeline, said instruction pipeline having access to said high speed memory during one of said clocked periods and said execution pipeline having access to said high speed memory during another one of said clocked periods.

23, The pipeline control method of claim 21 further comprising the step of independently controlling, for each pipe-line, the flow of instruction operations through said respective execution and instruction pipelines.

24. The pipeline control method of claim 23 wherein said controlling step further comprises the step of halting operation of one only of said execution and instruction pipelines in response to pipeline control conditions.

25. The pipeline control method of claim 21 further comprising the steps of detecting pipeline collisions in said instruction and execution pipelines, and delaying operation of at least a portion of said instruction pipeline for introducing a separa-tion between said colliding instructions.

26. A data processing system for processing a sequence of program instructions comprising an instruction pipeline having a plura-lity of serially operating instruction stages for reading instructions from storage and for forming therefrom plural address data to be employed during execution of said instructions, an execution pipeline having a plurality of serially operating execution stages for receiving said address data and for employing said address data formed by said instruction pipeline for referencing stored data to be employed for executing said instruc-tions, a pipeline control unit for operating said instruction pipeline and said execution pipeline, said pipeline control unit including means responsive to exception conditions on said execution and said instruction pipelines for independently controlling, for each pipelines the flow of instruction operations through said execution pipe-and said instruction pipeline.

27. A data processing system for processing a sequence of program instructions comprising an instruction pipeline having a plura-lity of serially operating instruction stages for reading instructions from storage and for forming therefrom plural address data to be employed during execution of said instructions, an execution pipeline having a plurality of serially operating execution stages for receiving said address data and for employing said address data formed by said instruction pipeline for referencing stored data to be employed for executing said instruc-tions, a pipeline control unit for synchronously operating said instruction pipeline and said execution pipeline, said pipeline control unit including a plurality of state registers, a plurality of combinatorial logic cir-cuits, one each of said state registers and said logic circuits being associated with each stage of said pipelines, each said logic circuit having a first signal output and a second signal output, each pipeline stage having a first phase of operation associated with said first signal output of said associated logic circuit and a second phase of operation associated wish said second signal output of said associated logic circuit, each said logic circuit and associated state register, associated with the same pipeline, being connected in series, and at least one of said logic circuits being connected to receive condition signal from said pipe-lines for controlling the flow of instructions through said pipeline.

28. The data processing system of claim 27 further comprising means for connecting said first signal output of a logic cirucit to the associated state register for determining when the associated pipeline stage has completed a first phase of operation.

29. A data processing system for processing a sequence of program instructions comprising an instruction pipeline having a plurality of serially operating instruction stages for reading instructions from storage and for forming therefrom plural address data to be employed during execution of said instructions, an execution pipeline having a plurality of serially operating execution stages for receiving said address data and for employing said address data formed by said instruction pipeline for referencing stored data to be employed for executing said instruc-tions, a pipeline control unit for synchronously operating said instruction pipeline and said execution pipeline, said pipeline control unit including means for initiating operation of at least one stage of said execution pipeline using one said address data formed by said instruction pipeline for a program instruction prior to the completion of said address data formation by said instruction pipe-line for said program instruction, whereby operation of at least one instruction stage and one execution stage of said respective pipelines overlaps for each program instruc-tion.

30. The data processing system of claim 6 further comprising means for detecting collisions between read data from a register associated with the instruc-tion pipeline phase of operation in response to a first instruction and write data written in registers asso-ciated with the execution pipeline phase of operation in response to an earlier instruction wherein said exe-cution phase of operation can include a plurality of execution cycles during each of which a register can be modified and wherein said first instruction requires one of said modified values to continue valid opera-tion, said detecting means comprising means for storing said modified values generated during the execution phase and the write register address associated therewith, means for comparing the associated write register address of each modified value with the read register address used by the instruction pipeline, means for directing, when said addresses match the modified value, to be written at said register address, to replace the data previously designated to be used during said instruction phase of operation.

31. The data processing system of claim 30 wherein said storing means receives data from the execution and store stage, said read register address is a read address generated by the instruction decode stage, said write register address is available during operation of the execute and store stage, and said directing means comprises a selector means connected between the instruction decode and address generation stages, said selector having the read data and the modified data as inputs thereto.

32. The data processing system of claim 30 wherein said directing means further comprises a second selector means connected between the address generation stage and the operand execute stage for altering at least a portion of the flow of address data to said operand execute stage in response to a collision detection signal.