CA1228170A

CA1228170A - Architecture for small instruction caches

Info

Publication number: CA1228170A
Application number: CA000482001A
Authority: CA
Inventors: Eric P. Kronstadt; Tushar R. Gheewala; Sharad P. Gandhi
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1984-10-24
Filing date: 1985-05-21
Publication date: 1987-10-13
Also published as: EP0179245B1; DE3582777D1; US4691277A; JPS6323586B2; JPS61100837A; EP0179245A3; EP0179245A2

Abstract

ABSTRACT
ARCHITECTURE FOR SMALL INSTRUCTION CACHES
A branch target table (10) is used as an instruction memory which is referenced by the addresses of instructions which are targets of branches. The branch target table consists of a target address table (12), a next fetch address table (14), a valid entries table (16) and an instruction table (18). whenever a branch is taken, some of the bits in the untranslated part of the address of the target instruction , i.e. the instruction being branched to, are used to address a line of the branch target table (10).
In parallel with address translation, all entries of the branch target table line are accessed, and the translated address is compared to the target address table (12) entry on that line. If the target address table entry matches the target address, the instruction prefetch unit (32) fetches the instruction addressed by the next fetch address table (14) entry for the given line and the line of instructions associated with the branch address table entry is read into an instruction queue (38) having a length set by the valid entry table (16) entry which indicates how many of these instructions are valid. Otherwise, the instruction prefetch unit (32) fetches the target and subsequent instructions as it would if there were no branch target table, and the target address table entry is set to the real address of the target instruction. The next fetch address table (14) is updated so that it always contains the address of the instruction which follows the last valid instruction in the line, and the valid entries table (16) is updated so that it always counts the number of valid instructions in the line.

Description

Yo-yo lZ28~7~

- ARCHITECTURE FOR SMALL INSTRUCTION Khakis Background of the Invention The invention describe herein generally relates to micro-processors having an on-chip instruction cane and, 5 more specifically, to a unique architecture for such an instruction memory which is particularly suited for enhancing the performance of instruction prefetch.
Current advances in very large scale integrated (VLSI) circuit technology permits design of high performance lo micro-processors with cycle times well under 100 nanoseconds Simultaneously, the performance of dynamic memories are improving to the point where random access memory (RAM) access times can very nearly match processor cycle times. However, the time it takes to drive addresses 15 and data off chip, generate appropriate memory chip selects, do the memory access, perform error detection, and drive back to the CPU can add several (CPU) cycles to the "system"
access time of a memory. As long as the CPU is fetching data sequentially, as in sequential instruction fetches, it 20 can prefetch far enough in advance so that it sees a constant stream of data arriving at intervals equivalent to the RAM cycle time, which, as noted above, is comparable to the CPU cycle time. However, as soon as a branch instruction occurs, the "prefetch pipeline" is broken, and 25 the CPU must wait for several cycles for the next instruction. With current VSLI chip densities, it is possible to add a fair amount of circuitry to the "CPU"
chip, some of which may be devoted to decreasing this idle time. A standard approach is to put a small instruction 30 memory, usually an instruction cache (I-cache), on the CPU
chip.
An example of a single-chip micro-processor having an instruction register with associated control decode or micro-control generator circuitry is disclosed in Use Yo~3-13s ~22~7~ ' Patent lo. 4,4~,042 issued to Karl M. Gut tag In this patent, the micro-processor communicates with external millionaire by a bidirectional multiplexed address/data bus.
Each instruction produces a sequence of microcode which are 5 generated by selecting an entry point for the first address of the control read only memory (ROM) then executing a series of jumps depending upon the instruction. Operating speed is increased by fetching the next instruction and starting to generate operand addresses before the current 10 result has been calculated and stored.
US. Patent Jo. 4,39~,946 to Thomas A. Lane discloses a pipeline processor wherein micro-instructions are held in a Control store that is partitioned into two microcode memory banks. This system can support three modes of sequencing:
15 single micro-instruction, sequential multiple micro-instructions, and multiple micro-instructions with conditional branching. When a conditional branch is performed, the branch not taken path is assumed and if true, the micro-instruction following the branch is executed with 20 no delay. If the branch is taken, the guess is purged and following a one clock delay, the branched to microinstruction is executed. The Lane system supports these sequencing modes at the maximum pipeline rate.
US. Patent No. 4,384,342 to Tokyo Immure et at 25 discloses a Lockwood prefetching technique wherein a first memory address register stores the column address and module designation portions of the current effective address, a second memory address register stores the row address portion of the current effective address, and a third memory 30 address register stores the module designation portion of the prior effective address. Since the same module is frequently accessed many times in succession the average access time is reduced by starting an access based on the contexts of the second and third memory address registers 35 without waiting until the column address and module YO-YO ~2~7~

designation portions of the current effective address are ; available from storage in the first Emory address register.
the access is completed, after the column address and module designation portions of the current effective address are S determined, if a comparator which is connected to the first and third memory address registers confirms that tune save memory module is being successively accessed. If not, the modules are accessed again based upon the contents of the first and second memory address registers.

10 Summary of the Invention It is an object of the present invention to provide a new architecture for an on-chip instruction cache which is more effective than the prior art.
It is another object of this invention to rove an 15 instruction memory architecture which uses an associative addressing scheme to provide better performance of instruction prefetch.
The objects of the invention are accomplished by providing a branch target table (BUT) as an instruction 20 memory which is referenced by the addresses of instructions which are targets of branches. The BUT consists of a target address table (TAT), a next fetch address table FIAT a valid entries table (VET) and an instruction table (IT).
whenever a branch is taken, some of the bits in the 25 untranslated part of the address of the target instruction, i.e. the instruction being branched to, are used to address a line of the BUT. In parallel with address translation, all entries of the BUT line are accessed, and the translated address is compared to the TAT entry on that line. If the 30 TAT entry matches the target address, the instruction prefetch unit fetches the instruction addressed by the NEAT
entry for the given line and the line of instructions associated with the BUT entry is read into an instruction queue having a pointer length set to the TV entry Nash -I 2~7~) , YO-YO

indicates how many of these instructions are valid.
Otherwise, the instruction prewash unit fetches tune target and subsequent instruction as it would if there were no BUT, and the TAT entry is set to the real address of the target 5 instruction. The FAT is updated so that it always contains the address of the instruction which follows the last valid instruction in tune line, and the VET is updated so that it always counts the number of valid instructions in the line.

Brief Description of the Drawing _ The foregoing and other objects, aspects and advantages of the invention will be better understood from the following detailed description of the invention with reference to the drawings, in which:
Figure 1 is a block diagram of the basic instruction 15 memory architecture according to the invention; and Figure 2 is a bloc diagram of the additional logic required for the example of a two-way set associative OTT
configuration.

Detailed Description of the Invention . _ _ . _ The branch target table (BUT) is an instruction memory which is referenced by the addresses of instructions which are targets of branches. The following discussion describes a one-way associative addressing scheme; as indicated later, greater associativity provides better performance.
With reference to the drawing, the BUT 10 consists of a target address table (TAT) 12, a next feign address table (FAT) 14, a valid entries table (VET) 16 and an instruction table (IT) 18. The CPU generated branch target instruction address is supplied to a first address register I This 30 address includes a virtual part and a fixed part. The virtual Hart of the address is supplied to an address translation unit 22 which generates the real address for storing in a second address register 24. The fixed part of 28~t~3 YO-YO

the address stored in register I is directly read into corresponding locations of register I An address comparator 26 compares entries in TAT 12 with the ranch target real address generated by the address translation 5 unit 22 and store in register 24. next feign address register 28 communicates with the FAT 14 and supplies a branch address to address multiplexer I Tune other input to multiplexer I is supplied from register 24 and, depending on the output of comparator 26, tune contents of id register 24 or of register 28 are swilled to the Cups instruction prefetch unit 32. The instruction prefetch unit 32 also communicates directly with register 28 which in turn provides addresses for reading into tune NUT 14.
A portion of the fixed address store in register I is 15 also supplied to a decoder 34 winch addresses a line on the ; BUT I In addition, there is provided an incremented 36 for incrementing entries in the VET 16. An instruction queue 38 is fed by the instruction table (IT) 18 of the BUT
I and is addressed by queue pointers in register I
20 supplied by the VET 16. Instructions to be delivered to the CPU are selected by an instruction multiplexer 42.
Whenever a branch is taken, some of the bits in the untranslated part of the address of the target instruction (the instruction being branched to) are used to address a 25 line of the BUT 10. In parallel with the address translation, all entries of the BUT line are accessed. The translated address is compared to the TAT entry on that line. If the TAT entry matches the target address, the instruction prefetch unit 32 fetches the instruction 30 addressed by the NAT 14 entry for the given line, and the line of instructions associate wit the branch address table entry is read into the instruction queue 38 having a length set by the VET 16 entry which indicates how zany of these instructions are valid. Otherwise, the instruction 35 raffish unit 32 fetches the target and subsequent I
YO-YO

instructions as it would if there were no BUT and tune TAT
entry is set to the real address of the target instruction.
The NEAT 14 is updated so that it always contains the address of the instruction which hollows the last valid 5 instruction in the line, and Tao VET 16 is updated so that it always counts the number of valid instructions in tune line.
The operation will be better understood from a consideration of the following examples.
Case 1: The TAT entry does not match the target address. This would happen, for example, the first time the branch is taken. In this case, the instruction raffish unit 32 fetches the target and subsequent instruction us it wound if there were no BUT, and the TAT entry is set to the 15 real address of the target instruction. As the target and subsequent instructions arrive on the CPU chip, they are ; delivered to the CPU for processing and simultaneously entered into the instruction table part of the OTT in the line associated with the target address. he target ; 20 instruction occupies tune first instruction location on this line, and subsequent instructions are placed in subsequent instruction locations on the line until either the line is filled, or another branch is taken. The NEAT entry for the line is updated so that it always contains the address of 25 the instruction which follows the last valid instruction in tune line. The VET 16 is updated so that it always counts the number of valid instructions in the line.
Case 2: The TAT entry matches the target address. In this case, the instruction prefetch unit fetches the 30 instruction addressed by the NEAT entry for the given line.
Simultaneously, the line of instructions associated with tune OTT entry is read into instruction queue 38, with queue length set to the VET entry which indicates how many or these instructions are valid. The instructions in tune queue 35 are immediately available to the CPU so that it is not Yule ~2~7~

sitting idle during the time required to refill the prefetch pipeline. Note that the prefetch unit 32 will fetch the instruction following the last valid instruction in the queue. As that and subsequent instructions arrive on the 5 CPU chip, they are placed at the end of the queue or delivered to the CPU for processing, and if there is room in the BUT line, simultaneously entered into the instruction table part of the BUT. The NEAT 14 is updated so that it always contains the address of the instruction weaken follows 10 the last valid instruction in the BUT line. The VET 16 is updated so that it always counts the number of valid instructions in the OTT line.
As described above, the scheme for indexing into the BUT is one-way associative; in owner words, there is only 15 one BUT line for each different value that can be taken on by that portion of the target address that is used to index into tune BUT. This is the simplest scheme. One can easily construct an n-way associative BUT. In this case, n lines of the BUT would be simultaneously addressed by a portion of Thea target instruction address, and the TAT entries for each of these lines would be compared to determine which line of the BUT contained the information for that target ; instruction. -If no match were found, then one of the entries in the "associativity class" would have to be 25 discarded (using a standard least recently used LO
algorithm) to make room for the information on the current branch. This technique is very effective in improving performance at the expense of some additional logic.
An example of the logic required for a two-way set 30 associative BUT configuration is shown in Figure 2. Two To lull and 1~2 wit corresponding Tats 12 and 12 , fats 141 and 142, Vets 161 and 162, and Its 181 and 1~2 are provided. The fixed address bits from register MU are supplied to associated decoders 34 and 342 to address lines 35 on By 1~1 and 1~2, respectively. 'he translated address ~L~2~3~7~
YO-YO

bits from register 24 are compared in comparators 261 and 262 with target addresses from Tats 121 and 122. The outputs of these comparators are provided as control inputs to multiplexes I I and I The output of multiplexer 3 5 is supplied to the instruction prefetch unit 32 as before.
multiplexer 44 receives as inputs the outputs of Vets 161 and 16 and provides an output to the queue pointer resister 40. Multiplexer 46 receives as inputs the outputs o.
instruction tables 18 and 18 and provides outputs to tune lo instruction queue 38.
In the above discussion, it was also assumed that when a branch was taken to a target instruction corresponding to an entry in the BTl, prefetching of tune instruction addressed by the FAT entry begins immediately.
15 This may not be necessary if the number of valid instructions in the BUT line is greater than the number of CUP cycles required to obtain that instruction. One could have some additional logic associated with the instruction queue, so that if the length of the queue is greater than a 20 certain constant (equal, for example, to the number of CPU
cycles required to complete a memory access, including all system access overhead), then no prefetching is done. this causes less traffic on the memory bus and reduces contention for instruction and data accesses.
There is no prefetching in a standard I-cache. when a cache miss occurs, an entire cache line is read from Cain memory. the penalty for a cache miss is at least as large as the full memory system access time plus the CPU cycle time multiplied by tune number of words in tune cache. In 30 some cache implementations, instructions are made available to the CPU as they are loaded into the cache (as in the present invention) so this last component of the miss penalty may be reduced. Ivory, it is frequently the case that in filling a cache line, one fetches and caches 35 instructions that are never used.

YO-YO ~22~

On a standard I-cache, a miss can occur on any instruction access. BUT misses can only occur on "taken"
branches, i.e. the execution of conditional branches for which the condition does not hold does not involve toe BUT.
5 Thus, if roughly 16% of all instructions are taken ranches, a 40% hit ratio in the BUT would be roughly equivalent to a 93g~ hit ratio on a standard I-cache. It is apparent that in Situations where only a small amount of memory is available for on chip instructions storage, a BUT according to the 10 present invention has advantages over a standard I-cache.
The table below contains comparative performance figures for a micro-processor system using a BUT and an I-cache of differing sizes.
¦ Relative Hit ¦ Performance Ratio _____________________________________________ 1/2-K I-Cache ¦ 1.~0 79%
(4 way associative) j I

20l-K I-Cache I 1.06 83%
(4 way associative) I
I
1/2-K BUT ¦ 1.15 40%
(1 way associative) l-K Bit ¦ 1.17 guy I way associative) I
The system consists of a CPU and a BTT/I-cache management unit on one chip and a separate memory system.
30 The memory system is composed of ripple mode memories which are configured so that over 90% of all CPU accesses to memory instruction and data) involve a DOW cycle time which is less than or equal to the CUP cycle time, while system access time adds three additional CPU cycles. The 35 system contains one address bus and one bus for instruction 28~
Yo-yo and data. The results are based on a single set of trace taxes, representing a high level language compilation. rlhe I-cache was 4-way set associative with either or 16 associativity classes, each containing four sixteen byte 5 lines. (Eight classes corresponds to a 1/2-~ cache size, 16 classes to a lo cache.) The BUT was only one-way associative and had either 16 or 32 lines (corresponding to 1/2-K or I BUT sizes) with room for seven instructions on each line. System modeling includes loads and stores as 10 well as instruction fetches, so that contention for the single data bus is included in these figures. The performance numbers in the table are normalized to tune performance of the l/2-K I-cache configuration. Note that the difference between I-cache performance and BUT
15 performance would be more pronounced if the associativity of the two were the same.

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. An instruction memory architecture for a microprocessor comprising:
a branch target table including a target address table, a next fetch address table, a valid entries table and an instruction table;
instruction address register means for storing a target address generated by said microprocessor;
decoder means responsive to said instruction address register means for addressing all entries of a line of said branch target table whenever a branch is taken;
comparator means connected to the outputs of said instruction register means and said target address table for providing an output indicative of a match between the target address and the accessed target address table entry on said addressed line; and instruction prefetch means connected to said instruction address register means and said next fetch address table and responsive to the output of said comparator means for fetching the instruction in the accessed next fetch address table if said target address table entry matches the target address, otherwise for fetching the target and subsequent instructions from main memory, said next fetch address table entry for the line being updated so that it always contains the address of the instruction which follows the last valid instruction in the line and said valid entries table being updated so that it always counts the number of valid instructions in the addressed branch target table line.

2. The instruction memory architecture as recited in claim 1 wherein said instruction address register means comprises:
a first register for storing an address having a virtual part and a fixed part, said address being the target address generated by said microprocessor, at least some of the bits in the fixed part of said address being supplied to said decoder means;
address translation means connected to said first register and responsive to the virtual part of said target address for generating a corresponding real address; and a second register connected to receive the real address from said address translation means and the fixed part of said target address stored in said first register, said comparator means receiving said real address stored in said second register for comparison with the accessed target address table entry.

3. The instruction memory architecture as recited in claim 1 further comprising:
instruction queue means connected to said instruction table for receiving instructions to be supplied to said microporcessor; and queue pointer register means connected to said valid entry table for storing pointers which address said instruction queue means.

4. The instruction memory architecture as recited in claim 1, said architecture being n-way set associative further comprising:
at least a second branch target table having a target address table, a next fetch address table, a valid address table and an instruction table;
at least a second decoder means responsive to said instruction register means for addressing all entries of a line of said branch target table whenever a branch is taken;
at least a second comparator means connected to the outputs of said instruction address register means and said target address table of said second branch target table for providing an output indicative of a match between the target address and the accessed target address table entry on said accessed line; and selection means controlled by the outputs of said comparators for providing the outputs of one of said instruction tables to said microprocessor.

5. A method of operating an instruction memory architecture for a microprocessor, said instruction memory architecture comprising a branch target table having a target address table, a next fetch address table, a valid entries table and an instruction table, said method comprising the steps of referencing said branch target table by the addresses of instructions which are targets of branches, all entries of an addressed line of said branch target table being accessed whenever a branch is taken, comparing the target address with the target address table entry on said addressed line and, if said target address table entry matches the target address, fetching the instruction addressed by said next fetch address table entry, otherwise fetching the target and subsequent instructions from main memory.

6. The method according to claim 5 further comprising the additional steps performed when an instruction is fetched of updating the next fetch address table entry for the addressed branch target table line so that it always contains the address of the instruction which follows the last valid instruction in the line, and updating the valid entries table so that it always counts the number of valid instructions in the addressed branch target table line.