1
HARDWARE IMPLEMENTATION OF COMPLEX DATA TRANSFER INSTRUCTIONS
This application is a continuation, of application Ser. No. 07/291,510, filed Dec. 29, 1988 now abandoned.
TECHNICAL FIELD
The subject matter of this invention pertains to computing systems, and more particularly, to an improved, high performance, multiprocessor and uniprocessor computer system.
In the design and development of computer systems, increasing emphasis is being placed on performance of such systems. The performance is very often a function of the technology used in manufacturing the integrated circuit chips which comprise the computer system. One such technology, new in the development of computer systems, is Complementary Metal Oxide Semiconductor (CMOS) technology. CMOS technology provides a greater degree of reliability, serviceability, and availability than seen before in prior computer systems, due mostly to a reduction in the physical number of chips which comprise the computer system. Since a scarcity of input/output pins on chips has been a problem with prior computer systems, a reduction in the number of chips, as a result of the use of CMOS technology, reduces the number of interconnections (input/output pins) between chips. In addition, performance may also be a function of the number of processors which comprise the computer system.
BACKGROUND ART
A major problem with such advanced technology computer systems is the need to provide data transfer instructions which permit the loading from memory to machine registers and the storing from machine registers to memory of data fields of multiple lengths. U.S. Pat. No. 4,745,547, issued on May 17, 1988 to Buchholz et al. assigned to the same assignee as this patent application teaches the need for this function in vector processing.
One way of accomplishing this is through the use of block transfers as in Input/Output type operations. U.S. Pat. No. 4,370,712 issued to Johnson et al. on Jan. 25, 1983, and U.S. Pat. No. 4,438,493 issued to Cushing et al. on Mar. 20, 1984, teach block transfer techniques. The overhead of such operations, however, makes them not as useful for intraprocessor transfers as for I/O transfers of large blocks of data.
A simple, special purpose method of data transfer is taught by Whipple, et al. in U.S. Pat. No. 4,716,545 issued on Dec. 29, 1987. This technique is special purpose in that it is restricted to a doubleword read or write. A brute-force method of implementing the Whipple, et al. doubleword transfer is taught by Johnson et al. in U.S. Pat. No. 4,361,869 issued on Nov. 30, 1982. This approach doubles the bus width, thereby speeding the transfer between memory and processor. Unfortunately, this approach does not provide great flexibility in the use of variable width fields.
A much more flexible approach is taught in U.S. Pat. No. 4,491,908 issued to Woods et al. on Jan. 1, 1985. This technique transfers up to four (4) words by specifying the operand length in one field of the transfer instruction. A microprogram is used to decode this field and make the actual transfer. However, the implementation in firmware, though flexible, is inherently slow.
2
Another means of decoding the variable operand length field is by table look-up as taught in IBM Technical Disclosure Bulletin, Volume 19, No. 1, dated June, 1976, by Plant et al. IBM Technical Disclosure Bulletin, 5 Volume 25, No. 4, dated September, 1982, by Nair et al. describes the problems associated with cache management when doing multiple word transfers.
DISCLOSURE OF THE INVENTION
10 The present invention overcomes the problems found in the prior art by adding hardware to reduce run time penalties associated with the over use of firmware in implementation. In the preferred embodiment, the ease of design is added by permitting the hardware to exe
15 cute software specified transfers from one to sixty-four, eight-bit bytes for both load and store instructions.
The hardware complexity is reduced by using essentially the same hardware as now exists. The special case
2Q conditions are removed, thereby potentially simplifying the hardware. Some additional circuitry is necessary, however, to create the "mini-instruction" program which emulates the actual software instruction. The penalty associated with microprogram imple
2j mentation is not experienced because the mini-instructions are executed with the existing hardware architecture.
The execution of a complex multiple load or store instruction begins with decoding as in the normal case.
30 When the specific instruction is identified, the fields are examined to determine whether a special condition exists which requires the use of one or more "miniinstructions". Such special cases include beginning or ending a transfer at byte boundaries which are inconsis
35 tent with normal word boundaries; beginning or ending a transfer at byte boundaries which are inconsistent with normal doubleword boundaries, and greater than eight byte data transfers. If no such special case exists, the instruction is simply executed as in the prior art
40 machine. If a special case is found, newly added control logic ensures that the current instruction is executed a number of times with a series of variables set into the control fields to effect the desired complex load or store in the most efficient manner. Each execution is termed
45 a "mini-instruction" because it is executed with the existing software architecture, but has limitations on allowable variables to ensure most efficient runtime execution.
50 BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a single processor computer system which employs the present invention.
FIG. 2 is a block diagram of a multiple processor computer system which employs the present invention. 55 FIG. 3 is a detailed block diagram of the instruction execution unit.
FIG. 4 is a block diagram of Fixed Point Processor 20-4 including the present invention.
FIG. 5 is a schematic diagram of a Table Lookaside Buffer Miss.
DESCRIPTION OF THE INVENTION
System Architecture
65 The present invention is preferably embodied in a modern computer system with an IBM System 370 architecture having single or multiple processors. Such systems are herein described.
3 4
Referring to FIG. 1, a uniprocessor computer system mit the instruction/execution unit 20 to make accesses
employing the present invention is illustrated. In FIG. 1 into the storage controller 12 for the fetching of data,
the uniprocessor system comprises an L3 memory 10 The integrated I/O subsystem 14 is connected to the
connected to a storage controller (SCL) 12. On one end, storage controller 12 via an 8-byte bus. The subsystem
the storage controller 12 is connected to integrated I/O 5 14 comprises three 64-byte buffers used to synchronize
subsystem controls 14, the controls 14 being connected data coining from the integrated I/O subsystem 14 with
to integrated adapters and single card channels 16. On the storage controller 12. That is, the instruction/execu
the other end, the storage controller 12 is connected to tion unit 20 and the I/O subsystem 14 operate on differ
I/D caches (LI) 18, which comprise an instruction ent clocks, the synchronization of the two clocks being
cache, and a data cache, collectively termed the "LI" 10 achieved by the three 64-byte buffer structure,
cache. The I/D caches 18 are connected to an Instruc- The multisystem channel communication unit 24 is a
tion unit (I-unit), an Execution unit (E-unit), a control 4-port channel-to-channel adapter, packaged externally
store 20 and a vector processor (VP) 22. The vector to the system.
processor 22 is described in patent application Ser. No. Referring to FIG. 2, a triadic (multiprocessor) system
530,842, filed Sep. 9, 1983, entitled "High Performance 15 employing the present invention is illustrated. In FIG.
Parallel Vector Processor", now U.S. Pat. No. 2, a pair of L3 memories lOa/lOb are connected to a bus
4,967,343, the disclosure of which is incorporated by switching unit (BSU) 26, the BSU including an L2
reference into the specification of this application. The cache 26a. The BSU26 is connected to the integrated
uniprocessor system of FIG. 1 also comprises the multi- I/O subsystem 14, to shared channel processor 28, and
system channel communication unit 24. 20 to three processors: a first processor including instruc
The L3 memory 10 comprises 2 "intelligent" memory tion/data caches 18a and instruction/execution unit/cards. The cards are "intelligent" due to the existence of control store 20a, a second processor including instruc
certain specific features: error checking and correction, tion/data caches 186 and instruction/execution units/
extended error checking and correction (ECC) refresh control store 20b, and a third processor including in
address registers and counters, and bit spare capability. 25 struction/data caches 18c and instruction/execution
The interface to the L3 memory 10 is 8-bytes wide. units/control store 20c. Each of the instruction/data
Memory sizes are 8, 16, 32 and 64 megabytes. The L3 caches 18a, 186, and 18c are termed "LI" caches. The
memory is connected to a storage controller (SCL) 12. cache in the BSU 26 is termed the L2 cache 26a, and the
The storage controller 12 comprises three bus arbi- main memory 10a/106 is termed the L3 memory, ters arbitrating for access to the L3 memory 10, to the 30 The BSU 26 connects the three processors 18a/20a, I/O subsystem controls 14, and to the I/D caches 18. 186/206, and 18c/20c, two L3 memory ports 10a/106, The storage controller further includes a directory two shared channel processors 28, and an integrated which is responsible for searching the instruction and I/O subsystem 14. The BSU 26 comprises circuits data caches 18, otherwise termed the LI cache, for data. which decide the priority for requests to be handled, If the data is located in the LI caches 18, but the data is 35 such as requests from each of the three processors to L3 obsolete, the storage controller 12 invalidates the obso- memory, or requests from the I/O subsystem 14 or lete data in the LI caches 18 thereby allowing the I/O shared channel processors, circuits which operate the subsystem controls 14 to update the data in the L3 mem- interfaces, and circuits to access the L2 cache 26a. The ory 10. Thereafter, instruction/execution units 20 must L2 cache 26a is a "store in" cache, meaning that operaobtain the updated data from the L3 memory 10. The 40 tions which access the L2 cache, to modify data, must storage controller 12 further includes a plurality of also modify data resident in the L2 cache (the only buffers for buffering data being input to L3 memory 10 exception to this rule is that, if the operation originates from the I/O subsystem controls 14 and for buffering from the I/O subsystem 14, and if the data is resident data being input to L3 memory 10 from instruction/exe- only in L3 memory 10a/106 and not in L2 cache 26a, cution units 20. The buffer associated with the instruc- 45 the data is modified only in L3 memory, not in L2 tion/execution units 20 is a 256 byte line buffer which cache). The system also containing vector processors allows the building of entries 8 bytes at a time for cer- 22a, 22b, and 22c associated with instruction/execution tain types of instructions, such as sequential operations. units 20a, 206 and 20c, respectively. This line buffer, when full, will cause a block transfer of The interface between the BSU 26 and L3 memories data to L3 memory to occur. Therefore, memory opera- 50 10a/106 comprises two 16-byte lines/ports in lieu of the tions are reduced from a number of individual store single 8-byte port in FIG. 1. However, the memory 10 operations to a much smaller number of line transfers. of FIG. 1 is identical to the memory cards 10a/106 of
The instruction/data caches 18 are each 16K byte FIG. 2. The two memory cards 10c/106 of FIG. 2 are
caches. The interface to the storage controller 12 is 8 accesses in parallel.
bytes wide; thus, an inpage operation from the storage 55 The shared channel processor 28 is connected to the controller 12 takes 8 data transfer cycles. The data BSU 26 via two ports, each port being an 8-byte intercache 18 is a "store through" cache, which means that face. The shared channel processor 28 is operated at a data from the instruction/execution units 20 are stored frequency which is independent of the BSU 26, the in L3 memory and, if the corresponding obsolete data is clocks within the BSU being synchronized with the not present in the LI caches 18, the data is not brought 60 clocks in the shared channel processor 28 in a manner into and stored in the LI caches. To assist this opera- which is similar to the clock synchronization between tion, a "store buffer" is present with the LI data cache the storage controller 12 and the integrated I/O subsys18 which is capable of buffering up to 8 store opera- tem 14 of FIG. 1.
tions. A functional description of the operation of the uni
The vector processor 22 is connected to the data 65 processor computer system of FIG. 1 will be set forth in
cache 18. It shares the dataflow of the instruction/exe- the following- paragraphs with reference to FIG. 1.
cution unit 20 into the storage controller 12, but the Normally, instructions are resident in the instruction
vector processor 22 will not, while it is operating, per- cache (LI cache) 18, waiting to be executed. The in
5
struction/execution unit 20 searches a directory disposed within the LI cache 18 to determine if the typical instruction is stored therein. If the instruction is not stored in the LI cache 18, the instruction/execution unit 20 will generate a storage request to the storage control- 5 ler 12. The address of the instruction, or the cache line containing the instruction will be provided to the storage controller 12. The storage controller 12 will arbitrate for access to the bus connected to the L3 memory 10. Eventually, the request from the instruction/execu- 10 tion unit 20 will be passed to the L3 memory 10, the request comprising a command indicating a line in L3 memory is to be fetched for transfer to the instruction/execution unit 20. The L3 memory will latch the request, decode it, select the location in the memory card 15 wherein the instruction is stored, and, after a few cycles of delay, the instruction will be delivered to the storage controller 12 from the L3 memory in 8-byte increments. The instruction is then transmitted from the storage controller 12 to the instruction cache (LI cache) 18, 20 wherein it is temporarily stored. The instruction is retransmitted from the instruction cache 18 to the instruction buffer within the instruction/execution unit 20. The instruction is decoded via a decoder within the instruction unit 20. Quite often, an operand is needed in order 25 to execute the instruction, the operand being resident in memory 10. The instruction/execution unit 20 searches the directory in the data cache 18; if the operand is not found in the directory of the data cache 18, another storage access is issued by the instruction/execution 30 unit 20 to access the L3 memory 10, exactly in the manner described above with respect to the instruction cache miss. The operand is stored in the data cache, the instruction/execution unit 20 searching the data cache 18 for the operand. If the instruction requires the use of 35 microcode, the instruction/execution unit 20 makes use of the microcode resident on the instruction execution unit 20 card. If an input/output (I/O) operation need be performed, the instruction/execution unit 20 decodes an I/O instruction, resident in the instruction cache 18. 40 Information is stored in an auxiliary portion of L3 memory 10, which is sectioned off from instruction/execution. At that point, the instruction/execution unit 20 informs the integrated I/O subsystem 14 that such information is stored in L3 memory, the subsystem 14 pro- 45 cessor accessing the L3 memory 10 to fetch the information.
A functional description of the operation of the multiprocessor computer system of FIG. 2 will be set forth in the following paragraphs with reference to FIG. 2. In 50 FIG. 2, assume that a particular instruction/execution unit, one of 20a, 206 or 20c, requires an instruction and searches its own LI cache, one of 18a, 186 or 18c, for the desired instruction. Assume further that the desired instruction is not resident in the LI cache. The particu- 55 lar instruction execution unit will then request access to the BSU 26 in order to search the L2 cache disposed therein. The BSU 26 contains an arbiter which receives requests from each of the instruction/execution units 20a, 206, 20c and from the shared channel processor 28 60 and from the integrated I/O subsystem 14, the arbiter granting access to one of these units at a time. When the particular instruction/execution unit (one of 20a-20c) is granted access to the BSU to search the L2 cache 26a, the particular instruction/execution unit searches the 65 directory of the L2 cache 26a disposed within the BSU 26 for the desired instruction. Assume that the desired instruction is found in the L2 cache. In that case, the
6
desired instruction is returned to the particular instruction/execution unit. If the desired instruction is not located within the L2 cache, as indicated by its directory, a request is made to the L3 memory, one of 10a or 106, for the desired instruction. If the desired instruction is located in the L3 memory, it is immediately transmitted to the BSU 26,16 bytes at a time, and is bypassed to the particular instruction/execution unit (one of 20a-20c) while simultaneously being stored in the L2 cache 26a in the BSU 26. Additional functions resident within the BSU relate to rules for storage consistency in a multiprocessor system. For example, when a particular instruction/execution unit 20c (otherwise termed "processor" 20c) modifies data, that data must be made visible to all other instruction/execution units, or "Processors", 20a, 206 in the complex. If processor 20c modifies data presently stored in its LI cache 18c, a search for that particular data is made in the L2 cache directory 26a of the BSU 26. If found, the particular data is modified to reflect the modification in the LI cache 18c. Furthermore, the other processors 20a and 206 are permitted to see the modified, correct data now resident in the L2 cache 26a in order to permit such other processors to modify their corresponding data resident in their LI caches 18a and 186. The subject processor 20c cannot reaccess the particular data until the other processors 20a and 206 have had a chance to modify their corresponding data accordingly.
Referring to FIG. 3, a detailed construction of each instruction/execution unit (20 in FIG. 1 or one of 20a-20c in FIG. 2) and its corresponding LI cache (18 in FIG. 1 or one of 18a-18c in FIG. 2) is illustrated. In FIG. 1 and in FIG. 2, the instruction/execution units 20, 20a, 206 and 20c are disposed in a block labelled "I-unit E-unit C/S (92KB)". This block may be termed the "processor", the "instruction processing unit", or, as indicated above, the "instruction/execution unit". For the sake of simplicity in the description provided below, each of the blocks 20, 20a-20c will be called the "processor". In addition, the "1/D caches (LI)" will be called the "LI cache". FIG. 3 provides a detailed construction for the processor (20, 20a, 206 or 20c) and for the LI cache (18, 18a, 186 or 18c).
In FIG. 3, the processor (one of 20, 20a-20c) comprises the following elements. The control store subsystem 20-1 comprises a high speed fixed control store 20-la of 84k bytes, a pageable area (8k byte, 2k word, 4-way associative pageable area) 20-16, a directory 20-lc for the pageable control store 20-16, a control store address register (CSAR) 20-10*, and an 8-element branch and link (BAL STK) facility 20-le. Machine state controls 20-2 include the global controls 20-2a for the processor, an op branch table 20-26 connected to the CSAR via the control store origin address bus which is used to generate the initial address for microcoded instructions. An address generation unit 20-3 comprises 3 chips, a first being an instruction cache DLAT and directory 20-3a, a second being a data cache DLAT and directory 20-36, a third being an address generation chip 20-3c connected to the LI cache 18, 18a-18c via the address bus. The instruction DLAT cache portion of the LI cache via four "hit" lines which indicate that the requested instruction will be found in the instruction cache portion 18-le of the LI cache. Likewise, four "hit" lines connect the data DLAT and directory 20-36 indicating that the requested data will be found in the data cache 18-26 portion of the LI cache. The address generation unit 20-3 contains copies of the 16 general
« PrécédentContinuer » |