WO1991020042A1

WO1991020042A1 - Fast interrupt mechanism for a multiprocessor system

Info

Publication number: WO1991020042A1
Application number: PCT/US1991/004074
Authority: WO
Inventors: Robert E. Ii Strout; George A. Spix; Edward C. Miller; Anthony R. Schooler; Alexander A. Silbey; Andrew E. Phelps; Brian D. Vanderwarn; Gregory G. Gaertner
Original assignee: Supercomputer Systems Limited Partnership
Priority date: 1990-06-11
Filing date: 1991-06-10
Publication date: 1991-12-26
Also published as: JPH05508046A

Abstract

A fast interrupt mechanism (33) is capable of simultaneously interrupting a community of associated processors (10) in a multiprocessor system. The fast interrupt mechanism (33) enables the more effective debugging of software executing on a multiprocessor system by allowing all of the processors (10) in a community associated with a parallel process to be halted within a limited number of clock cycles following a hardware exception or processor breakpoint. The fast interrupt mechanism (33) consists of a set of registers that are used to identify associations among multiple processors, a comparison matrix (450) that is used to select processors to be interrupted, a network of interconnections that transmit interrupt events to and from the processors, and elements in the processors that create and respond to fast interrupt events.

Description

FAST INTERRUPT MECHANISM FOR A

MULTIPROCESSOR SYSTEM

TECHNICAL FIELP

This invention relates generally to the field of signaling and interrupt mechanisms for computer and electronic logic systems. More particularly, the present invention relates to a method and apparatus for simultaneously performing a fast interrupt on a community of associated processors in a multiprocessor system.

BACKGROUND ART

The previously filed parent application entitled CLUSTER ARCHITECTURE FOR A HIGHLY PARALLEL SCALAR/VECTOR MULTIPROCESSOR SYSTEM, PCT Serial No.: PCT/US90/07655, describes a new cluster architecture for high-speed computer processing systems, referred to as supercomputers. For most supercomputer applications, the objective is to provide a computer processing system with the fastest processing speed and the greatest processing flexibility, i.e., the ability to process a large variety of traditional application programs. In an effort to increase the processing speed and flexibility of supercomputers, the cluster architecture for highly parallel multiprocessors described in the previously identified parent application provides an architecture for supercomputers wherein a multiple number of processors and external interface means can make multiple and simultaneous requests to a common set of shared hardware resources, such as main memory, secondary memory, global registers, interrupt mechanisms, or other shared resources present in the system. One of the important problems in designing such shared resource, multiprocessor systems is providing an effective mechanism for quickly interrupting processors in the event of a hardware exception or processor breakpoint. Parallel processing software running on present multiprocessor systems is sometimes very difficult to debug in the event of a failure. Because of the lack of an effective fast interrupt mechanism in present supercomputers, it is difficult to stop all of the processors in the community associated with the parallel process. The result is that one or more of the non-stopped processors may destroy the information necessary to identify the failure because they continue to operate for many hundreds, or even thousands of clock cycles after the failure is detected by the failing processor.

For example, most prior art massively parallel computer processing systems utilize a wavefront technique for interrupting the processors. The processor that encounters the exception or that generates the interrupt is treated as the center of the wave and the interrupt is propagated out from that processor according to the particular interconnection architecture for the system. In this model, the processors on the outermost edge of the system will not be interrupted until the interrupt wave reaches them, a time period that may be quite variable.

Another problem with many of the present interrupt mechanisms for multiprocessor systems is that all of the processors in the multiprocessor system are unconditionally interrupted in the event of a fault, not just the processors associated with a process group. The disadvantage to this technique is that all programs executing on the multiprocessor system are halted, not just the processes associated with the program experiencing the fault.

Two associated problems arise but of the inability to adequately direct signal requests to one or more processors. With many of the present interrupt mechanisms for supercomputers, for example, the Cray-1 and Cray X-MP supercomputers available from Cray Research, Inc., the interrupt mechanisms do not call an intelligent peripheral to target a service request to a particular processor that could best field the service request. A second problem is that if a peripheral or processor requires the attention of more than one other processor, multiple sequential signals or interrupts must be sent, one for each of the processors being requested. Although the prior art interrupt mechanisms for multiprocessor systems are acceptable under certain conditions, it would be desirable to provide a more effective interrupt mechanism for a multiprocessor system that was able to interrupt the execution of all processors in a community of associated processors within a bounded number of clock cycles from an interrupt. In addition, it would be desirable to provide an interrupt mechanism for the cluster architecture for the multiprocessor system described in the parent application that aids in providing a fully distributed, multithreaded software environment capable of implementing parallelism by default.

SUMMARY OF THE INVENTION

The present invention is a method and apparatus for simultaneously interrupting a community of associated processors in a multiprocessor system. The present invention enables the more effective debugging of software executing on a multiprocessor system by allowing all of the processors in a community associated with a parallel process to be halted within a limited number of clock cycles following a hardware exception or processor breakpoint. The present invention consists of a set of registers that are used to identify associations among multiple processors, a comparison matrix that is used to select processors to be interrupted, a network of interconnections that transmit interrupt events to and from the processors, and elements in the processors that create and respond to Fast Interrupt events. The multiprocessor system of the preferred embodiment includes a number of interrupt-type actions that will be referred to as events. Events include signals, traps, exceptions and interrupts. Because of the differences in each of the events, different software, and in some cases, hardware is required to most efficiently process an event. Unlike prior art multiprocessor systems, the preferred multiprocessor is capable of processing events both in the processors and in the peripheral controllers that are allowed to distributively access the common data structures associated with the multithreaded operating system. Also, unlike prior art multiprocessor systems, the present invention halts only those processors executing processes associated with the process group for a given program, instead of halting all of the processors in the multiprocessor system. An objective of the present invention is to provide a method and apparatus for a fast interrupt mechanism for a multiprocessor system that can quickly and effectively interrupt all of the processors in a community associated with a parallel process operating on a multiprocessor system. Another objective of the present invention is to provide a fast interrupt mechanism that can resolve simultaneous conflicting interrupts issued by a plurality of processors.

A further objective of the present invention is to provide a fast interrupt mechanism that halts only those processors executing processes associated with the process group for a given program, instead of halting all of the processors in a multiprocessor system.

These and other objectives of the present invention will become apparent with reference to the drawings, the detailed description of the preferred embodiment and the appended claims.

DESCRIPTION OF THE DR WINGS Fig. 1 is a block diagram of a single multiprocessor cluster of the preferred embodiment of the present invention.

Fig. 2 is block diagram of a four cluster implementation of the preferred embodiment of the present invention.

Fig. 3 is a block diagram of the interconnections for the Fast Interrupt Mechanism of the present invention.

Figs. 4a and 4b represent the addresses for the SETI and SETN registers in the four cluster implementation of the preferred embodiment of the present invention.

Fig. 5 is a block diagram showing the flow of fast interrupt events in a cluster in the preferred embodiment.

Fig. 6a is a diagram of the System Mode register. Fig. 6b is a diagram of the Pending Interrupts register. Fig. 6c is a diagram of the User Mode register.

Fig. 6d is a diagram of the Exception Status register. Fig. 7 is a block diagram of showing the implementation of the Fast Interrupt mechanism as part of the NRCA means of the preferred embodiment of the multiprocessor system. Fig. 8 is a schematic representation of the logical and physical address maps for the global registers. Fig. 9 is a block diagram of the Fast Interrupt mechanism SESTN and SETI register data paths.

Fig. 10 is a schematic diagram of the Fast Interrupt dispatch logic. Fig. 11 is a detailed implementation of one embodiment of the Fast Interrupt dispatch logic shown in Fig. 10.

Fig. 12 is a block diagram of the inter-cluster Fast Interrupt interface for the four cluster implementation of the preferred embodiment.

DESCRIPTION QF THE PREFERRED EMBODIMENT Referring to Fig. 1, the architecture of a single multiprocessor cluster of the preferred embodiment of the multiprocessor system for use with the present invention will be described. The preferred cluster architecture for a highly parallel scalar/vector multiprocessor system is capable of supporting a plurality of high-speed processors 10 sharing a large set of shared resources 12 (e.g., main memory 14, global registers 16, and interrupt mechanisms 18). The processors 10 are capable of both vector and scalar parallel processing and are connected to the shared resources 12 through an arbitration node means 20. Also connected through the arbitration node means 20 are a plurality of external interface ports 22 and input/output concentrators (IOC) 24 which are further connected to a variety of external data sources 26. The external data sources 26 may include a secondary memory system (SMS) 28 linked to the input/output concentrator 24 via a high speed channel 30. The external data sources 26 may also include a variety of other peripheral devices and interfaces 32 linked to the input/output concentrator 24 via one or more standard channels 34. The peripheral devices and interfaces 32 may include disk storage systems, tape storage system, printers, external processors, and communication networks. Together, the processors 10, shared resources 12, arbitration node 20 and external interface ports 22 comprise a single multiprocessor cluster 40 for a highly parallel multiprocessor system in accordance with the preferred embodiment of the present invention.

The preferred embodiment of the multiprocessor clusters 40 overcomes the direct-connection interface problems of present shared-memory supercomputers by physically organizing the processors 10, shared resources 12, arbitration node 20 and external interface ports 22 into one or more clusters 40. In the preferred embodiment shown in Fig. 2a and 2b, there are four clusters: 40a, 40b, 40c and 40d. Each of the clusters 40a, 40b, 40c and 40d physically has its own set of processors 10a, 10b, 10c and lOd, shared resources 12a, 12b, 12c and 12d, and external interface ports 22a, 22b, 22c and 22d that are associated with that cluster. The clusters 40a, 40b, 40c and 40d are interconnected through a remote cluster adapter 42 that is a logical part of each arbitration nodes means 20a, 20b, 20c and 20d. Although the clusters 40a, 40b, 40c and 40d are physically separated, the logical organization of the clusters and the physical interconnection through the remote cluster adapter 42 enables the desired symmetrical access to all of the shared resources 12a, 12b, 12c and 12d across all of the clusters 40a, 40b, 40c and 40d.

Referring to Fig. 3, the Fast Interrupt mechanism 100 will be described. The Fast Interrupt mechanism 100 allows a processor 10 to simultaneously send an interrupt to all other processors 10 associated with the same process. The Fast Interrupt mechanism 100 can simultaneously process all of the interrupt signals received at the interrupt mechanism 18 in a single cycle. However, signal delays may cause both the issuance and receipt of Fast Interrupts to be delayed for a plurality of cycles before or after the interrupt logic. Processors 10 are mapped into logical sets for purpose of operating system control by the contents of a group of Set Number (SETN) registers that are part of each cluster 40. In the preferred embodiment, there are 32 SETN registers in the global register system for a single cluster 40, one for each processor 10 in the multiprocessor system. A processor 10 is assigned to a set (process group) by loading that processor's SETN register with the value identified with the desired set. The 7-bit set value permits identification of 128 unique sets, one for each processor in a full system. Both processors and peripheral controllers can both read and write the SETN registers.

Another register, called the SET Interrupt (SETI) register, can be used to generate a Fast Interrupt to an explicit set of processors. Writing a value to SETI causes a Fast Interrupt to be sent to all processors whose SETN contents match the written value. The 7-bit value written to SETI is compared to the contents of all of the SETN registers when SETI is written. Interrupts are sent to all processors whose SETN register contents match the value in SETT. Both processors and peripheral controllers can write SETT, but SETT cannot be read.

SETN and SETT registers are part of the set of global registers 16 and are accessed in the same manner. From a processor, the SETN registers are read and written and the SETT registers written via the Group 1 MOVE instruction. Attempting the other Group 1 instructions on the SETN registers causes no change to the registers, and any data returned will contain unpredictable results. The SETN and SETT registers occupy bits 6 through 0 of the highest global register addresses in a cluster. They are addressed by setting a binary one in bit 15 of the GOFFSET register. Fig. 4a shows the SETN and SETI address map. Input/ output devices have a memory map that is different than the map used by the processors. The SETN addresses as seen from an peripheral controller for an input/output device are shown in the Fig. 4b.

Referring now to Fig. 5, the method for generating and receiving Fast Interrupt events in the processors will be described. The event request means 200 creates Fast Interrupt events in two basic ways. The first method is through assertion of the Fast Interrupt Request Line. The Fast Interrupt Request Line can be asserted by: (1) an exception condition (captured in the Exception Status register); or (2) issuing a Fast Associate Interrupt Request (FAIR) instruction to request an interrupt in the set of associated processors. The second method for creating Fast Interrupt Events is through writing a set number to the SETI register. The various processor control registers that are used by the processor in handling events are shown and described in Figs. 6a, 6b, 6c and 6d. The Pending Interrupt register (PI) contains an indicator that shows that a Fast Interrupt has been received (FI, bit 11). A System Mode register bit. Disable Fast Interrupt (DFI, bit 11), disables incoming Fast Interrupt requests. A processor cannot be interrupted by a Fast Interrupt request while DFI is set to one.

The algorithm for generating a Fast interrupt request by processor exception is (ENBx and EXCEPTTONx) and (not DEX) and (not FIRM) where: — ENBx is one of exception enables in the User Mode UM register

- EXCEPTTONx is the associated exception condition as shown in the Exception Status ES register

DEX is the disable exception bit in the System Mode SM register (bit 6)

FIRM is the Fast Interrupt Request Mask located in the System Mode SM register (bit 4) FIRM disables generation of a Fast Interrupt request when any exception is encountered. Setting FIRM to one disables Fast Interrupt request. If an individual exception is disabled, the Fast Interrupt can't occur for that type of exception. Note that setting Disable Exception (DEX) to one disables Fast Interrupt requests regardless of the state of FIRM.

A processor 10 can also generate a Fast Interrupt request through the FAIR instruction. Issuing a FAIR instruction will cause all processors 10 whose SETN registers contain the same value as the SETN of the issuing processor to receive a Fast Interrupt. It should be noted that the processor

10 that executed the FAIR instruction will NOT receive a Fast Interrupt itself. In addition. Fast Interrupt requests made by the FAIR instruction are masked by FIRM. Setting FIRM to one disables Fast Interrupt request.

Although both peripheral devices 32 and the SMS 28 may initiate Fast Interrupts, only processors 10 can be interrupted by Fast Interrupt Operations. The input/output subsystem allows a peripheral device 32 to directly write the number of the process set to be interrupted to the Fast Interrupt logic. As previously described, this occurs by writing into the SETI register. All processors 10 whose SETN registers contain the value written are then interrupted. Writing to SEΗ is the only mechanism that allows Fast Interrupts to be initiated from peripheral devices 32.

In operation, the operating system software (including a user-side scheduler in the preferred embodiment) assigns the processors 10 to a process group by writing the process number in the SETN register for that processor 10. Two steps are necessary to include a processor 10 in a set (process group). First, the SETN register for that processor 10 must be written with the number of the set it will be associated with. Second, the DFI bit in that processor's System Mode register must be set to zero. Peripheral devices 32 cannot receive Fast Interrupts. Referring now to Fig. 7, the physical organization of the Fast

Interrupt Mechanism in the four-cluster preferred embodiment of the present invention will be described. The preferred embodiment provides addressing for up to 128 SETN registers located among the four clusters 40. There are 32 SETN registers per cluster 40. In this embodiment, the SETN registers 16 for each cluster 40 are physically located within the NRCA means 46 of that cluster. For the purposes of addressing, the SEΗ register is treated as one of the SETN registers and is accessed in the same manner, except that the SEΗ register cannot be read.

There are sixteen ports 47 to the global registers 16, signal logic 31, and fast interrupt logic 33 from the thirty-two processors 10 and thirty-two external interface ports 22 in a cluster 40. Each port 47 is shared by two processors 10 and two external interface ports 22 and is accessed over the path 52. A similar port 49 services inter-duster requests for the global registers 16, fast interrupt logic 31, and signal logic 33 in this cluster as received by the MRCA means 48 and accessed over the path 56. As each request is received at the NRCA means 46, a cross bar and arbitration means 51 directs requests to the appropriate destination. If simultaneous requests come in for access to the SETN registers in the fast interrupt logic 33, for example, these requests are arbitrated for in a pipelined manner by the cross bar and arbitration means 51. The cross bar and arbitration means 51 utilizes a Multiple Request Toggling scheme algorithm. It receives input from sixteen arbitration nodes 44 and one MRCA means 48. An arbitration decision requires address information to select the target register and control information to determine the operation to be performed. This information is transmitted to the NRCA means 46 along with the data. The address and control can be for data to be sent to global registers 16 or to signal logic 31 or the fast interrupt logic 33.

The Multiple Requestor Toggling (MRT) priority system of the preferred embodiment of the present invention allows fair and efficient arbitration of simultaneous multiple requests to common shared resources by using a simple boolean algorithm to control a variety of switching mechanisms. All requestors are arbitrated in a distributed and democratic fashion by assigning priority to multiple requests on a first-come, first-serve basis with the priority of multiple simultaneous requests being resolved on the basis of a toggling system. The MRT priority scheme is applicable to any system where multiple requestors communicate with a commonly shared resource requiring an arbitration network to resolve simultaneous conflicting requests to that resource. In this case, resolution of conflicts refers to the determination of the order in which requests for access to the common resource are serviced. The MRT priority system is also useful for determining access to multiple shared resources. In this case, part of the MRT priority system, an inhibit matrix, is associated with each one of the multiple shared resources and the plurality of these inhibit matrixes are connected to each of the requestors. Each of the inhibit matrixes per shared resource are connected to a common part, the relative priority state storage means, which maintains the priority of each requestor relative to the others for all shared resources. The relative priority state storage means which stores die relative priority state of every requestor relative to every other requestor. Each cell or bit in the relative priority state storage means represents the relative priority of two requestors. This cell indicates which of the requestors will be granted access in the case of simultaneous resource requests. Each of the cells of the relative priority state storage means are connected to the inhibit matrix. Each cell in the relative priority state storage means drives two gates in the inhibit matrix for that destination. One gate represents requestor x inhibiting requestor y if x is higher priority, while the other gate represents requestor y inhibiting requestor x if y is highest. In this manner, the MRT priority system can be used to control a wide range of switching applications.

Referring now to Fig. 8, the method for addressing the global registers 16 is illustrated. Two methods are shown. The logical address map 710 is used by the processor 10. The physical address map 720 is used by the IOC 24. The methods for accessing the global registers 16 are also used to address the SETN and SEΗ registers in the preferred embodiment.

Referring to Fig. 9 the preferred implementation of the Fast

Interrupt mechanism 100 will be described. Fig 9 shows the data path to the SETN registers 310 and the SEΗ register 320. Data to be written into any of the registers 310 or 320 is sent via the input bus 330. The input bus 330 is driven from the output of the cross bar and arbitration means 51. Information on the address and command lines 340 that accompany the data is decoded and used to select the register 310 or 320 to be written. Only one register 310 or 320 can be written in a single clock cycle. Data to be read is returned to the requesting device through the output mux and delay pipeline 350. The address information that accompanied the read command is decoded to create the multiplexer select controls 352. In the preferred embodiment, the delay pipeline 350 is used to delay the read data so that the read latency for the SETN registers matches that of the global registers 16. The delay must be the same for all registers that are read through the NRCA means 46 since the cross bar and arbitration means 51 return data to the output paths 52 and 54. Fig. 3 shows the interconnection of fast interrupt lines between the processors and the Fast Interrupt Mechanism. Fast Interrupt events are transmitted to the FAST Interrupt Mechanism through the event request lines. Fast Interrupts are transmitted to the processors 10 over the Fast Interrupt lines. There is a separate pair of event/interrupt lines for each processor 10. This implementation eliminates any contention among processors 10 to send requests as well as any queuing to return Fast Interrupts to the processors 10 that would be incurred if shared, multiplexed transmission methods were used. The latency between request and resulting Fast Interrupt is kept to a minimum as a result.

When one processor 10 in a set (process group indicated by the SETN register) generates a Fast Interrupt request, the interrupt dispatch logic 450 sends interrupts to all of the processors 10 in the same set as the one that initiated the request by performing a 36-way simultaneous comparison of all SETN values as shown in Fig. 10. The notations SETN0, SETN1, SETN2 and so on in Fig. 10 represent the values contained in SETN registers 0, 1, 2 etc. The notation C0FI, C1FI, and C2FI represent Fast Interrupt values sent from another cluster. The value sent identifies the processor set that is to be interrupted. The method for creating and transmitting this value is described in more detail hereinafter. The notation SEΗ represents the contents of the SEΗ register. Comparisons among all values are continuously made. When a Fast Interrupt event is received from a processor 10 or external interface ports 22, the occurrence of the event is used in the qualification logic to initiate the actual interrupt. The combination of a comparison and the event signal from the initiating processor 10 or external interface ports 22 will cause the processor 10 associated with the comparison to be interrupted.

A detailed circuit diagram of the preferred implementation of the interrupt logic for the simultaneous comparison and verification circuit for an example four interrupt system is shown in Fig. 11. Four SETN registers are shown. The output of the SETN registers are connected to a matrix of comparators in the illustrated manner such that all registers are simultaneously compared with each other. The results of the comparisons are logically ANDed with the Fast Interrupt Event signals received from the processors, shown in the figure as FIRO, FIRl, FIR2, and FIR3 to produce the fast interrupt signals that are sent to each processor. Each comparison is used in the logical terms that produce interrupts for each processor assodated with the two SETN registers that are being compared. For example, the output of the comparison between SETNO and SETN1 is used in the terms that produce fast interrupts for processors 0 and 1. In the case of the term to produce the fast interrupt for processor 0, the comparison result is anded with the fast interrupt event from processor 1 to produce the fast interrupt for processor 0. Similarly, in the term for processor 1, the event from processor 0 is used. Since processors do not send Fast interrupts to themselves through the comparison process, the event from any processor is not used in its own fast interrupt term. This method is repeated for all comparisons so that all possible combinations of assodations can be properly handled by the qualification logic. All outputs from the AND operations for any processor are logically ORed together to form the actual processor Fast Interrupt. The example can easily be extended to see how the comparisons among the 32 SETN registers in a single duster 40 are performed and the fast interrupts generated.

Because the SETN registers in other dusters are not available for comparison, the contents of the SETN register assodated with a processor that has created a Fast Interrupt event is sent to all other clusters when the event is received at the Fast Interrupt mechanism. The means for doing this is shown in Fig. 12. Each of the Fast Interrupt events within the duster (FIR 0-31) and the event of writing a value to the SETT register, shown as SETTR in the figure, is sent to the Transmission Multiplexer control logic 500. The assertion of any of these signals causes the transmission Mux control logic 500 to select the contents of the SETN register associated with the processor 10 that created the fast interrupt event (or the SEΗ register contents) to be sent through the transmission multiplexer 510 and on to the transmission interface logic 520. The transmission interface logic 520 sends the selected value to all other dusters 40, along with a signal indicating that this duster 40 has had a fast interrupt event. Set number values from other dusters 40 are received at the Fast interrupt reception interface logic 530. This logic 530 receives the values sent from other dusters 40 along with the valid signals and sends them to the comparison matrix. The set number value sent from dusters 1, 2, and 3 are referred to in the diagrams as C0FI, C1FI, and C2FI, respectively. The valid event signals from dusters 1, 2, and 3 are referred to as COFTR, C1FIR, and C2FIR, respectivdy. It should be noted that if multiple fast interrupt events occur simultaneously in a duster 40, the set number values assodated with each event must be sent between clusters serially. The Transmission multiplexer control logic 500 detects this condition and holds all events that have occurred until the associated set number value has been transmitted. The order in which simultaneous events are resolved is determined by a simple priority scheme in which SETNO is lowest priority, ascending through in numerical order to SETN32 which is second highest, with SETT events the highest priority. Although the ^* description of the preferred embodiment has been presented, it is contemplated that various changes could be made without deviating from the spirit of the present invention. Accordingly, it is intended that the scope of the present invention be dictated by the appended claims rather than by the description of the preferred embodiment.

We daim:

Claims

1. A multiprocessor system comprising: a plurality of processors; and a fast interrupt means for simultaneously interrupting all of the processors assodated with a selected set of parallel processors within a bounded number of dodc cycles from the issue of an interrupt or exception.

2. The multiprocessor system of daim 1 wherein all of the processors assodated with a seleded set of parallel processors are automatically interrupted in response to an exception occurring in any one of the processors in the seleded set of parallel processors.

3. A fast interrupt mechanism for simultaneously interrupting all of the processors in a multiprocessor system that are associated with a selected set of parallel processors within a bounded number of dock cycles from the issue of an interrupt or exception, the fast interrupt mechanism comprising: user mode, system mode register, pending interrupt register for masking the external interrupt signal as well as and an event request means, event mask means and context switch means all in the processors fast interrupt mechanism including SEΗ registers, SETN registers and comparison matrix means.