WO1992004677A1 - Fault tolerant networking architecture - Google Patents

Fault tolerant networking architecture Download PDF

Info

Publication number
WO1992004677A1
WO1992004677A1 PCT/US1991/006060 US9106060W WO9204677A1 WO 1992004677 A1 WO1992004677 A1 WO 1992004677A1 US 9106060 W US9106060 W US 9106060W WO 9204677 A1 WO9204677 A1 WO 9204677A1
Authority
WO
WIPO (PCT)
Prior art keywords
semaphore
heartbeat
semaphores
reservation
computer
Prior art date
Application number
PCT/US1991/006060
Other languages
French (fr)
Inventor
James Wallace Sundet
Roger Gene Brown
Original Assignee
Cray Research, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cray Research, Inc. filed Critical Cray Research, Inc.
Publication of WO1992004677A1 publication Critical patent/WO1992004677A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • G06F11/183Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox

Definitions

  • This invention relates generally to computer networks, and in particular to a fault tolerant network having a semaphore box for controlling access to shared peripherals by a plurality of computers.
  • Multiprocessor systems typically include some method of providing interprocessor communication.
  • interprocessor communication through a shared main memory is typically referred to as a "loosely coupled" computer system.
  • Interprocessor communication through shared registers is typically referred to as a "tightly coupled” computer system.
  • Prior patents of the Assignee of the present invention, Cray Research, Inc. disclose various forms of interprocessor communication.
  • the present invention discloses a fault tolerant network for computers that includes a semaphore box for controlling access to shared peripherals.
  • the semaphore box is comprised of two major sections: an I/O section and a semaphore section.
  • the semaphore section contains two sets of semaphores: a first set comprising reservation semaphores for the shared peripherals and a second set comprising heartbeat semaphores for the sharing computers.
  • the first set is used to reserve a particular peripheral for a particular computer; the second set provides a "heartbeat" to prevent reservation semaphores from being set indefinitely in the event communication with a particular computer is lost.
  • the heartbeat semaphores are arranged in an array. Each time a semaphore command is received, a row of heartbeat semaphores is set. When the command completes, a column of heartbeat semaphores is returned to the requesting computer and then cleared. If the heartbeat semaphore in a particular position of the column is set, then the associated computer has accessed the semaphore box sometime between the current access of the requesting computer and its prior access. Conversely, if the particular heartbeat semaphore is cleared, then the associated computer has not accessed the semaphore box during the period. If the particular heartbeat semaphore for the associated computer remains cleared for some number of consecutive accesses, the requesting computer should conclude that the associated computer has lost communication with the semaphore box.
  • Figure 1 is a block diagram of a system configuration according to the preferred embodiment
  • Figure 2 is a block diagram of the semaphore box in the preferred embodiment
  • Figure 3 illustrates a channel command word in the preferred embodiment
  • Figure 4 illustrates the format of the channel status word in the preferred embodiment.
  • the term “semaphore” is often used. In the preferred embodiment, this term refers to a memory cell that is shared by a plurality of processors to provide a form of communication between the processors by indicating when significant events have taken place.
  • System Configuration Figure 1 is a block diagram of a system configuration according to the preferred embodiment of the present invention.
  • the present invention discloses a fault tolerant network for computers 10 that includes a semaphore box 16 for controlling access to shared peripherals 12.
  • the computers 10 may be YMP/16 computers of the type manufactured by Cray Research, Inc., the Assignee of the present invention.
  • the peripherals 12 may be Solid-state Storage Devices (SSDs) of the type described in either U.S. Pat. No. 4,630,230, issued December 16, 1986, to Sundet (which patent is incorporated herein by reference), or in U.S. Pat. No. 4,951,246, issued August 21, 1990, to Fromm et al. (which patent is incorporated herein by reference) .
  • SSDs Solid-state Storage Devices
  • Each peripheral 12 connects to the computers 10 via a high speed channel 15.
  • Each computer 10 connects to a semaphore box 16 via a low speed channel 14. Access to the shared peripherals 12 is coordinated among the computers 10 by means of communication through the semaphore box 16.
  • a block diagram of the semaphore box 16 is shown in Figure 2.
  • the semaphore box 16 is comprised of two major sections: a semaphore section and an I/O section.
  • the semaphore section contains two sets of semaphores: a first set comprising reservation semaphores for the shared peripherals 12; and a second set comprising heartbeat semaphores for the sharing computers 10.
  • the first set is used to reserve a particular peripheral 12 for a particular computer 10; the second set provides a "heartbeat" to prevent reservation semaphores from being set indefinitely in the event communication with a particular computer 10 is lost.
  • each reservation semaphore can be individually set, cleared, or tested by commands received by the semaphore box 16 from the computers 10.
  • each reservation semaphore is four bits long. One bit contains reservation information associated with one of the peripherals 12; the other three bits are a port identifier field containing a number of the input port 24A-24H which most recently set the reservation semaphore. The port identifier field thus identifies a particular computer 10 connected to the associated input port 24A-24H. The port identifier field is updated each time the reservation semaphore is set.
  • Table 1 is an illustration of an array of heartbeat semaphores, wherein each heartbeat semaphore is identified by a concatenated row-column value.
  • the array as shown in Table 1 measures eight by eight, but those skilled in the art will recognize that any size array may be used.
  • the array of heartbeat semaphores as shown in Table 1 is used to prevent the reservation semaphores from remaining in a "set” condition after communication with a computer 10 has been lost. A reservation semaphore that remains "set” prevents access to the associated peripheral 12 by the other computers 10.
  • a row of heartbeat semaphores is set.
  • the row number is the same as the number of the input port 24A-24H which received the command (i.e. row N is set by any command from Port N) , although other means of associating rows with computers 10 could be used.
  • a column of heartbeat semaphores is -1-
  • the column becomes part of the channel status word. Like the row number, the column number is preferably the same as the number of the input port 24A-24H which received the command (i.e. column N is returned in status word for Port N) , although other means of associating columns with computers 10 could be used.
  • the heartbeat semaphores provide a means of determining whether the computers 10 connected to the semaphore box 16 are still active. This is best illustrated by the following example. Assume computer A wants to determine the running status of computer B. Each time computer B sends a command to the semaphore box 16, all heartbeat semaphores in row B are set. Each time computer A sends a command to the semaphore box 16, the heartbeat semaphores in column A are returned in the channel status word and then cleared. Since the rows and columns intersect, the Bth heartbeat semaphore in column A indicates whether computer B has accessed the semaphore box
  • any computer 10 connected to the semaphore box 16 must periodically access the semaphore box 16 to keep its "heartbeat" alive.
  • the I/O section of the semaphore box 16 consists of input ports, voting circuits, and output ports.
  • Each attached computer 10 communicates with the semaphore box 16 across a low speed channel 14 which attaches to the semaphore box 16 at an input port 24A-24H and an output port 26A-26H.
  • the computer 10 is attached to output ports 26A-26H identically numbered or otherwise associated with input ports 24A-24H.
  • Each input port 24A-24H and output port 26A-26H of the semaphore box 16 is logically independent from the others.
  • Operations on the semaphore box 16 are accomplished by transmitting commands to the input ports 24A-24H.
  • the input ports 24A-24H then transmit the commands to all three of the semaphore groups 18, 20, and 22. (Note that in Figure 2 only a portion of the connections between the input ports 24A-24H and the semaphore groups 18, 20, and 22 are illustrated.)
  • the semaphore groups 18, 20, and 22 transmit the results of the operation to voting circuits 28A-28H.
  • the three copies of the results are inspected by voting circuits 28A-28H.
  • the voting circuits 28A-28H detect an error if there is a difference in the execution results.
  • a port conflict occurs when more than one command is received by the semaphore section from the I/O section in the same clock period. Conflicts are resolved on a priority basis in the preferred embodiment, although other methods of resolving conflicts could be substituted therefor. In the preferred embodiment, lower numbered input ports 24A-24H have priority over higher numbered input ports 24A-24H.
  • requests from higher numbered input ports 2 A-24H are held until the semaphore response for the lower numbered input ports 24A-24H is returned to the I/O section.
  • the requests are not queued, although alternative embodiments could easily implement such a scheme. Instead, in the preferred embodiment, if a lower numbered request is received while a higher number request is being held, the lower numbered request will again be honored first. Operation
  • Figure 3 illustrates a channel command word 32 in the preferred embodiment.
  • the codes used in Figure 3 are defined as shown in Table 2.
  • the channel command word 32 consists of a single 64-bit word. It is transmitted over the low speed channel 14 as four 16-bit parcels followed by a channel disconnect pulse.
  • the parcels of the channel command word 32 are identified as: parcel 0 (32A), parcel 1 (32B), parcel 2 (32C), and parcel 3 (32D) .
  • an input port 24A-24H receives the channel command word 32, it performs error checks to detect possible parity errors, command compare errors, or channel protocol errors. Detection of a channel error aborts the operation.
  • Each parcel of the channel command word 32 contains the same information, but the individual bit assignments differ in each parcel. Rearranging the bit assignments prevents a defective wire, connector pin, or associated circuits from causing a catastrophic error. Eight bits of each parcel are used to hold a command. The remaining eight bits are unused, except for parity, and may contain any value. Of the eight command bits, two bits contain a semaphore function code and six bits contain a semaphore select code. The four possible function codes are described as shown in Table 3.
  • the semaphore select code determines which of the reservation semaphores are affected by the semaphore function code.
  • Each parcel is sent to a different semaphore group for execution: parcel 0 (32A) is sent to semaphore group 18; parcel 1 (32B) is sent to semaphore group 20; parcel 2 (32C) is sent to semaphore group 22; parcel 3 (32D) is discarded.
  • the appropriate output port 26A-26H receives results from each of the three semaphore groups 18, 20, and 22. Both reservation semaphore and heartbeat semaphore information is contained in the results.
  • the three copies of the results are inspected by voting circuits 28A-28H and an appropriate channel status word is transmitted by the output ports 26A-26H to the originating computer 10. If the command received from the input ports 24A-24H had been in error, the voting circuits 28A-28H detect the error and prevent corruption of any data on the shared peripherals 12.
  • Figure 4 illustrates the format of the channel status word 36, which consists of two 16-bit parcels followed by a channel disconnect pulse.
  • the parcels of the channel status word 36 are identified as: parcel 0 (36A), and parcel 1 (36B).
  • the codes of the channel status word 36 in Figure 4 are defined as shown in Table 4, and the format is further described below.
  • CODE MEANING test condition semaphore state after function. port number of last reservation. heartbeat column bit 2 n . any error.
  • group C compare error.
  • group B compare error.
  • group A compare error.
  • the uppermost bit (TO) of the channel status word 36 is used only for a test and set operation. It reflects the state of the selected reservation semaphore at the time the function code is received by the semaphore section. Bit TO is set to the value 1 if the reservation semaphore is initially set. Otherwise, a value of 0 is returned in this position if the reservation semaphore is initially clear, and the set operation has been performed.
  • Bit SO reflects the condition of the selected reservation semaphore at the start of the current command.
  • Bits 10-12 indicate the number of the input port 24A- 24H (having a value of 0-7) which most recently changed the state if the selected reservation semaphore.
  • the heartbeat status byte provides each computer 10 with information about the availability of the other computers 10 connected to the semaphore box 16.
  • the bit position (H0-H7) of the heartbeat status corresponds to the input port 24A-24H (0-7) which controls the setting of the particular heartbeat semaphore.
  • Bit A0 is 1 if an error is detected by the input ports 24A-24H or output ports 26A-26H.
  • Bit A0 is a summation of bits C2-C0, P5, P4, p3-p0, and P3-P0 further described below.
  • Bits C0-C2 are set if there is an error in the execution results. The comparison is performed on 13 bits: the selected 4-bit semaphore, a semaphore test flag, and the 8-bit heartbeat semaphore column.
  • a semaphore compare error could be caused by any of several conditions; a hardware malfunction; a command error; residue from a previous command error; or a semaphore group 18, 20, or 22 whose contents have not been fully restored following a power loss or maintenance action. If a single semaphore group 18, 20, or 22 is in error, its results are ignored by the voting circuits 28A- 28H. If two semaphore groups 18, 20, or 22 fail, all comparisons will fail and the results of group 20 are used (although, there is no assurance that the results of group 20 are correct) .
  • Bit CO is set if information from semaphore group 18 fails to compare with either group 20 or 22.
  • bit Cl is set if information from semaphore group 20 fails to compare with either group 18 or 22.
  • Bit C2 is set if information from semaphore group 22 fails to compare with either group 18 or 20.
  • Bits P5-P0 indicate an error was detected during receipt of the command by the input ports 24A-24H. If P5- P0 are set, the command is aborted and parcel 0 (36A) of the channel status word 36 is invalid.
  • Bit P5 is set if the function codes do not match in all four parcels of the channel command word 32. This could be the result of a programming error or a data error in the channel command word 32. The requested operation is aborted and the command is not sent to the semaphore groups 18, 20, and 22. In the channel status word 36, the P5 bit (command compare error) is then set and parcel 0 (36A) of the channel status word 36 is invalid.
  • Bit P4 indicates that a channel protocol error was detected by the input ports 24A-24H or output ports 26A-
  • Normal protocol is defined as four ready pulses each with its accompanying parcel of data followed by a disconnect pulse. A resume pulse must be returned before a subsequent ready pulse is received. If more than four ready pulses are received before a disconnect pulse, then a channel protocol error has occurred. If a channel protocol error is detected, then all command data is assumed to have been corrupted and is ignored. The state of the heartbeat semaphores is not changed and information in parcel 0 (36A) of the channel status word 36 is invalid. Bit P4 in the channel status word 36 is set. However, no channel status word 36 is transmitted until a disconnect pulse is received. This ensures that any extra command parcels are flushed.
  • parity bits p3-p0 and P3-P0 are set in the event of a data error.
  • Four parity bits accompany each parcel of the channel command word 32. Bits P3-P0 perform a parity check using true logic levels, while bits p3-p0 perform a parity check using false logic levels. Thus, independent redundant checks are performed. Additionally, the P5 bit (command error) is set if a data bit rather than a parity bit caused the parity error.
  • Bit TM is set to 1 when the output ports 26A-26H have been placed in test mode.
  • the semaphore box 16 was designed for fault tolerance. For example, the semaphore box 16 may remain online and operational during repairs without impacting the integrity of the semaphores. Further, even if two of the three semaphore groups 18, 20, and 22 fail, the semaphore box 16 continues operating using the remaining group. As mentioned above, if no group compares correctly with any other, group 20 is used.
  • All logic modules are supplied with a common clock from a master clock module and operate synchronously with one another.
  • the master clock module is duplicated on a second module for backup purposes, but only one clock module may be powered on at any given time.
  • Each logic module is provided with two clock inputs, one for each clock module. Selection of the active master clock module is by means of a manual switch which controls power to the clock module and provides the logical clock enable signals. During the process of switching from one clock module to another, however, the system clock is not valid and the system requires re-initialization.
  • the semaphore box 16 resides in a standalone cabinet with its own cooling and internal power. Cooling is accomplished with forced room air. Multiple fans provide redundancy so that if a single fan malfunctions the equipment still remains operational.
  • Power is provided by four sets of identical power supplies to enhance fault tolerance. Each set of supplies is sized so that it is able to supply all the necessary power needs independently of the other. Power load shifting from the loss of one supply is automatic.
  • Each input port 24A-24H, output port 26A-26H, semaphore group 18, 20, and 22, and the master clock are tied to a common power bus, so that they may be individually disconnected from the power bus. Thus, no more than one module is affected by the loss of a single power supply breaker. Further, if power is removed from a single module, all other modules remain operational. Except for the master clock module, the process of applying or removing power to a single module does not affect the operation of the other modules. In addition, if a single module is removed from the system, all other modules remain operational. Thus, the process of inserting or removing modules does not affect the operation of the other modules.
  • a fault tolerant network which includes a semaphore box 16 for controlling access to shared peripherals 12.
  • the semaphore box 16 is comprised of two major sections: an I/O section and a semaphore section.
  • the semaphore section contains reservation semaphores and heartbeat semaphores.
  • the reservation semaphores are used to reserve a particular peripheral 12 for a particular computer 10; the heartbeat semaphores prevent reservation semaphores from being set indefinitely in the event communication with a particular computer 10 is lost.
  • the heartbeat semaphores are arranged in an array, wherein each time a command is received, a row of heartbeat semaphores is set.
  • a column of heartbeat semaphores is returned to the requesting computer 10 and then cleared.
  • the associated computer 10 has accessed the semaphore box 16 sometime between the current access of the requesting computer 10 and its prior access.
  • the particular heartbeat semaphore is cleared, then the associated computer 10 has not accessed the semaphore box 16 during the period.
  • the requesting computer 10 should conclude that the associated computer 10 has lost communication with the semaphore box 16 if the particular heartbeat semaphore for the associated computer 10 remains cleared for some number of consecutive accesses or some predetermined interval.

Abstract

A fault tolerant network for a plurality of computers includes a system for controlling access to shared peripherals. Access to the shared peripherals is coordinated among the computers by means of communication through a semaphore box. Each computer connects to the semaphore box via a channel. The semaphore box is comprised of two major sections: a semaphore section and an I/O section. The semaphore section contains two sets of semaphores: a first set comprising reservation semaphores for the shared peripherals; and a second set comprising heartbeat semaphores for the sharing computers. The first set is used to reserve a particular peripheral for a particular computer and indicate the source of the reservation; the second set provides a ''heartbeat'' to prevent reservation semaphores from being set indefinitely in the event communication with a particular computer is lost.

Description

FAULT TOLERANT NETWORKING ARCHITECTURE
BACKGROUND OF THE INVENTION 1. Field Of The Invention.
This invention relates generally to computer networks, and in particular to a fault tolerant network having a semaphore box for controlling access to shared peripherals by a plurality of computers.
2. Description of Related Art.
Multiprocessor systems typically include some method of providing interprocessor communication. For example, interprocessor communication through a shared main memory is typically referred to as a "loosely coupled" computer system. Interprocessor communication through shared registers is typically referred to as a "tightly coupled" computer system. Prior patents of the Assignee of the present invention, Cray Research, Inc., disclose various forms of interprocessor communication.
One such prior patent is U.S. Pat. No. 4,636,942, issued January 13, 1987, to Chen et al., which patent is incorporated herein by reference. This patent discloses a computer vector multiprocessing control wherein a pair of processors are provided and each are connected to a central memory through a plurality of memory reference ports. Processors are further connected to a plurality of shared registers, including registers for holding scalar and address information, and registers for holding information to be used in coordinating the transfer of information through the shared registers.
Another prior patent is U.S. Pat. No. 4,661,900, issued April 28, 1987, to Chen et al. , which patent is incorporated herein by reference. This patent discloses a flexible chaining method and apparatus wherein a pair of processors are connected to a central memory through a plurality of memory reference ports. The processors are further connected to a plurality of shared registers that may be directly addressed by either processor, and which hold scalar and address information in registers for holding information to be used in coordinating the transfer of information through the shared registers.
Still another prior patent is U.S. Pat. No. 4,754,398, issued June 28, 1988, to Pribnow, which patent is incorporated herein by reference. This patent discloses an interprocessor communication system for a multiprocessor system that includes a plurality of clusters having a plurality of semaphore registers and information registers. Whatever the merits of these prior patents for controlling interprocessor communication, they do not achieve the benefits of the present invention.
SUMMARY OF THE INVENTION The present invention discloses a fault tolerant network for computers that includes a semaphore box for controlling access to shared peripherals. The semaphore box is comprised of two major sections: an I/O section and a semaphore section. The semaphore section contains two sets of semaphores: a first set comprising reservation semaphores for the shared peripherals and a second set comprising heartbeat semaphores for the sharing computers. The first set is used to reserve a particular peripheral for a particular computer; the second set provides a "heartbeat" to prevent reservation semaphores from being set indefinitely in the event communication with a particular computer is lost.
The heartbeat semaphores are arranged in an array. Each time a semaphore command is received, a row of heartbeat semaphores is set. When the command completes, a column of heartbeat semaphores is returned to the requesting computer and then cleared. If the heartbeat semaphore in a particular position of the column is set, then the associated computer has accessed the semaphore box sometime between the current access of the requesting computer and its prior access. Conversely, if the particular heartbeat semaphore is cleared, then the associated computer has not accessed the semaphore box during the period. If the particular heartbeat semaphore for the associated computer remains cleared for some number of consecutive accesses, the requesting computer should conclude that the associated computer has lost communication with the semaphore box.
BRIEF DESCRIPTION OF THE DRAWINGS Referring now to the drawings, in which like reference numbers represent like elements throughout the several views:
Figure 1 is a block diagram of a system configuration according to the preferred embodiment; Figure 2 is a block diagram of the semaphore box in the preferred embodiment;
Figure 3 illustrates a channel command word in the preferred embodiment; and
Figure 4 illustrates the format of the channel status word in the preferred embodiment.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT In the following description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. It is understood that other embodiments may be used and structural changes may be made without departing from the scope of the present invention.
Glossary
In the following description, the term "semaphore" is often used. In the preferred embodiment, this term refers to a memory cell that is shared by a plurality of processors to provide a form of communication between the processors by indicating when significant events have taken place.
System Configuration Figure 1 is a block diagram of a system configuration according to the preferred embodiment of the present invention. The present invention discloses a fault tolerant network for computers 10 that includes a semaphore box 16 for controlling access to shared peripherals 12. The computers 10 may be YMP/16 computers of the type manufactured by Cray Research, Inc., the Assignee of the present invention. The peripherals 12 may be Solid-state Storage Devices (SSDs) of the type described in either U.S. Pat. No. 4,630,230, issued December 16, 1986, to Sundet (which patent is incorporated herein by reference), or in U.S. Pat. No. 4,951,246, issued August 21, 1990, to Fromm et al. (which patent is incorporated herein by reference) . Those skilled in the art, however, will recognize that other computers 10 and other peripherals 12 could substituted therefor.
Each peripheral 12 connects to the computers 10 via a high speed channel 15. Each computer 10 connects to a semaphore box 16 via a low speed channel 14. Access to the shared peripherals 12 is coordinated among the computers 10 by means of communication through the semaphore box 16.
Semaphore Box
A block diagram of the semaphore box 16 is shown in Figure 2. The semaphore box 16 is comprised of two major sections: a semaphore section and an I/O section.
The semaphore section contains two sets of semaphores: a first set comprising reservation semaphores for the shared peripherals 12; and a second set comprising heartbeat semaphores for the sharing computers 10. The first set is used to reserve a particular peripheral 12 for a particular computer 10; the second set provides a "heartbeat" to prevent reservation semaphores from being set indefinitely in the event communication with a particular computer 10 is lost. In the preferred embodiment, there are three identical copies, or groups 18, 20, and 22, of the two sets arranged in a triple module redundant configuration. The three groups are identified as: semaphore group A (18), semaphore group B (20), and semaphore group C (22).
The reservation semaphores can be individually set, cleared, or tested by commands received by the semaphore box 16 from the computers 10. In the preferred embodiment, each reservation semaphore is four bits long. One bit contains reservation information associated with one of the peripherals 12; the other three bits are a port identifier field containing a number of the input port 24A-24H which most recently set the reservation semaphore. The port identifier field thus identifies a particular computer 10 connected to the associated input port 24A-24H. The port identifier field is updated each time the reservation semaphore is set.
Table 1 is an illustration of an array of heartbeat semaphores, wherein each heartbeat semaphore is identified by a concatenated row-column value.
Figure imgf000008_0001
The array as shown in Table 1 measures eight by eight, but those skilled in the art will recognize that any size array may be used. The array of heartbeat semaphores as shown in Table 1 is used to prevent the reservation semaphores from remaining in a "set" condition after communication with a computer 10 has been lost. A reservation semaphore that remains "set" prevents access to the associated peripheral 12 by the other computers 10.
Each time a semaphore command is received by the semaphore box 16, a row of heartbeat semaphores is set. Preferably, the row number is the same as the number of the input port 24A-24H which received the command (i.e. row N is set by any command from Port N) , although other means of associating rows with computers 10 could be used. When the command completes, a column of heartbeat semaphores is -1-
returned to the requesting computer 10 by the output ports 26A-26H and then cleared to all zeros. The column becomes part of the channel status word. Like the row number, the column number is preferably the same as the number of the input port 24A-24H which received the command (i.e. column N is returned in status word for Port N) , although other means of associating columns with computers 10 could be used.
The heartbeat semaphores provide a means of determining whether the computers 10 connected to the semaphore box 16 are still active. This is best illustrated by the following example. Assume computer A wants to determine the running status of computer B. Each time computer B sends a command to the semaphore box 16, all heartbeat semaphores in row B are set. Each time computer A sends a command to the semaphore box 16, the heartbeat semaphores in column A are returned in the channel status word and then cleared. Since the rows and columns intersect, the Bth heartbeat semaphore in column A indicates whether computer B has accessed the semaphore box
16 in the period since the previous computer A access. If the heartbeat semaphore is set, computer B has accessed the semaphore box 16 sometime between the current access of computer A and the prior access of computer A. Conversely, if the heartbeat semaphore is clear, computer B has not accessed the semaphore box 16 during the period. If the heartbeat semaphore for computer B is clear for some number of consecutive accesses or some predetermined interval, then computer A concludes that computer B has lost communication with the semaphore box 16 and takes appropriate action. Those skilled in the art will recognize that the number of consecutive accesses or the predetermined interval is programmable. Thus, any computer 10 connected to the semaphore box 16 must periodically access the semaphore box 16 to keep its "heartbeat" alive. Referring again to Figure 2, the I/O section of the semaphore box 16 consists of input ports, voting circuits, and output ports. Each attached computer 10 communicates with the semaphore box 16 across a low speed channel 14 which attaches to the semaphore box 16 at an input port 24A-24H and an output port 26A-26H. Preferably, the computer 10 is attached to output ports 26A-26H identically numbered or otherwise associated with input ports 24A-24H. Each input port 24A-24H and output port 26A-26H of the semaphore box 16 is logically independent from the others. Operations on the semaphore box 16 are accomplished by transmitting commands to the input ports 24A-24H. The input ports 24A-24H then transmit the commands to all three of the semaphore groups 18, 20, and 22. (Note that in Figure 2 only a portion of the connections between the input ports 24A-24H and the semaphore groups 18, 20, and 22 are illustrated.) The semaphore groups 18, 20, and 22 transmit the results of the operation to voting circuits 28A-28H. The three copies of the results are inspected by voting circuits 28A-28H. The voting circuits 28A-28H detect an error if there is a difference in the execution results. If a single semaphore group 18, 20, or 22 is in error, its results are ignored by the voting circuits 28A- 28H. If two semaphore groups 18, 20, or 22 fail, all comparisons will fail and the results of group 20 are used (although there is no assurance that the results of group 20 are correct) . The output ports 26A-26H then transmit the results to the computer 10. A port conflict occurs when more than one command is received by the semaphore section from the I/O section in the same clock period. Conflicts are resolved on a priority basis in the preferred embodiment, although other methods of resolving conflicts could be substituted therefor. In the preferred embodiment, lower numbered input ports 24A-24H have priority over higher numbered input ports 24A-24H. Thus, requests from higher numbered input ports 2 A-24H are held until the semaphore response for the lower numbered input ports 24A-24H is returned to the I/O section. The requests are not queued, although alternative embodiments could easily implement such a scheme. Instead, in the preferred embodiment, if a lower numbered request is received while a higher number request is being held, the lower numbered request will again be honored first. Operation
Figure 3 illustrates a channel command word 32 in the preferred embodiment. The codes used in Figure 3 are defined as shown in Table 2.
Table 2 CODE MEANING Fn function code bit 2n
Sn semaphore select code bit 2n xx unused bit
The channel command word 32 consists of a single 64-bit word. It is transmitted over the low speed channel 14 as four 16-bit parcels followed by a channel disconnect pulse. The parcels of the channel command word 32 are identified as: parcel 0 (32A), parcel 1 (32B), parcel 2 (32C), and parcel 3 (32D) . After an input port 24A-24H receives the channel command word 32, it performs error checks to detect possible parity errors, command compare errors, or channel protocol errors. Detection of a channel error aborts the operation.
Each parcel of the channel command word 32 contains the same information, but the individual bit assignments differ in each parcel. Rearranging the bit assignments prevents a defective wire, connector pin, or associated circuits from causing a catastrophic error. Eight bits of each parcel are used to hold a command. The remaining eight bits are unused, except for parity, and may contain any value. Of the eight command bits, two bits contain a semaphore function code and six bits contain a semaphore select code. The four possible function codes are described as shown in Table 3.
Table 3 CODE
OPERATION
Test semaphore.
Set semaphore unconditionally.
Clear semaphore unconditionally.
Figure imgf000013_0001
Test and set semaphore if clear. No operation if semaphore is set,
The semaphore select code determines which of the reservation semaphores are affected by the semaphore function code.
Each parcel is sent to a different semaphore group for execution: parcel 0 (32A) is sent to semaphore group 18; parcel 1 (32B) is sent to semaphore group 20; parcel 2 (32C) is sent to semaphore group 22; parcel 3 (32D) is discarded. Upon completion, the appropriate output port 26A-26H receives results from each of the three semaphore groups 18, 20, and 22. Both reservation semaphore and heartbeat semaphore information is contained in the results. The three copies of the results are inspected by voting circuits 28A-28H and an appropriate channel status word is transmitted by the output ports 26A-26H to the originating computer 10. If the command received from the input ports 24A-24H had been in error, the voting circuits 28A-28H detect the error and prevent corruption of any data on the shared peripherals 12.
Figure 4 illustrates the format of the channel status word 36, which consists of two 16-bit parcels followed by a channel disconnect pulse. The parcels of the channel status word 36 are identified as: parcel 0 (36A), and parcel 1 (36B). The codes of the channel status word 36 in Figure 4 are defined as shown in Table 4, and the format is further described below.
Table 4
CODE MEANING test condition. semaphore state after function. port number of last reservation. heartbeat column bit 2n. any error. group C compare error. group B compare error. group A compare error. command compare error. channel error. channel parity error, group n. channel parity error, group n.
Figure imgf000015_0001
test mode status bit.
Reservation Semaphore Status (T-, S-, I-)
The uppermost bit (TO) of the channel status word 36 is used only for a test and set operation. It reflects the state of the selected reservation semaphore at the time the function code is received by the semaphore section. Bit TO is set to the value 1 if the reservation semaphore is initially set. Otherwise, a value of 0 is returned in this position if the reservation semaphore is initially clear, and the set operation has been performed.
Bit SO reflects the condition of the selected reservation semaphore at the start of the current command. Bits 10-12 indicate the number of the input port 24A- 24H (having a value of 0-7) which most recently changed the state if the selected reservation semaphore.
Heartbeat Status (H-)
The heartbeat status byte provides each computer 10 with information about the availability of the other computers 10 connected to the semaphore box 16. The bit position (H0-H7) of the heartbeat status corresponds to the input port 24A-24H (0-7) which controls the setting of the particular heartbeat semaphore.
Port Status (A-, C-, P-, TM)
Bit A0 is 1 if an error is detected by the input ports 24A-24H or output ports 26A-26H. Bit A0 is a summation of bits C2-C0, P5, P4, p3-p0, and P3-P0 further described below.
Bits C0-C2 (semaphore compare error) are set if there is an error in the execution results. The comparison is performed on 13 bits: the selected 4-bit semaphore, a semaphore test flag, and the 8-bit heartbeat semaphore column. A semaphore compare error could be caused by any of several conditions; a hardware malfunction; a command error; residue from a previous command error; or a semaphore group 18, 20, or 22 whose contents have not been fully restored following a power loss or maintenance action. If a single semaphore group 18, 20, or 22 is in error, its results are ignored by the voting circuits 28A- 28H. If two semaphore groups 18, 20, or 22 fail, all comparisons will fail and the results of group 20 are used (although, there is no assurance that the results of group 20 are correct) .
Bit CO is set if information from semaphore group 18 fails to compare with either group 20 or 22. Similarly, bit Cl is set if information from semaphore group 20 fails to compare with either group 18 or 22. Bit C2 is set if information from semaphore group 22 fails to compare with either group 18 or 20.
Bits P5-P0 indicate an error was detected during receipt of the command by the input ports 24A-24H. If P5- P0 are set, the command is aborted and parcel 0 (36A) of the channel status word 36 is invalid.
Bit P5 is set if the function codes do not match in all four parcels of the channel command word 32. This could be the result of a programming error or a data error in the channel command word 32. The requested operation is aborted and the command is not sent to the semaphore groups 18, 20, and 22. In the channel status word 36, the P5 bit (command compare error) is then set and parcel 0 (36A) of the channel status word 36 is invalid.
Bit P4 indicates that a channel protocol error was detected by the input ports 24A-24H or output ports 26A-
26H. Normal protocol is defined as four ready pulses each with its accompanying parcel of data followed by a disconnect pulse. A resume pulse must be returned before a subsequent ready pulse is received. If more than four ready pulses are received before a disconnect pulse, then a channel protocol error has occurred. If a channel protocol error is detected, then all command data is assumed to have been corrupted and is ignored. The state of the heartbeat semaphores is not changed and information in parcel 0 (36A) of the channel status word 36 is invalid. Bit P4 in the channel status word 36 is set. However, no channel status word 36 is transmitted until a disconnect pulse is received. This ensures that any extra command parcels are flushed.
Additionally, the appropriate parity bits p3-p0 and P3-P0 are set in the event of a data error. Four parity bits accompany each parcel of the channel command word 32. Bits P3-P0 perform a parity check using true logic levels, while bits p3-p0 perform a parity check using false logic levels. Thus, independent redundant checks are performed. Additionally, the P5 bit (command error) is set if a data bit rather than a parity bit caused the parity error.
Bit TM is set to 1 when the output ports 26A-26H have been placed in test mode.
Other Features The semaphore box 16 was designed for fault tolerance. For example, the semaphore box 16 may remain online and operational during repairs without impacting the integrity of the semaphores. Further, even if two of the three semaphore groups 18, 20, and 22 fail, the semaphore box 16 continues operating using the remaining group. As mentioned above, if no group compares correctly with any other, group 20 is used.
All logic modules are supplied with a common clock from a master clock module and operate synchronously with one another. The master clock module is duplicated on a second module for backup purposes, but only one clock module may be powered on at any given time. Each logic module is provided with two clock inputs, one for each clock module. Selection of the active master clock module is by means of a manual switch which controls power to the clock module and provides the logical clock enable signals. During the process of switching from one clock module to another, however, the system clock is not valid and the system requires re-initialization.
The semaphore box 16 resides in a standalone cabinet with its own cooling and internal power. Cooling is accomplished with forced room air. Multiple fans provide redundancy so that if a single fan malfunctions the equipment still remains operational.
Power is provided by four sets of identical power supplies to enhance fault tolerance. Each set of supplies is sized so that it is able to supply all the necessary power needs independently of the other. Power load shifting from the loss of one supply is automatic. Each input port 24A-24H, output port 26A-26H, semaphore group 18, 20, and 22, and the master clock are tied to a common power bus, so that they may be individually disconnected from the power bus. Thus, no more than one module is affected by the loss of a single power supply breaker. Further, if power is removed from a single module, all other modules remain operational. Except for the master clock module, the process of applying or removing power to a single module does not affect the operation of the other modules. In addition, if a single module is removed from the system, all other modules remain operational. Thus, the process of inserting or removing modules does not affect the operation of the other modules.
Summary
In summary, a fault tolerant network has been described which includes a semaphore box 16 for controlling access to shared peripherals 12. The semaphore box 16 is comprised of two major sections: an I/O section and a semaphore section. The semaphore section contains reservation semaphores and heartbeat semaphores. The reservation semaphores are used to reserve a particular peripheral 12 for a particular computer 10; the heartbeat semaphores prevent reservation semaphores from being set indefinitely in the event communication with a particular computer 10 is lost. The heartbeat semaphores are arranged in an array, wherein each time a command is received, a row of heartbeat semaphores is set. Further, when the command completes, a column of heartbeat semaphores is returned to the requesting computer 10 and then cleared. Thus, if the heartbeat semaphore in a particular position of the column is set, then the associated computer 10 has accessed the semaphore box 16 sometime between the current access of the requesting computer 10 and its prior access. Conversely, if the particular heartbeat semaphore is cleared, then the associated computer 10 has not accessed the semaphore box 16 during the period. The requesting computer 10 should conclude that the associated computer 10 has lost communication with the semaphore box 16 if the particular heartbeat semaphore for the associated computer 10 remains cleared for some number of consecutive accesses or some predetermined interval.
Conclusion The foregoing description of the preferred embodiment of the present invention has been presented for the pur¬ poses of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are pos- sible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

Claims

WHAT IS CLAIMED IS:
1. An apparatus for controlling access to at least one shared peripheral by a plurality of computers, comprising: at least one reservation semaphore for reserving the peripheral for an accessing computer; and at least one heartbeat semaphore operatively connected to the reservation semaphore for indicating whether the accessing computer has communicated with the apparatus during some time period, thereby preventing the reservation semaphore from being reserved indefinitely when communications are lost with the accessing computer occurs.
2. The apparatus of claim 1, further comprising means for arranging the reservation semaphores and the heartbeat semaphores in a redundant configuration to enhance fault tolerance.
3. The apparatus of claim 2, further comprising voting circuit means for inspecting an outcome of an operation performed against the redundant configuration and detecting whether an error has occurred.
4. The apparatus of claim 1, wherein the reservation semaphore comprises: first means for storing a first indicator of whether the peripheral is reserved; and second means for storing a second indicator of which computer reserved the peripheral.
5. The apparatus of claim 4, wherein the second means further comprises means for updating the second indicator each time the reservation semaphore changes state.
6. The apparatus of claim 1, further comprising: means for transmitting a first plurality of the heartbeat semaphores to the accessing computer; and means for examining the first plurality of the heartbeat semaphores to determine whether all of the computers connected to the apparatus are still active.
7. The apparatus of claim 1, further comprising: means for arranging the heartbeat semaphores in a two-dimensional array comprising a plurality of intersecting rows and columns; means for setting the heartbeat semaphores in a row of the array associated with the accessing computer whenever the accessing computer communicates with the apparatus; means for transmitting to the accessing computer and then clearing the heartbeat semaphores in a column of the array associated with the accessing computer whenever the accessing computer communicates with the apparatus; and means for determining when a disconnected computer has lost communications with the apparatus by examining the heartbeat semaphore at an intersection of the column associated with the accessing computer and a row associated with the disconnected computer, wherein the heartbeat semaphore at the intersection remains cleared for a predetermined period.
8. A method for controlling access to a shared peripheral by a plurality of computers, comprising: setting a reservation semaphore to reserve the peripheral for an accessing computer; clearing the reservation semaphore to indicate the peripheral is available; manipulating a heartbeat semaphore to indicate whether communication with the accessing computer has occurred, thereby preventing the reservation semaphore from being reserved indefinitely when communications have been lost with the accessing computer.
9. The method of claim 8, further comprising arranging the reservation semaphores and the heartbeat semaphores in a redundant configuration.
10. The method of claim 9, further comprising inspecting an outcome of an operation performed against the redundant configuration to detect whether an error has occurred.
11. The method of claim 8, wherein the setting step comprises: storing a first indicator in the reservation semaphore of whether the peripheral is reserved; and storing a second indicator in the reservation semaphore of whether the accessing computer has reserved the peripheral.
12. The method of claim 11, wherein the second indicator storing step further comprises updating the second indicator each time the reservation semaphore changes state.
13. The method of claim 8, wherein the clearing step comprises clearing the first indicator in the reservation semaphore when the peripheral is not reserved.
14. The method of claim 8, further comprising: transmitting a first plurality of the heartbeat semaphores to the accessing computer; and examining the first plurality of the heartbeat semaphores to determine whether the computers connected to the apparatus are still active.
15. The method of claim 8, further comprising: arranging the heartbeat semaphores in a two-dimensional array comprising a plurality of intersecting rows and columns; setting the heartbeat semaphores in a row of the array associated with the accessing computer; transmitting and then clearing the heartbeat semaphores in a column of the array associated with the accessing computer whenever the accessing computer communicates with the apparatus; and determining when a disconnected computer has lost communication with the apparatus by examining the heartbeat semaphore at an intersection of the column associated with the accessing computer and a row associated with the disconnected computer.
PCT/US1991/006060 1990-09-12 1991-08-23 Fault tolerant networking architecture WO1992004677A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US07/582,507 US5206952A (en) 1990-09-12 1990-09-12 Fault tolerant networking architecture
US582,507 1990-09-12

Publications (1)

Publication Number Publication Date
WO1992004677A1 true WO1992004677A1 (en) 1992-03-19

Family

ID=24329421

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/006060 WO1992004677A1 (en) 1990-09-12 1991-08-23 Fault tolerant networking architecture

Country Status (2)

Country Link
US (1) US5206952A (en)
WO (1) WO1992004677A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5506964A (en) * 1992-04-16 1996-04-09 International Business Machines Corporation System with multiple interface logic circuits including arbitration logic for individually linking multiple processing systems to at least one remote sub-system
US5544353A (en) * 1993-06-14 1996-08-06 International Business Machines Corporation Distributed processing object shared resource control apparatus and method
US5396505A (en) * 1993-07-06 1995-03-07 Tandem Computers Incorporated Programmable error-checking matrix for digital communication system
US5623670A (en) * 1995-02-17 1997-04-22 Lucent Technologies Inc. Method and apparatus for crash safe enforcement of mutually exclusive access to shared resources in a multitasking computer system
US5682470A (en) * 1995-09-01 1997-10-28 International Business Machines Corporation Method and system for achieving collective consistency in detecting failures in a distributed computing system
AU6949600A (en) * 1999-08-31 2001-03-26 Times N Systems, Inc. Efficient page ownership control
US8181089B1 (en) * 2007-08-24 2012-05-15 Datadirect Networks, Inc. Method for auto-correction of errors in a solid-state memory system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2057030A1 (en) * 1970-02-04 1971-08-12 Robotron Veb K Method and circuit arrangement for checking errors in control information honing
EP0035778A2 (en) * 1980-03-10 1981-09-16 The Boeing Company Modular system controller for a transition machine
JPS59113600A (en) * 1982-12-21 1984-06-30 Nec Corp Highly reliable storage circuit device
EP0230029A2 (en) * 1985-12-27 1987-07-29 AT&T Corp. Method and apparatus for fault recovery in a distributed processing system
US4754398A (en) * 1985-06-28 1988-06-28 Cray Research, Inc. System for multiprocessor communication using local and common semaphore and information registers

Family Cites Families (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE1218761B (en) * 1963-07-19 1966-06-08 International Business Machines Corporation, Armonk, N. Y. (V. St. A.) Data storage device
US3337854A (en) * 1964-07-08 1967-08-22 Control Data Corp Multi-processor using the principle of time-sharing
US3348210A (en) * 1964-12-07 1967-10-17 Bell Telephone Labor Inc Digital computer employing plural processors
US3444525A (en) * 1966-04-15 1969-05-13 Gen Electric Centrally controlled multicomputer system
US3566357A (en) * 1966-07-05 1971-02-23 Rca Corp Multi-processor multi-programed computer system
NL7106491A (en) * 1971-05-12 1972-11-14
US3735360A (en) * 1971-08-25 1973-05-22 Ibm High speed buffer operation in a multi-processing system
US3833889A (en) * 1973-03-08 1974-09-03 Control Data Corp Multi-mode data processing system
FR2253421A5 (en) * 1973-11-30 1975-06-27 Honeywell Bull Soc Ind
US4073005A (en) * 1974-01-21 1978-02-07 Control Data Corporation Multi-processor computer system
US4400771A (en) * 1975-12-04 1983-08-23 Tokyo Shibaura Electric Co., Ltd. Multi-processor system with programmable memory-access priority control
JPS5841538B2 (en) * 1975-12-04 1983-09-13 株式会社東芝 Multiprocessor system instructions
US4212057A (en) * 1976-04-22 1980-07-08 General Electric Company Shared memory multi-microprocessor computer system
JPS52130246A (en) * 1976-04-24 1977-11-01 Fujitsu Ltd Memory access control system
US4051551A (en) * 1976-05-03 1977-09-27 Burroughs Corporation Multidimensional parallel access computer memory system
US4128880A (en) * 1976-06-30 1978-12-05 Cray Research, Inc. Computer vector register processing
US4130864A (en) * 1976-10-29 1978-12-19 Westinghouse Electric Corp. Priority selection circuit for multiported central functional unit with automatic priority reduction on excessive port request
US4104720A (en) * 1976-11-29 1978-08-01 Data General Corporation CPU/Parallel processor interface with microcode extension
US4149242A (en) * 1977-05-06 1979-04-10 Bell Telephone Laboratories, Incorporated Data interface apparatus for multiple sequential processors
JPS5580164A (en) * 1978-12-13 1980-06-17 Fujitsu Ltd Main memory constitution control system
US4402046A (en) * 1978-12-21 1983-08-30 Intel Corporation Interprocessor communication system
US4280176A (en) * 1978-12-26 1981-07-21 International Business Machines Corporation Memory configuration, address interleaving, relocation and access control system
JPS55112651A (en) * 1979-02-21 1980-08-30 Fujitsu Ltd Virtual computer system
US4449183A (en) * 1979-07-09 1984-05-15 Digital Equipment Corporation Arbitration scheme for a multiported shared functional device for use in multiprocessing systems
US4365292A (en) * 1979-11-26 1982-12-21 Burroughs Corporation Array processor architecture connection network
US4392200A (en) * 1980-01-28 1983-07-05 Digital Equipment Corporation Cached multiprocessor system with pipeline timing
JPS5727363A (en) * 1980-07-24 1982-02-13 Fujitsu Ltd Vector data processor
US4380798A (en) * 1980-09-15 1983-04-19 Motorola, Inc. Semaphore register including ownership bits
US4480304A (en) * 1980-10-06 1984-10-30 International Business Machines Corporation Method and means for the retention of locks across system, subsystem, and communication failures in a multiprocessing, multiprogramming, shared data environment
US4509140A (en) * 1980-11-10 1985-04-02 Wang Laboratories, Inc. Data transmitting link
JPS57155666A (en) * 1981-03-20 1982-09-25 Fujitsu Ltd Instruction controlling system of vector processor
US4394727A (en) * 1981-05-04 1983-07-19 International Business Machines Corporation Multi-processor task dispatching apparatus
US4455602A (en) * 1981-05-22 1984-06-19 Data General Corporation Digital data processing system having an I/O means using unique address providing and access priority control techniques
DE3151120C2 (en) * 1981-12-23 1983-12-01 Siemens AG, 1000 Berlin und 8000 München Data processing system with main memory and several processors connected in series
US4493036A (en) * 1982-12-14 1985-01-08 Honeywell Information Systems Inc. Priority resolver having dynamically adjustable priority levels
US4594682A (en) * 1982-12-22 1986-06-10 Ibm Corporation Vector processing
ATE74675T1 (en) * 1983-04-25 1992-04-15 Cray Research Inc MULTIPROCESSOR CONTROL FOR VECTOR COMPUTERS.
US4630230A (en) * 1983-04-25 1986-12-16 Cray Research, Inc. Solid state storage device
US4661900A (en) * 1983-04-25 1987-04-28 Cray Research, Inc. Flexible chaining in vector processor with selective use of vector registers as operand and result registers
US4901230A (en) * 1983-04-25 1990-02-13 Cray Research, Inc. Computer vector multiprocessing control with multiple access memory and priority conflict resolution method
US4636942A (en) * 1983-04-25 1987-01-13 Cray Research, Inc. Computer vector multiprocessing control
CA1242284A (en) * 1983-04-25 1988-09-20 Steve S. Chen Computer vector multi-processing control
JPS60136872A (en) * 1983-12-26 1985-07-20 Hitachi Ltd Vector processor
US4771378A (en) * 1984-06-19 1988-09-13 Cray Research, Inc. Electrical interface system
CA1228677A (en) * 1984-06-21 1987-10-27 Cray Research, Inc. Peripheral interface system
US4745545A (en) * 1985-06-28 1988-05-17 Cray Research, Inc. Memory reference control in a multiprocessor
US4720780A (en) * 1985-09-17 1988-01-19 The Johns Hopkins University Memory-linked wavefront array processor
US4766535A (en) * 1985-12-20 1988-08-23 International Business Machines Corporation High-performance multiple port memory
US4805173A (en) * 1986-09-15 1989-02-14 Thinking Machines Corporation Error control method and apparatus
US4951246A (en) * 1989-08-08 1990-08-21 Cray Research, Inc. Nibble-mode dram solid state storage device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2057030A1 (en) * 1970-02-04 1971-08-12 Robotron Veb K Method and circuit arrangement for checking errors in control information honing
EP0035778A2 (en) * 1980-03-10 1981-09-16 The Boeing Company Modular system controller for a transition machine
JPS59113600A (en) * 1982-12-21 1984-06-30 Nec Corp Highly reliable storage circuit device
US4754398A (en) * 1985-06-28 1988-06-28 Cray Research, Inc. System for multiprocessor communication using local and common semaphore and information registers
EP0230029A2 (en) * 1985-12-27 1987-07-29 AT&T Corp. Method and apparatus for fault recovery in a distributed processing system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
I & CS - INDUSTRIAL AND PROCESS CONTROL MAGAZINE. vol. 60, no. 10, October 1987, RADNOR, PENNSYLVANIA US pages 73 - 76; J.A. HUMPHRY: 'Appying fault tolerant system architectures' see the whole document *
PATENT ABSTRACTS OF JAPAN vol. 8, no. 237 (P-310)(1674) 30 October 1984 & JP,A,59 113 600 ( NIPPON DENKI KK ) 30 June 1984 see the whole document *

Also Published As

Publication number Publication date
US5206952A (en) 1993-04-27

Similar Documents

Publication Publication Date Title
EP0106084B1 (en) Modular computer system
US4608631A (en) Modular computer system
EP0306244B1 (en) Fault tolerant computer system with fault isolation
US5005174A (en) Dual zone, fault tolerant computer system with error checking in I/O writes
US4916704A (en) Interface of non-fault tolerant components to fault tolerant system
US5249187A (en) Dual rail processors with error checking on I/O reads
EP0306209B1 (en) Dual rail processors with error checking at single rail interfaces
US5423024A (en) Fault tolerant processing section with dynamically reconfigurable voting
US4958273A (en) Multiprocessor system architecture with high availability
US5386551A (en) Deferred resource recovery
US5345566A (en) Method and apparatus for controlling dual bus system
EP0381334B1 (en) Apparatus for management, comparison, and correction of redundant digital data
US4866604A (en) Digital data processing apparatus with pipelined memory cycles
JPH0734179B2 (en) Automatic flight controller with multiple heterogeneous data processing channels.
EP1667024A2 (en) Memory based cross compare for cross checked systems
WO1989008883A1 (en) Record lock processor for multiprocessing data system
US5235687A (en) Method for replacing memory modules in a data processing system, and data processing system for performing the method
US6804794B1 (en) Error condition handling
US6594735B1 (en) High availability computing system
US5206952A (en) Fault tolerant networking architecture
US5557753A (en) Information processing unit having a multiplexed bus and a bus control method therefor
EP0251686A2 (en) Method and apparatus for sharing information between a plurality of processing units
EP0418030A2 (en) Improvements in and relating to stable memory circuits
EP0306855A2 (en) Arrangement for loading the parameters into active modules in a computer system
JPH07114521A (en) Multimicrocomputer system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA