CA2177850A1

CA2177850A1 - Fault resilient/fault tolerant computing

Info

Publication number: CA2177850A1
Application number: CA002177850A
Authority: CA
Inventors: Thomas Dale Bissett; Richard D. Fiorentino; Robert M. Glorioso; Diane T. Mccauley; James D. Mccollum; Glenn A. Tremblay; Mario Troiani
Original assignee: Individual
Current assignee: Marathon Technologies Corp
Priority date: 1993-12-01
Filing date: 1994-11-15
Publication date: 1995-06-08
Also published as: WO1995015529A1; DE69424565T2; EP0986008B1; EP0731945B1; EP0986008A3; EP0986008A2; EP0986007A2; EP0986007A3; JP3679412B2; DE69435090D1; DE69435090T2; AU1182095A; US5956474A; EP0731945A1; EP0731945A4; AU4286497A; AU4286697A; AU4286597A; US5600784A; AU711419B2

Abstract

A method of synchronizing at least two computing elements (CE1, CE2) that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements (CE1, CE2), monitoring the computing elements (CE1, CE2) to detect the production of a selected signal by one of the computing elements (CE1), waiting for the other computing elements (CE2) to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements (CE1, CE2) based on the time updates. In a second aspect of the invention, fault resilient, or tolerant, computers (200) are produced by designating a first processor as a computing element (204), designating a second processor (202) as a controller, connecting the computing element (204) and the controller (202) to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer (200). Each computing element (202, 204) of the computer (200) performs all instructions in the same number of cycles as the other computing elements (202, 204). The computer systems include one or more controllers (202) and at least two computing elements (204).

Description

~ WO 95/15529 ` 2 1 7 7 8 5 0 PCT/US94~13350 FAULT ~T~TTTT'~TrF~TJT~ Tl~TT~ T ~:u..~u~
Ba~ kuLu~ ~1 of the Invention The invention relate6 to fault resilient and fault tolerant computing methods and al ~alc-Lus.
Fault rP~iliPnt computer systems can rnntinllP to function in the ~Le:S~ e of hardware failures. These systems operate in either an avAilAhility mode or an 10 integrity mode, but not both. A system is "available"
when a hardware failure does not cause unacceptable delays in user access, and a system operating in an avAilAhility mode is configured to remain online, if pn~:~ihlp, when faced with a hal-lwaL~: error. A system has 15 data integrity when a hardware failure causes no data loss or corruption, and a system operating in an lntegrity mode is conf igured to avoid data loss or corruption, even if it must go offline to do so.
Fault tolerant systems stress both avAilAhility 20 and integrity. A fault tolerant system remains available and retains data integrity when faced with a single h~LurlaL~: failure, and, under some CiL~;uh_ La~lCeS~ with multiple hardware failures.
Disaster tolerant systems go one step beyond fault 25 tolerant systems and reguire that loss of a ~ ng site due to a natural or man-made disaster will not il-L~LLu~L system avAilAhility or corrupt or lose data.
Prior approaches to fault tolerance include software checkpoint/restart, triple modular ~ . y, 30 and pair and spare.
chprkrointtrestart systems employ two or more computing Pl~ ~: that operate a~y-,-.llLu--uusly and may execute different applications. Each application perio~ rA 1 1 y stores an image of the state of the 35 computing element on which it is running (a rhPrkroint).

WO 95/15529 ~ 2 ~ 7 7 ~ 5 o pcrlus94ll33so When a fAult in a, _ ;n~ element i6 ~C-t~cte~, the rh~rl~ro; n~ is used to restart the application on another computing element (or on the same computing element once the fault is corrected). To i 1 a 5 rh~ k~oint/restart system, each of the applications and/or the operating system to be run on the system must be modified to perio~;rAl ly store the image of the system. In addition, the system must be capable of "backtracking" (that is, undoing the effects of any 10 operations that OCUULL~::d ~ C~ to a checkpoint that i8 being restarted).
With triple modular r~al ~ 1Al y, three r~ ;n~
~l C run the same application and are operated in cycle ~y ._yule lockstep. All of the computing ~1 15 are crnnPrted to a block of voting logic that -r~=
the outputs (that is, the memory interfaces) of the three ~ ;ng ~1 ' and, if all of the outputs are the same, continues with normal operation. If one of the outputs is different, the voting logic shuts down the 2 0 ; ng element that has ~L uduced the di~rering output .
The voting logic, which is located between the computing c-l Ls and memory, has a si~n;~;r~nt impact on system speed .
Pair and spare systems include two or more pairs 25 of ; n~ ' ~ that run the same application and are u~=L~Led in cycle-by-cycle lockstep. A controller monitors the outputs (that is, the memory interfaces) of each computing element in a pair. If the outputs differ, both computing ~1 Ls in the pair are shut down.
~ - of the Invention ~ccording to the invention, a f ault r~ nt and/or fault tolerant system is O~1-A; n~d through use of at least two ;n~ CEs"~ that operate ~oy.l~llLullUU_ly in real time (that is, from cycle to 35 cycle) and ~yllu1l~ulluu~ly in so-c~lled "meta time. " The .

WO 95/1~529 - ~ 2 1 7 7 8 5 0 rCTiUS94/1335D

CEs are ,y~ Lullized at meta times that occur often enough so that the applications running on the CEs do not diverge, but are allowed to run a yl~ LOlluu51y between the meta times. For example, the CEs could be 5 ~iy-.~llLu..ized once each second and otherwise run a~yll~ u~uu:~ly. Because the CEs are ~e~y~ Lu~ized at each meta time, the CEs are said to be operating in meta time lockstep.
In particular: ' ~ '; - ; , meta times are def ined 10 as the times at which the CEs request I/O operations. In these ~-mhorlir L6, the CEs are y..~ l.Lu..ized after each I/O operation and run a:.y~ Lu~-uu:.ly between I/O
operations . This approach is ;~rPl; c;~hl e to systems in which at least two asy..- l.Lu..uus computing ~l~
15 running kl~nl~ 1 applications always generate I/O
L6uue~s in the same order. This approach can be further limited to L ~ay--~_l.Lu..ization arter only those I/O
requests that modify the proc~cc;n~ environment (that is, write LC~ue~
Meta time iy.. ~l.Lu-.ization according to the invention is achieved through use of a pairl3d modular L~ architecture that is L~~lauaL~ to applications and operating system f-uLLw~ . According to this architecture, each CE is paired with a controller, 25 otherwise known as an I/O ~Lu_-:ssuI ("IOP"). The IOPs perform any I/O operations requested by or directed to the CEs, detect hardware faults, and -y..~ l.Lu--i2e the CEs with each other after each I/O operation. In systems in which I/O L~ue~,~s are not issued with suf~icient 30 rrequency, the IOPs perin~;r~l ly Dy~ul~Lu~ize the CEs in ..se to so-call~d "~U~I---UIII interrupts" u~ Lc.~ed ~y ill~L--~LUCe850L int~:LUU~ eL.~ (IPI) modules c ~ 1 to the CEs.
In another particular ~mho~ of the invention, 35 rather than ~y..- l.Lu..izing the CEs based on each WO95/15529 ` 2 ~ 7785o ~ u ~
p~rticular I/O operation, the CEs are ~y~ r u..ized based on a window of I/O operations. In this approach, a list of I/O operations is maintained for each CE and the CEs ar~ y~ Lvllized Wllel.~ veI a common entry appears in all 5 of the lists. This a~Lu~l_1. allows flPY;h;lity as to the order in which I/O reuue_Ls are generated.
In yet another eYemplary ` i of the invention, the CEs are ~y..~ l.Lù~ized based either on signals that are perio~l i cA l l y UC~ L e.ted by the operating lO system or on hardware generated interrupts. For example, in the hardware i11Lc:LLu~ approach, a ~LUCeSS~L of each CE is modified to generate an interrupt every N cycles and the CEs are Ly-,c1.Lu..ized in l~D~u~.se to those ill LC L L U~ LS .
Primary, La of a paired modular L~du11d~l11L
system include surLw.lLe, off-the-shelf IOPs, off-the-shelf CEs, and pairs of uu~ 1 IPI modules that plug into PYr~n- ion slots of the IOP and the CE and are intt:Luu....e.;Led by a cable. R ~ L I/O devices can be 2 O connected to one or more of the CEs or IOPs to provide Ltdu1l~l11L I/O and offer features such as volume shadowing of key mass storage devices. A paired modular L~
system can A~ '-te any I/O device that is compatible with a ~Lv~ ~a..uL used in i 1~ inq an IOP of the 25 system.
The paired modular Ltdu11.la~L architecture uses minimal custom SGrLWc.L~ and hardware to enable at least two of f-the-shelf computing Pl ~ to be , i nP~ into a f ault rP~ nt or tolerant system that runs industry 30 ~ IId~L~1 operating systems, such as Windows NT, DOS, OS/2, or ~NIX, and unmodified applications. Thus, the architecture can avoid both the high costs and 1nflPYihil;ty of the proprietary operating systems, applications, and processor designs used in the prior 35 art.

WO 95/15S29 2 ~ 7 7 ~3 3 ~3 rCr/US94/13350 Another adva-,L~ge of the paired modular ~
architecture of the present invention is that it offers a certain degree of surLw~L~ fault tolerance. The majority of su~Lw~.~ errors are not algorithmic. Instead, most 5 errors are caused by a~y-..;1.. v~y between the computing element and I/û devices that results in I/O race conditions. By lP~ourl ;nq I/O r~ueaLs from the ; nq Pl , the paired modular ~
architecture should substAnt; ;1 l 1 y reduce the number of 10 so-called "T~P; cPnhtl7" software errors that result from such asynchrony.
In one aspect, generally, the invention features forming a fault tolerant or fault resilient computer by using at least one controller to :.y1.- 1..u..ize at least two 15 computing Pl ~ that each have clocks operating a~y-._l.. u-.uu~ly of the clocks of the other computing Pl~ . One or more signals, designated as meta time signals, are CPlPctP~ from a set of sigrtals y~uduced by the computing Pl . Thereafter, the computing 20 Pl ~ are monitored to detect the pro~ t; n of SPlPCtP~ signals by one of the computing Pl~ L~. Once a scOPrted signal is detected, the system waits for the pro~lrtinn of sPlPcted signals by the other _ ;nq P1~ LS~ and, upon receiving the splpct-pd signals, 25 transmits equal time updates to each of the computing Pl ~ :. The clocks of the computing Pl ~ are then updated based on the time updates.
Preferred ' '; of the invention include the features listed below. First, I/O le~ue~ are the 30 sc~lPcted signals. The I/O requests are ~-.acessed to produce I/O rpcpnncpc that are transmitted witLt the time updates. In addition to, or instead of, I/O ~ u~L~"
quantum interr~tpts can be the sPlPcfP~ sigrtals. The ; nq elements count either executed instructions or 35 the cycles of a clock such as the system c~ock, bus .

W0 95/1~529 ; 2 1 7 7 8 ~ O PCr/US94/13350 clock, or I/0 clock, and generate ~uantum interrupts ~1~e~lev-cI a pre~of i nod number of instructions or cycles occurs. When both I/0 ~e~ue~Ls and quantum i are u6ed as the Eolec~od signals, the computing ol~
5 count the number of instructions or cycles that occur without an I/o request. For example, a computing element could be ~JL UyL ' to generate a quantum interrupt .:.c,.evcr it processes for one hundred cycles without generating an I/0 request.
In one -';- L, instructions are counted by loading a counter with a predetormino~ value, ~nAhl ;n~
the counter with an I/0 request, de,L~ Ling the value of the counter, and ~i~nAl l in~ a quantum interrupt when the value of the counter reaches zero. In another 15 ~yLU c11, debugging features of the ~LuueDsoI are used to UelleL C.te the quantum interrupts .
For fault detection, the selected signals and ~ _ -nying dat2, if any, from each of the computing o~ ~ ~ are . ~ ed. If they do not match, a signal i8 20 gelleL<lted to indicate that a fault has OCuuLLed~
In some omho~l i L_, the computing o l Ls wait for time updates by pausing operation after producing the selected signals. The _ _ ing ol~ Ls resume operation upon receipt of the time updates. In other 25 ~ , the computing el~ continue operation after producing the 801 octed signals.
To avoid problems that can be caused by a.7y~ 1..u..uu5 activities of the computing ol , the aOy~lellLullvus activities are ii cAhlecl. The functions of 30 the ~i~yll~_11LUIlous activities are then performed when a selected signal is ~Lvduced. For example, normal memory refresh functions are disabled and, in their place, burst memory Ler~ez~1le8 are performed each time that a solected signal, such as an I/0 request or a quantum il-LeLLuuL, is 35 ,~)~ u~lu~ed .

Wo 95/15529 - Pcr/uS94/l3~O
~ 2177850 , The invention also features a method of producing fault resilient or fault tolerant __L~L~ by designating a first ~LoC~D5uL as a computing element, designating a second ~LUC~r~uL as a controller, and 5 connecting the computing element and the controller to produce a modular pair. Therea~ter, at least two modular pairs are connected to produce a fault r~Ril ;~n~. or fault tolerant _Lel. The processors used for the; in~
elements need not be ;~lcntic; l to each other, but 10 preferably they all perform each instruction of their instruction sets in the same number of cycles as are taken by the other processors. Typically, industry DLa~ d ~lùc~ssul~, are used in implementing the computing ~1~ and the controllers. For disaster 15 tolerance, at least one of the modular pairs can be located remotely from the other modular pairs. The controllers and computing el~ L:, are each able to run unmodif ied industry standard operating systems and applications. In addition, the controllers are able to 20 run a first operating system while the computing simult~norllcly run a second operating system.
I/û fault rpcil ;~nre is obtained by r~nn~y~t;n~
redundant I/0 devices to at least two modular pairs and transmitting at least identical I/û write requests and 25 dzlta to the redundant I/0 devices. While I/0 read requests need only ~e transmitted to one of the I/0 devices, identical I/0 read ~ ueaLs may be transmitted to more than one of the I/0 devices to verify data integrity. When redundant I/0 devices are rnnn~cto~ to 30 three or more modular pairs, tr~nc~;cs;~n of ~ nt;rS~l I/0 ,~yueDL~ allows identification of a faulty I/0 device .
In another aspect, generally, the invention features isolating I/0 requests from computing operations 35 in a _L~:r through use of I/0 redirection. Typically, WO 95115529 2 ~ 7 7 8 5 ~ PCr~s941l3350 ~
... . .

I/0 devices are a~c~cc~fl either through low level I/0 ré~ueDLs or by directly addressing the I/0 devices. Low level I/0 reuue~L~ include requests to the system's basic input output system (e.~., BIOS), boot fi 'e Leuue~L~
5 boot ~urLw7Le leuue~La, and LeyueaLs to the system's physical device driver s~rLw~l~e. When a computing element issues a low level I/0 request, the invention features using 80rLw~lLe to redirect the I/0 Leuue~-L3 to an I/0 ~LucessuL. When the computing element directly 10 addLèsses the physical I/0 devices, the invention features providing virtual I/0 devices that simulate the interfaces of physical I/0 devices. Directly ad~eDDed I/0 ~eyue-Ls are intercepted and provided to the virtual I/0 devices. PerioAic~lly~ the _ullLellLs of the virtual 15 I/0 devices are transmitted to the I/0 ~LuceDDu~ (s) as I/0 requests. At the I/0 LJLu~e,i5u~ (s), the transmitted contents of the virtual I/0 devices are provided to the physical I/0 devices. After the r euue .Led I/0 operations are performed, the results of the operations, if any, are 20 reLu,..ed to the computing ~1~ as ~ c to the I/0 ~ éuueD~D. Typically, the virtual I/0 devices include a virtual keyboard and a virtual display.
The invention also features detecting and fli:~nosin~ faults in a computer system that ;nrlllAoc at 25 least two controllers that are connected to each other and to at lea~t two computing ~ , and at leact two computing -1 L,i that are each rnnnQctecl to at least two of the controllers. Each computing element ~Luluce~
data and generates a value, such as an error rhork~n~
30 code, tbat relates to the data. Each computing element then transmits the data, along with its CULL ~ 1;
value, to the at least two controllers to which it is L~fl. When the controllers receive the data and associated values, they transmit the values to the other 35 controllers. Each controller then performs . Lions WO 95/15529 - - 2 1 7 7 ~ 5 0 PCT/IJS94/13350 ~-_ g _ on the values ~ ULL~ rlinq to each inq element and the values cULLw~ i nq to each controller. If the results of the c _LILions on the values ~:uLL-~l 1;nq to each controller are equal, and the results of the _LaLions on the values ~:ULL-~IJ ''~;nq to each ;nq element are equal, then no fault exists.
Otherwise, a f ault exists . In some in~ L~i~ces, the _Li. Lion may be a simple bit by bit comparison.
When a fault exists, fault ~l;Aqnnsi~ is attempted 10 by comparing, for each one of the computing element, all of the values CULL- I', ~inq to the one computing element.
If the values ~ULL--IJ ~linq to each computing element match for each computing element, but mismatch for dii~ferent computing Pl c, then one of the _ inq 15 ~l~ L,. is faulty. I~ the values cuLL--u l~l;nq to only one of the computing Pl~ ~ mismatch, then a path to that computing element is f aulty . If the values cuLL-~ ;n~ to multiple computing ~l mismatch, then the controller that is connected to the mismatching 20 computing Pl' L5 is faulty. Once identified, the ~aulty element is ~ hl P~ .
A system according to the invention can restore itself to full c;~r~hi 1 ity after a faulty element (that is, a CE, an IOP, a storage device, etc. ) is repaired.
25 The system does so by transferring the state of an active element to the repaired element and, thereafter, r~Pn~hl; nq the repaired element. ~Inactive or repaired :-LU~. are activated by transferring the operational state of an active ~u~ Ol or ~LUCe5f~UL~ to the 30 inactive ~LUCeS~UL through a controller. When the inactive ~L UUt:55UL is a i "q element, the operational state of an active ; nq element tor Pl ) is transferred through a controller. When the inactive ~Luces~uL is a controller, the operating state 35 of an active controller is directly transferred. The WO 95/15529 - lO - Pcr~S~4113350 tr~nsfer ean oce~1r either when 6ystem operation is paused or as a L~J~LUUI~d process.
~ his 1-~C~ LY r~r~hility can also be used to provide on-line upgrades of hardware, ,,c r~w~e, or both 5 by causing a ~r uc.t 8bUL of the system to fail by, for example, turning it off. The upgrade is then performed by either rPrl~C-;n~ or Dodifying the disabled ~LU- C ~SUL .
The uuyLc.ded ~uLUcessoI is then turned on and renctivat~d as A i cr~1csed above .
The invention also features a single controller, dual ccmputing element system in whieh a eontroller is c~ nnnPc ted to two eomputing Pl~ Ls . In this eomputer system, I/û operations by the _ ing Pl~ '~ are intereepted and redirected to the controller. Typically, 15 the controller and the two computing Pl ~ Ls each include an industry standard moth~L Lud~ 1, and are each ~ble to run unmodif ied industry standard operating systems and applieations. In addition, the controller is able to run a f irst operating sy6tem while the ; ng 20 Pl~ L~ simult~nPoucly run a second operating system.
The single eontroller system ean be ~ d to include a second controller rAnnPrtP'l both to the f irst eontroller and to the two computing Pl ~ . For purposes of providing limited disaster rPCi 1 ;Pnre, the 25 f irst eontroller and one of the computing Pl ~ L~ can be placed in a location remote from the second controller and the other computing element, and can be c~ .P~ L-~A to the seeond eontroller and the other computing element by a - i c ations link.
For; uved aV~ hil ity and performance, the dual controller, dual eomputing element system ean be connected to an identical second system. The two systems then run a distributed computing environDent in which one of the systeDs runs a f irst portion of a f irst ~ W095/15529 ~ 2177~50 PCrlUS94113350 application and the other system runs either a second application or a second portion of the first Arpl ;~a~;^n In another: ;- L, the invention fe~uL~s a computer system that ; nrl-l~oc three controllers ~ ed 5 to each other and three computing ~ that are each connected to different pairs of the three controllers.
This system, like the other systems, also features intercepting I/O operations by the computing Pl and redirecting them to the controllers for pro~pcc;n~. For lO disaster rPc;l;Pn~P, the first controller and one of the computing Pl~ LY are placed in a location remote from the I ~ ;n;n~ controllers and computing Pl , or each controller/ ; n~ element pair i5 placed in a different location.
A disaster tolerant system is created by connecting at least two of the three controller systems described above. The three controller systems are placed in remote locations and connected by a ; ~ Ations link.
Rnie~ DescriPtion of the Drawincrs Fig. l is a block diagram of a partially fault resilient system.
Fig. 2 is a block diagram of system 5ur~w~le of the system of Fig. l.
Fig. 3 is a flowchart of a procedure used by an IOP ~onitor of the system s~rLwcl. ~ of Fig. 2 .
Fig. 4 is a block diagram of an IPI module of the system of Fig. l.
Fig. 5 is a state transition table for the system of Fig. l.
Fig . 6 is a block diagram of a f ault rPs:; l; Pnt system.
Fig. 7 is a block diagram of a distributed fault resilient system.

WO 95115529 ; i ~ ~ 2 ~ 7 7 8 5 0 Pcr~S94113350 Fig. 8 is a block diagram of a fault tolerant ~ystem.
Fig. g is f 1 t _llt~L L of a fault t9~; trnci ~t ~Lvuls-luLe used by IOP6 of the system of Fig. 8.
Fig. 10 i6 a block diagram of a disaster tolerant system.
Descril~tion of the Prefer~ed ~
As illustrated in Fig. 1, a fault rP~ilit~nt system 10 ;n~ Pq an I/O ULUCe85UL ("IOP") 12 and two computing 10 Pl c (~CEs~) 14a, 14b (collectivêly referred to as CEs 14). Because system 10 inclllt9pc only a single IOP 12 and theref ore cannot recover rrom a f ailure in IOP 12, system 10 is not entirely fault resilient.
IOP 12 i nt lllt90c two inte:L--~L ùueSSUL il~ LeL uu....euL
15 ("IPI") modules 16a, 16b that are ct~nnPctt~d~
respectively, to UULL~ 7inq IPI modules 18a, 18b of CEs 14 by cables 20a, 20b. IOP 12 also inrl~-Aac a ces~u1 22, a mêmory system 24, two hard disk drives 26, 28, ~nd a power supply 30. Similarly, each CE 14 20 includes a yLucessol 32, a mêmory system 34, and a power supply 36. Separatê power supplies 36 are used to ensurê
fault rPci 1 i Pn~-e in thê event of a power supply failure.
P ucessuL- 32a, 32b are "identical" to each other in that, f or evêry instruction, the numbêr of cycles 2~ rêsluired for ~LUCeS5UL 32a to perform an instruction is identical to thê number of cycles requirêd for y~u~.essuL
32b to perform the same instruction. In thê illu~LLtltéd ~i L, systPm 10 has been implemêntêd using l.Lt~.d,~
Intêl 486 basêd motheL~ùaL~s for ~lucessuL-~ 22, 32 and 30 four mêgabytes of mêmory for each of mêmory system 24, 34 .
IûP 12 and CEs 14 of system 10 run unmodifiêd operating system and applicationS software, with hard drive 26 being usêd as the boot disk for thê IOP and hard 35 driv~ 28 bêing usêd as thê boot ~isk for CEs 14. In Wo 95115529 pcrNs94ll335o truly fault resilient or fault tolerant systems that include at least two IOPs, each hard drive would also ~e duplicated .
In the illustrated ';~ L, the operating 5 system for IOP 12 and CEs 14 is DOS. However, other operating systems can also be used. IIJL~::UV~:L, IOP 12 can run a different operating system from the one run by CEs 14. For example, IOP 12 could run Unix while CEs 14 run DOS. This approach is a-lv~l-Lc-.3euus because it allows CEs 10 14 to access peripherals from operating systems that dû
not support the peripherals. For example, if CEs 14 were running an operating system that did not support CD-ROM
drives, and IOP 12 were running one that did, CEs 14 could access the CD-RûM drive by issuing I/O L~:~ue-L:-15 ~ ntic~l to those used to, say, access a hard drive.IOP 12 would then handle the translation of the I/O
request to one suitable for ~C~eccin~ the CD-ROM d~ive.
Referring also to Fig. 2, system 10 1 nrl ll~lOC
spe~ ed system sorLwale 40 that controls the booting 20 and ~iy~ Lu~lization of CEs 14, disables local time in CEs 14, redirects all I/O Le~ue- Ls from CEs 14 to IOP 12 for execution, and returns the results of the I/O requests, if any, from IOP 12 to CEs 14.
System sorLwc.le 40 inrl~ oc two sets of IPI BIOS
25 42 that are RO~-based and are each located in the IPI
module 18 of a CE 14. IPI BIOS 42 are used in bootup and liy~ lLullization activities. Nhen a CE 14 is booted, IPI
BIOS 42 replaces the I/O interrupt adlLasses in the system BIOS illL~LLU~L table with ad~Le~es that are 30 controlled by CE Drivers 44. The interrupt ~.idL~___8 that are replaced include those CULLI ~ ;n-J to video services, fixed disk services, serial i~ ations service6, keyboard services, and time of day services.
IPI BIOS 42 also ~i c Ihl~c normal memory refreshing 35 to ensure that memory refreshing; which affects the WO 95115529 2 ~ 7 7 8 ~ O PCTnlS94/13350 ~

number of cycles during which a CE 14 is actually pror~csi"~, is controlled by system surLwc.lc: 40. Msmory reEreshing is required to maintain memory integrity. In known refreshing methods, memory is ~erLe~ ad 5 perio~;cAl ly, with one block of memory being r~rLe~l-ed at the end of each refresh period. The duration of the refresh period is CP~ ~t~rl so that the entire memory i5 ~ :Dlled within the memory's refresh limit. Thu~, for sxample, if a memory has 256 blocks and an 8 ms refresh 10 limit, then the refresh period is 31.25 ,U,5 (8 ms / 256).
In the described ~ oA;- ~, IPI BIOS 42 A;~Ahl~-c memory refreshing by placing a counter used in the Intel 486 LheL~oaLd to control memory refreshing in a mode that requires a gate input to the counter to change in 15 order to inl L- L. Because the gate input is typically connected to the power supply, the gate input never changes and the counter is effectively A;cAhl~A.
Two CE Drivers 44 of system sur~wc~rê 40 handle memory refreshing by burst refreshing multiple blocks o~
20 memory each time that an I/O request or quantum illLëLLu~L
is generated. CE Drivers 44 are stored on CE boot disk 28 and are run by CEs 14. In addition to performing burst memory r~rL~Dl,es, CE Drivers 44 intercept I/O
r~u,ue~L:~ to the system BIOS and redirects them through IPI modules 18 to IOP 12 f or ~y~rut; on . CE Drivers 44 also respond to interrupt requests from IPI modules 18, disable the _ystem clock, and, based on information supplied by IOP Monitor 48, control the time of day of CEs 14.
An IOP Driver 46 that is located on IOP boot disk 26 and is run by IOP 12 handles I/O r~yue~Ls from CE8 14 by redirecting them to an IOP Monitor 48 for procpc~g;n~
and transmitting the results ~rom IOP Monitor 48 to CEs 14 . IOP Driver 46 ~ ; c~tes with CE drivsrs 44 using 35 a packet protocol.

~ W095115529 ~ - ` ` 217785~ rcrllls94ll335o IOP Nonitor 48 is located on IOP boot disk 26 and is run by IOP 12. IOP Monitor 48 controls system 10 and performs the actual I/O Lcyue-Ls to produce the results that are transmitted by IOP Driver 46 to CEs 14.
System software 40 also 1 nnl~ console surLw~Lé
49 that runs on IOP 12 and provides f or user control of system 10. using console SGrLwcLLe 49, a user can reset, boot, or ~-yil~ llLUlliZe a CE 14. me user can also set one or both of CEs 14 to automatically boot (autoboot) and/or 10 automatically riyl.~l.Lu..ize (autosync) after being reset or upon startup. The ability to control each CE 14 is useful both during normal operation and for test purposes. Using console software 49, the user can also place system 10 into either an integrity mode in which 15 IOP Nonitor 4 8 shuts down both CEs 14 when f aced with a m;~_ ~ e error, a first avR;l~h;l;ty mode in which IOP
Monitor 48 rli~:~h~ CE 14a when faced with a m;~ e:
error, or a second avA;lAh;lity mode in which IOP Nonitor 48 disables CE 14b when ~aced with a mi . , ~è error.
20 Finally, console software 49 allows the user to request the status of system 10. In an alternative: ~ ~i , console ~orLwclLe 49 could be implemented using a separate ;~Luuea~uL that i~ ~tes with IOP 12.
Each CE 14 runs a copy of the same application and 25 the same operating system as that run by the other CE 14.
Moreover, the contents of memory systems 34a and 34b are the same, and the operating context of CEs 14 are the same at each ~yll llL u..ization time . Thus, IOP Monitor 48 should receive identical seu~ueilces of I/O requests from 3 o CEs 14 .
As shown in Fig. 3, IOP Monitor 48 j~Lu~ésses and monitors I/O requests according to a ~luoelluLe 100.
Initially, IOP Monitor 48 waits for an I/O request from one of CEs 14 (step 102). Upon receiving an I/O request 35 packet from, for example, CE 14b, IOP Nonitor 48 waits wo 95,l5~2g ; ~ 2 t 7 7 8 5 0 pCI/US94/13350 for either an I/O recluest from CE 14a or for the expiration of a timeout period (step 104). Because system 10 uses the DOS operating system, which halts execution of an application while an I/O recluest is being 5 processed, IOP Monitor 48 is guaranteed not to receive an I/O recIuest from CE 14b while waiting (step 104) for the I/O recluest from the CE 14a.
Next, IOP Monitor 48 checks to ~t~rmin-~ whether the timeout period has expired (step 106). If not (that 0 i5, an I/O request packet from CE 14a has arrived), IOP
Monitor 48 ~:s the ~ of the packets (step 108), and, i~ the rhorl-QllmQ are ecIual, ~lu~:eDDes the I/O
recluest (step 110). After proc~cs;ng the I/O reguest, IOP Monitor 48 issues a request to the system BIOS of IOP
15 12 for the current time of day (step 112).
After receiving the time of day, IOP Monitor 48 a~ Q an IPI packet that i n~ C the time of c~ay and the results, if any, of the I/O recluest (step 114) and sends the IPI packet to IOP Driver 46 (step 116) for 20 trS~n~ sicn to CEs 14. When CEs 14 receive the IPI
packet, they use the transmitted time of day to update their local clocks which, as already noted, are otherwise ~1 i Q :~ h l ~
As required by DOS, execution in CEs 14 is 25 5~lCr~n~ l until IOP Monitor 48 returns the results of the I/O recluest through IOP Driver 46. Because, before ~Y~nllti nn is re8umed, the times of day of both CEs 14 are updated to a common value (the transmitted time of day from the IPI packet), the CEs 14 are kept in time 30 Dy~ ...ization with the transmitted time oi day being designated the meta time. If a multitasking operating system were employed, execution in CEs 14 would not be ~'-L''~l~''CI while IOP ~onitor 48 performed the I/O recluest.
Instead, proc~sin~ in CEs 14 would be D ~ d only 35 until receipt of an acknowle, L indicating that IOP

~ Wo 95/15529 - 2 1 7 7 8 5 0 PCTIITS94/13350 Monitor 48 has begun proc~cc; n~ the I/O request (step 110). The acknowle L would include the time of day and would be used by CEs 14 to update the local clocks.
After sending the IPI packet to IOP Driver 46, IOP
5 Nonitor 48 verifies that both of CEs 14 are online (step 118), and, if so, waits for another I/O request from one of CEs 14 (step 102).
If the timeout period has expired (step 106), IOP
Monitor 48 lliCAhl~-C the CE 14 that failed to respond 10 (step 119) and processes the I/O reque6t (step 110).
If there is a mis. _ between the, ~ of the packets from CEs 14 (step 108), IOP Monitor 48 checks to see if system 10 i5 operating in an avA;lAhil;ty mode or an integrity mode (step 120).
If system 10 is operating in an avA;lAh;l;ty mode, IOP Monitor 48 ~l;cAhl~c the appropriate CE 14 based on the 5PlectQ~i avA;lAh;l;ty mode (step 122), and EJLu~ Des the I/O request (step 110). Thereafter, when IOP Monitor 48 checks whether both CEs 14 are online (step 118), and 20 AccllminlJ that the disabled CE 14 has not been repaired and reactivated, IOP Monitor 48 then waits for an I/O
request from the online CE 14 (step 124). Because system 10 is no longer fault resilient, when an I/O request is received, IOP Monitor 48 immediately u- uce:, es the I/O
25 reS~uest (step llO).
If system 10 is operating in an integrity mode when a m; ~_ ~ e is detected, IOP Monitor 48 ~i; CAhlP~
both CEs 14 (step 126) and stops processing (step 128).
Referring again to Figs. l and 2, when the 30 application or the operating system of, for example, CE
14a makes a non-I/O call to the system BIOS, the system BIOS oY~'C~lt"C the request and returns the results to the application without invoking system software 40.
However, if the application or the operating system makes 35 ar, I/O BIOS call, CE Driver 44a intercepts the I/O

WO 9~/15529 2 ~ 7 7 8 5 0 Pcr/usg4ll3350 ;. .~,` '; ~

request. After intercepting the I/O request, CE Driver 44a pArkA~Oc the I/O request into an IPI packet and transmits the IPI packet to IOP 12.
When IPI module 16a of IOP 32 detects trAn-~;cc~nn 5 of an IPI packet from CE 14a, IPI module 16a ~elleLIltes an illLeLLU~I_ to IOP Driver 16. IOP Driver 46 then reads the IPI packet.
As rl~ccl~ccp~l above, IOP Nonitor 48 Le~ull-l~ to the IPI packet from CE 14a according to ~JLU-,edULe 100. As 10 also Aiccllcsed~ Acsllmin~ that there are no hardware ~aults, IOP Driver 46 eventually transmits an IPI packet that contains the results of the I/O request and the time of day to CEs 14.
IPI modules 18 of CEs 14 receive the IPI packet 15 from IOP 12. CE Drivers 44 unpack the IPI packet, update the time of day of CEs 14, and return control of CEs 14 to the application or the operating system running on CEs 14 .
If no I/O requests are issued within a given time 20 interval, the IPI module 18 of a CE 14 generates a 80-called quantum interrupt that invokes the CE Driver 44 of the CE 14. In re~ul~ae, the CE Driver 44 creates a quantum interrupt IPI packet and transmits it to IOP 12.
IOP Monitor 4 8 treats the quantum interrupt IPI packet as 25 an IPI packet without an I/O request. Thus, IOP Monitor 48 detects the i - i n~ quantum interrupt IPI packet (step 102 of Fig. 3) and, if a matching quantum interrupt IPI packet is received from the other CE 14 (steps 104, 106 , and 108 of Fig. 3 ), issues a request to the system 30 BIOS of IOP 12 for the current time of day (step 112 of Fig. 3). IOP Monitor 48 then r~rkA~Oc the current time of day into a quantum ré~uu--~e IPI packet (step 114 of Fig. 3) that IOP Driver 46 then sends to CEs 14 (step 116 o~ Fig . 3 ) . CE Drivers 44 respond to the quantum 35 r~uu..se IPI packet by updating the time of day and WO 95/15529 - PCr~S94113350 ~ 2 1 77850 returning control of OEs 14 to the application or the operating system running on CEs 14.
I~ IOP Nonitor 48 does not receive a quantum interrupt IPI package from the other CE 14 within a - 5 predefined timeout period (step 106 of Fig. 3), IOP
Monitor 48 responds by AiR~hlin~ the non-r~crnnAin~ CE
14 .
As 6hown in Fig. 1, IPI modules 16, 18 and cables 20 provide all of the hardware ni c~c~ to produce a 10 fault resilient system from the ~Lalldt Ld Intel 486 based moth~rholrds used to implement ~LucessuLa 22, 32. An IPI
module 16 and an IPI module 18, which are implemented using identical boards, each perform similar fllnr~ ;nnc.
As illustrated in Fig. 4, an IPI module 18 15 i nrl llA-~c a control logic 50 that _ ; rates I/O
e~Ls and L~ c-~c between the system bus of a ~JLU~;eaaUL 32 of a OE 14 and a parallel interface 52 of IPI module 18. Parallel interface 52, in turn, i rate5 with the parallel interface of an IPI module 20 16 through a cable 20. Parallel interface 52 inrli~A~c a sixteen bit data output port 54, a sixteen bit data input port 56, and a control port 58. Cable 20 is configured so that data output port 54 is rnnn-~rt~d to the data input port of the IPI module 16, data input port 56 is 25 connected to the data output port of the IPI module 16, and control port 58 is rnnn~cte~A to the control port of the IPI module 16. Control port 58 i l a hAnAch~king protocol between IPI module 18 and the IPI
module 16.
Control logic 50 is also connected to an IPI BIOS
RON 60. At startup, control logic 50 transfers IPI BIOS
- 42 (Fig. 2), the contents of IPI BIOS ROM 60, to yLUCeSSUL 32 through the system bus of ~LUU_aaU~ 32.
A QI counter 62, also located on IPI module 18, 35 generates quantum illLt:LLUyLS as ~l;crllqc-~A~ above. QI

Wo 95/15529 2 1 7 7 8 5 o Pcrlus94ll33~

counter 62 l nrl~ R a clock input 64 that i5 C ~ Led to the system clock of processor 32 and a gate input 66 that iB connected to control logic 50. Gate input 66 is used to activate and reset the ccunter value of QI counter 62.
5 When activated, QI counter 62 deuL Ls the counter value by one during each cycle of the system clock of processor 32. When the counter value reaches zero, QI
counter 62 y~ Lc.Les a quantum interrupt that, as ll;ccllc~d above, activates CE Driver 44 (Fig. 2).
CE Driver 44 deactivates QI counter 62 at the b~innin~ of each I/0 tr~nc~r~inn~ CE Driver 44 deactivates QI counter 62 by requesting an I/0 write at a f irst address, known as the QI deactivation address .
Control logic 50 detects the ItO write request and 15 deactivates QI counter 62 through gate input 66. Because this particular I/0 write is for control ~uL~oses only, control logic 50 does not pass the I/0 wrlte to parallel interrace 52. At the rnnrlllcinl~ of each I/0 transaction, CE Driver 44 resets and activates QI counter 62 by 20 recluesting an I/0 write to a second address, known as the QI activation address. Control logic 50 lec,~u.,ds by r~cett;n~ and activating QI counter 62.
In an alternative approach, quantum interrupts are generated through use of debugging or other fer.LuL.:s 25 available in y~uCeisuL 32. Some commonly availa~le ~LU~dS.' UL~ include ,~ lrJqi ng or trap instructions that trap errors by transferring control of the ~Lucessul to a designated program arter the completion of a c~l rrt~7 number of instructions following the trap instruction.
30 In this approach, each time that CE Driver 44 returns control Or E LU~ e55UL 32 to the application or operating system, CE Driver 44 issues a trap instruction to indicate that control of ~Lu~ e55UL 32 should be given to CE Driver 44 upon completion of, for example, 300 35 instructions. ~fter ~LU~ eSSUL 32 ~ let~C the indicated .

WO 95/15529 ~ - PCrlUS94113350 ~ 2 1 77~50 300 instructions, the trap instruction causes control of ~Lu- ea~luL 32 to be leLuL.Ied to CE Driver 44. In the event that an I/o request activates CE Driver 44 prior to completion of the indicated number of instructions, CE
5 Driver 44 issues an instruction that cancels the trap instruction .
IPI Module 18 is also used in activating an offline CE 14. As tlicc--cc~d below, before an offline CE
14 i5 activated, the contents of the memory syfitem 34 of 10 the active CE 14 are copied into the memory system 34 of the offline CE 14. To Tn;n;n~i~e the effects of this copying on the active CE 14, the ~lucessuL 32 of the active CE 14 is permitted to cnn~in~ procDccinq and the memory is copied only during cycles in which the system 15 bus of the ~LucessuL 32 of the active CE 14 is not in use .
To enable ~Loc:e~uL 32 to continue prO. ~cc;nq while the memory is being copied, IPI module 18 accounts for memory writes by the ~LU-_e55UL 32 to ad-lL.~ses that 20 have already been copied to the offline CE 14. To do ~;o, control logic 50 monitors the system bus and, when the yLùcessû~ 32 writes to a memory address that has already been copied, stores the address in a FIFO 68. When the memory transfer is complete, or when FIFO 68 is full, the 25 contents of memory locations associated with the memory addLèssQs stored in FIFO 68 are copied to the offline CE
14 and FIFO 68 is emptied . In other ~L ucl~l.es, FIFO 68 is '; f i ~d to store both memory addL ~ses and the contents of memory locations associated with the 30 adlLasses, or to store the block addresse6 of memory blocks to which memory addLc ~ses being written belong.
- IPI module 18 also handles non-BIOS I/O requests.
In some computer systems, the BIOS is too slow to effectively perform I/O operations such as video display.
35 As a result, some less _LLu~LuLc:d or less ~ic~;rl in~d WO95/15a29 2 1 7 7 8 5 0 r~ 7~au operating systems, such as DOS or 'JNIX, allow applications to ci~ uu.,.~_..L the BIOS and make non-BIOS I/O
leuuc:aLs by directly reading from or writing to the a.llLesseD associated with I/O devices. These non-BIOS
5 I/O reU~UeS LD, which cannot be intercepted by rh~7ng;n~ the system interrupt table, as is done in cr~nn~c~ n with, for example, I/O disk reads and writes, are problematic ror a system in which 7yl--.llL u.,ization reSIuires tight control of the I/O interface.
To remedy this problem, and to assure that even non-BIOS I/O ~euue~Ls can be isolated and managed by IOP
12, IPI module 18 7nrl~ C virtual I/O devices that mimic the hardware interfaces of physical I/O devices. These virtual I/O devic:es include a virtual display 70 and a 15 virtual keyLGal.l 72. As needed, other virtual I/O
devices such as a virtual mouse or virtual serial and parallel ports could also be used.
In practice, control logic 50 monitors the system bus for read or write operations directed to addresses 20 associated with non-BIOS I/O ~ e~ueDLs to system I/O
devices. ~hen control logic 50 detects such an operation, control logic 50 stores the information n~C~c4 7- y to Lecu,.~,LLu- L the operation in the appropriate virtual device. Thus, for example, when control logic 50 25 detects a write operation directed to an address associated with the display, control logic 50 stores the information ~ C~4~ y to leovllaLLu.;L the operation in virtual display 70. Each time that a BIOS I/O request or a quantum interrupt occurs, CE Driver 4 4 scans the 30 virtual I/O devices and, if the virtual devices are not empty, r-- lrc the information stored in the virtual devices into an IPI packet and transmits the IPI packet to IOP 12. IOP 12 treats the packet like a BIOS I/O
request using ~LUUédULe 100 r~CC17CS~r7 above. When 35 control logic 50 detects a read addLèsDed to a virtual WO 9~15529 - ~ ~ 2 1 7 ~ 8 5 0 PCrn~S94113350 I/O device, control logic 50 5~ccQmhl~c the read request into an IPI packet for h:~n~ll ;n~ by IOP 12. IOP 12 treats the IPI packet like a ~-~al-dar 1 BIOS I/O request.
Referring to Fig. 5, each CE 14 always uueL~It.es in 5 one of eight states and, because there are only a limited number of p~rm;~c;hl~ state combinations, system 10 always opêrates in one of fourteen states. The major CE
operating states are OFFLINE, RTB (ready to boot), BOOTING, ACTIVE, RTS (ready to sync), WAITING, rt_SYNC, 10 (~-y-.ul.Lu.lizing as master), and S_SYNC (ay~ ru-.izing as slave). IOP Monitor 48 changes the operating states of CEs 14 based on the state of system 10 and user ~
from console software 49. T_rough console 2~Ur~W~lLt: 49, a user can reset a CE 14 at any time. Whenever the user 15 resets a CE 14, or a f ault occurs in the CE 14, IOP
Monitor 48 changes the state of the CE 14 to OFFLINE.
At startup, system 10 is operating with both CEs 14 OFFLINE (state 150). system 10 operates in the upper states of Fig. 5 (states 152-162) when CE 14a becomes 20 operational before CE 14b and in the lower states (states 166-176) when CE 14b is the first to become operational.
If CEs 14 become operational simultaneously, the first operational CE 14 to be r~ro~n; 7-~ri by IOP Monitor 48 is treated as the f irst to become operational .
When a CE 14 indicates that it is ready to boot by issuing a boot request, the state of the CE 14 changes to RTB if the CE 14 is not set to autoboot or to BOOTING if the CE 14 is set to a~LobouL. For example, if CE 14a issues a boot request when both CEs 14 are OFFLINE, and 30 CE 14a is not set to autoboot, then the state of CE 14a changes to RTB (state 152). Thereafter, IOP Monitor 48 - waits for the user, through console sur~wc.~e 49, to boot CE 14a. When the user boots CE 14a, the state of CE 14a changes to BOOTING (state 154). If the user resets CE
35 14a, the state of CE 14a changes to OFFLINE (state 150).

WO95/l5s29 '~ i - 2 1 7 7 ~ 5 ~ PCr~S94~l3350 Il~ both CEs 14 are OFFLINE when CE 14a issues a boot request, and CE 14a iB 6et to auLcLou~, the state of CE 14a changes to BOOTING ~state 154). If CE 14a boots rully~ the state of CE 14a changes to ACTIVE
5 (state 156 ) .
When CE 14a is ACTIVE, and CE 14b issues a boot request, or if CE 14b had issued a boot request while the state of CE 14a was transitioning from OFFLINE to ACTIVE
(state6 152-156), the state of CE 14b changes to RTS
10 (state 158) if CE 14b is set to autosync and otherwise to WAITING (state 160). If the state of CE 14b changes to RTS (state 158), IOP Monitor waits for the user to issue a "yll~llL u--ize command to CE 14b . When the user issues such a command, the state of CE 14b changes to WAITING
15 ( state 160 ) .
Once CE 14b is WAITING, IOP Monitor 48 copies the contents of memory system 34a of CE 14a into memory system 34b of CE 14b. Once the memory transfer is complete, IOP Monitor 48 waits for CE 14a to transmit a 20 quantum interrupt or I/O request IPI packet. Upon receipt of such a packet, IOP Monitor 48 changes the state of CE 14a to M_SYNC and the state of CE 14b to S_SYNC (state 162), and ~y"~ l..u--izes the CEs 14. This ay--~ llLul.ization inn~ wc r-~c~nn~1n~ to any memory changes 25 that o_- uLLed while IOP Monitor 48 was waiting for CE 14a to transmit a quantum interrupt or I/O request IPI
packet. lJpon completion of the ~y..- l.Lu..ization, the states of the CEs 14 both change to ACTIVE (state 164) and system 10 is deemed to be fully operational.
~ In an alternative impl~ L~t.ion, IOP Monitor 48 does not wait for memory transfer to complete before ~hAngl n~ the state of CE 14a to M SYNC and the state of CE 14b to S_SYNC (state 162). Instead, IOP Monitor 48 makes this state change upon receipt of an IPI packet WO 95/15529 ~CrlUS941133!iO
-; 21 77850 from CE 14a and performs the memory transfer as part of the ~y~ Lu.-ization process.
Similar state transitions occur when CE 14b is the first CE 14 to issue a boot request. Thus, ~Ff-lmin7 that 5 CE 14b is not set to auLuLouL, CE 14b transiticns from OFFLINE (state 150) to RTC (state 166) to BOOTING (state 168) to ACTIVE (state 170). Similarly, once CE 14b is ACTIVE, and ~ccllm;n~ that CE 14a is not set to auLuayllc~
CE 14a transition6 from OFFLINE (state 170) to RTS (state 10 172) to WAITING (state 174) to S_SYNC (state 176) to ACTIVE (state 164).
In other PmhoSi ' 5 of the invention, for example, referring to Fig. 6, a fault r~c; 1 ;~nt system 200 ; nr~ Pc two IOPs 202 and two CEs 204 . Each CE 204 15 is connected, through an IPI card 206 and a cable 208, to an IPI card 210 of each IOP 202. IOPs 202 are l~dulldallLly connected to each other through IPI cards 210 and cables 212. Because every _ L of system 200 has a r~du..d-~.L backup L, system 200 is entirely 20 fault resilient. In an alternative approach, cables 208 and 210 could be replaced by a pair of local area nc i ~L}.S to which each IOP 202 and CE 204 would be P~l. Indeed, local area n~L _/L~S~ can always be substituted for cable ~-r~nnect;ons.
System 200 is operating 6ystem and appli-~tir~n sc,r LwaL e ; nr~ L in that it does not require modifications of the operating system or the 'lrPl ;~At;nn sGrLwaLe to operate. Any single piece of haL, _ ~ can be u~yLaded or repaired in system 200 with no service 30 interruption. Therefore, by sequentially rPrl~;n~ each piece of hardware and allowing system 200 to Le--y...l.lu..ize after each repl~ L, the hardware of system 200 can be replaced in its entirety without service interruption. Similarly, software on system 200 35 can be upgraded with minimal service interruption (that WO 95/1~529 ~ ~u is, during the sor~waLe upgrade, the application will become unavailable for an acceptable period of time such as two seconds). Also, disaster tolerance for 1JuL,uO8eS
of av~ h; 11 ty can be obtained by placing each IOP/CE
5 pair in a separate location and nnnPct;n~ the pairs through a ~ i rntions link.
Re~erring to Fig. 7, a distributed, high performance, fault rec~ Pn~ system 220 inrlll~pc two systems 200, the IOPs 202 of which are ~u....e- Led to each 10 other, through IPI modules, by cables 222. System 220 uses distributed computing environment software to achieve high perf ormance by running separate portions of an application on each system 200. System 220 is fault tolerant and offers the ability to perform both h~ è
15 and SurLw.lLe upgrades without service interruption.
Referring to Fig. 8, a fault tolerant system 230 inrlll~Pq three IOPs (232, 234, and 236) and three CEs (238, 240, and 242). Through IPI modules 244 and cables 246, each IOP is cnnnPctP~ to an IPI module 244 of each 20 of the other IOPs. Through IPI modules 248 and cables 250, each CE is connected to an IPI module 244 of two of the IOPs, with CE 238 being cnnnPr~P~l to IOPs 232 and 234, CE 240 being rnnnPCtP~ to IOPs 232 and 236, and CE
242 being rnnnPctPd to IOPs 234 and 236. Like system z5 200, system 230 allows for hardware u~yLcldes without service i--Lé--u~Lion and surLw~le u~-tdes with only minimal service i..LeL, u~Lion.
As can be seen from a comparison of Figs. 7 and 8, the CEs and IOPs of systems 200 and 230 are itlPn~ir~lly 30 configured. As a result, upgrading a fault rP~ Pnt system 200 to a fault tolerant system 230 does not require any r~rl~ L of existing ha~ e and entails the simple ~Lu :eduLe of adding an additional CE/IOP pair, connecting the cables, and making appropriate changes to 35 the system sorLw~e. This modularity is an i ~ L_l,L

~ WO 95/15529 2 ~ 7 7 8 5 0 ~CrlUS941133~0 feature of the paired modular redundant architecture of the invention.
Because the ~ _ Ls of system 230 are triply rPdl-n~nt, system 230 is more capable of identifying the 5 source of a hardware fault than is system 10. Thus, while system lo simply ~i C~hlp~ one or both of CEs 14 when an error i5 ~lpta~tp~l~ system 230 offers a higher degree of fault ~ nnSia.
Referring to Fig. 9, each IOP (232, 234, 236) of 10 system 230 performs fault diagnosis according to a pLU~edUL~ 300. Initially, each IOP (232, 234, 236) checks for major faults such as power loss, broken cables, and r.u..~u.-uLional CEs or IOPs using well known te~-hn;quPc such as power sensing, cable sensing, and 15 protocol timeouts (step 302). When such a fault is detected, each IOP ~l;c~hlP~ the faulty device or, if nPCPc~- y, the entire system.
After ~hP~ k;n~ for major faults, each IOP waits to receive IPI packets (that is, guantum il~Lt:LLU~L:~ or I/O
20 reguests) from the two CEs to which the IOP is ,, "-~(step 304). Thus, for example, IOP 232 waits to receive IPI packets from CEs 238 and 240. After receiving IPI
packets from both cnnnPcted CEs, each IOP transmits the .1,P. 1~ CRCs~) of those IPI packets to the other two 25 IOPs and waits for receipt of CRCs from the other two IOPs ( step 3 0 6 ) .
After receiving the CRCs from the other two IOPs, each IOP generates a three by three matrix in which each column ULLt=lyUlldS to a CE, each row CULLt:S,,UU~ldS to an 30 IOP, ~nd ~h entry is th= CIIC reoeiv-d fron the ~o~u=='~

Wo 95/15529 - PCr~S941133~0 ` 21 7785~

CE by the row's IOP (step 308). Thus, for example, IOP
232 generates the following matrix:

IOP 2 3 2 ' CRC CRC X

IOP 2 3 6 ¦ X CRC CRC
After generating the matrix, IOP 232 sums the entries in each row and each column of the matrix. I~ the three row sum6 are egual and the three column sums are egual (step 10 310), then there is no fault and IOP 232 checks again for maj or faults (step 3 02 ) .
I~ either the three rows ' sums or the three columns' sums are unegual (step 310), then IOP 232 c:5 the CRC entries in each of the columns of the 15 matrix. If the two CRC entries in each column match (step 312), then IOP 232 ~ qnnS"C that a CE failure has OCUULL~:d and ~iCAhl~C the CE ULL~ ;nlJ to the column for which the sum does not egual the sums of the other columns (step 314).
I~ the CRC entries in one or more of the matrix columns do not match (step 312), then IOP 232 rl~rmin~-c how many of the columns include mi ~at- l-ed entries. If the matrix inr~ c only one column with mi~c.L~ lled entries (step 315), then IOP 232 ~ nns~c that the path 25 between the IOP 1uLL-~ lin~ to the matrix row sum that i5 unegual to the other matrix row sums and the CE
CUL L -~lJ~ l i ng to the column having mi_.uatul,ed entries has failed and ~l~c~hl~c that path (step 316). For ,uuLuoges of the rliaqnnsic~ the path inrl~ c the IPI module 244 in 30 the IOP, the IPI module 248 in the CE, and the cable 250.
If the matrix inrl~ Pc more than one column with mismatched entries (step 314), then IOP 232 confirms that one matriX row sum is unegual to the other matrix row sums, ~ nns~-c an IOP fallure, and r~c;~hl~c the IOP
35 UULL ~ l1nq to the matrix row sum that is unegual to the other matrix row sums (step 318).

WO 95/15529 , 2 ~ 7 7 8 ~ ~ PCrlUS94113350 If, after ~liA~nncin~ and accounting for a CE
failure (step 314), path failure (step 316), or IOP
failure (step 318), IOP 232 detpnminpc that system 300 still inrlllApc sufficient non-faulty hardware to remain 5 operational, IOP 232 checks again for major faults (step 302). Because system 230 is triply lG~ l, system 230 can continue to operate even after several _ have failed. For example, to remain operating in an avA11;1hility mode, system 230 only needs to have a single 10 fllnrti~n~l CE, a single functional IOP, and a fllnrti~n path between the two.
Using ~Loce.luLæ 300, each IOP (232, 234, 236) can correctly ~liA~nnse any single failure in a fully operational system 230 or in a system 230 in which one 15 element (that is, a CE, an IOP, or a path) has previously been tli C~hlPrl. In a system 230 in which an element has been IlicAhlpd~ each IOP accounts for CRCs that are not received because of the ~li CAhled element by using values that appear to be correct in comparison to actually 2 0 received CRCs .
PL~eduLe 300 is not dPrPn~lPnt on the particular aLLall, of intëL~ Li t~nc between the CEs and IOPs.
To operate properly, ~Læce-luLæ 300 only requires that the output of each CE be directly monitored by at least two 25 IOPs. Thus, ~Lo~lluLe 300 could be i l~ P~l in a system using any intèlevlllleeL ~ ni ''~ and does not require point to point c~nnPrt; nnC between the CEs and IOPs. For example, the CEs and IOPs could be rrnnPrtPd to at least two local area - ~ Lh:.. In an alternative 3 0 ~ L ua~ instead of summing the CRC values in the rows and columns of the matrix, these values can be - ed and those rows or columns in which the entries do not match can be marked with a match/mismatch indicator.
A simplified version of ~L-,ce.luL~ 300 c~n be 35 i l~ P'3 for use in a system 200. In this ~L~_edULe, wo 95/15529 PcrluS94l13350 : - `" 2 ~ 7~50 each IOP 202 of system 200 generates a two by two matriY
in which each column ~ LL~ S to a CE 204 and each row .:IJLLe:~JlJlldC to a IOP 202:

IOP 202 ' CRC CRC

After generating the matrix, each IOP 202 attache~ a mismatch indicator to each row or column in which the two entries are mi ,uat~ ed.
I~ there are no mismatch indicators, then system 200 is operating correctly.
I~ neither row and both columns have mismatch indicators, then an IOP 202 has faulted. n~rPn~in~ on the operating mode of system 200, an IOP 202 either 15 ~ hl~c another IOP 202 or shuts down system 200. The IOP 202 to be disabled is selected based on user ~rr~
parameters similar to the two av~ hi 1; ty modes used in system lO.
Ir both rows and neither column have mismatch 20 indicators, then a CE 204 has faulted. In this case, IOPs 202 respond by ~ hl in~ a CE 204 if system 200 is operating in an av~ h;lity mode or, if system 200 is operating in an integrity mode, shutting down system 200.
If both rows and one column have mismatch indicators, 25 then one of the paths between the IOPs 202 and the CE 204 CuLL~ inlJ to the mismatched column has failed.
DPrF-n~lin~ on the operating mode of system 200, IOPs 202 either disable the CE 204 having the failed path or shut down system 200. If both rows and both column have 30 mismatch indicators, then multiple faults eYist and IOPs 202 shut down system 200.
I~ one row and both columns have mismatch indicators, then the IOP 202 c ~JLL-~ 1;n~ to the mln..,c.~:1.ed row has faulted. Dpr~n~;ng on the operating 35 mode of system 200, the other IOP 202 either ~ hlP~ the fAulty IOP 202 or shuts down system 200. If one row and WO 9~115529 : '~'Crll~S941133~0 ~ 21 7785~

one column have mismatch indicators, then the path between the IOP 202 ULLP~ in~ to the mismatched row and the CE 204 ~uLLe~ linr~ to the mismatched column has failed. DPr~n~ain~ on the operating mode of system 200, 5 IOPs 202 either account for the failed path in future pro~csi ntJ or shut down system 200 .
Referring to Fig. lO, one ~ of a disaster tolerant system 260 inr~ a7pc two fault tolerant systems 230 located in remote locations and connected by ic~tions link 262, such as Ethernet or fiber, and operating in meta time lockstep with each other. To obtain meta time lockstep, all IPI packets are transmitted between fault tolerant systems 230. Like system 220, system 260 allows for hardware and Sur~war-~
15 upgrades without service interruption.
As shown, the paired modular Ledull~d~lL
architecture of the invention allows for varying levels of fault r~il 7~nre and fault tolerance through use of CEs that operate a~yl.~llLvlluusly in real time and are 20 controlled by IOPs to operate ayl~ llLvllvu~ly in meta time.
This architecture is simple and cost-effective, and can be ~Yr-~n~ed or upgraded with minimal difficulty.
~ b~t i: cl~imed i~:

Claims

1. A method of synchronizing at least two computing elements in a computer system including the at least two computing elements and at least one controller, wherein each of the computing elements have clocks that operate asynchronously of the clocks of the other computing elements, said method comprising the steps of:
selecting one or more signals from a set of signals produced by the computing elements;
monitoring the computing elements to detect the production of a selected signal by one of the computing elements;
waiting for the production of a selected signal by each other computing element after detection of a selected signal by one of the computing elements;
transmitting equal time updates from the at least one controller to each of the computing elements after receipt of selected signals from all of the computing elements; and updating the clocks of the computing elements based on the time updates.

2. The method of claim 1, further comprising the step of forming a fault resilient computer from the at least two computing elements and the at least one controller.

3. The method of claim 1, wherein said selecting step comprises the step of selecting I/O requests as the selected signals.

4. The method of claim 3, further comprising the steps of:
processing the I/O requests at the at least one controller to produce I/O responses; and transmitting the time updates with the I/O
responses from the at least one controller to the at least two computing elements.

5. The method of claim 1, wherein said selecting step comprises the step of selecting quantum interrupts and I/O requests as the selected signals.

6. The method of claim 1, wherein said selecting step comprises the step of selecting quantum interrupts as the selected signals.

7. The method of claim 6, further comprising the step of generating quantum interrupts in each computing element by counting clock cycles in the computing elements.

8. The method of claim 7, wherein the step of counting clock cycles includes counting the cycles of a selected one of a system clock, an I/O clock, and a bus clock.

9. The method of claim 7, further comprising the steps of:
loading a counter in each of the computing elements with a predetermined value;
enabling the counter in each of the computing elements with an I/O request;
decrementing the value of the counter during a clock cycle in each of the computing elements; and signalling a quantum interrupt from a computing element when the value of the counter of the computing element reaches zero.

10. The method of claim 6, further comprising the step of generating quantum interrupts by counting executed instructions in each computing element.

11. The method of claim 6, further comprising the step of using debugging features of each computing element to generate quantum interrupts.

12. The method of claim 1, further comprising the step of maintaining, for each computing element, a list of the selected signals produced by the computing element, wherein the equal time updates are transmitted when the lists for each computing element include a common entry.

13. The method of claim 1, further comprising the steps of:
comparing the selected signals generated by the computing elements and data, if any, accompanying the selected signals, and signalling that a fault has occurred if the selected signals or the accompanying data do not match.

14. The method of claim 1, further comprising the steps of:
stopping operation of each computing element after that computing element produces a selected signal, and resuming operation of a computing element upon receipt by the computing element of a time update.

15. The method of claim 1, further comprising the step of continuing operation of a computing element after producing the selected signal.

16. The method of claim 1, further comprising the steps of:
disabling asynchronous activities of the computing elements; and performing the functions of the asynchronous activities at a computing element when the computing element produces a selected signal.

17. The method of claim 16, wherein said disabling step comprises the step of disabling normal memory refresh functions, and said performing step comprises the step of performing burst memory refreshes when said selected signal is produced.

18. The method of claim 17, wherein said disabling step further comprises the steps of:
placing a counter used in the normal memory refresh functions in a mode that requires an input value to a gate to change, and connecting the gate to a fixed voltage.

19. A method of producing a fault resilient or fault tolerant computer, comprising the steps of:
designating a first processor as a computing element;
designating a second processor as a controller;
connecting the computing element and the controller to produce a modular pair;
connecting at least two modular pairs to produce a fault resilient or fault tolerant computer, wherein each computing element performs all instructions in the same number of cycles as the other computing elements.

20. The method of claim 19, wherein the first and second processors are industry standard processors.

21. The method of claim 19, further including the step of running industry a standard operating systems and applications on the at least two controllers and the at least two computing elements.

22. The method of claim 19, further including the steps of:
running a first operating system on the at least two controllers; and running a second operating system on the at least two computing elements.

23. The method of claim 19, further comprising the step of locating a modular pair remotely from the one or more other modular pairs to provide disaster tolerance.

24. The method of claim 19, further comprising the steps of:
connecting a first I/O device to a first modular pair;
connecting a second I/O device to a second modular pair, said second I/O device being redundant of the first I/O device; and transmitting at least identical I/O write requests and data to the first and second I/O devices.

25. The method of claim 24, further comprising the steps of:
connecting a third I/O device to a third modular pair, said third I/O device being redundant of the first and second I/O devices; and transmitting at least identical I/O write requests and data to the first, second, and third I/O devices.

26. The method of claim 19, further comprising the step of activating an inactive processor by transferring the operational state of an active processor to the inactive processor through a controller.

27. The method of claim 26, further comprising the step of pausing processing by said computing during said transferring step.

28. The method of claim 26, further comprising the step of performing said transferring step as a background process without pausing processing by said computing elements.

29. The method of claim 19, further comprising the step of upgrading a processor while said computing elements are processing by:
disabling a processor to be upgraded;
upgrading the disabled processor; and reactivating the upgraded processor by transferring the operational state of an active processor to the upgraded processor through a controller.

30. The method of claim 19, further comprising the step of repairing a processor while said computing elements are processing by:
disabling a processor to be repaired;
repairing the disabled processor; and reactivating the repaired processor by transferring the operational state of an active processor to the repaired processor through a controller.

31. A method of isolating I/O requests from computing operations in a computer, comprising the steps of:
providing a virtual I/O unit that simulates the interface of a physical I/O device;
intercepting an I/O request by a computing element that is addressed to the physical I/O device;
providing the intercepted I/O request to the virtual I/O unit;
transmitting the contents of the virtual I/O unit to an I/O processor; and at the I/O processor, providing the transmitted contents of the virtual I/O device to the physical I/O
device.

32. The method of claim 31, wherein said providing step includes providing a virtual keyboard.

33. The method of claim 31, wherein said providing step includes providing a virtual display.

34. The method of claim 31, further comprising the step of using the virtual I/O device to expose software errors caused by a software asynchrony that results in I/O race conditions.

35. The method of claim 31, further comprising the steps of:
intercepting a low level I/O request by a computing element;
redirecting the intercepted low level I/O request to the I/O processor;
at the I/O processor, performing the requested I/O
operation to produce I/O results; and returning the I/O results to the computing element.

36. A method of detecting and diagnosing faults in a computer system that includes at least two elements and at least two controllers, wherein each of the computing elements is connected to at least two of the controllers, and each controller is connected to at least two computing elements and to the other controllers, said method comprising the steps of:
producing data at each of the computing elements;
generating a value at each of the computing elements that relates to the produced data;
transmitting the data, along with the corresponding values, from each computing element to the at least two connected controllers;
transmitting the values received by each controller to the other controllers; and performing computations on the values corresponding to each computing element and the values corresponding to each controller;
wherein, when the results of the computations performed on the values corresponding to each controller are equal, and the results of the computations performed on the values corresponding to each computing element are equal, no faults exist.

37. The method of claim 36, further comprising, when the results of the computations performed on the values corresponding to each computing element and the results of the computations performed on the values corresponding to each controller are not equal, the steps of:

comparing, for each one of the computing elements, all of the values corresponding to the one computing element, and designating one of the computing elements as faulty when the values corresponding to each computing element match for each computing element, but mismatch for different computing elements.

38. The method of claim 36, further comprising, when the results of the computations performed on the values corresponding to each computing element and the results of the computations performed on the values corresponding to each controller are not equal, the steps of:
comparing, for each one of the computing elements, all of the values corresponding to the one computing element, and designating a connection to one of the computing elements as faulty when the values corresponding only to the one computing element mismatch.

39. The method of claim 36, further comprising, when the results of the computations performed on the values corresponding to each computing element and the results of the computations performed on the values;
corresponding to each controller are not equal, the steps of:
comparing, for each one of the computing elements, all of the values corresponding to the one computing element, and when the values corresponding to two or more of the computing elements mismatch, designating the controller connected to the two or more computing elements as faulty.

40. A computer system, comprising:
a controller, a first computing element connected to the controller, a second computing element connected to the controller, means for intercepting I/O operations by the first and second computing elements, and means for transmitting the intercepted I/O
operations to the controller, wherein the first computing element performs each instruction of its instruction set in the same number of cycles as the second computing element takes to perform said instruction.

41. The computer system of claim 40, wherein the controller and the first and second computing elements each include an industry standard motherboard.

42. The computer system of claim 40, further comprising a second controller connected to the first controller and to the first and second computing elements.

43. The computer system of claim 42, wherein the first controller and the first computing element are located in a first location and the second controller and the second computing element are located in a second location, and further comprising a communications link connecting the first controller to the second controller, the first controller to the second computing element, and the second controller to the first computing element.

44. The computer system of claim 42, further comprising:

a third controller;
a fourth controller connected to the third controller;
a third computing element connected to the third controller and the fourth controller;
a fourth computing element connected to the third controller and the fourth controller;
means for connecting the third and fourth controllers to the first and second controller; and means for distributing computing tasks between the computing elements, wherein the first and second computing elements perform a first set of computing tasks and the third and fourth computing elements perform a second set of computing tasks wherein the third and fourth computing elements perform each instruction of their instruction sets in the same number of cycles as the first and second computing elements take to perform said instruction.

45. The computer system of claim 42, wherein the first controller and the first computing element are remotely located from the second controller and the second computing element to provide disaster tolerance.

46. The computer system of claim 40, wherein each of said first and second computing elements further comprises means for generating a quantum interrupt.

47. A computer system, comprising:
a first controller;
a second controller connected to the first controller;
a third controller connected to the first and second controllers;

a first computing element connected to the first and second controllers;
a second computing element connected to the second and third controllers; and a third computing element connected to the first and third controllers.

48. The computer system of claim 47, wherein the first controller and the first computing element are remotely located from the other controllers and computing elements to provide disaster tolerance.

49. The computer system of claim 47, further comprising:
means for intercepting I/O operations by the first computing element;
means for transmitting the intercepted I/O from the first computing element to the first and second controllers;
means for intercepting I/O operations by the second computing element;
means for transmitting the intercepted I/O from the second computing element to the second and third controllers;
means for intercepting I/O operations by the third computing element; and means for transmitting the intercepted I/O from the third computing element to the first and third controllers.

50. The computer system of claim 47, further comprising:
a fourth controller;
a fifth controller connected to the fourth controller;

a sixth controller connected to the fourth and fifth controllers;
a fourth computing element connected to the fourth and fifth controllers;
a fifth computing element connected to the fifth and sixth controllers;
a sixth computing element connected to the fourth and sixth controllers; and a communications link for connecting the first, second, and third controllers to the fourth, fifth, and sixth controllers, wherein the first, second, and third controllers, and the first, second, and third computing elements are in a first location and the fourth, fifth, and sixth controllers, and the fourth, fifth, and sixth computing elements are in a second location.