CA2106280A1 - Apparatus and methods for fault-tolerant computing employing a daemon monitoring process and fault-tolerant library to provide varying degrees of fault tolerance - Google Patents

Apparatus and methods for fault-tolerant computing employing a daemon monitoring process and fault-tolerant library to provide varying degrees of fault tolerance

Info

Publication number
CA2106280A1
CA2106280A1 CA2106280A CA2106280A CA2106280A1 CA 2106280 A1 CA2106280 A1 CA 2106280A1 CA 2106280 A CA2106280 A CA 2106280A CA 2106280 A CA2106280 A CA 2106280A CA 2106280 A1 CA2106280 A1 CA 2106280A1
Authority
CA
Canada
Prior art keywords
fault
tolerant
node
monitor
processes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA2106280A
Other languages
French (fr)
Other versions
CA2106280C (en
Inventor
Yennun Huang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Yennun Huang
American Telephone And Telegraph Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yennun Huang, American Telephone And Telegraph Company filed Critical Yennun Huang
Publication of CA2106280A1 publication Critical patent/CA2106280A1/en
Application granted granted Critical
Publication of CA2106280C publication Critical patent/CA2106280C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated

Abstract

Techniques for fault-tolerant computing without fault-tolerant hardware or operating systems. The techniques employ a monitor daemon which is implemented as one or more user processes and a fault-tolerant library which can be bound into application programs. A user process is made fault tolerant by registering it with the monitor daemon. The degree of fault tolerance can be controlled by means of the fault-tolerant library. Included in the fault-tolerant library are functions which define portions of a user process's memory as critical memory, which copy the critical memory to persistent storage, and which restore the critical memory from persistent storage. The monitor daemon monitors fault-tolerant processes, and when such a process hangs or crashes, the daemon restarts it. When the techniques are employed in a mufti-node system, the monitor on each node monitors one other node in addition to the processes in its own node. In addition, the monitor may maintain copies of the state of fault- tolerant processes running at least on the monitored node.
When the monitored node fails, the monitor starts the processes from the monitored node for which the monitor has state on its own node. When a node leaves or rejoins the system, the relationship between monitored and monitoring nodes is automatically reconfigured.
CA002106280A 1992-09-30 1993-09-15 Apparatus and methods for fault-tolerant computing employing a daemon monitoring process and fault-tolerant library to provide varying degrees of fault tolerance Expired - Fee Related CA2106280C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US95454992A 1992-09-30 1992-09-30
US954,549 1992-09-30

Publications (2)

Publication Number Publication Date
CA2106280A1 true CA2106280A1 (en) 1994-03-31
CA2106280C CA2106280C (en) 2000-01-18

Family

ID=25495593

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002106280A Expired - Fee Related CA2106280C (en) 1992-09-30 1993-09-15 Apparatus and methods for fault-tolerant computing employing a daemon monitoring process and fault-tolerant library to provide varying degrees of fault tolerance

Country Status (5)

Country Link
US (1) US5748882A (en)
EP (1) EP0590866B1 (en)
JP (1) JP3145236B2 (en)
CA (1) CA2106280C (en)
DE (1) DE69330239T2 (en)

Families Citing this family (131)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2152329C (en) 1994-09-08 1999-02-09 N Dudley Fulton Iii Apparatus and methods for software rejuvenation
IE74214B1 (en) * 1995-02-01 1997-07-16 Darragh Fanning Error prevention in computer systems
JP3448126B2 (en) * 1995-03-13 2003-09-16 株式会社東芝 Information processing apparatus, computer network, and information processing method
JPH08286779A (en) * 1995-04-18 1996-11-01 Fuji Xerox Co Ltd Application automatic restarting device
FI102220B (en) * 1995-10-30 1998-10-30 Nokia Telecommunications Oy Collection of error information when restarting a computer unit
JP2730534B2 (en) * 1995-12-18 1998-03-25 日本電気株式会社 Data backup method and apparatus for data communication network terminal
FR2745100B1 (en) * 1996-02-19 1998-04-17 France Telecom FAULT TRANSPARENCY COMPUTER SYSTEM FOR USER APPLICATIONS
US5974414A (en) * 1996-07-03 1999-10-26 Open Port Technology, Inc. System and method for automated received message handling and distribution
US5958062A (en) * 1997-03-19 1999-09-28 Fujitsu Limited Client/server system and computer system
US5875291A (en) * 1997-04-11 1999-02-23 Tandem Computers Incorporated Method and apparatus for checking transactions in a computer system
US5987250A (en) * 1997-08-21 1999-11-16 Hewlett-Packard Company Transparent instrumentation for computer program behavior analysis
US5911060A (en) * 1997-09-26 1999-06-08 Symantec Corporation Computer method and apparatus for unfreezing an apparently frozen application program being executed under control of an operating system
US6009258A (en) * 1997-09-26 1999-12-28 Symantec Corporation Methods and devices for unwinding stack of frozen program and for restarting the program from unwound state
US5974566A (en) * 1997-10-07 1999-10-26 International Business Machines Corporation Method and apparatus for providing persistent fault-tolerant proxy login to a web-based distributed file service
US5926777A (en) * 1997-10-14 1999-07-20 Nematron Corporation Method and apparatus for monitoring computer system service life parameters
US6035416A (en) * 1997-10-15 2000-03-07 International Business Machines Corp. Method and apparatus for interface dual modular redundancy
JP3450177B2 (en) * 1998-03-20 2003-09-22 富士通株式会社 Network monitoring system and monitored control device
US6304982B1 (en) * 1998-07-14 2001-10-16 Autodesk, Inc. Network distributed automated testing system
US6266781B1 (en) * 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6195760B1 (en) * 1998-07-20 2001-02-27 Lucent Technologies Inc Method and apparatus for providing failure detection and recovery with predetermined degree of replication for distributed applications in a network
US8631066B2 (en) 1998-09-10 2014-01-14 Vmware, Inc. Mechanism for providing virtual machines for use by multiple users
WO2000022527A1 (en) * 1998-10-09 2000-04-20 Sun Microsystems, Inc. Process monitoring in a computer system
US7516453B1 (en) * 1998-10-26 2009-04-07 Vmware, Inc. Binary translator with precise exception synchronization mechanism
US6401216B1 (en) * 1998-10-29 2002-06-04 International Business Machines Corporation System of performing checkpoint/restart of a parallel program
US6393583B1 (en) 1998-10-29 2002-05-21 International Business Machines Corporation Method of performing checkpoint/restart of a parallel program
FR2786285B1 (en) * 1998-11-24 2001-02-02 Secap DEVICE AND METHOD FOR PROTECTING AGAINST BATTERY OVERFLOWS IN A MEMORY AND FRANKING MACHINE IMPLEMENTING THEM
US7370102B1 (en) * 1998-12-15 2008-05-06 Cisco Technology, Inc. Managing recovery of service components and notification of service errors and failures
US6718376B1 (en) * 1998-12-15 2004-04-06 Cisco Technology, Inc. Managing recovery of service components and notification of service errors and failures
US6871224B1 (en) 1999-01-04 2005-03-22 Cisco Technology, Inc. Facility to transmit network management data to an umbrella management system
US6654801B2 (en) 1999-01-04 2003-11-25 Cisco Technology, Inc. Remote system administration and seamless service integration of a data communication network management system
US6708224B1 (en) * 1999-01-19 2004-03-16 Netiq Corporation Methods, systems and computer program products for coordination of operations for interrelated tasks
US6453430B1 (en) 1999-05-06 2002-09-17 Cisco Technology, Inc. Apparatus and methods for controlling restart conditions of a faulted process
US6401215B1 (en) * 1999-06-03 2002-06-04 International Business Machines Corporation Resynchronization of mirrored logical data volumes subsequent to a failure in data processor storage systems with access to physical volume from multi-initiators at a plurality of nodes
US6718486B1 (en) * 2000-01-26 2004-04-06 David E. Lovejoy Fault monitor for restarting failed instances of the fault monitor
US6594774B1 (en) * 1999-09-07 2003-07-15 Microsoft Corporation Method and apparatus for monitoring computer system objects to improve system reliability
US6389370B1 (en) * 1999-09-14 2002-05-14 Hewlett-Packard Company System and method for determining which objects in a set of objects should be processed
US7010590B1 (en) * 1999-09-15 2006-03-07 Datawire Communications Networks, Inc. System and method for secure transactions over a network
US6480809B1 (en) * 1999-09-23 2002-11-12 Intel Corporation Computer system monitoring
US6735716B1 (en) * 1999-09-27 2004-05-11 Cisco Technology, Inc. Computerized diagnostics and failure recovery
KR100644572B1 (en) * 1999-10-02 2006-11-13 삼성전자주식회사 Device operation detecting apparatus and method in directory serve
AU1074801A (en) * 1999-10-05 2001-05-10 Ejasent Inc. Virtual endpoint
US6990668B1 (en) * 1999-10-20 2006-01-24 International Business Machines Corporation Apparatus and method for passively monitoring liveness of jobs in a clustered computing environment
US7039680B2 (en) 1999-10-20 2006-05-02 International Business Machines Corporation Apparatus and method for timeout-free waiting for an ordered message in a clustered computing environment
US6421688B1 (en) 1999-10-20 2002-07-16 Parallel Computers Technology, Inc. Method and apparatus for database fault tolerance with instant transaction replication using off-the-shelf database servers and low bandwidth networks
US6505298B1 (en) * 1999-10-25 2003-01-07 International Business Machines Corporation System using an OS inaccessible interrupt handler to reset the OS when a device driver failed to set a register bit indicating OS hang condition
US6662310B2 (en) 1999-11-10 2003-12-09 Symantec Corporation Methods for automatically locating url-containing or other data-containing windows in frozen browser or other application program, saving contents, and relaunching application program with link to saved data
US6631480B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods and systems for protecting data from potential corruption by a crashed computer program
US6630946B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods for automatically locating data-containing windows in frozen applications program and saving contents
US6567937B1 (en) * 1999-11-17 2003-05-20 Isengard Corporation Technique for remote state notification and software fault recovery
US6594784B1 (en) * 1999-11-17 2003-07-15 International Business Machines Corporation Method and system for transparent time-based selective software rejuvenation
US6629266B1 (en) 1999-11-17 2003-09-30 International Business Machines Corporation Method and system for transparent symptom-based selective software rejuvenation
GB2359386B (en) * 2000-02-16 2004-08-04 Data Connection Ltd Replicated control block handles for fault-tolerant computer systems
GB2359384B (en) 2000-02-16 2004-06-16 Data Connection Ltd Automatic reconnection of partner software processes in a fault-tolerant computer system
US7814309B1 (en) * 2000-02-29 2010-10-12 Cisco Technology, Inc. Method for checkpointing and reconstructing separated but interrelated data
US7225244B2 (en) * 2000-05-20 2007-05-29 Ciena Corporation Common command interface
US6654903B1 (en) 2000-05-20 2003-11-25 Equipe Communications Corporation Vertical fault isolation in a computer system
US6983362B1 (en) 2000-05-20 2006-01-03 Ciena Corporation Configurable fault recovery policy for a computer system
US6715097B1 (en) 2000-05-20 2004-03-30 Equipe Communications Corporation Hierarchical fault management in computer systems
US6691250B1 (en) 2000-06-29 2004-02-10 Cisco Technology, Inc. Fault handling process for enabling recovery, diagnosis, and self-testing of computer systems
US6687849B1 (en) 2000-06-30 2004-02-03 Cisco Technology, Inc. Method and apparatus for implementing fault-tolerant processing without duplicating working process
US6718538B1 (en) * 2000-08-31 2004-04-06 Sun Microsystems, Inc. Method and apparatus for hybrid checkpointing
US7587499B1 (en) * 2000-09-14 2009-09-08 Joshua Haghpassand Web-based security and filtering system with proxy chaining
US8972590B2 (en) 2000-09-14 2015-03-03 Kirsten Aldrich Highly accurate security and filtering software
DE10101754C2 (en) 2001-01-16 2003-02-20 Siemens Ag Procedure for the automatic restoration of data in a database
US7225361B2 (en) * 2001-02-28 2007-05-29 Wily Technology, Inc. Detecting a stalled routine
US6952766B2 (en) * 2001-03-15 2005-10-04 International Business Machines Corporation Automated node restart in clustered computer system
US6810495B2 (en) 2001-03-30 2004-10-26 International Business Machines Corporation Method and system for software rejuvenation via flexible resource exhaustion prediction
US6918051B2 (en) * 2001-04-06 2005-07-12 International Business Machines Corporation Node shutdown in clustered computer system
US6922796B1 (en) * 2001-04-11 2005-07-26 Sun Microsystems, Inc. Method and apparatus for performing failure recovery in a Java platform
US6928585B2 (en) * 2001-05-24 2005-08-09 International Business Machines Corporation Method for mutual computer process monitoring and restart
US7000100B2 (en) * 2001-05-31 2006-02-14 Hewlett-Packard Development Company, L.P. Application-level software watchdog timer
US8423674B2 (en) * 2001-06-02 2013-04-16 Ericsson Ab Method and apparatus for process sync restart
US7409420B2 (en) * 2001-07-16 2008-08-05 Bea Systems, Inc. Method and apparatus for session replication and failover
US7702791B2 (en) * 2001-07-16 2010-04-20 Bea Systems, Inc. Hardware load-balancing apparatus for session replication
US6925582B2 (en) * 2001-08-01 2005-08-02 International Business Machines Corporation Forwarding of diagnostic messages in a group
US7003775B2 (en) * 2001-08-17 2006-02-21 Hewlett-Packard Development Company, L.P. Hardware implementation of an application-level watchdog timer
US20030065970A1 (en) * 2001-09-28 2003-04-03 Kadam Akshay R. System and method for creating fault tolerant applications
US7051331B2 (en) * 2002-01-02 2006-05-23 International Business Machines Corporation Methods and apparatus for monitoring a lower priority process by a higher priority process
AU2003205209A1 (en) * 2002-01-18 2003-09-02 Idetic, Inc. Method and computer system for protecting software components of an application against faults
US7168008B2 (en) * 2002-01-18 2007-01-23 Mobitv, Inc. Method and system for isolating and protecting software components
US7392302B2 (en) * 2002-02-21 2008-06-24 Bea Systems, Inc. Systems and methods for automated service migration
US7124320B1 (en) * 2002-08-06 2006-10-17 Novell, Inc. Cluster failover via distributed configuration repository
JP3749208B2 (en) * 2002-08-14 2006-02-22 株式会社東芝 Process migration method, computer
US7096383B2 (en) 2002-08-29 2006-08-22 Cosine Communications, Inc. System and method for virtual router failover in a network routing system
US7089450B2 (en) * 2003-04-24 2006-08-08 International Business Machines Corporation Apparatus and method for process recovery in an embedded processor system
US7287179B2 (en) * 2003-05-15 2007-10-23 International Business Machines Corporation Autonomic failover of grid-based services
KR100435985B1 (en) * 2004-02-25 2004-06-12 엔에이치엔(주) Nonstop service system using voting and, information updating and providing method in the same
US8190714B2 (en) * 2004-04-15 2012-05-29 Raytheon Company System and method for computer cluster virtualization using dynamic boot images and virtual disk
US9178784B2 (en) 2004-04-15 2015-11-03 Raytheon Company System and method for cluster management based on HPC architecture
US8336040B2 (en) * 2004-04-15 2012-12-18 Raytheon Company System and method for topology-aware job scheduling and backfilling in an HPC environment
US8335909B2 (en) 2004-04-15 2012-12-18 Raytheon Company Coupling processors to each other for high performance computing (HPC)
US20050235055A1 (en) * 2004-04-15 2005-10-20 Raytheon Company Graphical user interface for managing HPC clusters
US7711977B2 (en) * 2004-04-15 2010-05-04 Raytheon Company System and method for detecting and managing HPC node failure
US7386761B2 (en) * 2004-04-16 2008-06-10 International Business Machines Corporation Diagnostic repair system and method for computing systems
US8140653B1 (en) * 2004-06-25 2012-03-20 Avaya Inc. Management of a multi-process system
US7360123B1 (en) * 2004-06-30 2008-04-15 Symantec Operating Corporation Conveying causal relationships between at least three dimensions of recovery management
FR2872605B1 (en) * 2004-06-30 2006-10-06 Meiosys Sa METHOD FOR MANAGING SOFTWARE PROCESS, METHOD AND SYSTEM FOR REDISTRIBUTION OR CONTINUITY OF OPERATION IN MULTI-COMPUTER ARCHITECTURE
US8122280B2 (en) 2004-08-26 2012-02-21 Open Invention Network, Llc Method and system for providing high availability to computer applications
US7925932B1 (en) * 2004-09-20 2011-04-12 Symantec Operating Corporation Method and appartus for detecting an application process failure
DE102004050350B4 (en) * 2004-10-15 2006-11-23 Siemens Ag Method and device for redundancy control of electrical devices
US7433931B2 (en) * 2004-11-17 2008-10-07 Raytheon Company Scheduling in a high-performance computing (HPC) system
US8244882B2 (en) * 2004-11-17 2012-08-14 Raytheon Company On-demand instantiation in a high-performance computing (HPC) system
US7475274B2 (en) * 2004-11-17 2009-01-06 Raytheon Company Fault tolerance and recovery in a high-performance computing (HPC) system
US7549087B2 (en) * 2005-03-29 2009-06-16 Microsoft Corporation User interface panel for hung applications
US7613957B2 (en) * 2005-04-06 2009-11-03 Microsoft Corporation Visual indication for hung applications
US8078919B2 (en) * 2005-06-14 2011-12-13 Hitachi Global Storage Technologies Netherlands B.V. Method, apparatus and program storage device for managing multiple step processes triggered by a signal
EP1960930A4 (en) * 2005-11-30 2009-12-16 Kelsey Hayes Co Microprocessor memory management
US7574591B2 (en) * 2006-01-12 2009-08-11 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US7716461B2 (en) * 2006-01-12 2010-05-11 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
JP2007265137A (en) * 2006-03-29 2007-10-11 Oki Electric Ind Co Ltd Multi-task processing method and multi-task processing apparatus
US7657787B2 (en) * 2006-04-11 2010-02-02 Hewlett-Packard Development Company, L.P. Method of restoring communication state of process
WO2008018969A1 (en) * 2006-08-04 2008-02-14 Parallel Computers Technology, Inc. Apparatus and method of optimizing database clustering with zero transaction loss
US9384159B2 (en) * 2007-05-24 2016-07-05 International Business Machines Corporation Creating a checkpoint for a software partition in an asynchronous input/output environment
US7934129B2 (en) * 2008-09-05 2011-04-26 Microsoft Corporation Network hang recovery
US7996722B2 (en) * 2009-01-02 2011-08-09 International Business Machines Corporation Method for debugging a hang condition in a process without affecting the process state
US8352788B2 (en) * 2009-07-20 2013-01-08 International Business Machines Corporation Predictive monitoring with wavelet analysis
US8566634B2 (en) * 2009-12-18 2013-10-22 Fujitsu Limited Method and system for masking defects within a network
US10191796B1 (en) * 2011-01-31 2019-01-29 Open Invention Network, Llc System and method for statistical application-agnostic fault detection in environments with data trend
WO2012127599A1 (en) * 2011-03-22 2012-09-27 富士通株式会社 Input/output control device, information processing system, and log sampling program
US20130246363A1 (en) * 2012-03-15 2013-09-19 Ellen L. Sorenson Idle point auditing for databases
US8843930B2 (en) * 2012-07-10 2014-09-23 Sap Ag Thread scheduling and control framework
US9378180B1 (en) 2013-06-27 2016-06-28 Emc Corporation Unified SCSI target management for starting and configuring a service daemon in a deduplication appliance
US9378160B1 (en) 2013-06-27 2016-06-28 Emc Corporation Unified SCSI target management for shutting down and de-configuring a service daemon in a deduplication appliance
US9390034B1 (en) 2013-06-27 2016-07-12 Emc Corporation Unified SCSI target management for performing a delayed shutdown of a service daemon in a deduplication appliance
US9384151B1 (en) * 2013-06-27 2016-07-05 Emc Corporation Unified SCSI target management for managing a crashed service daemon in a deduplication appliance
US10430240B2 (en) 2015-10-13 2019-10-01 Palantir Technologies Inc. Fault-tolerant and highly-available configuration of distributed services
CN106445721B (en) * 2016-10-11 2019-07-12 Oppo广东移动通信有限公司 The method and mobile terminal of house dog fault-tolerant processing
DE202018006861U1 (en) * 2017-10-30 2023-11-22 Dexcom, Inc. Diabetes management partner interface for wireless communication of analyte data
CN109634769B (en) * 2018-12-13 2021-11-09 郑州云海信息技术有限公司 Fault-tolerant processing method, device, equipment and storage medium in data storage
CN111124678A (en) * 2019-12-18 2020-05-08 青岛海尔科技有限公司 Memory scheduling processing method and device
JP2022131846A (en) * 2021-02-26 2022-09-07 ミネベアミツミ株式会社 motor

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1434186A (en) * 1972-04-26 1976-05-05 Gen Electric Co Ltd Multiprocessor computer systems
US4539655A (en) * 1982-03-16 1985-09-03 Phoenix Digital Corporation Microcomputer based distributed control network
US4635258A (en) * 1984-10-22 1987-01-06 Westinghouse Electric Corp. System for detecting a program execution fault
US4989133A (en) * 1984-11-30 1991-01-29 Inmos Limited System for executing, scheduling, and selectively linking time dependent processes based upon scheduling time thereof
FR2602891B1 (en) * 1986-08-18 1990-12-07 Nec Corp ERROR CORRECTION SYSTEM OF A MULTIPROCESSOR SYSTEM FOR CORRECTING AN ERROR IN A PROCESSOR BY PUTTING THE PROCESSOR INTO CONTROL CONDITION AFTER COMPLETION OF THE MICROPROGRAM RESTART FROM A RESUMPTION POINT
US4819159A (en) * 1986-08-29 1989-04-04 Tolerant Systems, Inc. Distributed multiprocess transaction processing system and method
US5109329A (en) * 1987-02-06 1992-04-28 At&T Bell Laboratories Multiprocessing method and arrangement
US5003466A (en) * 1987-02-06 1991-03-26 At&T Bell Laboratories Multiprocessing method and arrangement
US4805107A (en) * 1987-04-15 1989-02-14 Allied-Signal Inc. Task scheduler for a fault tolerant multiple node processing system
US4868818A (en) * 1987-10-29 1989-09-19 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Fault tolerant hypercube computer system architecture
JPH0642215B2 (en) * 1988-01-28 1994-06-01 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Distributed monitoring subsystem
US5050070A (en) * 1988-02-29 1991-09-17 Convex Computer Corporation Multi-processor computer system having self-allocating processors
US4979105A (en) * 1988-07-19 1990-12-18 International Business Machines Method and apparatus for automatic recovery from excessive spin loops in an N-way multiprocessing system
DE69031233T2 (en) * 1989-02-24 1997-12-04 At & T Corp Adaptive work sequence planning for multiple processing systems
US5257369A (en) * 1990-10-22 1993-10-26 Skeen Marion D Apparatus and method for providing decoupling of data exchange details for providing high performance communication between software processes
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
JP2752764B2 (en) * 1990-03-08 1998-05-18 日本電気株式会社 Failure handling method
US5363502A (en) * 1990-06-08 1994-11-08 Hitachi, Ltd. Hot stand-by method and computer system for implementing hot stand-by method
EP0470322B1 (en) * 1990-08-07 1996-04-03 BULL HN INFORMATION SYSTEMS ITALIA S.p.A. Message-based debugging method
US5157663A (en) * 1990-09-24 1992-10-20 Novell, Inc. Fault tolerant computer system
JP3189903B2 (en) * 1991-06-03 2001-07-16 富士通株式会社 Device with capability saving / restoring mechanism
JPH05128080A (en) * 1991-10-14 1993-05-25 Mitsubishi Electric Corp Information processor
US5363503A (en) * 1992-01-22 1994-11-08 Unisys Corporation Fault tolerant computer system with provision for handling external events

Also Published As

Publication number Publication date
DE69330239T2 (en) 2001-11-29
JP3145236B2 (en) 2001-03-12
EP0590866B1 (en) 2001-05-23
EP0590866A2 (en) 1994-04-06
DE69330239D1 (en) 2001-06-28
JPH06202893A (en) 1994-07-22
EP0590866A3 (en) 1997-01-22
US5748882A (en) 1998-05-05
CA2106280C (en) 2000-01-18

Similar Documents

Publication Publication Date Title
CA2106280A1 (en) Apparatus and methods for fault-tolerant computing employing a daemon monitoring process and fault-tolerant library to provide varying degrees of fault tolerance
DE69907818T2 (en) Method and device for error detection and recovery with predetermined type of replication for distributed applications in a network
DE69907824T2 (en) Method and device for error detection and recovery with a predetermined degree of replication for distributed applications in a network
CA2288016A1 (en) Method and system for recovery in a partitioned shared nothing database system using virtual shared disks
Bernstein Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
AU2002231167B2 (en) Method of "split-brain" prevention in computer cluster systems
CA2323106A1 (en) File server storage arrangement
EP0990986A3 (en) Failure recovery of partitioned computer systems including a database schema
CA2150059A1 (en) Progressive Retry Method and Apparatus Having Reusable Software Modules for Software Failure Recovery in Multi-Process Message-Passing Applications
AU2045600A (en) Method and system for object oriented software recovery
DE69614623D1 (en) Fault-tolerant multiple network server
IE820752L (en) Computer control system
CA2053344A1 (en) Method and system increasing the operational availability of a system of computer programs operating in a distributed system of computers
DE69311797T2 (en) FAULT-TOLERANT COMPUTER SYSTEM WITH DEVICE FOR PROCESSING EXTERNAL EVENTS
WO2001095315A3 (en) Data storage system and process
Huang et al. NT-SwiFT: Software implemented fault tolerance on Windows NT
Huang et al. Components for software fault tolerance and rejuvenation
Liao et al. Partial replication of metadata to achieve high metadata availability in parallel file systems
EP0632381A3 (en) Fault-tolerant computer systems.
JPS5640901A (en) Backup method of process control
Bhide et al. Implicit replication in a network file server
Crane et al. Failure and its Recovery in an Object-Oriented Distributed System
Apodaca Enabling Fast Recovery For Autonomous Vehicle Systems With Linux Container Checkpointing
HINKEY et al. A fault recovery mechanism using logical bus addressing
Mishra et al. Protocol modularity in systems for managing replicated data

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed