CA2319214A1 - Method for improving system availability after the failure of processors in a processor platform - Google Patents
Method for improving system availability after the failure of processors in a processor platform Download PDFInfo
- Publication number
- CA2319214A1 CA2319214A1 CA002319214A CA2319214A CA2319214A1 CA 2319214 A1 CA2319214 A1 CA 2319214A1 CA 002319214 A CA002319214 A CA 002319214A CA 2319214 A CA2319214 A CA 2319214A CA 2319214 A1 CA2319214 A1 CA 2319214A1
- Authority
- CA
- Canada
- Prior art keywords
- processor
- processors
- chain
- data
- logical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1405—Saving, restoring, recovering or retrying at machine instruction level
- G06F11/1407—Checkpointing the instruction stream
Abstract
When tasks are processed in a processor platform, the processing procedure is distributed over a logical chain of several processors of said processor platform. When one of the processors fails, the data is lost and the entire chain is left blocked for a considerable period. According to the inventive method, this problem is solved by forming another chain in which significant status-related data is transferred to the following processor. When restarted, the processor which failed can then reload said data and resume the status that it had before the failure.
Description
Description Method for improving system availability after the failure of processors in a processor platform.
The invention relates to a method in accordance with the preamble to patent claim 1.
Contemporary communication systems have a plurality of processors which interact with one another to process particular tasks or subtasks. Such a plurality of processors is also called a processor platform. The platform is administratively defined before the communication system is put into operation.
During operation of the communications system, one of the processors in the processor platform accepts the task to be processed, with data required for this purpose, and carries out a first processing operation.
According to the result, a further processor is then driven, to which the result of the first processing operation is then supplied. For its part, this processor then carries out further processing operations and transfers the ascertained result possibly to a further processor. The processing steps of a subsequent processor thus depend directly on the result of the predecessor. This forms a logical chain generally including a plurality of processors in the processor platform. These processors form a subset of all the processors in the processor platform.
The problem with such an arrangement is that, if only one of the processors in this logical chain fails, the task can no longer be processed. In this case, under some circumstances, processing of the task cannot even be terminated, because the task is not recognized as being such if data which is essential for this purpose has been lost during the failure. However, it means that this logical chain of processors remains blocked for the processing of further tasks.
In the case of the prior art, these failures are handled in a cyclical time frame by starting monitoring programs or audits which examine the processors in a processor platform for hardware and software errors. As a rule, these monitoring and checking operations are carried out at a time when there is little traffic. The fundamental time interval can therefore sometimes take up a very long time. The incorrect response thus remains unnoticed for the duration of this time interval.
The publication "Krishna Kumar R. et al.: "A
Fault Tolerant Multi-Transputer Architecture", Microprocessors and Microsystems, vol. 17, No. 2, January 1, 1993, pages 75-81, XP000355542" talks about a method for improving system availability. The configuration mentioned therein has a central control device. This central control device checks and controls a chain formed by a plurality of processors. If one of the processors fails, said central control device takes said processor out of operation using a switching network. The processor adjacent to the failed processor then takes on the tasks of the failed processor. This can be done to such an extent because the applications being discussed here contain processor-neutral data which can be processed by each of the processors. To this extent, what is involved here is a rigid configuration which cannot be changed at any time to suit the requirements of the tasks to be computed.
The invention is based on the object of indicating a way in which the failure of one or more processes in a processor platform can be handled AMENDED SHEET
---, GR 98 P 1046P - 2a -PCT/DE99/0125 _ efficiently in order to increase the dynamics of the system.
On the basis of the preamble to patent claim 1, the invention is achieved by the characterizing features thereof.
The particular advantage of the invention is the formation of a further logical chain of processors superimposed on the first logical chain. In this arrangement, significant data from a processor arranged in this chain is transferred to the next processor in this chain. This occurs irrespective of which of the processors in the first logical chain is having the result of the processing transferred to it. This has the associated advantage that, when restarted, a failed processor can load back this significant data directly from the next processor in this chain again, and it thus has a portrayal of the data as before the failure.
AMENDED SHEET
,_" CA 02319214 2000-07-18 Advantageous developments of the invention are specified in the subclaims.
The invention is explained in more detail below with the aid of an illustrative embodiment.
FIGURE 1 shows a processor platform having a total of 30 processors, and FIGURE 2 shows a linear chain of processors.
Figure 1 shows, by way of example, 30 processors P1...P3o in a processor platform. For reliability reasons, all the processors are duplicated so that, in the event of one processor failing, a switch can be made to the processor arranged as its redundancy processor, all of said processors being intermeshed via connecting lines. The processors P1, Plo, Pis, Pze are then intended to process a waiting task, and they thus form a first logical chain in the relevant processor platform. The waiting task is to be the setting-up of a connection.
As Figure 2 shows, a provision of the invention is that the processors P1...P3o are arranged in a second logical chain. According to the present illustrative embodiment, the start of this chain is thus formed by the processor P1. As the further element of this chain, said processor P1 is followed by the processor PZ etc.
The end of the chain is formed by the processor P3o.
Accozding to the present illustrative embodiment, the processor platform is thus intended to have the task of setting up a connection. To this end, this task and data which is necessary for this are supplied to one of the processors in the first logical chain of processors. By way of example, this is to be processor P1.
.- ..."q The task is split up into subtasks, with each subtask running on one of the processors Plo, Pls. Pie integrated in the processing procedure. In this arrangement, the subsequent processor in the chain is dependent on the other processors' preprocessing.
The processor P1 now processes the first subtask. According to the result of the processing procedure, the data defining this result is then supplied to the processor Plo, which carries out a further processing operation before the data is supplied to the processors P15 and Pze and leaves the chain again.
A provision of the invention is that significant data from the processor P1 is now transmitted to the processor PZ connected downstream in the second logical chain. The significant data is intended to be data which represents a representative portrayal of the physical and logical states assumed by the processor P1. Furthermore, the significant data describes the current state of the relevant task currently being processed in the processor P1.
Similarly, the next processors in the second logical chain are supplied with significant data from the processor connected upstream. The processor P11 thus stores significant data from the processor Plo, the processor PZ3 stores significant data from the processor PZZ and so on. The significant data can be supplied at the same time as the result is transmitted to the processor connected next in the first logical chain.
This mode of procedure is not obligatory, however. A
cyclical time interval between the processing procedures is also conceivable in this case. The significant data is deleted again when processing of the task in the subsequent processor has ended.
According to the present illustrative embodiment, it is now assumed that one of the processors fails together with the processor which is arranged as a redundant processor.
By way of example, this is to be processor P15. In this case, the data which was just being processed is lost and can no longer be supplied to the processor P28 for further processing.
The processor P15 is now started up again directly after the failure. For this purpose, the significant data supplied to the processor P16 is stored back in the processor P15 again. This means that the knowledge before the failure is then present in the processor P15 again, and processing of the task can be continued. The result obtained is then supplied to the processor P28. Consequently, the gap produced by the failure in the logical first chain is closed again.
The invention relates to a method in accordance with the preamble to patent claim 1.
Contemporary communication systems have a plurality of processors which interact with one another to process particular tasks or subtasks. Such a plurality of processors is also called a processor platform. The platform is administratively defined before the communication system is put into operation.
During operation of the communications system, one of the processors in the processor platform accepts the task to be processed, with data required for this purpose, and carries out a first processing operation.
According to the result, a further processor is then driven, to which the result of the first processing operation is then supplied. For its part, this processor then carries out further processing operations and transfers the ascertained result possibly to a further processor. The processing steps of a subsequent processor thus depend directly on the result of the predecessor. This forms a logical chain generally including a plurality of processors in the processor platform. These processors form a subset of all the processors in the processor platform.
The problem with such an arrangement is that, if only one of the processors in this logical chain fails, the task can no longer be processed. In this case, under some circumstances, processing of the task cannot even be terminated, because the task is not recognized as being such if data which is essential for this purpose has been lost during the failure. However, it means that this logical chain of processors remains blocked for the processing of further tasks.
In the case of the prior art, these failures are handled in a cyclical time frame by starting monitoring programs or audits which examine the processors in a processor platform for hardware and software errors. As a rule, these monitoring and checking operations are carried out at a time when there is little traffic. The fundamental time interval can therefore sometimes take up a very long time. The incorrect response thus remains unnoticed for the duration of this time interval.
The publication "Krishna Kumar R. et al.: "A
Fault Tolerant Multi-Transputer Architecture", Microprocessors and Microsystems, vol. 17, No. 2, January 1, 1993, pages 75-81, XP000355542" talks about a method for improving system availability. The configuration mentioned therein has a central control device. This central control device checks and controls a chain formed by a plurality of processors. If one of the processors fails, said central control device takes said processor out of operation using a switching network. The processor adjacent to the failed processor then takes on the tasks of the failed processor. This can be done to such an extent because the applications being discussed here contain processor-neutral data which can be processed by each of the processors. To this extent, what is involved here is a rigid configuration which cannot be changed at any time to suit the requirements of the tasks to be computed.
The invention is based on the object of indicating a way in which the failure of one or more processes in a processor platform can be handled AMENDED SHEET
---, GR 98 P 1046P - 2a -PCT/DE99/0125 _ efficiently in order to increase the dynamics of the system.
On the basis of the preamble to patent claim 1, the invention is achieved by the characterizing features thereof.
The particular advantage of the invention is the formation of a further logical chain of processors superimposed on the first logical chain. In this arrangement, significant data from a processor arranged in this chain is transferred to the next processor in this chain. This occurs irrespective of which of the processors in the first logical chain is having the result of the processing transferred to it. This has the associated advantage that, when restarted, a failed processor can load back this significant data directly from the next processor in this chain again, and it thus has a portrayal of the data as before the failure.
AMENDED SHEET
,_" CA 02319214 2000-07-18 Advantageous developments of the invention are specified in the subclaims.
The invention is explained in more detail below with the aid of an illustrative embodiment.
FIGURE 1 shows a processor platform having a total of 30 processors, and FIGURE 2 shows a linear chain of processors.
Figure 1 shows, by way of example, 30 processors P1...P3o in a processor platform. For reliability reasons, all the processors are duplicated so that, in the event of one processor failing, a switch can be made to the processor arranged as its redundancy processor, all of said processors being intermeshed via connecting lines. The processors P1, Plo, Pis, Pze are then intended to process a waiting task, and they thus form a first logical chain in the relevant processor platform. The waiting task is to be the setting-up of a connection.
As Figure 2 shows, a provision of the invention is that the processors P1...P3o are arranged in a second logical chain. According to the present illustrative embodiment, the start of this chain is thus formed by the processor P1. As the further element of this chain, said processor P1 is followed by the processor PZ etc.
The end of the chain is formed by the processor P3o.
Accozding to the present illustrative embodiment, the processor platform is thus intended to have the task of setting up a connection. To this end, this task and data which is necessary for this are supplied to one of the processors in the first logical chain of processors. By way of example, this is to be processor P1.
.- ..."q The task is split up into subtasks, with each subtask running on one of the processors Plo, Pls. Pie integrated in the processing procedure. In this arrangement, the subsequent processor in the chain is dependent on the other processors' preprocessing.
The processor P1 now processes the first subtask. According to the result of the processing procedure, the data defining this result is then supplied to the processor Plo, which carries out a further processing operation before the data is supplied to the processors P15 and Pze and leaves the chain again.
A provision of the invention is that significant data from the processor P1 is now transmitted to the processor PZ connected downstream in the second logical chain. The significant data is intended to be data which represents a representative portrayal of the physical and logical states assumed by the processor P1. Furthermore, the significant data describes the current state of the relevant task currently being processed in the processor P1.
Similarly, the next processors in the second logical chain are supplied with significant data from the processor connected upstream. The processor P11 thus stores significant data from the processor Plo, the processor PZ3 stores significant data from the processor PZZ and so on. The significant data can be supplied at the same time as the result is transmitted to the processor connected next in the first logical chain.
This mode of procedure is not obligatory, however. A
cyclical time interval between the processing procedures is also conceivable in this case. The significant data is deleted again when processing of the task in the subsequent processor has ended.
According to the present illustrative embodiment, it is now assumed that one of the processors fails together with the processor which is arranged as a redundant processor.
By way of example, this is to be processor P15. In this case, the data which was just being processed is lost and can no longer be supplied to the processor P28 for further processing.
The processor P15 is now started up again directly after the failure. For this purpose, the significant data supplied to the processor P16 is stored back in the processor P15 again. This means that the knowledge before the failure is then present in the processor P15 again, and processing of the task can be continued. The result obtained is then supplied to the processor P28. Consequently, the gap produced by the failure in the logical first chain is closed again.
Claims (3)
1. A method for improving system availability after the failure of processors in a processor platform, having at least one processor platform formed by a plurality of processors (P1...P30), where a prescribed task is processed by some of these processors (P1, P10, P15, P28) by the task being split into subtasks which are each processed on one of the processors (P1, P10, P15, P28), thus forming a first logical chain (K1) for the duration of the tasks' processing, characterized in that a second logical chain (K2) comprising all the processors (P1...P30) in the processor platform is permanently formed in which physical and logical processor data and data describing the task's current processing state and is from a processor arranged in this chain (K2) are transferred to the next processor in this chain (K2), and in that, when a failed processor is restarted, the significant data is loaded back from the next processor in the second logical chain (K2) again.
2. The method as claimed in claim 1, characterized in that, when a failed processor is restarted, the physical and logical processor data and the data describing the task's current processing state are loaded back from the next processor in the second logical chain (K2) again.
3. The method as claimed in claim 1 or 2, characterized in that the physical and logical processor data and the data describing the task's current processing state are deleted when processing of the task in the subsequent processor has ended.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE19801992A DE19801992C2 (en) | 1998-01-20 | 1998-01-20 | Process for improving system availability after processor processor failure |
DE19801992.0 | 1998-01-20 | ||
PCT/DE1999/000125 WO1999038077A1 (en) | 1998-01-20 | 1999-01-19 | Method for improving system availability following the failure of the processors of a processor platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CA2319214A1 true CA2319214A1 (en) | 1999-07-29 |
Family
ID=7855150
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002319214A Abandoned CA2319214A1 (en) | 1998-01-20 | 1999-01-19 | Method for improving system availability after the failure of processors in a processor platform |
Country Status (6)
Country | Link |
---|---|
US (1) | US6625752B1 (en) |
EP (1) | EP1049978B1 (en) |
CA (1) | CA2319214A1 (en) |
DE (2) | DE19801992C2 (en) |
ES (1) | ES2198925T3 (en) |
WO (1) | WO1999038077A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE19801992C2 (en) | 1998-01-20 | 2000-07-06 | Siemens Ag | Process for improving system availability after processor processor failure |
US6999994B1 (en) * | 1999-07-01 | 2006-02-14 | International Business Machines Corporation | Hardware device for processing the tasks of an algorithm in parallel |
JP5948933B2 (en) * | 2012-02-17 | 2016-07-06 | 日本電気株式会社 | Job continuation management apparatus, job continuation management method, and job continuation management program |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4521847A (en) | 1982-09-21 | 1985-06-04 | Xerox Corporation | Control system job recovery after a malfunction |
US5271013A (en) * | 1990-05-09 | 1993-12-14 | Unisys Corporation | Fault tolerant computer system |
US5214652A (en) * | 1991-03-26 | 1993-05-25 | International Business Machines Corporation | Alternate processor continuation of task of failed processor |
US5815651A (en) * | 1991-10-17 | 1998-09-29 | Digital Equipment Corporation | Method and apparatus for CPU failure recovery in symmetric multi-processing systems |
US5513354A (en) * | 1992-12-18 | 1996-04-30 | International Business Machines Corporation | Fault tolerant load management system and method |
JP2846837B2 (en) * | 1994-05-11 | 1999-01-13 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Software-controlled data processing method for early detection of faults |
JPH0887341A (en) * | 1994-09-16 | 1996-04-02 | Fujitsu Ltd | Computer system with automatic degeneratively starting-up function |
US5649088A (en) * | 1994-12-27 | 1997-07-15 | Lucent Technologies Inc. | System and method for recording sufficient data from parallel execution stages in a central processing unit for complete fault recovery |
JP3196004B2 (en) * | 1995-03-23 | 2001-08-06 | 株式会社日立製作所 | Failure recovery processing method |
JP3247043B2 (en) * | 1996-01-12 | 2002-01-15 | 株式会社日立製作所 | Information processing system and logic LSI for detecting failures using internal signals |
US5758051A (en) * | 1996-07-30 | 1998-05-26 | International Business Machines Corporation | Method and apparatus for reordering memory operations in a processor |
DE19801992C2 (en) | 1998-01-20 | 2000-07-06 | Siemens Ag | Process for improving system availability after processor processor failure |
-
1998
- 1998-01-20 DE DE19801992A patent/DE19801992C2/en not_active Expired - Lifetime
-
1999
- 1999-01-19 EP EP99932437A patent/EP1049978B1/en not_active Expired - Lifetime
- 1999-01-19 ES ES99932437T patent/ES2198925T3/en not_active Expired - Lifetime
- 1999-01-19 WO PCT/DE1999/000125 patent/WO1999038077A1/en active IP Right Grant
- 1999-01-19 DE DE59905317T patent/DE59905317D1/en not_active Expired - Fee Related
- 1999-01-19 CA CA002319214A patent/CA2319214A1/en not_active Abandoned
- 1999-01-19 US US09/600,715 patent/US6625752B1/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
EP1049978A1 (en) | 2000-11-08 |
DE59905317D1 (en) | 2003-06-05 |
EP1049978B1 (en) | 2003-05-02 |
ES2198925T3 (en) | 2004-02-01 |
DE19801992A1 (en) | 1999-08-05 |
DE19801992C2 (en) | 2000-07-06 |
WO1999038077A1 (en) | 1999-07-29 |
US6625752B1 (en) | 2003-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6438707B1 (en) | Fault tolerant computer system | |
US6058490A (en) | Method and apparatus for providing scaleable levels of application availability | |
US4628508A (en) | Computer of processor control systems | |
US5721817A (en) | Control method and apparatus for dynamically switching a logical session | |
JP4315016B2 (en) | System switching method for computer system | |
AU718427B2 (en) | Method and apparatus for handling processing errors in telecommunications exchanges | |
CN107480014A (en) | A kind of High Availabitity equipment switching method and device | |
CN1322422C (en) | Automatic startup of cluster system after occurrence of recoverable error | |
US20020073409A1 (en) | Telecommunications platform with processor cluster and method of operation thereof | |
CA2319214A1 (en) | Method for improving system availability after the failure of processors in a processor platform | |
CN113568707A (en) | Computer control method and system of ocean platform based on container technology | |
CN1812341A (en) | Controlling service failover in clustered storage apparatus networks and opration method thereof | |
JP3447347B2 (en) | Failure detection method | |
JP3394189B2 (en) | Uninterrupted update system for program / data of any processor | |
US5455940A (en) | Method for abnormal restart of a multiprocessor computer of a telecommunication switching system | |
JPH0683657A (en) | Service processor switching system | |
US6173249B1 (en) | Method of determining termination of a process under a simulated operating system | |
WO2019216210A1 (en) | Service continuation system and service continuation method | |
JPH0736721A (en) | Control system for multiplex computer system | |
US7213167B1 (en) | Redundant state machines in network elements | |
US11954509B2 (en) | Service continuation system and service continuation method between active and standby virtual servers | |
KR100309678B1 (en) | Process Monitoring and Failure Recovery | |
JPH0630069B2 (en) | Multiplexing system | |
JP2977705B2 (en) | Control system of networked multiplexed computer system | |
CN117579465A (en) | Fault processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FZDE | Dead |