DE4039013A1

DE4039013A1 - Error function data detector in multiprocessor system - reduces load on processors with no errors by interrupting only affected units for data gathering

Info

Publication number: DE4039013A1
Application number: DE19904039013
Authority: DE
Inventors: Hiroyuki Hidaka; Masayuki Sugioka; Hiroshi Kakita; Masaya Watanabe; Akio Yamamoto
Original assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Current assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Priority date: 1989-12-08
Filing date: 1990-12-06
Publication date: 1991-06-13
Also published as: DE4039013C2; JP2956849B2; JPH03179538A

Abstract

The device for detecting error function information in a multiprocessor system contains a service processor. Each system processor determines whether an indicated fault is present in all processors or only in the unit in which it was detected. A further device only halts the unit with the indicated fault when the fault is only indicated in that unit. The service processor acquired the information only for the unit with the fault without interrupting the other units. USE/ADVANTAGE - Error function detection arrangement reduces the load on processors in which no error is detected.

Description

Die vorliegende Erfindung betrifft eine Vorrichtung zur Erfassung von Informationen über Fehlfunktionen in einem Multiprozessor, die eine Operation eines Multiprozessors beim Auftreten einer Fehlfunktion stoppt und die anlie gende Information erfaßt.The present invention relates to a device for Collect information about malfunctions in one Multiprocessor, which is an operation of a multiprocessor stops when a malfunction occurs and the information collected.

Bei einer Vorrichtung herkömmlicher Bauart friert ein Prozessor, wie z. B. eine CPU eines Datenverarbeitungssy stems, seine eigene CPU sofort ein, wenn eine Fehlfunk tion in der Hardware des Datenverarbeitungssystems regi striert wird, und berichtet den Nachweis der Fehlfunktion einem Serviceprozessor (SVP), worauf dieser die anlie gende Information der CPU erfaßt und sich Softwareinfor mation über Programmunterbrechungen verschafft. In a conventional device freezes Processor, such as B. a CPU of a data processing system stems to its own CPU immediately if there is a malfunction tion in the hardware of the data processing system regi is reported and reports the evidence of the malfunction a service processor (SVP), whereupon the app information of the CPU and software information procurement interruptions.

Theorien der Datenverarbeitungssysteme sind heutzutage komplex, so daß es unabdingbar ist, Informationen über die Hardware zu erfassen, wenn eine Fehlfunktion auf tritt. Besonders in einem Multiprozessorsystem, das eine Mehrzahl von Prozessoren wie CPUs und IOPs und eine Mehr zahl von Serviceprozessoren umfaßt, ist es wichtig, um die Fehleranalyse zu vereinfachen, nicht nur den Prozes sor, bei dem die Fehlfunktion aufgetreten ist, sondern auch andere Prozessoren sofort einzufrieren, um zu ver hindern, daß das System in einem unnormalen Betriebszu stand arbeitet und daß die anliegenden Informationen ei ner Mehrzahl von Prozessoren erfaßt werden.Theories of data processing systems are nowadays complex, so it is essential to have information about to detect the hardware when a malfunction occurs occurs. Especially in a multiprocessor system that one Plurality of processors like CPUs and IOPs and a multitude number of service processors, it is important to to simplify error analysis, not just the processes sensor in which the malfunction occurred, but also freeze other processors immediately to ver prevent the system from operating abnormally stood working and that the attached information ner plurality of processors are detected.

Der Stand der Technik bezüglich der Erfassung von anlie gender Information beim Auftreten einer Fehlfunktion in einem Datenverarbeitungssystem mit Multiprozessor ist beispielsweise aus JP 61-2 73 643-A, JP 63-2 62 729-A, JP 63- 2 51 840-A und JP 63-2 51 841-A bekannt.The state of the art regarding the detection of anlie gender information when a malfunction occurs in a data processing system with multiprocessor for example from JP 61-2 73 643-A, JP 63-2 62 729-A, JP 63- 2 51 840-A and JP 63-2 51 841-A known.

In JP 61-2 73 643-A wird die anliegende Information in al len Einheiten erfaßt, wenn eine Fehlfunktion auftritt, wobei anschließend die anliegende Information durch erz wungenes Stoppen des gesamten Systems ohne Berücksichti gung der Ursache der Fehlfunktion erfaßt wird.In JP 61-2 73 643-A, the information in al len units detected when a malfunction occurs whereupon the attached information by ore bumpy stopping of the entire system without consideration the cause of the malfunction is detected.

In JP 63-2 62 729-A wird das Auftreten einer Fehlfunktion während der Kommunikation zwischen CPUs einer Partner- CPU über eine Steuersignalleitung eines gemeinsamen Bus ses mitgeteilt, woraufhin die Partner-CPU andere CPUs über eine Steuersignalleitung im Bus anweist, Daten zu erfassen, wonach die angewiesenen CPUs Daten der Haupt speicher durch Programmunterbrechung erfassen. JP 63-2 62 729-A describes the occurrence of a malfunction during communication between CPUs of a partner CPU via a control signal line of a common bus ses, whereupon the partner CPU other CPUs via a control signal line in the bus instructs data to capture after which the instructed CPUs data the main Capture memory by interrupting the program.

Bei einer Vorrichtung gemäß dem Stand der Technik ent steht eine Zeitverzögerung zwischen dem Nachweis der Fehlfunktion durch eine CPU und der Programmunterbrechung durch andere CPUs, weil der Nachweis der Fehlfunktion an andere CPUs übermittelt wird und die Daten über Program munterbrechung erfaßt werden. Während dieser Zeit fahren andere CPUs fort, unter diesem Zustand der Fehlfunktion zu arbeiten und könnten so fehlerhaft arbeiten. Außerdem wird der Erfassung der anliegenden Information, die von großter Bedeutung für die Fehleranalyse ist, keine Beach tung geschenkt.In a device according to the prior art ent there is a time delay between the proof of Malfunction due to a CPU and the program interruption by other CPUs because of the detection of the malfunction other CPUs is transmitted and the data via program interruption can be detected. Drive during this time other CPUs continue under this malfunction condition to work and could work erroneously. Furthermore is the capture of the attached information by of greatest importance for the error analysis is no beach free gift.

In JP 63-2 51 840-A und JP 63-2 51 841-A muß eine CPU, die auf einen gemeinsamen Speicherbereich zugreifen will, warten; wenn dann ein Speicherfehler-Detektor einen Feh ler erkennt, bleibt die CPU im Wartezustand und der Feh ler wird anderen CPUs mitgeteilt, so daß die Verarbeitung in Übereinstimmung mit dem Inhalt des Fehlers schnell durchgeführt werden kann.In JP 63-2 51 840-A and JP 63-2 51 841-A a CPU that wants to access a common storage area, waiting; if a memory error detector fails detects, the CPU remains in the waiting state and the error It is communicated to other CPUs so that the processing quickly in accordance with the content of the error can be carried out.

Vorrichtungen des Standes der Technik lassen die CPUs je doch nur dann warten, wenn ein Fehler im gemeinsamen Speicherbereich auftritt, weiterhin werden die Verarbei tung des Fehlers in der CPU und die Erfassung von anlie gender Information nicht beachtet.Prior art devices let the CPUs each but only wait if there is a mistake in the common Memory area occurs, processing continues the error in the CPU and the acquisition of gender information ignored.

Bei der Analyse des aus JP 61-273 643-A bekannten Standes der Technik haben die Erfinder der vorliegenden Erfin dung festgestellt, daß das gesamte System erzwungen ge stoppt wird, sobald ein Fehler auftritt, ohne den Grund für die Fehlfunktion zu berücksichtigen, wodurch die Ein/Ausgabe-Geräte etwa durch einen Überlauf stark belastet werden. In the analysis of the state known from JP 61-273 643-A of technology, the inventors of the present invention found that the entire system was forced will stop as soon as an error occurs for no reason account for the malfunction, causing the Input / output devices, for example, are heavily burdened by an overflow will.

Wie oben beschrieben, berücksichtigen andere Beschreibun gen des Standes der Technik die Arbeit von anderen CPUs als der, bei der der Fehler nachgewiesen wurde, genauso wenig wie die Tatsache, daß die Erfassung der anliegenden Information wichtig zur Fehleranalyse ist.As described above, consider other descriptions state of the art the work of other CPUs than the one with which the error was proven, as well little like the fact that capturing the pending Information is important for error analysis.

Der vorliegenden Erfindung liegt daher die Aufgabe zu grunde, eine Datenverarbeitungsanlage zu schaffen, welche das Problem, das in der herkömmlichen Bauart auftritt, löst und eine Belastung von Einheiten, die von der Fehl funktion nicht beeinflußt werden, reduziert, wenn die Fehlfunktion in einem Datenverarbeitungssystem mit Multi prozessoren auftritt und die einen Teil der Prozessoren weiterarbeiten läßt, um den Zusammenbruch des gesamten Systems zu verhindern.The present invention is therefore based on the object reasons to create a data processing system which the problem that occurs in the conventional design triggers and a burden of units by the fault function are not affected, reduced if the Malfunction in a data processing system with multi processor occurs and that is part of the processors lets the collapse of the whole continue Prevent system.

Eine weitere Aufgabe der Erfindung besteht darin, eine Datenverarbeitungsanlage zu schaffen, die hardwaremäßig die Möglichkeit besitzt, alle Prozessoren ohne Fehler einzufrieren, wenn eine Fehlfunktion, die zum Anhalten des gesamten Systems führen würde, auftritt, und die ein System zum Nachweis von Fehlfunktionen umfaßt, um genaue Hardwareinformation von allen Prozessoren zu erhalten, ohne daß komplexe Logik hinzugefügt wird.Another object of the invention is to provide a Data processing system to create the hardware has the ability to process all processors without errors freeze when malfunctioning to stop of the entire system would occur, and the one Malfunction detection system includes accurate Get hardware information from all processors, without adding complex logic.

Eine weitere Aufgabe der Erfindung besteht darin, eine Datenverarbeitungsanlage zu schaffen, die leicht an eine durch Hinzufügen von Prozessoren und Serviceprozessoren bewirkte Neustrukturierung des Systems angepaßt werden kann, indem die Prozessoren des Multiprozessors, die beim Auftreten der Fehlfunktion eingefroren wurden, nacheinan der von den Serviceprozessoren wieder in Betrieb gesetzt werden. Another object of the invention is to provide a To create data processing system that is easy on a by adding processors and service processors caused restructuring of the system to be adjusted can by the processors of the multiprocessor, which at Occurrence of the malfunction were frozen, one after the other which the service processors put back into operation will.

Die oben genannten Aufgaben werden erfindungsgemäß da durch gelöst, daß Mittel, welche anhand der Art der Fehl funktion Einheiten bestimmen, deren anliegende Informa tionen erfaßt werden, wenn die Fehlfunktion auftritt, und Mittel, die lediglich die Arbeit der bestimmten Ein heiten anhalten, so daß die anliegende Information ohne Anhalten der Einheiten, die keine Verbindung mit der Fehlfunktion haben, erfaßt werden kann, vorgesehen sind.The above-mentioned tasks are there according to the invention by resolved that means which are based on the nature of the mistake function Determine units, their pending informa cations are detected when the malfunction occurs, and means that only the work of the particular one stop so that the information is without Stopping the units that are not connected to the Malfunction, can be detected, are provided.

Weitere Aufgaben der Erfindung werden dadurch gelöst, daß Mittel vorgesehen werden, welche eine Einheit anhalten, wenn eine Fehlfunktion auftritt, die es notwendig macht, das ganze System anzuhalten und die anliegende Informa tion von allen Einheiten zu erfassen und somit alle Ein heiten betrifft, und die Fehlfunktion allen anderen Ein heiten mitteilen. Zudem sind noch Mittel, die die Arbeit der Einheit entsprechend der Information über die Fehl funktion einer anderen Einheit anhalten, und Mittel, die eine Stopp-Anforderung (Einfrier-Anforderung) aus einer Fehlfunktionsinformation einer noch nicht wieder in Be trieb gesetzten Einheit in einem Fehlerbehebungsmodus un terdrücken, so daß das Erfassen der anliegenden Informa tion durch die Serviceprozessoren durchgeführt wird, vor gesehen.Further objects of the invention are achieved in that Means are provided that stop unity, if a malfunction occurs that makes it necessary to stop the whole system and the pending informa tion of all units and thus all units units, and the malfunction of all other units communicating. They are also funds that work the unit according to the information about the mistake stop function of another unit, and means that a stop request (freeze request) from one Malfunction information not yet in Be powered unit in a troubleshooting mode terpress so that the detection of the attached informa tion is carried out by the service processors seen.

Gemäß der vorliegenden Erfindung wird beim Auftreten ei ner Fehlfunktion, die nur eine Einheit betrifft, nur diese eine Einheit angehalten, ohne daß die anderen Ein heiten gestoppt werden. Dadurch wird der Systemabsturz minimiert und die Verarbeitungseffizienz erhöht.According to the present invention, ei ner malfunction that affects only one unit, only this one unit stopped without the other one units can be stopped. This will cause the system to crash minimized and processing efficiency increased.

Wenn die Fehlfunktion alle Einheiten betrifft, kann das gesamte System sofort gestoppt werden, so daß in keiner Einheit ein Fehler auftritt und die anliegende Informa tion, die wichtig für die Fehleranalyse ist, von allen Einheiten erfaßt werden kann.If the malfunction affects all units, it can entire system can be stopped immediately, so that in none Unit an error occurs and the pending informa tion that is important for error analysis by everyone Units can be recorded.

Da Mittel zur Unterdrückung der Einfrier-Anforderung auf grund der Fehlfunktionsinformation einer anderen Einheit im Wiederherstellungsmodus vorgesehen sind, kann der Pro zeß zur Fehlerbehebung in der entsprechenden Einheit durchgeführt werden, ohne daß andere Einheiten berück sichtigt werden müssen. Somit werden der Prozeß zur Feh lerbehebung durch die Serviceprozessoren und die Anpas sung an die Erweiterung mit zusätzlichen Prozessoren und Serviceprozessoren vereinfacht.Because means to suppress the freeze request on due to the malfunction information of another unit are provided in recovery mode, the Pro time for troubleshooting in the corresponding unit be carried out without other units must be viewed. Thus the process becomes a mistake Eliminated by service processors and adaptors solution to the expansion with additional processors and Service processors simplified.

Die Erfindung wird im folgenden anhand bevorzugter Aus führungsformen mit Bezug auf die Zeichnungen näher erläu tert; es zeigen:The invention is based on preferred Aus leadership forms with reference to the drawings tert; show it:

Fig. 1 ein Blockschaltbild des Systemaufbaus einer er sten Ausführungsform der vorliegenden Erfindung; Fig. 1 is a block diagram of the system structure of a first embodiment of the present invention;

Fig. 2 ein genaues Schaltbild eines Unterdrückungszu stand-Generators; Fig. 2 is a detailed circuit diagram of a Suppression Stand generator;

Fig. 3 ein Blockschaltbild des Systemaufbaus einer zwei ten Ausführungsform der vorliegenden Erfindung; Fig. 3 is a block diagram of the system structure of a two-th embodiment of the present invention;

Fig. 4 ein Logikschaltbild des Aufbaus einer Fehlfunkti onsüberwachung-Steuereinheit; Fig. 4 is a logic diagram showing the structure of a malfunction monitoring control unit;

Fig. 5 ein Flußdiagramm zur Erklärung eines Arbeitsab laufes aufgrund einer Fehlfunktion; und Fig. 5 is a flow chart for explaining a workflow due to a malfunction; and

Fig. 6 ein Flußdiagramm zur Erklärung einer MCW-Wieder herstellungsoperation. Fig. 6 is a flowchart of a MCW-recovery operation for the explanation.

Fig. 1 zeigt ein Blockschaltbild eines Systemaufbaus ei ner ersten Ausführungsform der vorliegenden Erfindung. In Fig. 1 bezeichnen die Bezugszeichen 1 und 2 Befehlspro zessoren (BPs), das Bezugszeichen 3 Eingabe/Ausgabe-Pro zessoren (EAP), das Bezugszeichen 4 ein Anlagensteuerung (AS), das Bezugszeichen 5 einen Hauptspeicher, das Be zugszeichen 6 einen Serviceprozessor (SVP), das Bezugs zeichen 7 einen Speicher für den SVP, die Bezugszeichen 8 bis 10 bezeichnen Fehlernachweisschaltungen, das Bezugs zeichen 11 einen Unterdrückungszustand-Generator und das Bezugszeichen 15 einen Anforderungswähler. Fig. 1 shows a block diagram of a system structure of a first embodiment of the present invention. In Fig. 1, reference numerals 1 and 2 command processors (BPs), reference number 3 input / output processors (EAP), reference number 4 a system controller (AS), reference number 5 a main memory, reference number 6 a service processor (SVP), reference numeral 7 a memory for the SVP, reference numerals 8 to 10 denote fault detection circuits, reference numeral 11 a suppression state generator, and reference numeral 15 a request selector.

In der ersten Ausführungsform der vorliegenden Erfindung gemäß Fig. 1 sind die Mehrzahl der Befehlsprozessoren 1 und 2 und der Eingabe/Ausgabe-Prozessor 3 an die Anlagen steuerung 4 angeschlossen, so daß sie über diese auf den Hauptspeicher 5 zugreifen können. Diese Prozessoren sind an den Serviceprozessor 6 angeschlossen.In the first embodiment of the present invention shown in FIG. 1, the plurality of command processors 1 and 2 and the input / output processor 3 are connected to the system controller 4 , so that they can access the main memory 5 via this. These processors are connected to the service processor 6 .

In der so konfigurierten ersten Ausführungsform der vor liegenden Erfindung erkennt die Fehlernachweisschaltung 8 im Befehlsprozessor 1 die Fehlfunktion und sendet Fehler meldesignale 16 und 17 an den Unterdrückungszustand-Gene rator 11 in der Anlagensteuerung 4, wenn eine Fehlfunk tion im Befehlsprozessor 1 (BP0) auftritt.In the first embodiment of the present invention configured in this way, the error detection circuit 8 in the command processor 1 detects the malfunction and sends error signal signals 16 and 17 to the suppression state generator 11 in the system controller 4 when a malfunction occurs in the command processor 1 (BP 0 ) .

Die Fehlermeldesignale umfassen das Fehlermeldesignal 16, das eine Fehlfunktion meldet, die den Befehlsprozes sor 1 selbst und nicht das ganze System betrifft, und das Fehlermeldesignal 17, das eine Fehlfunktion meldet, die das ganze System betreffen könnte. Die Auswahl dieser Si gnale geschieht mittels der Fehlernachweisschaltung 8. The error reporting signals include the error reporting signal 16 , which reports a malfunction that affects the command processor 1 itself and not the entire system, and the error reporting signal 17 , which reports a malfunction that could affect the entire system. These signals are selected by means of the error detection circuit 8 .

Wenn eine Fehlfunktion im Befehlsprozessor 2 (BP1) auf tritt, registriert die Fehlernachweisschaltung 9 im Be fehlsprozessor 2 die Fehlfunktion und sendet Fehlermel designale 19 und 20 an den Unterdrückungszustand-Genera tor 11.When a malfunction occurs in the command processor 2 (BP 1) which registers the error detection circuit 9 in the Be fail processor 2, the failure and sends error indication designale 19 and 20 to the suppression state genera tor. 11

Wenn der Unterdrückungszustand-Generator 11 das Fehler meldesignal 16 erhält, erzeugt er einen Unterdrückungszu stand, der nur den Befehlsprozessor 1 unterdrückt. Wenn er das Fehlermeldesignal 17 erhält, erzeugt er einen Un terdrückungszustand, um alle Befehlsprozessoren und den Eingabe/Ausgabe-Prozessor 3 zu unterdrücken. Der Unter drückungszustand-Generator 11 erzeugt auch dann den Un terdrückungszustand, um alle Befehlsprozessoren und den Eingabe/Ausgabeprozessor 3 zu unterdrücken, wenn eine Fehlfunktion in der Anlagensteuerung 4 auftritt und die Fehlernachweisschaltung 10 ein Fehlermeldesignal 23 aus gibt.When the suppression state generator 11 receives the error notification signal 16 , it generates a Suppression status which only suppresses the command processor 1 . When it receives the error report signal 17 , it generates a suppression state to suppress all command processors and the input / output processor 3 . The suppress state generator 11 also generates the suppress state to suppress all command processors and the input / output processor 3 when a malfunction occurs in the system controller 4 and the error detection circuit 10 outputs an error report signal 23 .

Wie in Fig. 2 gezeigt, umfaßt der Unterdrückungszustand- Generator 11 drei ODER-Gatter 33 bis 35, an denen die oben beschriebenen Fehlermeldesignale anliegen, sowie Flip-Flops 30 bis 32, die Unterdrückungssignale an die Befehlsprozessoren und den Eingabe/Ausgabe-Prozessor aus geben.As shown in Fig. 2, the suppression state generator 11 comprises three OR gates 33 to 35 , to which the error message signals described above are applied, and flip-flops 30 to 32 , which suppress signals to the command processors and the input / output processor give.

Wenn der Unterdrückungszustand-Generator 11 das Fehler meldesignal 16 oder 19 vom Befehlsprozessor 1 oder 2 er hält, welches eine Fehlfunktion innerhalb des BP anzeigt, erzeugt er die Unterdrückungssignale 27 oder 28, um den Befehlsprozessor entsprechend der Nachweissignale über die ODER-Gatter 33 und 34 und die Flip-Flops 30 und 31 zu unterdrücken. Wenn er mindestens eines der Fehlermeldesi gnale 17, 20 oder 23 empfängt, die das ganze System be treffen, setzt er alle Flip-Flops 30 bis 32 über das ODER-Gatter 35 und die ODER-Gatter 31 und 32, um die Un terdrückungssignale 27 bis 29 zu erzeugen und so alle Prozessoren zu unterdrücken.If the suppression state generator 11 receives the error signal 16 or 19 from the command processor 1 or 2 , which indicates a malfunction within the BP, it generates the suppression signals 27 or 28 to the command processor according to the detection signals via the OR gates 33 and 34 and suppress flip-flops 30 and 31 . If it receives at least one of the error message signals 17 , 20 or 23 that affects the entire system, it sets all flip-flops 30 to 32 via the OR gate 35 and the OR gates 31 and 32 to the suppression signals 27 to generate 29 and so suppress all processors.

Die Unterdrückungssignale werden durch ein Steuersignal 36 vom Serviceprozessor 6 zurückgesetzt, wenn der Servi ceprozessor 6 das Erfassen der anliegenden Signale been det hat.The suppression signals are reset by a control signal 36 from the service processor 6 when the Servi ceprozessor 6 detecting the signals applied has been det.

Diese Unterdrückungssignale 27 bis 29 werden jeweils zu sammen mit dem vom jeweiligen Prozessor erzeugten Anfor derungssignal 18, 21 bzw. 22 an die Gatter 12 bis 14 und über diese Gatter an den Anforderungswähler 15 geschickt.These suppression signals 27 to 29 are sent together with the request signal generated by the respective processor 18 , 21 or 22 to the gates 12 to 14 and via these gates to the request selector 15 .

Der Anforderungswähler 15 wählt eine der Anforderungen der Prozessoren 1 bis 3 aus und weist die Anlagensteue rung 4 an, die Anforderung zu verarbeiten. Wenn das Un terdrückungssignal des Unterdrückungszustand-Generators 11 an den Gattern 12 bis 14 anliegt, unterdrücken die Gatter die Anforderungen der entsprechenden Prozessoren, so daß die Anforderungen den Anforderungswähler nicht er reichen, sondern unterdrückt werden.The request selector 15 selects one of the requests of the processors 1 to 3 and instructs the plant controller 4 to process the request. When the suppression signal of the suppression state generator 11 is applied to the gates 12 to 14 , the gates suppress the requests of the corresponding processors, so that the requests do not reach the request selector but are suppressed.

Wenn daher das Unterdrückungssignal zur Unterdrückung ei nes Prozessors vom Unterdrückungszustand Generator 11 ausgegeben wird, werden die entsprechenden Anforderungs signale 18, 21 und 22 nicht an den Anforderungswähler 15 weitergegeben, so daß die Anlagensteuerung 4 die Anforde rung nicht verarbeitet und der Prozessor, der die Anfor derung ausgegeben hat, seinen Betrieb unterbricht.Therefore, when the suppression signal is output to suppress ei nes processor from the suppression state generator 11, the corresponding request, signals 18, 21 and 22 are not passed on to the request selector 15, so that the system controller 4 tion the require not processed and the processor Anfor the has interrupted its business.

Der Serviceprozessor 6 erfaßt die anliegende Information nur von dem Prozessor, der seine Arbeit gestoppt hat, und sendet nach Behebung des Fehlers das Steuersignal 36 an die Anlagensteuerung 4, um das ganze System wieder in den Arbeitszustand zu versetzen.The service processor 6 only detects the pending information from the processor which has stopped its work and, after the error has been remedied, sends the control signal 36 to the system controller 4 in order to put the entire system back into the working state.

Bei der oben beschriebenen ersten Ausführungsform der vorliegenden Erfindung wird nur die Einheit, bei der die Fehlfunktion aufgetreten ist, angehalten, während die Ar beit der anderen Einheiten fortgesetzt wird, wenn eine Fehlfunktion in der Datenverarbeitungsanlage vom Multi prozessor-Typ auftritt und wenn die Fehlfunktion nicht auch die anderen Einheiten betrifft. Außerdem werden nur die anliegenden Informationen der gestoppten Einheit er faßt. Dadurch wird der Systemabsturz der ganzen Anlage vermieden, die Verarbeitungsleistung der Datenverarbei tungsanlage verbessert und der Überlauf der Ein gabe/Ausgabe-Elemente vermieden.In the first embodiment of the present invention is only the unit in which the Malfunction has occurred, stopped while the ar the other units will continue if one Malfunction in the data processing system of the Multi processor type occurs and if the malfunction fails also affects the other units. Besides, only the information attached to the stopped unit sums up. This will cause the entire system to crash avoided the processing power of data processing system improved and the overflow of the inlet Gabe / Output elements avoided.

In der oben beschriebenen ersten Ausführungsform der vor liegenden Erfindung umfaßt die Datenverarbeitungsanlage zwei Befehlsprozessoren und einen Eingabe/Ausgabe-Prozes sor, wobei auch eine größere Anzahl von Prozessoren mög lich wäre. Weiterhin werden die Anforderungssignale der Prozessoren bei der ersten Ausführungsform der vorliegen den Erfindung unterdrückt, um den Prozessor mit der Fehl funktion anzuhalten, obwohl das Taktsignal zur Unterdrüc kung verwendet werden könnte, um den Prozessor anzuhal ten.In the first embodiment described above, the lying invention includes the data processing system two command processors and one input / output process sor, with a larger number of processors possible would be. Furthermore, the request signals of the Processors in the first embodiment of the present suppressed the invention to the processor with the fault function to stop, although the clock signal to suppress could be used to stop the processor ten.

In der oben beschriebenen ersten Ausführungsform der vor liegenden Erfindung wird nicht nur der Betrieb der Ein heit, bei der die Fehlfunktion auftrat, sondern auch der Betrieb anderer Einheiten angehalten, wenn die Fehlfunk tion die ganze Anlage betreffen kann. Es dauert jedoch eine kurze Zeitspanne, bevor die Einheiten, bei denen keine Fehlfunktion auftrat, ihre Arbeit anhalten; während dieser Zeitspanne kann bei diesen Einheiten eine Fehl funktion auftreten.In the first embodiment described above, the lying invention will not only operate the one unit in which the malfunction occurred, but also the Operation of other units stopped when the malfunction tion can affect the entire system. However, it takes time a short period of time before the units where no malfunction occurred, stop working; while this period of time can be wrong with these units function occur.

Eine zweite Ausführungsform der vorliegenden Erfindung verhindert die Fehlfunktion sogar im oben genannten Fall. Die zweite Ausführungsform der vorliegenden Erfindung wird nun genau beschrieben, wobei auf die Fig. 3, 4, 5 und 6 Bezug genommen wird.A second embodiment of the present invention prevents the malfunction even in the above case. The second embodiment of the present invention will now be described in detail with reference to FIGS. 3, 4, 5 and 6.

Fig. 3 zeigt ein Blockschaltbild eines Aufbaus der zwei ten Ausführungsform der vorliegenden Erfindung, Fig. 4 zeigt ein Logik-Schaltbild einer Anordnung einer Fehl funktionsüberwachung-Steuereinheit, Fig. 5 zeigt ein Flußdiagramm zur Erläuterung der Arbeit bei einer Fehl funktion und Fig. 6 zeigt ein Flußdiagramm zur Erläute rung eines MCW-Rücksetzvorganges. In Fig. 3 und 4 be zeichnet das Bezugszeichen 41 eine Taktsteuereinheit, das Bezugszeichen 42 einen Serviceprozessor (SVP), die Be zugszeichen 50 und 70 CPUs, die Bezugszeichen 51 und 71 unabhängige Logikeinheiten, die Bezugszeichen 52 und 72 gemeinsame Logikeinheiten und die Bezugszeichen 53 und 73 Fehlfunktionsüberwachung-Steuereinheiten (MCUs). Fig. 3 shows a block diagram of a configuration of the two th embodiment of the present invention, Fig. 4 shows a logic diagram of an arrangement of a malfunction monitoring control unit, Fig. 5 shows a flow chart for explaining the operation in case of malfunction and Fig. 6 shows a flowchart for explaining an MCW reset operation. In FIGS. 3 and 4 be distinguished, reference numeral 41 a clock control unit, numeral 42 a service processor (SVP), the Be reference numbers 50 and 70 CPUs, the reference numerals 51 and 71 independent logic units, reference numerals 52 and 72 common logic units, and reference numerals 53 and 73 Malfunction Monitoring Control Units (MCUs).

In der zweiten Ausführungsform der vorliegenden Erfin dung, gezeigt in Fig. 3, umfaßt die Datenverarbeitungsan lage vom Multiprozessor-Typ zwei CPUs 50 und 70 und einen Serviceprozessor 42.In the second embodiment of the present invention, shown in FIG. 3, the multiprocessor-type data processing system includes two CPUs 50 and 70 and a service processor 42 .

In Fig. 3 umfassen die CPUs 50 und 70 unabhängige Lo gikeinheiten (EUs) 51 und 71, die unabhängig in der je weiligen CPU arbeiten, um Befehle und Operationen durch zuführen, gemeinsame Logikeinheiten (SCUs) 52 und 72, die Bereiche steuern und bearbeiten, die von beiden CPUs 50 und 70 gemeinsam benutzt werden, z. B. der (nicht ge zeigte) Hauptspeicher, und Fehlfunktionsüberwachung- Steuereinheiten (MCUs) 53 und 73, welche die Fehlfunktio nen überwachen und steuern, welche durch Paritätsprüfung in den unabhängigen Logikeinheiten (EUs) 51 und 71 und den gemeinsamen Logikeinheiten (SCUs) 52 und 72 nachge wiesen werden.In Fig. 3, CPUs 50 and 70 include independent logic units (EUs) 51 and 71 that operate independently in the respective CPU to perform commands and operations, common logic units (SCUs) 52 and 72 that control and process areas that are shared by both CPUs 50 and 70 , e.g. B. the main memory (not shown), and malfunction control units (MCUs) 53 and 73 , which monitor and control the malfunctions, which by parity check in the independent logic units (EUs) 51 and 71 and the common logic units (SCUs) 52 and 72 can be demonstrated.

Die Fehlfunktion, die in den unabhängigen Logikeinheiten (EUs) 51 und 71 auftritt, betrifft nicht die anderen CPUs, während die Fehlfunktion, die in den gemeinsamen Logikeinheiten (SCUs) 52 und 72 auftritt, auch andere CPUs betrifft.The malfunction that occurs in independent logic units (EUs) 51 and 71 does not affect the other CPUs, while the malfunction that occurs in common logic units (SCUs) 52 and 72 also affects other CPUs.

Der Serviceprozessor 42 fragt die unabhängigen Logikein heiten (EUs) 51 und 71, die gemeinsamen Logikeinheiten (SCUs) 52 und 72 und die Fehlfunktionsüberwachung- Steuereinheiten (MCUs) 53 und 73 über Steuerleitungen 55 und 75 ab und setzt die unabhängigen Logikeinheiten (EUs) 51 und 71, die gemeinsamen Logikeinheiten (SCUs) 52 und 72 und die Fehlfunktionsüberwachung-Steuereinheiten (MCUs) 53 und 73 über die Steuerleitungen 56 und 76 zu rück. Der Serviceprozessor 42 kann das Auftreten einer Fehlfunktion der CPUs 50 und 70 nachweisen, weil er von den Fehlfunktionsüberwachung-Steuereinheiten (MCUs) 53 und 73 über die Steuerleitungen 57 und 77 informiert wird.The service processor 42 queries the independent logic units (EUs) 51 and 71 , the common logic units (SCUs) 52 and 72 and the malfunction monitoring control units (MCUs) 53 and 73 via control lines 55 and 75 and sets the independent logic units (EUs) 51 and 71 , the common logic units (SCUs) 52 and 72 and the malfunction monitoring control units (MCUs) 53 and 73 via the control lines 56 and 76 . The service processor 42 can detect the occurrence of a malfunction of the CPUs 50 and 70 because it is informed by the malfunction monitoring control units (MCUs) 53 and 73 via the control lines 57 and 77 .

Der Ausführungstakt für die unabhängigen Logikeinheiten 51 und 71 und die gemeinsamen Logikeinheiten 52 und 72 wird von der Taktsteuereinheit 41 über UND-Gatter 54 und 74 und Steuerleitungen 58 und 78 ausgegeben. The execution clock for the independent logic units 51 and 71 and the common logic units 52 and 72 is output from the clock control unit 41 via AND gates 54 and 74 and control lines 58 and 78 .

Die Logikelemente der Fehlfunktionsüberwachung-Steuerein heiten (MCUs) 53 und 73 sind in Fig. 4 gezeigt.The logic elements of the malfunction monitoring control units (MCUs) 53 and 73 are shown in FIG. 4.

In Fig. 4 wird ein MCW-Register 64 normalerweise vom Ser viceprozessor 42 über die Steuerleitung 55 in einem An fangsprogramm-Lademodus auf "1" gesetzt und ist "EIN". Die Fehlermeldesignale der gemeinsamen Logikeinheit 52 und der unabhängigen Logikeinheit 51 werden an die Steu erleitungen 59 und 60 ausgesendet; außerdem werden die Fehlermeldesignale über die Steuerleitung 63 zu der CPU 70 und dem Serviceprozessor 42, zu der CPU 70 über ein ODER-Gatter 67 und zum Serviceprozessor über die Steuer leitung 57 und ein ODER-Gatter 68 gesendet, um über die Fehlfunktion der CPU 50 zu informieren.In Fig. 4, an MCW register 64 is normally set to "1" by the service processor 42 via the control line 55 in an initial program load mode and is "ON". The error message signals of the common logic unit 52 and the independent logic unit 51 are sent to the control lines 59 and 60 ; in addition, the error message signals are sent via the control line 63 to the CPU 70 and the service processor 42 , to the CPU 70 via an OR gate 67 and to the service processor via the control line 57 and an OR gate 68 to inform about the malfunction of the CPU 50 to inform.

Das Fehlermeldesignal in der CPU 50, das über die Steuer leitung 63 an die CPU 70 gesendet wird, setzt durch ein UND-Gatter 85 in der CPU 70 ein Flip-Flop (FF) 86.The error report signal in the CPU 50 , which is sent via the control line 63 to the CPU 70 , sets a flip-flop (FF) 86 by an AND gate 85 in the CPU 70 .

Während die Fehlfunktionsüberwachung-Steuereinheit 53 nun genau beschrieben wurde, gilt das Gesagte entsprechend für die Fehlfunktionsüberwachung-Steuereinheit 73.While the malfunction monitoring control unit 53 has now been described in detail, what has been said applies correspondingly to the malfunction monitoring control unit 73 .

Die Arbeitsweise der zweiten Ausführungsform der vorlie genden Erfindung wird nun unter Bezugnahme auf das Fluß diagramm in Fig. 5 erklärt, wobei angenommen wird, daß eine Fehlfunktion in der gemeinsamen Logikeinheit 52 nachgewiesen wurde.The operation of the second embodiment of the present invention will now be explained with reference to the flow chart in Fig. 5, assuming that a malfunction has been detected in the common logic unit 52 .

1) If a malfunction is detected in the common logic unit 52 , this is reported to the malfunction monitoring control unit 53 via the control line 59 . The malfunction in the common logic unit 52 also affects other CPUs, in this embodiment the CPU 70 , and when the malfunction monitoring control unit 53 receives the message, it informs the service processor 42 via the OR gates 67 and 68 and the control line 57 and sets the output of the OR gate 68 to the AND gate 54 via an inverter 69 and a control line 62 to the clock signal generated by the clock control unit 41 and passed on to the common logic unit 52 and the independent logic unit 51 , freeze so that the CPU 50 is stopped.
In addition, the output of the OR gate 67 is sent out as a stop request due to the detection of a malfunction to the CPU 70 via the control line 63 and applied to an AND gate 85 (step 501 ).
2) When the malfunction monitoring control unit 73 in the CPU 70 receives the stop request due to the detection of a malfunction of the CPU 50 , like the malfunction monitoring control unit 53, in step 501 , it sets the flip-flop 86 to " via the AND gate 85 " 1 "when the MCW register 84 is " ON ", reports the detection of a malfunction via the control line 77 to the service processor 42 and freezes the clock signal to the independent logic unit 71 and the common logic unit 72 via the control line 82 , sends the Output of the OR gate 87 to the malfunction monitoring control unit 53 of the CPU 50 and sets the flip-flop 66 to "1" (step 511 ).
3) When the service processor 42 receives the malfunction detection message, it executes the CPU 50 troubleshooting process. The service processor 42 detects the hardware information (pending information) of the CPU 50 and, based on the pending information, determines whether the malfunction is a (SCUCK) in the common logic unit (steps 503 , 504 ).
4) If it is determined in step 504 that the malfunction has occurred in the common logic unit, the register 64 in the malfunction monitoring control unit 53 is reset to "0" via the control line 55 and the stop request to the CPU 50 , which is sent from the malfunction monitoring control unit 73 of the CPU 70 via the control line 83 is suppressed by the AND gate 65 (step 505 ).
5) Then the service processor 42 sends a reset command to the CPU 50 via the control line 56 , the independent logic unit 51 and the common logic unit 52 initialises, sets the flip-flop 66 , which holds the stop request of the CPU 70 , returns to "0" and issues a start command (step 506 ).
6) Based on step 506 , the CPU again receives the clock signal from the clock control unit 41 and carries out the software troubleshooting process (step 502 ).
7) Then the service processor 42 performs the same steps as steps 503 through 506 on the CPU 70 to carry out the troubleshooting process (steps 507 through 510 , 512 ).

Bei dem softwareseitigen Fehlerbehebungsprozeß für die oben beschriebene Fehlfunktion in der gemeinsamen Lo gikeinheit wird der Status einer anderen CPU über den Steuerbus 43 übertragen. Ferner wird, falls bei einer an deren CPU die Fehlfunktion zurückgesetzt wurde, das MCW- Register in derselben über die Steuerleitung 61 oder 81 auf "1" gesetzt. Wenn daher in der gemeinsamen Logikein heit wieder eine Fehlfunktion auftritt, können alle CPUs sofort eingefroren und die anliegende Information erfaßt werden.In the software troubleshooting process for the above-described malfunction in the common logic unit, the status of another CPU is transmitted via the control bus 43 . Furthermore, if the malfunction of one of its CPUs has been reset, the MCW register in it is set to "1" via the control line 61 or 81 . Therefore, if a malfunction occurs again in the common logic unit, all CPUs can be frozen immediately and the pending information can be detected.

Fig. 6 zeigt ein Flußdiagramm des Rücksetzprozesses des MCW-Registers. Dieses wird jetzt beschrieben. Fig. 6 shows a flowchart of the MCW register reset process. This will now be described.

1) In the troubleshooting process of the CPU 50, the CPU 50 , in this case via the control bus 43 , checks the status of the other CPU, in the embodiment discussed the CPU 70 , whether it is in the malfunction status (steps 601 and 602 ).
2) If it is decided in step 602 that the other CPU is not in the malfunction status, the MCW register 64 is set to "1" while the debugging in the own CPU is reported to the other CPU (step 630 and 604 ).
3) During the debugging process of the CPU 70 , the MCW register 84 is set to "1" and the debugging in the own CPU is notified to the CPU 50 as in steps 601 to 604 (steps 605 to 608 ).
4) If the own CPU has the normal status and only the other CPU is in the malfunction status, the CPU sets the MCW register to "1" by interrupting the program (step 609 ) when the error correction is reported.

In der obigen Beschreibung der Arbeitsweise der zweiten Ausführungsform der vorliegenden Erfindung wird angenom men, daß die Fehlfunktion in der gemeinsamen Logikeinheit aufgetreten ist, was bedeutet, daß eine Fehlfunktion auf getreten ist, die auch andere Einheiten wie etwa die CPUs betrifft. Wenn eine Fehlfunktion in der unabhängigen Lo gikeinheit 51 oder 71 auftritt, betrifft sie die anderen CPUs nicht und deshalb wird das MCW-Register so verwen det, daß die anderen CPUs nicht eingefroren werden und der Fehlerbehebungsprozeß nur bei der CPU, die aufgrund der Fehlfunktion eingefroren wurde, durchgeführt wird.In the above description of the operation of the second embodiment of the present invention, it is assumed that the malfunction has occurred in the common logic unit, which means that a malfunction has occurred that also affects other units such as the CPUs. If a malfunction occurs in the independent logic unit 51 or 71 , it does not affect the other CPUs and therefore the MCW register is used so that the other CPUs are not frozen and the troubleshooting process only for the CPU that is frozen due to the malfunction was carried out.

In der zweiten Ausführungsform der vorliegenden Erfindung umfaßt die Anlage gemäß der obigen Beschreibung zwei CPUs und einen Serviceprozessor, obwohl die vorliegende Erfin dung auch auf eine Anlage die mehr CPUs oder mehr Servi ceprozessoren enthält , angewandt werden kann.In the second embodiment of the present invention the system comprises two CPUs as described above and a service processor, although the present invention the more CPUs or more servi contains ceprocessors, can be applied.

Entsprechend der zweiten Ausführungsform der vorliegenden Erfindung werden gemäß der obigen Beschreibung alle CPUs sofort eingefroren, wenn eine Fehlfunktion, die einer Mehrzahl von CPUs gemeinsam ist, nachgewiesen wird, so daß die Fehlfunktion von anderen CPUs, die die Fehlfunk tion nachgewiesen haben, verhindert wird und die für die Auswertung wichtige anliegende Information von allen CPUs erfaßt werden kann.According to the second embodiment of the present All CPUs are invention according to the above description immediately frozen if a malfunction that one Most CPUs are common, it is proven so that the malfunction of other CPUs causing the malfunction have proven, is prevented and for the Evaluation of important information from all CPUs can be detected.

Im Fehlerbehebungsprozeß, kann die Einfrier-Anforderung von einer anderen CPU unterdrückt werden. Somit kann der Serviceprozessor selbst dann die Fehlerbehebungsprozesse an den entsprechenden Prozessoren durchführen, wenn er die Meldungen über den Fehlfunktionsnachweis von einer Mehrzahl von CPUs erhält. Deshalb können die CPUs durch einen einfachen Fehlerbehebungsprozeß wieder in Betrieb gesetzt werden, ohne daß dieser Prozeß speziell auf den CPU-Aufbau Rücksicht nehmen müßte.In the troubleshooting process, the freeze request can be suppressed by another CPU. Thus, the Service processor even then the troubleshooting processes perform on the appropriate processors if he the reports of the malfunction verification from one Receives a majority of CPUs. Therefore, the CPUs can a simple troubleshooting process back in operation be set without this process specifically on the CPU structure should be considered.

Da der Fehlerbehebungsprozeß hintereinander ausgeführt werden kann, ohne daß andere CPUs beachtet werden, ist es darüberhinaus nicht nötig die Anzahl der CPUs in der An lage zu berücksichtigen, so daß der Prozeß selbst dann ohne Verzögerung ausgeführt werden kann, wenn die Anlage aufgrund einer erhöhten oder verminderten Zahl von CPUs oder Serviceprozessoren neu konfiguriert werden muß.Because the troubleshooting process is done in sequence can, without paying attention to other CPUs, it is Furthermore, the number of CPUs in the An is not necessary able to be considered, so the process even then can be run without delay if the facility due to an increased or decreased number of CPUs or service processors must be reconfigured.

Gemäß der vorliegenden Erfindung führen die Einheiten, die keine Verbindung mit der Fehlfunktion haben, weiter ihre Arbeit durch, wenn eine Fehlfunktion auftritt, so daß die Belastung einer Einheit wie ein Eingabe/Ausgabe- Element, das eine mechanische Tätigkeit begleitet, redu ziert werden kann und ein Systemabsturz, der beträchtli che Fehlfunktionen zur Folge hätte, vermieden werden kann. Selbst wenn die Fehlfunktion die ganze Anlage be trifft, wird die Fehlfunktion anderer Einheiten als der jenigen, bei der sie aufgetreten ist, vermieden.According to the present invention, the units that are not related to the malfunction their work through when a malfunction occurs, so that the load on a unit like an input / output Element that accompanies a mechanical activity, redu can be decorated and a system crash that considerable malfunctions could be avoided can. Even if the malfunction affects the whole system units will malfunction other than that those with whom she performed avoided.

Claims

1. Data processing system, with
a plurality of processors ( 1 , 2 ); and
a service processor ( 6 ) which can detect pending information in the event of a malfunction; characterized by
that each of these processors ( 1 , 2 ) means ( 8 , 9 ) for determining whether a malfunction has occurred affects all other units or only the unit in which the malfunction occurred, and means ( 11 ) for only the work of those Stop the unit in which the malfunction occurred when the determining means ( 8 , 9 ) determined that the malfunction only affected the unit in which the malfunction occurred; and
the service processor ( 6 ) detects the pending information only from the stopped units without stopping the work of the other units not related to the malfunction.

2. Data processing system, with
a plurality of processors ( 50 , 70 );
a service processor ( 42 ) capable of capturing information in the event of a malfunction; characterized,
that each of these processors ( 50 , 70 ) means of determination ( 51 , 52 , 71 , 72 ) which determine whether a malfunction which has occurred affects all other units or only the unit in which the malfunction has occurred, means tel ( 53 , 73 ) which stop the own unit and report the malfunction to the other units when the determining means ( 52 , 72 ) have determined that the malfunction also affects all other units, and means ( 53 , 73 ) which stop the own unit after a malfunction has been reported by another unit; and
the service processor ( 42 ) detects the pertinent information of the processors ( 50 , 70 ) one after the other.

3. Data processing system according to claim 2, characterized in that each of the processors ( 50 , 70 ) additionally comprises means ( 53 , 73 ) which suppress a stop request due to the malfunction report of a unit which has not yet been restored during an error recovery process.

4. Data processing system according to claim 3, characterized in that the malfunction message is transmitted to all other units via a control line ( 43 ) connecting the units directly.

5. Data processing system, with
a plurality of processors ( 50 , 70 );
a service processor ( 42 ) capable of capturing information in the event of a malfunction; characterized,
that each of these processors ( 50 , 70 ) means of determination ( 51 , 52 , 71 , 72 ) which determine whether a malfunction which has occurred affects all other units or only the unit in which the malfunction has occurred, means tel ( 53 , 73 ), which only stop the work of the unit in which the malfunction has occurred if the determination means ( 51 , 71 ) have determined that the malfunction only affects the unit in which the malfunction occurred, means ( 53 , 57 ) which stop the own unit and report the malfunction to the other units when the determining means ( 52 , 72 ) have determined that the malfunction also affects all other units, and means ( 53 , 73 ) which stop the own unit, after a malfunction has been reported by another unit; and
the service processor ( 42 ) detects the pending information only from the stopped units without stopping the work of the other units not related to the malfunction, and sequentially detects the pending information from the processors ( 50 , 70 ) when after the determining means ( 51 , 71 ) the malfunction affects only one unit.

6. Data processing system according to claim 5, characterized in that each of the processors ( 50 , 70 ) further comprises means ( 53 , 73 ) for suppressing a stop request, which is transmitted by a not yet restored unit during the troubleshooting process.

7. Data processing system according to claim 5, characterized in that malfunction reports are transmitted to all other units via a control line ( 43 ) which connects the units directly.