US20040078681A1 - Architecture for high availability using system management mode driven monitoring and communications - Google Patents

Architecture for high availability using system management mode driven monitoring and communications Download PDF

Info

Publication number
US20040078681A1
US20040078681A1 US10/056,949 US5694902A US2004078681A1 US 20040078681 A1 US20040078681 A1 US 20040078681A1 US 5694902 A US5694902 A US 5694902A US 2004078681 A1 US2004078681 A1 US 2004078681A1
Authority
US
United States
Prior art keywords
system management
code
level
computer system
gateway
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/056,949
Inventor
Nick Ramirez
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US10/056,949 priority Critical patent/US20040078681A1/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: RAMIREZ, NICK
Publication of US20040078681A1 publication Critical patent/US20040078681A1/en
Priority to US11/240,237 priority patent/US7434085B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0748Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a remote unit communicating with a single-box computer node experiencing an error/fault
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0778Dumping, i.e. gathering error/state information after a fault for later diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Definitions

  • the present invention relates to microprocessor systems and more particularly, but without limitation, relates to using System Management Mode (SMM) and System Management Interrupt (SMI) capabilities of a microprocessor to detect errors, execute instructions and provide fault tolerance and high availability for computer systems.
  • SMM System Management Mode
  • SMI System Management Interrupt
  • Computer systems based upon the Intel x86 family of microprocessors can employ several different microprocessor modes. Each mode has a defined boundary in terms of memory addresses for program code and data that is configured by the firmware of the microprocessor.
  • a number of the microprocessor modes are “protected” modes in the sense that they operate independently from and are generally unaffected by the operating system that functions over of the microprocessor modes.
  • the microprocessor can execute instructions in several protected modes simultaneously without risk of violating the independent operations of each mode.
  • the System Management Mode is a protected mode that is designed to have complete authority over the microprocessor.
  • the SMM provides full access to certain I/O functions of the BIOS (basic input/output system), and system management code can be used to control hardware and firmware features independently from the operating system and application software.
  • the SMM can be used to store information about the system configuration of a frozen or powered-down device because its operation does not depend upon the correct functioning of the higher-level operating system, application code or device drivers.
  • FIG. 1 is a schematic block diagram of architecture levels of a computer system including a SMM according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method for promoting a faster recovery after a system malfunction according to an embodiment of the present invention.
  • FIG. 3 is a schematic block diagram of a hardware rack assembly including SMM code according to an embodiment of the present invention.
  • FIG. 4 is a schematic block diagram of a high availability telecommunication system employing SMM policy and procedure code according to the present invention.
  • the SMM of one or more microprocessors is used to provide a fault-tolerant, high availability system.
  • a SMM code executes prescribed plans of action from policy handling code.
  • Such prescribed plans of action may include saving critical state information upon detection, which information can be used to speed up resetting of a computer system after a malfunction, and to determine the cause of the error or malfunction.
  • the SMM functions operate independently from the operating system.
  • FIG. 1 is a schematic representation of the architecture of a computer system according to the present invention.
  • applications programs 10 such as, for example, a word processing program, function over the operating system 20 in that the program code of the applications interact with functions and procedures of the operating system rather than directly with the firmware and hardware of the computer system.
  • the operating system 20 operates over the firmware and BIOS 30 that is embodied in the microprocessor and various other application specific integrated circuits (ASICs) in the computer system 5 .
  • the firmware layer 30 includes firmware code and memory space 40 allocated for function of the SMM.
  • the SMM code stored at 40 may contain various monitoring, data loading, and resetting procedures, among others prescribed operations.
  • SMI System Management Interrupt
  • the SMM can be used to promote a faster recovery after operating system malfunctions according to the following method illustrated as a flow chart in FIG. 2.
  • a SMI-triggering timer elapses.
  • SMM code on a resident or networked microprocessor resets the timer.
  • the SMM code then check points and registers (identifies) all application software, drivers, the operating system and I/O devices as a background task. Since the background checks are based upon a timer, any changes in this information over time is also registered. It is noted that the timer can be set to a very small time period, such as 1 millisecond, so that the monitoring occurs approximately continuously for all practical purposes.
  • the timer can also be set to longer intervals so as not to affect system application software performance.
  • step S 4 the SMM independently monitors whether the operating system has crashed due to hardware or driver malfunction. If no crash has occurred, the method performs and interrupt return in step S 5 , but if a crash has occurred, in S 6 , SMM code releases the state information it has stored to other system management controls which control rebooting of the computer system. This is followed by an interrupt return in S 7 . If a two-stage time is used for SMI/reset, the timer is then allowed to run out and cause a microprocessor reset.
  • state information about the drivers, I/O, memory, and interrupts just prior to the crash enables a faster and more “stateful” rebooting and recovery because the system controls can more directly replicate the last known functional state of the computer system prior to the malfunction.
  • the state information can also be used to determine any sources of operating system malfunction after a crash. This determination can also enable faster recovery by attributing the cause of malfunction between hardware, software and external causes.
  • System management code can then determine whether a reboot or a hardware exchange is more appropriate for system recovery.
  • SMI interrupts can initiate SMM policy handling software in response to errors as they occur. For example, SMM software can be used to close files or send messages to alert both applications and the operating system to perform emergency shut down procedures before disconnection or power loss.
  • FIG. 3 schematically illustrates a rack computer system 100 in which SMI and SMM policy handling code are used to detect “hot swaps” and to direct emergency shut down operations. Hot swaps denote the removal and replacement of a component in a computer system while the system is running.
  • computer system 100 comprises a rack of cards 101 - 106 connected by a bus 110 . Each of the cards provides functionality to the rack computer system as a whole and may include redundant capability. Although only six cards are shown, the number is merely exemplary, and any number of cards can be coupled to the rack computer system.
  • Cards 101 , 105 are I/O cards
  • card 102 is a motherboard including a microprocessor (mP)
  • card 103 is a memory card
  • 106 is a network interface card.
  • Card 104 includes a microprocessor that stores SMM policy handling code according to the present invention.
  • each card can include a microprocessor having SMM capability, or the SMM code may be stored separately from the cards in a microprocessor 120 coupled to the rack computer system 100 .
  • Each of the cards 101 - 106 is releasable from the rack computer system via respective switches 111 - 116 . If, for example, switch 111 is activated to release I/O card 101 , a SMI is automatically generated by the I/O card and passed on the SMM code at card 104 or at microprocessor 120 . The SMM code then executes instructions to save configured states of the card 101 being removed in SMRAM (system management random access memory) allocated for such purposes, and shuts off power delivery to the card.
  • SMRAM system management random access memory
  • the SMM code When a replacement board is inserted into the same slot, or if there is a separate card in another slot that is on stand-by, the SMM code detects the insertion or the stand-by condition, and then downloads the state information from memory to the card, thus providing a smooth transition for swapped or exchanged cards to maintain the availability of the cards for continued use. In this manner, the SMM monitors the computer system 5 so that there is a mechanism to transfer functionality of various components so that the system as a whole remains active at all times.
  • FIG. 4 shows a voice-over Internet Protocol (VoIP) infrastructure 200 that includes a media gateway 210 , a signaling gateway 220 and a gateway controller 230 . All of these components are coupled to one another, and in addition, each is coupled to a high availability system controller 240 .
  • the media gateway 210 performs the function of translating between continuous PCM voice data traffic to or from a regular POTS (plain old telephone system) network from and packetized data to or from an IP-based network.
  • VoIP voice-over Internet Protocol
  • the signaling gateway 220 converts between SS 7 signaling messages of a POTS network and IP or H.323 based signaling messages of an IP network.
  • the gateway controller 230 receives the converted signaling messages from the signaling gateway 220 and translates telephone number information into an IP address and then arranges routing of the media traffic from the media gateway 210 according to the IP address.
  • Each of the media gateway 210 , signaling gateway 220 , and gateway controller 230 includes an Intel x86 microprocessor having a SMM.
  • each of the gateway and controller components 210 , 220 , 230 running at all time to prevent loss of communication, in part because this system does not necessarily have the redundant capacity of the POTS networks. Therefore, high availability of these components is a necessity to provide quality of service comparable to the regular voice networks.
  • the embedded SMM of the telecom components 210 , 220 , 230 can be used to perform near-continuous monitoring for faults, and to save state information. When a malfunction is detected at any of the telecom components 210 , 220 , 230 an SMI is generated and the SMM code sends a message including state information to a high availability controller 240 which stores policy and procedure code.
  • the high availability controller Upon receiving the SMI, the high availability controller transmits instructions to one or more of the processors at the components. While the high availability controller 240 is depicted as a separate component, the control code to implement high availability may be co-located in one or more of the processors within the telecom components 210 , 220 , 230 in addition to, or instead of, the separate controller 240 .
  • the high availability controller 240 may execute code to determine the cause of the malfunction, power-down various devices; activate replacement devices (not shown) to cover for any malfunctioning devices; and reroute traffic to maintain the quality of service of the network.
  • the policy and procedures initiated by the high availability system controller 240 can be also include more detailed and complex instructions, such as specific instructions to the signaling gateway regarding sending certain types of requests for re-sending or momentary delaying of messages during a malfunction.

Abstract

A computer system and method for providing high availability. The computer system includes an application level, an operating system level supporting the application level, and a firmware level supporting the operating system level. The firmware level includes a microprocessor having a system management mode that functions independently from the operating system level. The system management mode is configurable to execute system management code to monitor each of the levels of the computer system and to correct malfunctions in the levels in response to a system management interrupt.

Description

    FIELD OF THE INVENTION
  • The present invention relates to microprocessor systems and more particularly, but without limitation, relates to using System Management Mode (SMM) and System Management Interrupt (SMI) capabilities of a microprocessor to detect errors, execute instructions and provide fault tolerance and high availability for computer systems. [0001]
  • BACKGROUND INFORMATION
  • Computer systems based upon the Intel x86 family of microprocessors can employ several different microprocessor modes. Each mode has a defined boundary in terms of memory addresses for program code and data that is configured by the firmware of the microprocessor. A number of the microprocessor modes are “protected” modes in the sense that they operate independently from and are generally unaffected by the operating system that functions over of the microprocessor modes. The microprocessor can execute instructions in several protected modes simultaneously without risk of violating the independent operations of each mode. [0002]
  • In Intel microprocessors, the System Management Mode (SMM) is a protected mode that is designed to have complete authority over the microprocessor. For example, the SMM provides full access to certain I/O functions of the BIOS (basic input/output system), and system management code can be used to control hardware and firmware features independently from the operating system and application software. Additionally, the SMM can be used to store information about the system configuration of a frozen or powered-down device because its operation does not depend upon the correct functioning of the higher-level operating system, application code or device drivers. [0003]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic block diagram of architecture levels of a computer system including a SMM according to an embodiment of the present invention. [0004]
  • FIG. 2 is a flow chart of a method for promoting a faster recovery after a system malfunction according to an embodiment of the present invention. [0005]
  • FIG. 3 is a schematic block diagram of a hardware rack assembly including SMM code according to an embodiment of the present invention. [0006]
  • FIG. 4 is a schematic block diagram of a high availability telecommunication system employing SMM policy and procedure code according to the present invention.[0007]
  • DETAILED DESCRIPTION
  • In accordance with the present invention, the SMM of one or more microprocessors is used to provide a fault-tolerant, high availability system. When errors are detected, a SMM code executes prescribed plans of action from policy handling code. Such prescribed plans of action may include saving critical state information upon detection, which information can be used to speed up resetting of a computer system after a malfunction, and to determine the cause of the error or malfunction. The SMM functions operate independently from the operating system. [0008]
  • FIG. 1 is a schematic representation of the architecture of a computer system according to the present invention. According to the figure, [0009] applications programs 10 such as, for example, a word processing program, function over the operating system 20 in that the program code of the applications interact with functions and procedures of the operating system rather than directly with the firmware and hardware of the computer system. The operating system 20, in turn, operates over the firmware and BIOS 30 that is embodied in the microprocessor and various other application specific integrated circuits (ASICs) in the computer system 5. The firmware layer 30 includes firmware code and memory space 40 allocated for function of the SMM. The SMM code stored at 40 may contain various monitoring, data loading, and resetting procedures, among others prescribed operations. The various procedures of the code do not run continuously, but rather, they are triggered by a System Management Interrupt (SMI) which is delivered to the SMM code after the expiration of a timer 45 or after the occurrence of an external event or an occurrence in the operating system 20 or application 30 levels.
  • The SMM can be used to promote a faster recovery after operating system malfunctions according to the following method illustrated as a flow chart in FIG. 2. In step S[0010] 1, a SMI-triggering timer elapses. In step S2, after elapse of the timer, SMM code on a resident or networked microprocessor resets the timer. In step S3, the SMM code then check points and registers (identifies) all application software, drivers, the operating system and I/O devices as a background task. Since the background checks are based upon a timer, any changes in this information over time is also registered. It is noted that the timer can be set to a very small time period, such as 1 millisecond, so that the monitoring occurs approximately continuously for all practical purposes. The timer can also be set to longer intervals so as not to affect system application software performance. In step S4, the SMM independently monitors whether the operating system has crashed due to hardware or driver malfunction. If no crash has occurred, the method performs and interrupt return in step S5, but if a crash has occurred, in S6, SMM code releases the state information it has stored to other system management controls which control rebooting of the computer system. This is followed by an interrupt return in S7. If a two-stage time is used for SMI/reset, the timer is then allowed to run out and cause a microprocessor reset.
  • Having state information about the drivers, I/O, memory, and interrupts just prior to the crash enables a faster and more “stateful” rebooting and recovery because the system controls can more directly replicate the last known functional state of the computer system prior to the malfunction. Additionally, the state information can also be used to determine any sources of operating system malfunction after a crash. This determination can also enable faster recovery by attributing the cause of malfunction between hardware, software and external causes. System management code can then determine whether a reboot or a hardware exchange is more appropriate for system recovery. Furthermore, SMI interrupts can initiate SMM policy handling software in response to errors as they occur. For example, SMM software can be used to close files or send messages to alert both applications and the operating system to perform emergency shut down procedures before disconnection or power loss. [0011]
  • FIG. 3 schematically illustrates a [0012] rack computer system 100 in which SMI and SMM policy handling code are used to detect “hot swaps” and to direct emergency shut down operations. Hot swaps denote the removal and replacement of a component in a computer system while the system is running. As shown in FIG. 3, computer system 100 comprises a rack of cards 101-106 connected by a bus 110. Each of the cards provides functionality to the rack computer system as a whole and may include redundant capability. Although only six cards are shown, the number is merely exemplary, and any number of cards can be coupled to the rack computer system. Cards 101, 105 are I/O cards, card 102 is a motherboard including a microprocessor (mP), card 103 is a memory card and 106 is a network interface card. Card 104 includes a microprocessor that stores SMM policy handling code according to the present invention. Alternatively, each card can include a microprocessor having SMM capability, or the SMM code may be stored separately from the cards in a microprocessor 120 coupled to the rack computer system 100.
  • Each of the cards [0013] 101-106 is releasable from the rack computer system via respective switches 111-116. If, for example, switch 111 is activated to release I/O card 101, a SMI is automatically generated by the I/O card and passed on the SMM code at card 104 or at microprocessor 120. The SMM code then executes instructions to save configured states of the card 101 being removed in SMRAM (system management random access memory) allocated for such purposes, and shuts off power delivery to the card. When a replacement board is inserted into the same slot, or if there is a separate card in another slot that is on stand-by, the SMM code detects the insertion or the stand-by condition, and then downloads the state information from memory to the card, thus providing a smooth transition for swapped or exchanged cards to maintain the availability of the cards for continued use. In this manner, the SMM monitors the computer system 5 so that there is a mechanism to transfer functionality of various components so that the system as a whole remains active at all times.
  • An architecture applying SMM driven monitoring and communications can be readily applicable to a telecommunications infrastructure to provide fault tolerance and high availability. A high-level schematic illustration of a telecommunications system employing SMM monitoring and communications according to the present invention is shown in FIG. 4. FIG. 4 shows a voice-over Internet Protocol (VoIP) infrastructure [0014] 200 that includes a media gateway 210, a signaling gateway 220 and a gateway controller 230. All of these components are coupled to one another, and in addition, each is coupled to a high availability system controller 240. The media gateway 210 performs the function of translating between continuous PCM voice data traffic to or from a regular POTS (plain old telephone system) network from and packetized data to or from an IP-based network. Similarly, the signaling gateway 220 converts between SS7 signaling messages of a POTS network and IP or H.323 based signaling messages of an IP network. The gateway controller 230 receives the converted signaling messages from the signaling gateway 220 and translates telephone number information into an IP address and then arranges routing of the media traffic from the media gateway 210 according to the IP address. Each of the media gateway 210, signaling gateway 220, and gateway controller 230, includes an Intel x86 microprocessor having a SMM.
  • In this VoIP telephone system, it is vital to have each of the gateway and [0015] controller components 210, 220, 230 running at all time to prevent loss of communication, in part because this system does not necessarily have the redundant capacity of the POTS networks. Therefore, high availability of these components is a necessity to provide quality of service comparable to the regular voice networks. The embedded SMM of the telecom components 210, 220, 230 can be used to perform near-continuous monitoring for faults, and to save state information. When a malfunction is detected at any of the telecom components 210, 220, 230 an SMI is generated and the SMM code sends a message including state information to a high availability controller 240 which stores policy and procedure code. Upon receiving the SMI, the high availability controller transmits instructions to one or more of the processors at the components. While the high availability controller 240 is depicted as a separate component, the control code to implement high availability may be co-located in one or more of the processors within the telecom components 210, 220, 230 in addition to, or instead of, the separate controller 240.
  • As described above, the [0016] high availability controller 240 may execute code to determine the cause of the malfunction, power-down various devices; activate replacement devices (not shown) to cover for any malfunctioning devices; and reroute traffic to maintain the quality of service of the network. The policy and procedures initiated by the high availability system controller 240 can be also include more detailed and complex instructions, such as specific instructions to the signaling gateway regarding sending certain types of requests for re-sending or momentary delaying of messages during a malfunction. Since the monitoring and correction functions of the SMM and high availability controller operate in a protected mode independently from higher level software operating systems such as gateway translation software, H.323 messaging software, and any base operating system, such as UNIX, Linux, or Microsoft Windows™, hardware fault tolerance can be safeguarded independently regardless of the function of the higher level software components.
  • In the foregoing description, the system and method of the invention have been described with reference to a number of examples that are not to be considered limiting. Rather, it is to be understood and expected that variations in the principles of the system and method herein disclosed may be made by one skilled in the art, and it is intended that such modifications, changes, and/or substitutions are to be included within the scope of the present invention as set forth in the appended claims. [0017]

Claims (17)

What is claimed is:
1. A computer system for providing high availability comprising:
an application level;
an operating system level supporting the application level; and
a firmware level supporting the operating system level, the firmware level including a microprocessor having a system management mode that functions independently from the operating system level, the system management mode being configurable to execute system management code to monitor each of the levels of the computer system and to correct malfunctions in the levels.
2. The computer system of claim 1, wherein the system management mode is configured to execute code in response to receipt of a system management interrupt.
3. The computer system of claim 1, further comprising:
a timer coupled to the firmware level and set for a duration,
wherein the timer triggers transmission of a system management interrupt after the duration has elapsed.
4. The computer system of claim 3, wherein the system management code stores state information concerning the application, the operating system and the firmware levels upon receipt of the system level interrupt.
5. The computer system of claim 4, wherein the system management interrupt is independently triggered by external events occurring in at least one of the application level, the operating system level, and the firmware level.
6. The computer system of claim 5, wherein the system management interrupt is triggered by a malfunction in at least one of the application level, operating system level, and firmware level.
7. The computer system of claim 6, wherein the system management code initiates a stateful rebooting of the computer using the saved information concerning the application, the operating system and the firmware levels.
8. The computer system of claim 6, wherein the system management code includes a diagnostic routine to determine the source of the malfunction.
9. A method for providing a stateful resetting of a computer system incuding a microprocessor that has a system management mode that functions independently of an operating system, the method comprising:
storing state information concerning the computer system according to code executed in the system management mode on a regular basis;
detecting a malfunction in the computer system; and
triggering the system management mode to execute code to deliver the stored state information during resetting of the computer system to restore a state of the computer system prior to malfunction.
10. The method of claim 9 further comprising:
setting a timer;
sending a system management interrupt after elapsing of the timer;
resetting the timer; and
storing the state information upon receipt of the system management interrupt.
11. The method of claim 10, wherein the timer is set for a short duration.
12. A method of providing high availability in a rack computer system including a microprocessor having a system management mode and including manually replaceable component cards, the method comprising:
detecting initiation of manual removal of a component card;
storing state information concerning the component card according to code executed in the system management mode;
shutting off power to the component card being removed; and
delivering the stored state information to a replacement card to ensure availability of the functions provided by the component card being removed.
13. The method of claim 12, further comprising:
before delivering the state information to the replacement card, determing whether one of:
i) a replacement card has been inserted to replace the removed card, and
ii) a pre-installed stand-by component can be used as a substitute for the removed card.
14. A telecommunication system having high availability comprising:
a media gateway for converting between POTS voice traffic and IP voice traffic, the media gateway including a processor having a system management mode that executes code to monitor a state of the media gateway;
a signaling gateway for converting between POTS signaling traffic and IP signaling traffic, the signaling gateway including a processor having a system management mode that executes code to monitor a state of the signaling gateway;
a gateway controller coupled to the signaling gateway for receiving signaling messages therefrom and determining IP routing addresses corresponding to telephone numbers, the gateway controller including a processor having a system management mode that executes code to monitor a state of the gateway controller; and
a high availability system controller coupled to all of the media gateway, the signaling gateway and the gateway controller, the high availability system controller having policy and procedure code configured to execute when triggered by at least one of the media gateway, the signaling gateway and the gateway controller in response to at least one event.
15. The telecommunication system of claim 14, wherein, if a malfunction occurs, the high availability system controller is alerted from the state information provided by one or more components, and the policy and procedure code executes a diagnostic routine to determine a cause of the malfunction and initiates a power-down procedure for all malfunctioning components.
16. The telecommunication system of claim 15, wherein the policy and procedure code includes routines to activate replacement components to cover for malfunctioning components.
17. The telecommunication system of claim 16, wherein the policy and procedure code includes routines for rerouting voice and signaling traffic to maintain quality of service.
US10/056,949 2002-01-24 2002-01-24 Architecture for high availability using system management mode driven monitoring and communications Abandoned US20040078681A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/056,949 US20040078681A1 (en) 2002-01-24 2002-01-24 Architecture for high availability using system management mode driven monitoring and communications
US11/240,237 US7434085B2 (en) 2002-01-24 2005-09-29 Architecture for high availability using system management mode driven monitoring and communications

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/056,949 US20040078681A1 (en) 2002-01-24 2002-01-24 Architecture for high availability using system management mode driven monitoring and communications

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US11/240,237 Division US7434085B2 (en) 2002-01-24 2005-09-29 Architecture for high availability using system management mode driven monitoring and communications

Publications (1)

Publication Number Publication Date
US20040078681A1 true US20040078681A1 (en) 2004-04-22

Family

ID=32092172

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/056,949 Abandoned US20040078681A1 (en) 2002-01-24 2002-01-24 Architecture for high availability using system management mode driven monitoring and communications
US11/240,237 Expired - Fee Related US7434085B2 (en) 2002-01-24 2005-09-29 Architecture for high availability using system management mode driven monitoring and communications

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/240,237 Expired - Fee Related US7434085B2 (en) 2002-01-24 2005-09-29 Architecture for high availability using system management mode driven monitoring and communications

Country Status (1)

Country Link
US (2) US20040078681A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060168377A1 (en) * 2005-01-21 2006-07-27 Dell Products L.P. Reallocation of PCI express links using hot plug event
US20070220332A1 (en) * 2006-02-13 2007-09-20 Suresh Marisetty Configurable error handling apparatus and methods to operate the same
JP2012008672A (en) * 2010-06-23 2012-01-12 Lenovo Singapore Pte Ltd Backup method of main memory and data protection system
CN106445784A (en) * 2016-09-27 2017-02-22 北京搜狐新动力信息技术有限公司 Information monitoring method and information monitoring device
WO2020000950A1 (en) * 2018-06-27 2020-01-02 郑州云海信息技术有限公司 Method and device for testing robustness and stability of smm, and storage medium
US20220043915A1 (en) * 2019-04-30 2022-02-10 Hewlett-Packard Development Company, L.P. Storage of network credentials

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5212357B2 (en) * 2007-03-12 2013-06-19 富士通株式会社 Multi-CPU abnormality detection and recovery system, method and program
US20090119748A1 (en) * 2007-08-30 2009-05-07 Jiewen Yao System management mode isolation in firmware
US8260830B2 (en) * 2009-11-09 2012-09-04 Middlecamp William J Adapting a timer bounded arbitration protocol
US8621118B1 (en) * 2010-10-20 2013-12-31 Netapp, Inc. Use of service processor to retrieve hardware information
EP3413532A1 (en) 2017-06-07 2018-12-12 Hewlett-Packard Development Company, L.P. Monitoring control-flow integrity

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5539901A (en) * 1993-09-30 1996-07-23 Intel Corporation Method and apparatus for system management mode support for in-circuit emulators
US5909696A (en) * 1996-06-04 1999-06-01 Intel Corporation Method and apparatus for caching system management mode information with other information
US5910984A (en) * 1993-06-14 1999-06-08 Low; Colin Fault tolerant service-providing apparatus for use in a telecommunications network
US5987604A (en) * 1997-10-07 1999-11-16 Phoenix Technologies, Ltd. Method and apparatus for providing execution of system management mode services in virtual mode
US6122732A (en) * 1998-10-23 2000-09-19 Compaq Computer Corporation System management interrupt for a desktop management interface/system management basic input output system interface function
US6553515B1 (en) * 1999-09-10 2003-04-22 Comdial Corporation System, method and computer program product for diagnostic supervision of internet connections

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2986299B2 (en) * 1992-04-15 1999-12-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Peripheral device connection detection system
US5745770A (en) * 1993-12-27 1998-04-28 Intel Corporation Method and apparatus for servicing simultaneous I/O trap and debug traps in a microprocessor
US6205560B1 (en) * 1996-02-27 2001-03-20 Via-Cyrix, Inc. Debug system allowing programmable selection of alternate debug mechanisms such as debug handler, SMI, or JTAG
US6049672A (en) * 1996-03-08 2000-04-11 Texas Instruments Incorporated Microprocessor with circuits, systems, and methods for operating with patch micro-operation codes and patch microinstruction codes stored in multi-purpose memory structure
US6314532B1 (en) * 1998-12-04 2001-11-06 Lucent Technologies Inc. Method and system for recovering from a software failure
US6697960B1 (en) * 1999-04-29 2004-02-24 Citibank, N.A. Method and system for recovering data to maintain business continuity
WO2001071501A1 (en) * 2000-03-22 2001-09-27 Interwoven Inc. Method and apparatus for storing changes to file attributes without having to store an additional copy of the file contents
US6658590B1 (en) * 2000-03-30 2003-12-02 Hewlett-Packard Development Company, L.P. Controller-based transaction logging system for data recovery in a storage area network
US7111189B1 (en) * 2000-03-30 2006-09-19 Hewlett-Packard Development Company, L.P. Method for transaction log failover merging during asynchronous operations in a data storage network
US6728897B1 (en) * 2000-07-25 2004-04-27 Network Appliance, Inc. Negotiating takeover in high availability cluster
GB0020488D0 (en) * 2000-08-18 2000-10-11 Hewlett Packard Co Trusted status rollback
US6820216B2 (en) * 2001-03-30 2004-11-16 Transmeta Corporation Method and apparatus for accelerating fault handling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5910984A (en) * 1993-06-14 1999-06-08 Low; Colin Fault tolerant service-providing apparatus for use in a telecommunications network
US5539901A (en) * 1993-09-30 1996-07-23 Intel Corporation Method and apparatus for system management mode support for in-circuit emulators
US5909696A (en) * 1996-06-04 1999-06-01 Intel Corporation Method and apparatus for caching system management mode information with other information
US5987604A (en) * 1997-10-07 1999-11-16 Phoenix Technologies, Ltd. Method and apparatus for providing execution of system management mode services in virtual mode
US6122732A (en) * 1998-10-23 2000-09-19 Compaq Computer Corporation System management interrupt for a desktop management interface/system management basic input output system interface function
US6553515B1 (en) * 1999-09-10 2003-04-22 Comdial Corporation System, method and computer program product for diagnostic supervision of internet connections

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060168377A1 (en) * 2005-01-21 2006-07-27 Dell Products L.P. Reallocation of PCI express links using hot plug event
US20070220332A1 (en) * 2006-02-13 2007-09-20 Suresh Marisetty Configurable error handling apparatus and methods to operate the same
US7533300B2 (en) * 2006-02-13 2009-05-12 Intel Corporation Configurable error handling apparatus and methods to operate the same
JP2012008672A (en) * 2010-06-23 2012-01-12 Lenovo Singapore Pte Ltd Backup method of main memory and data protection system
CN106445784A (en) * 2016-09-27 2017-02-22 北京搜狐新动力信息技术有限公司 Information monitoring method and information monitoring device
WO2020000950A1 (en) * 2018-06-27 2020-01-02 郑州云海信息技术有限公司 Method and device for testing robustness and stability of smm, and storage medium
US11307973B2 (en) 2018-06-27 2022-04-19 Zhengzhou Yunhai Information Technology Co., Ltd. Method and device for testing robustness and stability of SMM, and storage medium
US20220043915A1 (en) * 2019-04-30 2022-02-10 Hewlett-Packard Development Company, L.P. Storage of network credentials

Also Published As

Publication number Publication date
US20060031706A1 (en) 2006-02-09
US7434085B2 (en) 2008-10-07

Similar Documents

Publication Publication Date Title
US7434085B2 (en) Architecture for high availability using system management mode driven monitoring and communications
US6505298B1 (en) System using an OS inaccessible interrupt handler to reset the OS when a device driver failed to set a register bit indicating OS hang condition
US7003775B2 (en) Hardware implementation of an application-level watchdog timer
US7756048B2 (en) Method and apparatus for customizable surveillance of network interfaces
US6622261B1 (en) Process pair protection for complex applications
KR100539202B1 (en) Fault notification system and process using local area network
US8495415B2 (en) Method and system for maintaining backup copies of firmware
US7076689B2 (en) Use of unique XID range among multiple control processors
US7000100B2 (en) Application-level software watchdog timer
US7219254B2 (en) Method and apparatus for high availability distributed processing across independent networked computer fault groups
US20040034816A1 (en) Computer failure recovery and notification system
US6425093B1 (en) Methods and apparatuses for controlling the execution of software on a digital processing system
US7788520B2 (en) Administering a system dump on a redundant node controller in a computer system
US20060259815A1 (en) Systems and methods for ensuring high availability
EP2127215A2 (en) Method and apparatus for hardware assisted takeover
JP2001188684A (en) System and method for selective rejuvenation on transparent time base
US7089413B2 (en) Dynamic computer system reset architecture
US7318171B2 (en) Policy-based response to system errors occurring during OS runtime
US7134046B2 (en) Method and apparatus for high availability distributed processing across independent networked computer fault groups
CA2124772C (en) Processor shelf controller
US7194614B2 (en) Boot swap method for multiple processor computer systems
US20030221141A1 (en) Software-based watchdog method and apparatus
US20090252047A1 (en) Detection of an unresponsive application in a high availability system
US20030065861A1 (en) Dual system masters
US8533528B2 (en) Fault tolerant power sequencer

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:RAMIREZ, NICK;REEL/FRAME:012550/0983

Effective date: 20020123

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION