US20080080384A1 - System and method for implementing an infiniband error log analysis model to facilitate faster problem isolation and repair - Google Patents
System and method for implementing an infiniband error log analysis model to facilitate faster problem isolation and repair Download PDFInfo
- Publication number
- US20080080384A1 US20080080384A1 US11/537,823 US53782306A US2008080384A1 US 20080080384 A1 US20080080384 A1 US 20080080384A1 US 53782306 A US53782306 A US 53782306A US 2008080384 A1 US2008080384 A1 US 2008080384A1
- Authority
- US
- United States
- Prior art keywords
- network
- event
- interconnect type
- subnet manager
- manager
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000008439 repair process Effects 0.000 title description 9
- 238000002955 isolation Methods 0.000 title description 8
- 238000004590 computer program Methods 0.000 claims 8
- 230000008569 process Effects 0.000 description 20
- 238000012545 processing Methods 0.000 description 11
- 238000007726 management method Methods 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 239000004744 fabric Substances 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000008030 elimination Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0659—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
Definitions
- the present invention relates in general to the field of data processing systems. More specifically, the present invention relates to the field of diagnosing problems within data processing system systems.
- processors implemented in server computers have substantially improved; where processor speeds and bandwidth greatly exceed the capacity of the input/output interfaces such as industry standard architecture (ISA), peripheral component interconnect (PCI), Ethernet, etc.
- ISA industry standard architecture
- PCI peripheral component interconnect
- Ethernet etc.
- This capacity inequality limits both server throughput and the speed at which data can be transferred between servers on a network.
- Different server standards have been proposed to improve network performance. The differing server standard proposals led to the development of the InfiniBand Architecture Specification, which was adopted by the InfiniBand Trade Association in October 2000.
- IBA InfiniBand Architecture
- IBA InfiniBand Architecture
- IBA InfiniBand Opera
- IBA InfiniBand Architecture
- IBA is a clustering fabric
- an entity is needed to initialize, configure, and manage the fabric.
- IBA defines this entity as a “Subnet Manager” (SM), which is tasked with the role of subnet administration.
- SM Subnet Manager
- the SM performs its tasks in-band (i.e., over IB links) and discovers and initializes devices (e.g., switches, host adapters, etc.) that are coupled to the IB fabric.
- any failures that result in loss of in-band communications are difficult to diagnose and time intensive to remedy.
- Some IB vendors have attempted to address this shortcoming in a variety of methods, such as “problem isolation” documents or applications that communicate out-of-band with the SM. These applications provide the user a view of the fabric and, in case of in-band failures, log events that may be useful in determining the cause of the failure. While the latter approach can yield additional failure information, the scope is limited to only the observations of the SM. As cluster sizes increase, a one-sided view of fabric failures makes problem isolation difficult and may require a “process of elimination” technique of determining the cause of failures.
- a “process of elimination” method is cost-prohibitive, since problem determination entail replacement of non-defective parts. Therefore, there is a need for a system and method for addressing the aforementioned limitations of the prior art in detecting the cause of failure in IB networks.
- the present invention includes a system, method, and computer-readable medium for detecting errors on a network.
- a network error manager retrieves a network topology from a master subnet manager; wherein the network includes a collection of devices coupled by a first interconnect type.
- the network error manager receives from the master subnet manager at least one event notification via a second interconnect type.
- An error log analysis component identifies at least one device among the collection of devices as a possible cause of the connectivity failure in the first interconnect type.
- the network error manager retrieves events from at least one device among the collection of devices that can influence a state of the first interconnect type.
- FIG. 1 is a block diagram illustrating an exemplary InfiniBand network in which a preferred embodiment of the present invention may be implemented
- FIG. 2 is a block diagram depicting an exemplary data processing system according to a preferred embodiment of the present invention
- FIG. 3 is a high-level logical flowchart illustrating an exemplary method for implementing InfiniBand error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention
- FIG. 4 is a block diagram depicting exemplary contents of a system memory in accordance with a preferred embodiment of the present invention.
- FIGS. 5A-5B are high-level logical flowcharts illustrating more detailed steps within an exemplary method for implementing InfiniBand error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention
- an InfiniBand (IB) network includes a Subnet Manager that maintains an accurate topological representation of the network and otherwise oversees network administration.
- a network error manager periodically interrogates the subnet manager for a topological representation of the network and listens for failure notifications, hereinafter referred to as “events”, sent by IB devices that detect an IB communication failure.
- An “IB device” is any device that either implements the network, is attached to the network by means of utilizing an IB device, or a device that can influence the state of IB devices and the state of the IB network. This includes, but is not limited to: switches, adapters, servers/systems, and power supplies.
- the events are forwarded to the network error manager by the subnet manager. Once the network error manager determines that more analysis of a particular event is required, the error manager forwards the event to an error log analysis component.
- the error log analysis component categorizes each received event into at least one of a collection of event pools. After a predetermined time limit each event pool expires. The error log analysis component analyzes each event in the expired event pool for any correlations and/or relations between the events to enable a user to more accurately and efficiently diagnose failing IB devices within an IB network.
- network 100 includes servers 102 a - b, management central server 104 , and InfiniBand (IB) switches 106 a - b.
- Servers 102 a - b are coupled to each other via IB switches 106 a - b and IB adapters 110 a - b.
- servers 102 a - b are coupled to management central server 104 via Ethernet adapters 112 a - c.
- IB switches 106 a - b include master subnet manager 108 a and standby subnet manager 108 b for performing subnet administration.
- master subnet manager 108 a becomes inoperable
- standby subnet manager 108 b takes over the responsibilities of administering the IB connections.
- FIG. 2 is a block diagram depicting an exemplary data processing system 200 that may be utilized to implement servers 102 a - b and management central server 104 as illustrated in FIG. 1 .
- exemplary data processing system 200 includes processor(s) 202 a - n, which are coupled to system memory 204 via system bus 206 .
- system memory 204 may be implemented as a collection of dynamic random access memory (DRAM) modules.
- Mezzanine bus 208 acts as an intermediary between system bus 206 and peripheral bus 214 .
- peripheral bus 214 may be implemented as a peripheral component interconnect (PCI), accelerated graphics port (AGP), or any other peripheral bus.
- PCI peripheral component interconnect
- AGP accelerated graphics port
- peripheral bus 214 Coupled to peripheral bus 214 is hard disk drive 210 , which is utilized by data processing system 200 as a mass storage device. Also coupled to peripheral bus 214 are network adapter 216 and a collection of peripherals 212 a - n. As discussed herein in more detail, network adapter 216 may be implemented by any type of network protocol including, but not limited to, Ethernet, IEEE 802.11x, etc.
- data processing system 200 can include many additional components not specifically illustrated in FIG. 2 . Because such additional components are not necessary for an understanding of the present invention, they are not illustrated in FIG. 2 or discussed further herein. It should be also be understood, however, that the enhancements to data processing system 200 to facilitate faster problem isolation and repair provided by the present invention are applicable to data processing systems of any system architecture.
- the present invention is in no way limited to the generalized multi-processor architecture or symmetric multi-processing (SMP) architecture illustrated in FIG. 2 .
- FIG. 4 is a block diagram illustrating exemplary contents of system memory 204 of management central server 104 according to a preferred embodiment of the present invention.
- system memory 204 includes operating system 402 , which further includes shell 406 for providing transparent user access to resources such as application programs 408 .
- shell 406 (as it is called in UNIX®), also called a command processor in Windows®, is a program that provides an interpreter and an interface between the user and the operating system. More specifically, shell 406 executes commands that are entered into a command-line user interface or file.
- shell 406 is generally the highest level of the operating system hierarchy and serves as a command interpreter.
- Shell 406 provides a system prompt, interprets commands entered by keyboard, mouse, or other user input media, and sends the interpreted command(s) to the appropriate lower levels of the operating system (e.g., kernel 404 ) for processing.
- the operating system e.g., kernel 404
- shell 404 is a text-based, line-oriented user interface
- the present invention will support other user-interface modes, such as graphical, voice, gestural, etc. equally well.
- operating system 402 also includes kernel 404 , which includes lower levels of functionality for operating system 402 , including providing essential services required by other parts of operating system 402 and applications 408 , including memory management, process and task management, disk management, and mouse and keyboard management.
- Applications 408 can include a browser, utilized for access to the Internet, word processors spreadsheets, and other applications.
- system memory 204 includes network error manager 114 and error log analysis (ELA) component 115 , both of which are discussed herein in more detail.
- ELA error log analysis
- Network error manager 114 stored within system memory 204 of management central server 104 communicates with master subnet manager 108 a to obtain views of the IB topology as the topology is discovered by master subnet manager 108 a. Also, since management central server 104 is coupled to servers 102 a - b via Ethernet connections, network error manager 114 also collects information from each server 102 a - b pertaining to their respective IB adapters 110 a - b. During operation, an IB failure may result in loss of in-band IB connectivity and multiple IB devices may observe a failure and report to the active subnet manager (e.g., master subnet manager 108 a or standby subnet manager 108 b ).
- the active subnet manager e.g., master subnet manager 108 a or standby subnet manager 108 b .
- the active subnet manager forwards the events to network error manager 114 via the Ethernet connection.
- the ability of Network error manager 114 to obtain events from all affected IB devices via the active subnet manager enables the network error manager 114 and error log analysis component 115 to more accurately and efficiently determine the root cause of the failure.
- An accurate diagnosis of the root cause of the failure allows a user or repair personnel to order replacement parts for only the failing devices. Also, repair time is greatly reduced since typical “process of elimination” diagnosis is not necessary utilizing the present invention.
- some devices within network 100 may be field replaceable units (FRUs), which may be replaced by either a user or a technician on-site, without requiring the server to be returned to the vendor for the repair.
- FRUs field replaceable units
- Network error manager 114 further receives events from the servers. These events describe state changes in the server that can, in turn, result in state changes in the IB network. While network error manager 114 and error log analysis component 115 are not responsible for the callout of such events, these events may be utilized to modify analysis of IB network events. As such, network error manager 114 and error log analysis component 115 can be considered to be alerted to such events, which will be subsequently described as “alerts”.
- Network error manager 114 works in conjunction with error log analysis (ELA) component 115 to gather network-wide asynchronous failure notifications (“events”), perform a first level of analysis per event, and pass important events to ELA component 115 for a final analysis of the event relative to how the particular event correlates to other detected events that may affect network operation. While this embodiment does not include events from software or firmware that is critical to InfiniBand network operation, network error manager 114 may be configured to include such events to notify users of software or firmware errors.
- ELA error log analysis
- network error manager 114 interrogates the received events and determines if more data is required to classify the event. Such data may include, but is not limited to, further information regarding potential field replaceable units (FRUs), a time out value (when the event is set to expire), or location information that clarifies the location of the failure.
- network error manager 114 may apply a threshold to an event to throttle reporting to ELA component 115 by network error manager 114 because certain events are more important based on their frequency of occurrence rather than each individual occurrence.
- a threshold may include a minimum number of events of a certain type that must occur before network error manager 114 reports that type of event to ELA component 115 .
- Network error manager 114 reports the type of event, the detector's location, and location information to ELA component 115 .
- the location information includes all required information to identify all the potential FRUs related to the event.
- Such FRU location information may include, but is not limited to: (1) logical FRU location; (2) physical FRU location; (3) machine type, model, and serial number of the enclosure that contains the device; (4) machine type, model, and serial number of power enclosure that is critical to providing power and servicing the device; (5) part number; and (6) part serial number.
- the location information given must be detailed enough to define a useful hierarchy of device and/or component containers.
- a device can be contained within a frame that has power that influences the device, as well as a chassis that affects the logic function and power for the device, and it may further be considered a part of a particular network of devices.
- the logical FRU location includes fields that enumerate the network, frame, chassis, board, and port associated with the reporting device and event on the device.
- the classes of FRUs are based on their location relative to the device that detected the particular event.
- the main division point between classes is the connection between two ports in network 100 .
- one embodiment could include the possibility of an event from the interface of any connection method between two distinct FRUs.
- There is a local FRU location list that lists all locations on the same side of a connection with respect to the device that detected the event.
- There is a remote FRU list for all locations on the opposite side of a cable/connection with respect to the device that detected the event.
- there is a repeater FRU list that lists all locations between the two ends of a cable/connection with respect to the device that detected the event.
- event pools 410 include, but are not limited to the following pools: switch link 412 a, switch device 412 b, adapter device 412 c, switch device and link 412 d, and alert 412 e.
- Switch link events categorized in switch link 412 a, all occur on a switch link, which can be either between two switches or between an adapter and a switch. These events involve a connection of some sort between two device ports.
- Network error manager 114 must supply at least the local and remote FRU list information. If there are repeaters between both ports, information regarding these repeaters must be supplied to ELA component 115 .
- Switch device and adapter device events categorized in switch device 412 b and adapter device 412 c, are similar in that they involve events that are related only to the device that is reporting the event.
- Network error manager 114 must supply the local FRU list information associated with the device.
- Switch device and link events categorized in switch device and link 412 d, indicate that the detecting FRU may be defective, but the detecting FRU affects the state of one or more links and may cause events to be reported by the other side of the link.
- Alert events are those for which ELA component 115 is not responsible for reporting as serviceable, but are important in that the alert event may induce network events. Alert events are utilized to suppress the reporting of network events as serviceable.
- a “serviceable event” is an event that may be addressed via replacing FRUs by a user or an on-site technician.
- the main purpose of the event pools is to keep similar events together so that they may be properly correlated.
- the pools may be considered a first-level analysis of correlation.
- the one exception to this rule is the alert event, whose events can be correlated across all of the pools.
- a pool “times out” (expires), or no longer accepts new events in order for ELA component 115 to make correlations between collected events within the pool.
- the “fast” mechanism is defined such that the timeout for the pool is based on a timeout value for the first event in the pool.
- the “slow” mechanism is defined such that the time out for the pool is based on a timeout value for the latest event to arrive in the pool.
- the fast mechanism can suffice for many event relationships.
- the slow mechanism is utilized when there may be a large variance in the time influence of a particular event.
- This defined maximum time is utilized to circumvent the possibility of a pool remaining open indefinitely.
- the maximum time value is chosen based on the events characteristics of network 100 . If the maximum time value is too short, correlation between events may be lost. This would result in events being reported as serviceable when they should not be considered serviceable. In turn, this would result in replacement of non-faulty FRUs. If the chosen maximum value is too long, it may take an inordinate amount of time to report a serviceable event, which can compromise the performance of network 100 .
- the alert pool operates slightly differently in that each event times out individually rather than as a group in the entire alert pool, which takes into account the special influence that alert events have on other events.
- Non-alert pools remain open based on the timeout value and trigger characteristics of the events that are placed within the pool. Once a pool times out, all of the events within the particular pool are compared with one another to determine if and how they relate to each other. The timeout value must take into account latencies for event reporting and event influence. Events may take varying amounts of time to be transferred to ELA component 115 . Furthermore, the influence of one event to another event may not be immediate, so any delayed reactions must be taken into account in the chosen timeout value for a particular pool.
- ELA component 115 There are several characteristics that describe to ELA component 115 how a particular event relates to other events in network 100 :
- Timeout value of a particular event which influences how long a pool can stay open before being analyzed.
- Timeout trigger of an event which influences how long a pool can stay open before being analyzed.
- Priority which, in absence of other correlation techniques, is utilized as a final arbiter to decide which of a group of events reported from the same device has priority to be reported. This minimizes the possibility of multiple events with the same suggested service action.
- Correlation by location is performed based on locality of devices relative to the reporting device. Local correlation is performed relative to devices on the same side of a cable or other connection mechanism as the reporting device. Remote correlation is performed relative to devices on the opposite side of a cable or other connection mechanism as the reporting device. Each characteristic is simply a list of events that are to be tested for correlation.
- the correlation by location is tightly tied to the scope of influence.
- the scope of influence characteristic indicates at what level within a location's scope an event has influence. For example, a board failure may affect multiple ports on that board. Thus, the event associated with such a board failure must be characterized as having a scope of influence that includes the entire board.
- the local FRU list supplied by network manager 114 is tested with respect to scope of influence to see if two events correlate. For example, assume that a first event includes the following features:
- the first event lists a second event in its local correlation characteristic
- the first event has a scope of influence at the port level in a computer system
- Both the first event and the second event are categorized in the same event pool.
- both the first event and the second event correlate to the same location from the highest level in the location hierarchy down to the port level, then the first event will suppress the reporting of the second event as a serviceable event. However, the second event still has the opportunity to suppress the reporting of any events of which it has correlation by location and scope of influence. Thus, the ability to analyze a chain reaction is maintained.
- Remote correlation is similar to local correlation. However, instead of comparing the local FRU lists for both events, remote correlation compares the remote FRU list for the first event with the local FRU list for the second event, and the local FRU list for the first event with the remote FRU list for the second event. This comparison of locations is also done under the scope of influence characteristic defined in the first event.
- the first event lists the second event in its local correlation characteristic
- the first event has a scope of influence down to the board level
- Both the first event and the second event are categorized in the same event pool.
- both the first event and the second event correlate to the same location from the highest level in the location hierarchy down to the board level, then the first event will suppress the reporting of the second event. If after all correlations are made and there remain multiple events reported by the same device, a priority comparison is made. The event with the higher priority is reported and the other is suppressed.
- events are correlated not only based on the relation of types of events and their locality, but also based on when they occurred in time. Two events that occur hours apart are not likely to be related. However, two events that occur within seconds are much more likely to be related. To that end, each event is assigned a timeout value that indicates how long it should be kept in the pool before being reported. During the time that the event is in the pool, it can be related to other events based on correlation and priority characteristics. If it is not suppressed during the timeout period, then it will be reported as a serviceable event.
- ELA component 115 Once ELA component 115 has a serviceable event to open, the ELA component 115 calls another method to open the event into a tracking database that presents the serviceable events to users. This tracking database allows users to see currently open and closed events, and to indicate what types of actions the users have taken with respect to resolving a serviceable event. Finally, when the user is satisfied, the user may close the particular event.
- FIG. 3 is a high-level logical flowchart illustrating an exemplary method for implementing InfiniBand (IB) error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention.
- the process begins at step 300 and proceeds to step 302 , which illustrates network error manager 114 receiving a topology of network 100 from master subnet manager 108 a.
- step 302 illustrates network error manager 114 receiving a topology of network 100 from master subnet manager 108 a.
- standby subnet manager 108 b takes over the responsibilities of master subnet manager 108 a.
- step 304 illustrates a determination made by at least one IB device (e.g., IB adapter 110 a - b, IB switch 106 a - b, etc.) if there is a loss of IB connectivity. If there is no loss of IB connectivity, network manager 114 continues monitoring network 100 , as depicted in step 306 . The process returns to step 304 and continues in an iterative fashion.
- IB device e.g., IB adapter 110 a - b, IB switch 106 a - b, etc.
- At least one IB device detects a loss of IB connectivity
- at least one connectivity event is sent by each IB device that detects loss of IB connectivity via Ethernet adapters 112 a - b.
- the at least one connectivity event is received by network error manager 114 via Ethernet adapter 112 c, as illustrated in step 308 .
- Network manager 114 identifies possible causes of the IB connectivity failure, as illustrated in step 310 .
- the process returns to step 306 and proceeds in an iterative fashion.
- FIG. 5A is a high-level logical flowchart that depicts step 308 of FIG. 3 in more detail in accordance with a preferred embodiment of the present invention.
- the process begins at step 500 , and proceeds to step 502 , which illustrates network error manager 114 monitoring the active subnet manager for asynchronous events sent from devices within network 100 that have detected communication failures within the network.
- step 504 illustrates network error manager 114 determining if an event has been received. If an event has not been received, the process returns to step 502 and continues in an iterative fashion. If an event has been received, the process proceeds to step 506 , which shows network error manager 114 determining whether the event should be forwarded to ELA component 115 .
- network error manager 114 makes this determination by requesting more information regarding the event, if needed and analyzing the frequency of the event. If network error manager 114 decides not to forward the event to ELA component 115 , the event is discarded, the process returns to step 502 and proceeds in an iterative fashion. If network error manager 114 decides to forward the event to ELA component 115 , the event is forwarded and ELA component 115 categorizes the received event into an event pool. The process returns to step 502 and proceeds in an iterative fashion.
- FIG. 5B is a high-level logical flowchart that shows step 310 of FIG. 3 in more detail in accordance with a preferred embodiment of the present invention.
- Step 310 illustrates the identification of the possible cause of IB failure within network 100 .
- the process begins at step 510 and proceeds to step 512 , which illustrates ELA component 115 determining if an event pool has expired (due to the previously discussed predetermined timeout values). If no event pools have expired, the process continues to step 514 , which shows ELA component 115 continuing to categorize received events from network error manager 114 into event pools 410 .
- step 516 which illustrates ELA component 115 determining if any correlations (location, scope of influence, timeout values, etc.) exist between the events in the expired event pool.
- step 518 depicts ELA component 115 presenting at least one serviceable event (e.g., an event that may be remedied by the user or an on-site technician through the replacement of at least one FRU) to assist in diagnosis of the cause of communication failure.
- step 520 which illustrates ELA component 115 waiting for the next event pool to expire. The process returns to step 512 and proceeds in an iterative fashion.
- a network error manager retrieves a network topology from a master subnet manager, wherein the network includes a collection of devices coupled by a first interconnect type.
- the network error manager receives from the master subnet manager at least one event notification via a second interconnect type.
- An error log analysis component identifies at least one device among the collection of devices as a possible cause of the connectivity failure in the first interconnect type.
- the network error manager retrieves events from at least one device among the collection of devices that can influence a state of the first interconnect type.
- Program code defining functions in the present invention can be delivered to a data storage system or a computer system via a variety of signal-bearing media, which include, without limitation, non-writable storage media (e.g., CD-ROM), writable storage media (e.g., hard disk drive, read/write CD-ROM, optical media), system memory such as, but not limited to Random Access Memory (RAM), and communication media, such as computer and telephone networks including Ethernet, the Internet, wireless networks, and like network systems.
- signal-bearing media when carrying or encoding computer-readable instructions that direct method functions in the present invention represent alternative embodiments of the present invention.
- the present invention may be implemented by a system having means in the form of hardware, software, or a combination of software and hardware as described herein or their equivalent.
Abstract
Description
- 1. Technical Field
- The present invention relates in general to the field of data processing systems. More specifically, the present invention relates to the field of diagnosing problems within data processing system systems.
- 2. Description of the Related Art
- In recent years, hardware and software developers have improved server architectures and designs with the goal of more robust and reliable servers for mission critical networking applications. For example, some server applications require that servers respond to client requests in a highly reliable manner.
- Additionally, processors implemented in server computers have substantially improved; where processor speeds and bandwidth greatly exceed the capacity of the input/output interfaces such as industry standard architecture (ISA), peripheral component interconnect (PCI), Ethernet, etc. This capacity inequality limits both server throughput and the speed at which data can be transferred between servers on a network. Different server standards have been proposed to improve network performance. The differing server standard proposals led to the development of the InfiniBand Architecture Specification, which was adopted by the InfiniBand Trade Association in October 2000.
- The InfiniBand Architecture (IBA) specifications define InfiniBand operation but limit the scope of the architecture to functions that can be performed only over the InfiniBand wires. Given that IBA is a clustering fabric, an entity is needed to initialize, configure, and manage the fabric. IBA defines this entity as a “Subnet Manager” (SM), which is tasked with the role of subnet administration. The SM performs its tasks in-band (i.e., over IB links) and discovers and initializes devices (e.g., switches, host adapters, etc.) that are coupled to the IB fabric.
- With the IBA's scope limited to in-band functionality only, any failures that result in loss of in-band communications are difficult to diagnose and time intensive to remedy. Some IB vendors have attempted to address this shortcoming in a variety of methods, such as “problem isolation” documents or applications that communicate out-of-band with the SM. These applications provide the user a view of the fabric and, in case of in-band failures, log events that may be useful in determining the cause of the failure. While the latter approach can yield additional failure information, the scope is limited to only the observations of the SM. As cluster sizes increase, a one-sided view of fabric failures makes problem isolation difficult and may require a “process of elimination” technique of determining the cause of failures. A “process of elimination” method is cost-prohibitive, since problem determination entail replacement of non-defective parts. Therefore, there is a need for a system and method for addressing the aforementioned limitations of the prior art in detecting the cause of failure in IB networks.
- The present invention includes a system, method, and computer-readable medium for detecting errors on a network. According to a preferred embodiment of the present invention, a network error manager retrieves a network topology from a master subnet manager; wherein the network includes a collection of devices coupled by a first interconnect type. When a connectivity failure is detected in the first interconnect type, the network error manager receives from the master subnet manager at least one event notification via a second interconnect type. An error log analysis component identifies at least one device among the collection of devices as a possible cause of the connectivity failure in the first interconnect type. The network error manager retrieves events from at least one device among the collection of devices that can influence a state of the first interconnect type.
- The above, as well as additional purposes, features, and advantages of the present invention will become apparent in the following detailed written description.
- The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further purposes and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying figures, wherein:
-
FIG. 1 is a block diagram illustrating an exemplary InfiniBand network in which a preferred embodiment of the present invention may be implemented; -
FIG. 2 is a block diagram depicting an exemplary data processing system according to a preferred embodiment of the present invention; -
FIG. 3 is a high-level logical flowchart illustrating an exemplary method for implementing InfiniBand error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention; -
FIG. 4 is a block diagram depicting exemplary contents of a system memory in accordance with a preferred embodiment of the present invention; and -
FIGS. 5A-5B are high-level logical flowcharts illustrating more detailed steps within an exemplary method for implementing InfiniBand error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention - The present invention includes a system and method for implementing InfiniBand error log analysis model to facilitate faster problem isolation and repair. According to one embodiment of the present invention, an InfiniBand (IB) network includes a Subnet Manager that maintains an accurate topological representation of the network and otherwise oversees network administration. A network error manager periodically interrogates the subnet manager for a topological representation of the network and listens for failure notifications, hereinafter referred to as “events”, sent by IB devices that detect an IB communication failure. An “IB device” is any device that either implements the network, is attached to the network by means of utilizing an IB device, or a device that can influence the state of IB devices and the state of the IB network. This includes, but is not limited to: switches, adapters, servers/systems, and power supplies.
- The events are forwarded to the network error manager by the subnet manager. Once the network error manager determines that more analysis of a particular event is required, the error manager forwards the event to an error log analysis component. The error log analysis component categorizes each received event into at least one of a collection of event pools. After a predetermined time limit each event pool expires. The error log analysis component analyzes each event in the expired event pool for any correlations and/or relations between the events to enable a user to more accurately and efficiently diagnose failing IB devices within an IB network.
- Referring now to the figures, and in particular, referring to
FIG. 1 , there is illustrated a block diagram depicting anexemplary network 100 in which a preferred embodiment of the present invention may be implemented. As illustrated,network 100 includes servers 102 a-b, managementcentral server 104, and InfiniBand (IB) switches 106 a-b. Servers 102 a-b are coupled to each other via IB switches 106 a-b and IB adapters 110 a-b. However, servers 102 a-b are coupled to managementcentral server 104 via Ethernet adapters 112 a-c. As previously discussed, a subnet manager administers the IB connections by discovering and initializing devices (e.g., switches, host adapters, etc.) that are connected to the IB fabric. Therefore, IB switches 106 a-b includemaster subnet manager 108 a andstandby subnet manager 108 b for performing subnet administration. In the event themaster subnet manager 108 a becomes inoperable,standby subnet manager 108 b takes over the responsibilities of administering the IB connections. Those with skill in the art will appreciate that the present invention is not limited to two servers, but may accommodate any number of servers innetwork 100. -
FIG. 2 is a block diagram depicting an exemplarydata processing system 200 that may be utilized to implement servers 102 a-b and managementcentral server 104 as illustrated inFIG. 1 . As depicted, exemplarydata processing system 200 includes processor(s) 202 a-n, which are coupled tosystem memory 204 viasystem bus 206. Preferably,system memory 204 may be implemented as a collection of dynamic random access memory (DRAM) modules. Mezzaninebus 208 acts as an intermediary betweensystem bus 206 andperipheral bus 214. Those with skill in this art will appreciate thatperipheral bus 214 may be implemented as a peripheral component interconnect (PCI), accelerated graphics port (AGP), or any other peripheral bus. Coupled toperipheral bus 214 ishard disk drive 210, which is utilized bydata processing system 200 as a mass storage device. Also coupled toperipheral bus 214 arenetwork adapter 216 and a collection of peripherals 212 a-n. As discussed herein in more detail,network adapter 216 may be implemented by any type of network protocol including, but not limited to, Ethernet, IEEE 802.11x, etc. - Those skilled in the art will appreciate that
data processing system 200 can include many additional components not specifically illustrated inFIG. 2 . Because such additional components are not necessary for an understanding of the present invention, they are not illustrated inFIG. 2 or discussed further herein. It should be also be understood, however, that the enhancements todata processing system 200 to facilitate faster problem isolation and repair provided by the present invention are applicable to data processing systems of any system architecture. The present invention is in no way limited to the generalized multi-processor architecture or symmetric multi-processing (SMP) architecture illustrated inFIG. 2 . -
FIG. 4 is a block diagram illustrating exemplary contents ofsystem memory 204 of managementcentral server 104 according to a preferred embodiment of the present invention. As illustrated,system memory 204 includesoperating system 402, which further includesshell 406 for providing transparent user access to resources such asapplication programs 408. Generally, shell 406 (as it is called in UNIX®), also called a command processor in Windows®, is a program that provides an interpreter and an interface between the user and the operating system. More specifically,shell 406 executes commands that are entered into a command-line user interface or file. Thus,shell 406 is generally the highest level of the operating system hierarchy and serves as a command interpreter.Shell 406 provides a system prompt, interprets commands entered by keyboard, mouse, or other user input media, and sends the interpreted command(s) to the appropriate lower levels of the operating system (e.g., kernel 404) for processing. Note that whileshell 404 is a text-based, line-oriented user interface, the present invention will support other user-interface modes, such as graphical, voice, gestural, etc. equally well. - As depicted,
operating system 402 also includeskernel 404, which includes lower levels of functionality foroperating system 402, including providing essential services required by other parts ofoperating system 402 andapplications 408, including memory management, process and task management, disk management, and mouse and keyboard management.Applications 408 can include a browser, utilized for access to the Internet, word processors spreadsheets, and other applications. Also, as illustrated inFIG. 4 ,system memory 204 includesnetwork error manager 114 and error log analysis (ELA)component 115, both of which are discussed herein in more detail. -
Network error manager 114, stored withinsystem memory 204 of managementcentral server 104 communicates withmaster subnet manager 108 a to obtain views of the IB topology as the topology is discovered bymaster subnet manager 108 a. Also, since managementcentral server 104 is coupled to servers 102 a-b via Ethernet connections,network error manager 114 also collects information from each server 102 a-b pertaining to their respective IB adapters 110 a-b. During operation, an IB failure may result in loss of in-band IB connectivity and multiple IB devices may observe a failure and report to the active subnet manager (e.g.,master subnet manager 108 a orstandby subnet manager 108 b). The active subnet manager forwards the events to networkerror manager 114 via the Ethernet connection. The ability ofNetwork error manager 114 to obtain events from all affected IB devices via the active subnet manager enables thenetwork error manager 114 and errorlog analysis component 115 to more accurately and efficiently determine the root cause of the failure. An accurate diagnosis of the root cause of the failure allows a user or repair personnel to order replacement parts for only the failing devices. Also, repair time is greatly reduced since typical “process of elimination” diagnosis is not necessary utilizing the present invention. In one embodiment of the present invention, some devices (e.g.,IB adapters 110 a, etc.) withinnetwork 100 may be field replaceable units (FRUs), which may be replaced by either a user or a technician on-site, without requiring the server to be returned to the vendor for the repair. -
Network error manager 114 further receives events from the servers. These events describe state changes in the server that can, in turn, result in state changes in the IB network. Whilenetwork error manager 114 and errorlog analysis component 115 are not responsible for the callout of such events, these events may be utilized to modify analysis of IB network events. As such,network error manager 114 and errorlog analysis component 115 can be considered to be alerted to such events, which will be subsequently described as “alerts”. -
Network error manager 114 works in conjunction with error log analysis (ELA)component 115 to gather network-wide asynchronous failure notifications (“events”), perform a first level of analysis per event, and pass important events toELA component 115 for a final analysis of the event relative to how the particular event correlates to other detected events that may affect network operation. While this embodiment does not include events from software or firmware that is critical to InfiniBand network operation,network error manager 114 may be configured to include such events to notify users of software or firmware errors. - To perform the first level of analysis,
network error manager 114 interrogates the received events and determines if more data is required to classify the event. Such data may include, but is not limited to, further information regarding potential field replaceable units (FRUs), a time out value (when the event is set to expire), or location information that clarifies the location of the failure. In an embodiment of the present invention,network error manager 114 may apply a threshold to an event to throttle reporting toELA component 115 bynetwork error manager 114 because certain events are more important based on their frequency of occurrence rather than each individual occurrence. Such a threshold may include a minimum number of events of a certain type that must occur beforenetwork error manager 114 reports that type of event toELA component 115. -
Network error manager 114 reports the type of event, the detector's location, and location information toELA component 115. The location information includes all required information to identify all the potential FRUs related to the event. Such FRU location information may include, but is not limited to: (1) logical FRU location; (2) physical FRU location; (3) machine type, model, and serial number of the enclosure that contains the device; (4) machine type, model, and serial number of power enclosure that is critical to providing power and servicing the device; (5) part number; and (6) part serial number. The location information given must be detailed enough to define a useful hierarchy of device and/or component containers. For example, a device can be contained within a frame that has power that influences the device, as well as a chassis that affects the logic function and power for the device, and it may further be considered a part of a particular network of devices. In one embodiment, the logical FRU location includes fields that enumerate the network, frame, chassis, board, and port associated with the reporting device and event on the device. - In one embodiment, there are three classes of FRUs that may be reported. Because
ELA component 115 is concerned with analyzing events, the classes of FRUs are based on their location relative to the device that detected the particular event. The main division point between classes is the connection between two ports innetwork 100. However, one embodiment could include the possibility of an event from the interface of any connection method between two distinct FRUs. There is a local FRU location list that lists all locations on the same side of a connection with respect to the device that detected the event. There is a remote FRU list for all locations on the opposite side of a cable/connection with respect to the device that detected the event. Also, there is a repeater FRU list that lists all locations between the two ends of a cable/connection with respect to the device that detected the event. - When an event is reported to
ELA component 115, each event is categorized into one ofseveral event pools 410 that are utilized to relate events by location and type. As shown inFIG. 4 , event pools 410 include, but are not limited to the following pools: switch link 412 a,switch device 412 b,adapter device 412 c, switch device and link 412 d, and alert 412 e. - Switch link events, categorized in switch link 412 a, all occur on a switch link, which can be either between two switches or between an adapter and a switch. These events involve a connection of some sort between two device ports.
Network error manager 114 must supply at least the local and remote FRU list information. If there are repeaters between both ports, information regarding these repeaters must be supplied toELA component 115. - Switch device and adapter device events, categorized in
switch device 412 b andadapter device 412 c, are similar in that they involve events that are related only to the device that is reporting the event.Network error manager 114 must supply the local FRU list information associated with the device. - Switch device and link events, categorized in switch device and link 412 d, indicate that the detecting FRU may be defective, but the detecting FRU affects the state of one or more links and may cause events to be reported by the other side of the link.
- Alert events, categorized in
alert 412 e, are those for whichELA component 115 is not responsible for reporting as serviceable, but are important in that the alert event may induce network events. Alert events are utilized to suppress the reporting of network events as serviceable. A “serviceable event” is an event that may be addressed via replacing FRUs by a user or an on-site technician. - The main purpose of the event pools is to keep similar events together so that they may be properly correlated. The pools may be considered a first-level analysis of correlation. The one exception to this rule is the alert event, whose events can be correlated across all of the pools.
- After a predetermined period of time, a pool “times out” (expires), or no longer accepts new events in order for
ELA component 115 to make correlations between collected events within the pool. There are two trigger mechanisms utilized to control when each pool expires. The “fast” mechanism is defined such that the timeout for the pool is based on a timeout value for the first event in the pool. The “slow” mechanism is defined such that the time out for the pool is based on a timeout value for the latest event to arrive in the pool. - The fast mechanism can suffice for many event relationships. However, the slow mechanism is utilized when there may be a large variance in the time influence of a particular event. Along with the slow mechanism, there is a defined maximum time for a pool to remain open. This defined maximum time is utilized to circumvent the possibility of a pool remaining open indefinitely. The maximum time value is chosen based on the events characteristics of
network 100. If the maximum time value is too short, correlation between events may be lost. This would result in events being reported as serviceable when they should not be considered serviceable. In turn, this would result in replacement of non-faulty FRUs. If the chosen maximum value is too long, it may take an inordinate amount of time to report a serviceable event, which can compromise the performance ofnetwork 100. - Finally, the alert pool operates slightly differently in that each event times out individually rather than as a group in the entire alert pool, which takes into account the special influence that alert events have on other events. Non-alert pools remain open based on the timeout value and trigger characteristics of the events that are placed within the pool. Once a pool times out, all of the events within the particular pool are compared with one another to determine if and how they relate to each other. The timeout value must take into account latencies for event reporting and event influence. Events may take varying amounts of time to be transferred to
ELA component 115. Furthermore, the influence of one event to another event may not be immediate, so any delayed reactions must be taken into account in the chosen timeout value for a particular pool. - There are several characteristics that describe to
ELA component 115 how a particular event relates to other events in network 100: - (1) Correlation by location, which can be either local or remote locations.
- (2) Scope of influence, which is utilized to describe how many locations a specific event may affect.
- (3) Timeout value of a particular event, which influences how long a pool can stay open before being analyzed.
- (4) Timeout trigger of an event, which influences how long a pool can stay open before being analyzed.
- (5) Priority, which, in absence of other correlation techniques, is utilized as a final arbiter to decide which of a group of events reported from the same device has priority to be reported. This minimizes the possibility of multiple events with the same suggested service action.
- (6) Time of reporting, which the earliest reported event takes precedence over any failure notification of equal priority at the same location.
- Correlation by location is performed based on locality of devices relative to the reporting device. Local correlation is performed relative to devices on the same side of a cable or other connection mechanism as the reporting device. Remote correlation is performed relative to devices on the opposite side of a cable or other connection mechanism as the reporting device. Each characteristic is simply a list of events that are to be tested for correlation.
- The correlation by location is tightly tied to the scope of influence. The scope of influence characteristic indicates at what level within a location's scope an event has influence. For example, a board failure may affect multiple ports on that board. Thus, the event associated with such a board failure must be characterized as having a scope of influence that includes the entire board.
- For local correlation, the local FRU list supplied by
network manager 114 is tested with respect to scope of influence to see if two events correlate. For example, assume that a first event includes the following features: - (1) The first event lists a second event in its local correlation characteristic;
- (2) The first event has a scope of influence at the port level in a computer system; and
- (3) Both the first event and the second event are categorized in the same event pool.
- If both the first event and the second event correlate to the same location from the highest level in the location hierarchy down to the port level, then the first event will suppress the reporting of the second event as a serviceable event. However, the second event still has the opportunity to suppress the reporting of any events of which it has correlation by location and scope of influence. Thus, the ability to analyze a chain reaction is maintained.
- Remote correlation is similar to local correlation. However, instead of comparing the local FRU lists for both events, remote correlation compares the remote FRU list for the first event with the local FRU list for the second event, and the local FRU list for the first event with the remote FRU list for the second event. This comparison of locations is also done under the scope of influence characteristic defined in the first event.
- For example, assume the first event has the following features:
- (1) The first event lists the second event in its local correlation characteristic;
- (2) The first event has a scope of influence down to the board level; and
- (3) Both the first event and the second event are categorized in the same event pool.
- If both the first event and the second event correlate to the same location from the highest level in the location hierarchy down to the board level, then the first event will suppress the reporting of the second event. If after all correlations are made and there remain multiple events reported by the same device, a priority comparison is made. The event with the higher priority is reported and the other is suppressed.
- Finally, it is important to remember that events are correlated not only based on the relation of types of events and their locality, but also based on when they occurred in time. Two events that occur hours apart are not likely to be related. However, two events that occur within seconds are much more likely to be related. To that end, each event is assigned a timeout value that indicates how long it should be kept in the pool before being reported. During the time that the event is in the pool, it can be related to other events based on correlation and priority characteristics. If it is not suppressed during the timeout period, then it will be reported as a serviceable event.
- Once
ELA component 115 has a serviceable event to open, theELA component 115 calls another method to open the event into a tracking database that presents the serviceable events to users. This tracking database allows users to see currently open and closed events, and to indicate what types of actions the users have taken with respect to resolving a serviceable event. Finally, when the user is satisfied, the user may close the particular event. -
FIG. 3 is a high-level logical flowchart illustrating an exemplary method for implementing InfiniBand (IB) error log analysis model to facilitate faster problem isolation and repair according to a preferred embodiment of the present invention. The process begins atstep 300 and proceeds to step 302, which illustratesnetwork error manager 114 receiving a topology ofnetwork 100 frommaster subnet manager 108 a. In the event thatmaster subnet manager 108 a becomes unavailable,standby subnet manager 108 b takes over the responsibilities ofmaster subnet manager 108 a. - The process continues to step 304, which illustrates a determination made by at least one IB device (e.g., IB adapter 110 a-b, IB switch 106 a-b, etc.) if there is a loss of IB connectivity. If there is no loss of IB connectivity,
network manager 114 continuesmonitoring network 100, as depicted instep 306. The process returns to step 304 and continues in an iterative fashion. - Returning to step 304, if at least one IB device detects a loss of IB connectivity, at least one connectivity event is sent by each IB device that detects loss of IB connectivity via Ethernet adapters 112 a-b. The at least one connectivity event is received by
network error manager 114 viaEthernet adapter 112 c, as illustrated instep 308.Network manager 114 identifies possible causes of the IB connectivity failure, as illustrated instep 310. The process returns to step 306 and proceeds in an iterative fashion. -
FIG. 5A is a high-level logical flowchart that depicts step 308 ofFIG. 3 in more detail in accordance with a preferred embodiment of the present invention. The process begins atstep 500, and proceeds to step 502, which illustratesnetwork error manager 114 monitoring the active subnet manager for asynchronous events sent from devices withinnetwork 100 that have detected communication failures within the network. The process continues to step 504, which illustratesnetwork error manager 114 determining if an event has been received. If an event has not been received, the process returns to step 502 and continues in an iterative fashion. If an event has been received, the process proceeds to step 506, which showsnetwork error manager 114 determining whether the event should be forwarded toELA component 115. As previously discussed,network error manager 114 makes this determination by requesting more information regarding the event, if needed and analyzing the frequency of the event. Ifnetwork error manager 114 decides not to forward the event toELA component 115, the event is discarded, the process returns to step 502 and proceeds in an iterative fashion. Ifnetwork error manager 114 decides to forward the event toELA component 115, the event is forwarded andELA component 115 categorizes the received event into an event pool. The process returns to step 502 and proceeds in an iterative fashion. -
FIG. 5B is a high-level logical flowchart that showsstep 310 ofFIG. 3 in more detail in accordance with a preferred embodiment of the present invention. Step 310 illustrates the identification of the possible cause of IB failure withinnetwork 100. The process begins atstep 510 and proceeds to step 512, which illustratesELA component 115 determining if an event pool has expired (due to the previously discussed predetermined timeout values). If no event pools have expired, the process continues to step 514, which showsELA component 115 continuing to categorize received events fromnetwork error manager 114 into event pools 410. If an event pool has expired, the process continues to step 516, which illustratesELA component 115 determining if any correlations (location, scope of influence, timeout values, etc.) exist between the events in the expired event pool. The process proceeds to step 518, which depictsELA component 115 presenting at least one serviceable event (e.g., an event that may be remedied by the user or an on-site technician through the replacement of at least one FRU) to assist in diagnosis of the cause of communication failure. The process continues to step 520, which illustratesELA component 115 waiting for the next event pool to expire. The process returns to step 512 and proceeds in an iterative fashion. - As discussed, the present invention includes a system, method, and computer-readable medium for detecting errors on a network. According to a preferred embodiment of the present invention, a network error manager retrieves a network topology from a master subnet manager, wherein the network includes a collection of devices coupled by a first interconnect type. When a connectivity failure is detected in the first interconnect type, the network error manager receives from the master subnet manager at least one event notification via a second interconnect type. An error log analysis component identifies at least one device among the collection of devices as a possible cause of the connectivity failure in the first interconnect type. The network error manager retrieves events from at least one device among the collection of devices that can influence a state of the first interconnect type.
- It should be understood that at least some aspects of the present invention may alternatively be implemented as a program product. Program code defining functions in the present invention can be delivered to a data storage system or a computer system via a variety of signal-bearing media, which include, without limitation, non-writable storage media (e.g., CD-ROM), writable storage media (e.g., hard disk drive, read/write CD-ROM, optical media), system memory such as, but not limited to Random Access Memory (RAM), and communication media, such as computer and telephone networks including Ethernet, the Internet, wireless networks, and like network systems. It should be understood, therefore, that such signal-bearing media when carrying or encoding computer-readable instructions that direct method functions in the present invention represent alternative embodiments of the present invention. Further, it is understood that the present invention may be implemented by a system having means in the form of hardware, software, or a combination of software and hardware as described herein or their equivalent.
- While the present invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/537,823 US7872982B2 (en) | 2006-10-02 | 2006-10-02 | Implementing an error log analysis model to facilitate faster problem isolation and repair |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/537,823 US7872982B2 (en) | 2006-10-02 | 2006-10-02 | Implementing an error log analysis model to facilitate faster problem isolation and repair |
Publications (2)
Publication Number | Publication Date |
---|---|
US20080080384A1 true US20080080384A1 (en) | 2008-04-03 |
US7872982B2 US7872982B2 (en) | 2011-01-18 |
Family
ID=39281780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/537,823 Expired - Fee Related US7872982B2 (en) | 2006-10-02 | 2006-10-02 | Implementing an error log analysis model to facilitate faster problem isolation and repair |
Country Status (1)
Country | Link |
---|---|
US (1) | US7872982B2 (en) |
Cited By (76)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138532A1 (en) * | 2008-11-28 | 2010-06-03 | Thomson Licensing | Method of operating a network subnet manager |
US20130097619A1 (en) * | 2011-10-18 | 2013-04-18 | International Business Machines Corporation | Administering Incident Pools For Event And Alert Analysis |
US20130219225A1 (en) * | 2009-07-16 | 2013-08-22 | Hitachi, Ltd. | Management system for outputting information denoting recovery method corresponding to root cause of failure |
US8621277B2 (en) | 2010-12-06 | 2013-12-31 | International Business Machines Corporation | Dynamic administration of component event reporting in a distributed processing system |
US8639980B2 (en) | 2011-05-26 | 2014-01-28 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US20140043999A1 (en) * | 2006-08-22 | 2014-02-13 | Centurylink Intellectual Property Llc | System and method for modifying connectivity fault management packets |
US8660995B2 (en) | 2011-06-22 | 2014-02-25 | International Business Machines Corporation | Flexible event data content management for relevant event and alert analysis within a distributed processing system |
US8676883B2 (en) | 2011-05-27 | 2014-03-18 | International Business Machines Corporation | Event management in a distributed processing system |
US8688769B2 (en) | 2011-10-18 | 2014-04-01 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US8689050B2 (en) | 2011-06-22 | 2014-04-01 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US20140098702A1 (en) * | 2012-10-05 | 2014-04-10 | Jean-Philippe Fricker | Position discovery by detecting irregularities in a network topology |
US8713581B2 (en) | 2011-10-27 | 2014-04-29 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US8730816B2 (en) | 2010-12-07 | 2014-05-20 | International Business Machines Corporation | Dynamic administration of event pools for relevant event and alert analysis during event storms |
US8756462B2 (en) | 2011-05-24 | 2014-06-17 | International Business Machines Corporation | Configurable alert delivery for reducing the amount of alerts transmitted in a distributed processing system |
US8769096B2 (en) | 2010-11-02 | 2014-07-01 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system |
US20140198654A1 (en) * | 2013-01-16 | 2014-07-17 | Fujitsu Limited | Communication monitor, prediction method, and recording medium |
US20140219107A1 (en) * | 2011-03-03 | 2014-08-07 | Telefonaktiebolaget L M Ericsson (Publ) | Technique for Determining Correlated Events in a Communication System |
US8805999B2 (en) | 2010-12-07 | 2014-08-12 | International Business Machines Corporation | Administering event reporting rules in a distributed processing system |
US8868986B2 (en) | 2010-12-07 | 2014-10-21 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system with event listeners and alert listeners |
US8880944B2 (en) | 2011-06-22 | 2014-11-04 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US8898299B2 (en) | 2010-11-02 | 2014-11-25 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8943366B2 (en) | 2012-08-09 | 2015-01-27 | International Business Machines Corporation | Administering checkpoints for incident analysis |
US8954811B2 (en) | 2012-08-06 | 2015-02-10 | International Business Machines Corporation | Administering incident pools for incident analysis |
US9086968B2 (en) | 2013-09-11 | 2015-07-21 | International Business Machines Corporation | Checkpointing for delayed alert creation |
US9170860B2 (en) | 2013-07-26 | 2015-10-27 | International Business Machines Corporation | Parallel incident processing |
US9178936B2 (en) | 2011-10-18 | 2015-11-03 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US9201756B2 (en) | 2011-05-27 | 2015-12-01 | International Business Machines Corporation | Administering event pools for relevant event analysis in a distributed processing system |
US9246865B2 (en) | 2011-10-18 | 2016-01-26 | International Business Machines Corporation | Prioritized alert delivery in a distributed processing system |
US9253029B2 (en) | 2013-01-16 | 2016-02-02 | Fujitsu Limited | Communication monitor, occurrence prediction method, and recording medium |
US9256482B2 (en) | 2013-08-23 | 2016-02-09 | International Business Machines Corporation | Determining whether to send an alert in a distributed processing system |
US9286143B2 (en) | 2011-06-22 | 2016-03-15 | International Business Machines Corporation | Flexible event data content management for relevant event and alert analysis within a distributed processing system |
US9348687B2 (en) | 2014-01-07 | 2016-05-24 | International Business Machines Corporation | Determining a number of unique incidents in a plurality of incidents for incident processing in a distributed processing system |
US9361184B2 (en) | 2013-05-09 | 2016-06-07 | International Business Machines Corporation | Selecting during a system shutdown procedure, a restart incident checkpoint of an incident analyzer in a distributed processing system |
US9479341B2 (en) | 2006-08-22 | 2016-10-25 | Centurylink Intellectual Property Llc | System and method for initiating diagnostics on a packet network node |
WO2017014905A1 (en) * | 2015-07-20 | 2017-01-26 | Schweitzer Engineering Laboratories, Inc. | Communication link failure detection in a software defined network |
US9602337B2 (en) | 2013-09-11 | 2017-03-21 | International Business Machines Corporation | Event and alert analysis in a distributed processing system |
US9658902B2 (en) | 2013-08-22 | 2017-05-23 | Globalfoundries Inc. | Adaptive clock throttling for event processing |
US9866483B2 (en) | 2015-07-20 | 2018-01-09 | Schweitzer Engineering Laboratories, Inc. | Routing of traffic in network through automatically generated and physically distinct communication paths |
US9900206B2 (en) | 2015-07-20 | 2018-02-20 | Schweitzer Engineering Laboratories, Inc. | Communication device with persistent configuration and verification |
US9923779B2 (en) | 2015-07-20 | 2018-03-20 | Schweitzer Engineering Laboratories, Inc. | Configuration of a software defined network |
US20180176238A1 (en) | 2016-12-15 | 2018-06-21 | Sap Se | Using frequency analysis in enterprise threat detection to detect intrusions in a computer system |
US10341311B2 (en) | 2015-07-20 | 2019-07-02 | Schweitzer Engineering Laboratories, Inc. | Communication device for implementing selective encryption in a software defined network |
US10375038B2 (en) * | 2016-11-30 | 2019-08-06 | International Business Machines Corporation | Symmetric multiprocessing management |
US10482241B2 (en) | 2016-08-24 | 2019-11-19 | Sap Se | Visualization of data distributed in multiple dimensions |
US10530794B2 (en) | 2017-06-30 | 2020-01-07 | Sap Se | Pattern creation in enterprise threat detection |
US10534907B2 (en) | 2016-12-15 | 2020-01-14 | Sap Se | Providing semantic connectivity between a java application server and enterprise threat detection system using a J2EE data |
US10534908B2 (en) | 2016-12-06 | 2020-01-14 | Sap Se | Alerts based on entities in security information and event management products |
US10536476B2 (en) | 2016-07-21 | 2020-01-14 | Sap Se | Realtime triggering framework |
US10542016B2 (en) * | 2016-08-31 | 2020-01-21 | Sap Se | Location enrichment in enterprise threat detection |
US10552605B2 (en) | 2016-12-16 | 2020-02-04 | Sap Se | Anomaly detection in enterprise threat detection |
US10630705B2 (en) | 2016-09-23 | 2020-04-21 | Sap Se | Real-time push API for log events in enterprise threat detection |
US10659314B2 (en) | 2015-07-20 | 2020-05-19 | Schweitzer Engineering Laboratories, Inc. | Communication host profiles |
US10673879B2 (en) | 2016-09-23 | 2020-06-02 | Sap Se | Snapshot of a forensic investigation for enterprise threat detection |
US10681064B2 (en) | 2017-12-19 | 2020-06-09 | Sap Se | Analysis of complex relationships among information technology security-relevant entities using a network graph |
US10764306B2 (en) | 2016-12-19 | 2020-09-01 | Sap Se | Distributing cloud-computing platform content to enterprise threat detection systems |
US10785189B2 (en) | 2018-03-01 | 2020-09-22 | Schweitzer Engineering Laboratories, Inc. | Selective port mirroring and in-band transport of network communications for inspection |
US10863558B2 (en) | 2016-03-30 | 2020-12-08 | Schweitzer Engineering Laboratories, Inc. | Communication device for implementing trusted relationships in a software defined network |
CN112367196A (en) * | 2020-10-30 | 2021-02-12 | 锐捷网络股份有限公司 | Method and device for detecting network communication fault and electronic equipment |
US10979309B2 (en) | 2019-08-07 | 2021-04-13 | Schweitzer Engineering Laboratories, Inc. | Automated convergence of physical design and configuration of software defined network |
US10986111B2 (en) | 2017-12-19 | 2021-04-20 | Sap Se | Displaying a series of events along a time axis in enterprise threat detection |
US11075908B2 (en) | 2019-05-17 | 2021-07-27 | Schweitzer Engineering Laboratories, Inc. | Authentication in a software defined network |
US11165685B2 (en) | 2019-12-20 | 2021-11-02 | Schweitzer Engineering Laboratories, Inc. | Multipoint redundant network device path planning for programmable networks |
US11228521B2 (en) | 2019-11-04 | 2022-01-18 | Schweitzer Engineering Laboratories, Inc. | Systems and method for detecting failover capability of a network device |
US11281808B2 (en) * | 2020-01-28 | 2022-03-22 | International Business Machines Corporation | Detection and repair of failed hardware components |
US11336564B1 (en) | 2021-09-01 | 2022-05-17 | Schweitzer Engineering Laboratories, Inc. | Detection of active hosts using parallel redundancy protocol in software defined networks |
US11418432B1 (en) | 2021-04-22 | 2022-08-16 | Schweitzer Engineering Laboratories, Inc. | Automated communication flow discovery and configuration in a software defined network |
US11470094B2 (en) | 2016-12-16 | 2022-10-11 | Sap Se | Bi-directional content replication logic for enterprise threat detection |
US11563756B2 (en) | 2020-04-15 | 2023-01-24 | Crowdstrike, Inc. | Distributed digital security system |
US11616790B2 (en) | 2020-04-15 | 2023-03-28 | Crowdstrike, Inc. | Distributed digital security system |
US11645397B2 (en) | 2020-04-15 | 2023-05-09 | Crowd Strike, Inc. | Distributed digital security system |
US11711379B2 (en) * | 2020-04-15 | 2023-07-25 | Crowdstrike, Inc. | Distributed digital security system |
US11750502B2 (en) | 2021-09-01 | 2023-09-05 | Schweitzer Engineering Laboratories, Inc. | Detection of in-band software defined network controllers using parallel redundancy protocol |
US11836137B2 (en) | 2021-05-19 | 2023-12-05 | Crowdstrike, Inc. | Real-time streaming graph queries |
US11838174B2 (en) | 2022-02-24 | 2023-12-05 | Schweitzer Engineering Laboratories, Inc. | Multicast fast failover handling |
US11848860B2 (en) | 2022-02-24 | 2023-12-19 | Schweitzer Engineering Laboratories, Inc. | Multicast fast failover turnaround overlap handling |
US11861019B2 (en) | 2020-04-15 | 2024-01-02 | Crowdstrike, Inc. | Distributed digital security system |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8108724B2 (en) * | 2009-12-17 | 2012-01-31 | Hewlett-Packard Development Company, L.P. | Field replaceable unit failure determination |
US9135097B2 (en) | 2012-03-27 | 2015-09-15 | Oracle International Corporation | Node death detection by querying |
US9582350B2 (en) * | 2014-10-07 | 2017-02-28 | International Business Machines Corporation | Device driver error isolation on devices wired via FSI chained interface |
US10530640B2 (en) | 2016-09-29 | 2020-01-07 | Micro Focus Llc | Determining topology using log messages |
US11222287B2 (en) | 2019-07-25 | 2022-01-11 | International Business Machines Corporation | Machine learning for failure event identification and prediction |
US11226879B2 (en) | 2020-05-08 | 2022-01-18 | International Business Machines Corporation | Fencing non-responding ports in a network fabric |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5483637A (en) * | 1994-06-27 | 1996-01-09 | International Business Machines Corporation | Expert based system and method for managing error events in a local area network |
US5960381A (en) * | 1998-07-07 | 1999-09-28 | Johnson Controls Technology Company | Starfield display of control system diagnostic information |
US6058116A (en) * | 1998-04-15 | 2000-05-02 | 3Com Corporation | Interconnected trunk cluster arrangement |
US6061723A (en) * | 1997-10-08 | 2000-05-09 | Hewlett-Packard Company | Network management event correlation in environments containing inoperative network elements |
US6078979A (en) * | 1998-06-19 | 2000-06-20 | Dell Usa, L.P. | Selective isolation of a storage subsystem bus utilzing a subsystem controller |
US20020116485A1 (en) * | 2001-02-21 | 2002-08-22 | Equipe Communications Corporation | Out-of-band network management channels |
US20020159451A1 (en) * | 2001-04-27 | 2002-10-31 | Foster Michael S. | Method and system for path building in a communications network |
US20030063560A1 (en) * | 2001-10-02 | 2003-04-03 | Fujitsu Network Communications, Inc. | Protection switching in a communications network employing label switching |
US20040015744A1 (en) * | 2002-07-22 | 2004-01-22 | Finisar Corporation | Scalable multithreaded network testing tool |
US6694361B1 (en) * | 2000-06-30 | 2004-02-17 | Intel Corporation | Assigning multiple LIDs to ports in a cluster |
US6810418B1 (en) * | 2000-06-29 | 2004-10-26 | Intel Corporation | Method and device for accessing service agents on non-subnet manager hosts in an infiniband subnet |
US6836750B2 (en) * | 2001-04-23 | 2004-12-28 | Hewlett-Packard Development Company, L.P. | Systems and methods for providing an automated diagnostic audit for cluster computer systems |
-
2006
- 2006-10-02 US US11/537,823 patent/US7872982B2/en not_active Expired - Fee Related
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5483637A (en) * | 1994-06-27 | 1996-01-09 | International Business Machines Corporation | Expert based system and method for managing error events in a local area network |
US6061723A (en) * | 1997-10-08 | 2000-05-09 | Hewlett-Packard Company | Network management event correlation in environments containing inoperative network elements |
US6058116A (en) * | 1998-04-15 | 2000-05-02 | 3Com Corporation | Interconnected trunk cluster arrangement |
US6078979A (en) * | 1998-06-19 | 2000-06-20 | Dell Usa, L.P. | Selective isolation of a storage subsystem bus utilzing a subsystem controller |
US5960381A (en) * | 1998-07-07 | 1999-09-28 | Johnson Controls Technology Company | Starfield display of control system diagnostic information |
US6810418B1 (en) * | 2000-06-29 | 2004-10-26 | Intel Corporation | Method and device for accessing service agents on non-subnet manager hosts in an infiniband subnet |
US6694361B1 (en) * | 2000-06-30 | 2004-02-17 | Intel Corporation | Assigning multiple LIDs to ports in a cluster |
US20020116485A1 (en) * | 2001-02-21 | 2002-08-22 | Equipe Communications Corporation | Out-of-band network management channels |
US6836750B2 (en) * | 2001-04-23 | 2004-12-28 | Hewlett-Packard Development Company, L.P. | Systems and methods for providing an automated diagnostic audit for cluster computer systems |
US20020159451A1 (en) * | 2001-04-27 | 2002-10-31 | Foster Michael S. | Method and system for path building in a communications network |
US20030063560A1 (en) * | 2001-10-02 | 2003-04-03 | Fujitsu Network Communications, Inc. | Protection switching in a communications network employing label switching |
US20040015744A1 (en) * | 2002-07-22 | 2004-01-22 | Finisar Corporation | Scalable multithreaded network testing tool |
Cited By (105)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9253661B2 (en) * | 2006-08-22 | 2016-02-02 | Centurylink Intellectual Property Llc | System and method for modifying connectivity fault management packets |
US10523554B2 (en) | 2006-08-22 | 2019-12-31 | Centurylink Intellectual Property Llc | System and method of routing calls on a packet network |
US9479341B2 (en) | 2006-08-22 | 2016-10-25 | Centurylink Intellectual Property Llc | System and method for initiating diagnostics on a packet network node |
US20140043999A1 (en) * | 2006-08-22 | 2014-02-13 | Centurylink Intellectual Property Llc | System and method for modifying connectivity fault management packets |
US8127003B2 (en) * | 2008-11-28 | 2012-02-28 | Thomson Licensing | Method of operating a network subnet manager |
US20100138532A1 (en) * | 2008-11-28 | 2010-06-03 | Thomson Licensing | Method of operating a network subnet manager |
US20130219225A1 (en) * | 2009-07-16 | 2013-08-22 | Hitachi, Ltd. | Management system for outputting information denoting recovery method corresponding to root cause of failure |
US9189319B2 (en) * | 2009-07-16 | 2015-11-17 | Hitachi, Ltd. | Management system for outputting information denoting recovery method corresponding to root cause of failure |
US8825852B2 (en) | 2010-11-02 | 2014-09-02 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system |
US8898299B2 (en) | 2010-11-02 | 2014-11-25 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8769096B2 (en) | 2010-11-02 | 2014-07-01 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system |
US8627154B2 (en) | 2010-12-06 | 2014-01-07 | International Business Machines Corporation | Dynamic administration of component event reporting in a distributed processing system |
US8621277B2 (en) | 2010-12-06 | 2013-12-31 | International Business Machines Corporation | Dynamic administration of component event reporting in a distributed processing system |
US8737231B2 (en) | 2010-12-07 | 2014-05-27 | International Business Machines Corporation | Dynamic administration of event pools for relevant event and alert analysis during event storms |
US8805999B2 (en) | 2010-12-07 | 2014-08-12 | International Business Machines Corporation | Administering event reporting rules in a distributed processing system |
US8868984B2 (en) | 2010-12-07 | 2014-10-21 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system with event listeners and alert listeners |
US8868986B2 (en) | 2010-12-07 | 2014-10-21 | International Business Machines Corporation | Relevant alert delivery in a distributed processing system with event listeners and alert listeners |
US8730816B2 (en) | 2010-12-07 | 2014-05-20 | International Business Machines Corporation | Dynamic administration of event pools for relevant event and alert analysis during event storms |
US20140219107A1 (en) * | 2011-03-03 | 2014-08-07 | Telefonaktiebolaget L M Ericsson (Publ) | Technique for Determining Correlated Events in a Communication System |
US9325568B2 (en) * | 2011-03-03 | 2016-04-26 | Telefonaktiebolaget Lm Ericsson (Publ) | Technique for determining correlated events in a communication system |
US8756462B2 (en) | 2011-05-24 | 2014-06-17 | International Business Machines Corporation | Configurable alert delivery for reducing the amount of alerts transmitted in a distributed processing system |
US8639980B2 (en) | 2011-05-26 | 2014-01-28 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8645757B2 (en) | 2011-05-26 | 2014-02-04 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8676883B2 (en) | 2011-05-27 | 2014-03-18 | International Business Machines Corporation | Event management in a distributed processing system |
US9213621B2 (en) | 2011-05-27 | 2015-12-15 | International Business Machines Corporation | Administering event pools for relevant event analysis in a distributed processing system |
US9201756B2 (en) | 2011-05-27 | 2015-12-01 | International Business Machines Corporation | Administering event pools for relevant event analysis in a distributed processing system |
US9344381B2 (en) | 2011-05-27 | 2016-05-17 | International Business Machines Corporation | Event management in a distributed processing system |
US8689050B2 (en) | 2011-06-22 | 2014-04-01 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US8660995B2 (en) | 2011-06-22 | 2014-02-25 | International Business Machines Corporation | Flexible event data content management for relevant event and alert analysis within a distributed processing system |
US9419650B2 (en) | 2011-06-22 | 2016-08-16 | International Business Machines Corporation | Flexible event data content management for relevant event and alert analysis within a distributed processing system |
US9286143B2 (en) | 2011-06-22 | 2016-03-15 | International Business Machines Corporation | Flexible event data content management for relevant event and alert analysis within a distributed processing system |
US8880944B2 (en) | 2011-06-22 | 2014-11-04 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US8713366B2 (en) | 2011-06-22 | 2014-04-29 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US8880943B2 (en) | 2011-06-22 | 2014-11-04 | International Business Machines Corporation | Restarting event and alert analysis after a shutdown in a distributed processing system |
US20130097620A1 (en) * | 2011-10-18 | 2013-04-18 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8688769B2 (en) | 2011-10-18 | 2014-04-01 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US20130097619A1 (en) * | 2011-10-18 | 2013-04-18 | International Business Machines Corporation | Administering Incident Pools For Event And Alert Analysis |
US9178936B2 (en) | 2011-10-18 | 2015-11-03 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US9178937B2 (en) | 2011-10-18 | 2015-11-03 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US8893157B2 (en) * | 2011-10-18 | 2014-11-18 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US8887175B2 (en) * | 2011-10-18 | 2014-11-11 | International Business Machines Corporation | Administering incident pools for event and alert analysis |
US9246865B2 (en) | 2011-10-18 | 2016-01-26 | International Business Machines Corporation | Prioritized alert delivery in a distributed processing system |
US8713581B2 (en) | 2011-10-27 | 2014-04-29 | International Business Machines Corporation | Selected alert delivery in a distributed processing system |
US8954811B2 (en) | 2012-08-06 | 2015-02-10 | International Business Machines Corporation | Administering incident pools for incident analysis |
US8943366B2 (en) | 2012-08-09 | 2015-01-27 | International Business Machines Corporation | Administering checkpoints for incident analysis |
US20140098702A1 (en) * | 2012-10-05 | 2014-04-10 | Jean-Philippe Fricker | Position discovery by detecting irregularities in a network topology |
US9143338B2 (en) * | 2012-10-05 | 2015-09-22 | Advanced Micro Devices, Inc. | Position discovery by detecting irregularities in a network topology |
US9253029B2 (en) | 2013-01-16 | 2016-02-02 | Fujitsu Limited | Communication monitor, occurrence prediction method, and recording medium |
US9350602B2 (en) * | 2013-01-16 | 2016-05-24 | Fujitsu Limited | Communication monitor, prediction method, and recording medium |
US20140198654A1 (en) * | 2013-01-16 | 2014-07-17 | Fujitsu Limited | Communication monitor, prediction method, and recording medium |
US9361184B2 (en) | 2013-05-09 | 2016-06-07 | International Business Machines Corporation | Selecting during a system shutdown procedure, a restart incident checkpoint of an incident analyzer in a distributed processing system |
US9170860B2 (en) | 2013-07-26 | 2015-10-27 | International Business Machines Corporation | Parallel incident processing |
US9658902B2 (en) | 2013-08-22 | 2017-05-23 | Globalfoundries Inc. | Adaptive clock throttling for event processing |
US9256482B2 (en) | 2013-08-23 | 2016-02-09 | International Business Machines Corporation | Determining whether to send an alert in a distributed processing system |
US9086968B2 (en) | 2013-09-11 | 2015-07-21 | International Business Machines Corporation | Checkpointing for delayed alert creation |
US10171289B2 (en) | 2013-09-11 | 2019-01-01 | International Business Machines Corporation | Event and alert analysis in a distributed processing system |
US9602337B2 (en) | 2013-09-11 | 2017-03-21 | International Business Machines Corporation | Event and alert analysis in a distributed processing system |
US9389943B2 (en) | 2014-01-07 | 2016-07-12 | International Business Machines Corporation | Determining a number of unique incidents in a plurality of incidents for incident processing in a distributed processing system |
US9348687B2 (en) | 2014-01-07 | 2016-05-24 | International Business Machines Corporation | Determining a number of unique incidents in a plurality of incidents for incident processing in a distributed processing system |
US10341311B2 (en) | 2015-07-20 | 2019-07-02 | Schweitzer Engineering Laboratories, Inc. | Communication device for implementing selective encryption in a software defined network |
US9923779B2 (en) | 2015-07-20 | 2018-03-20 | Schweitzer Engineering Laboratories, Inc. | Configuration of a software defined network |
US9900206B2 (en) | 2015-07-20 | 2018-02-20 | Schweitzer Engineering Laboratories, Inc. | Communication device with persistent configuration and verification |
WO2017014905A1 (en) * | 2015-07-20 | 2017-01-26 | Schweitzer Engineering Laboratories, Inc. | Communication link failure detection in a software defined network |
US10721218B2 (en) | 2015-07-20 | 2020-07-21 | Schweitzer Engineering Laboratories, Inc. | Communication device for implementing selective encryption in a software defined network |
US9866483B2 (en) | 2015-07-20 | 2018-01-09 | Schweitzer Engineering Laboratories, Inc. | Routing of traffic in network through automatically generated and physically distinct communication paths |
US10659314B2 (en) | 2015-07-20 | 2020-05-19 | Schweitzer Engineering Laboratories, Inc. | Communication host profiles |
US10863558B2 (en) | 2016-03-30 | 2020-12-08 | Schweitzer Engineering Laboratories, Inc. | Communication device for implementing trusted relationships in a software defined network |
US11012465B2 (en) | 2016-07-21 | 2021-05-18 | Sap Se | Realtime triggering framework |
US10536476B2 (en) | 2016-07-21 | 2020-01-14 | Sap Se | Realtime triggering framework |
US10482241B2 (en) | 2016-08-24 | 2019-11-19 | Sap Se | Visualization of data distributed in multiple dimensions |
US10542016B2 (en) * | 2016-08-31 | 2020-01-21 | Sap Se | Location enrichment in enterprise threat detection |
US10673879B2 (en) | 2016-09-23 | 2020-06-02 | Sap Se | Snapshot of a forensic investigation for enterprise threat detection |
US10630705B2 (en) | 2016-09-23 | 2020-04-21 | Sap Se | Real-time push API for log events in enterprise threat detection |
US10375038B2 (en) * | 2016-11-30 | 2019-08-06 | International Business Machines Corporation | Symmetric multiprocessing management |
US10623383B2 (en) | 2016-11-30 | 2020-04-14 | International Business Machines Corporation | Symmetric multiprocessing management |
US10534908B2 (en) | 2016-12-06 | 2020-01-14 | Sap Se | Alerts based on entities in security information and event management products |
US10534907B2 (en) | 2016-12-15 | 2020-01-14 | Sap Se | Providing semantic connectivity between a java application server and enterprise threat detection system using a J2EE data |
US10530792B2 (en) | 2016-12-15 | 2020-01-07 | Sap Se | Using frequency analysis in enterprise threat detection to detect intrusions in a computer system |
US20180176238A1 (en) | 2016-12-15 | 2018-06-21 | Sap Se | Using frequency analysis in enterprise threat detection to detect intrusions in a computer system |
US10552605B2 (en) | 2016-12-16 | 2020-02-04 | Sap Se | Anomaly detection in enterprise threat detection |
US11470094B2 (en) | 2016-12-16 | 2022-10-11 | Sap Se | Bi-directional content replication logic for enterprise threat detection |
US11093608B2 (en) | 2016-12-16 | 2021-08-17 | Sap Se | Anomaly detection in enterprise threat detection |
US10764306B2 (en) | 2016-12-19 | 2020-09-01 | Sap Se | Distributing cloud-computing platform content to enterprise threat detection systems |
US10530794B2 (en) | 2017-06-30 | 2020-01-07 | Sap Se | Pattern creation in enterprise threat detection |
US11128651B2 (en) | 2017-06-30 | 2021-09-21 | Sap Se | Pattern creation in enterprise threat detection |
US10681064B2 (en) | 2017-12-19 | 2020-06-09 | Sap Se | Analysis of complex relationships among information technology security-relevant entities using a network graph |
US10986111B2 (en) | 2017-12-19 | 2021-04-20 | Sap Se | Displaying a series of events along a time axis in enterprise threat detection |
US10785189B2 (en) | 2018-03-01 | 2020-09-22 | Schweitzer Engineering Laboratories, Inc. | Selective port mirroring and in-band transport of network communications for inspection |
US11075908B2 (en) | 2019-05-17 | 2021-07-27 | Schweitzer Engineering Laboratories, Inc. | Authentication in a software defined network |
US10979309B2 (en) | 2019-08-07 | 2021-04-13 | Schweitzer Engineering Laboratories, Inc. | Automated convergence of physical design and configuration of software defined network |
US11228521B2 (en) | 2019-11-04 | 2022-01-18 | Schweitzer Engineering Laboratories, Inc. | Systems and method for detecting failover capability of a network device |
US11165685B2 (en) | 2019-12-20 | 2021-11-02 | Schweitzer Engineering Laboratories, Inc. | Multipoint redundant network device path planning for programmable networks |
US11281808B2 (en) * | 2020-01-28 | 2022-03-22 | International Business Machines Corporation | Detection and repair of failed hardware components |
US11645397B2 (en) | 2020-04-15 | 2023-05-09 | Crowd Strike, Inc. | Distributed digital security system |
US11563756B2 (en) | 2020-04-15 | 2023-01-24 | Crowdstrike, Inc. | Distributed digital security system |
US11616790B2 (en) | 2020-04-15 | 2023-03-28 | Crowdstrike, Inc. | Distributed digital security system |
US11711379B2 (en) * | 2020-04-15 | 2023-07-25 | Crowdstrike, Inc. | Distributed digital security system |
US11861019B2 (en) | 2020-04-15 | 2024-01-02 | Crowdstrike, Inc. | Distributed digital security system |
CN112367196A (en) * | 2020-10-30 | 2021-02-12 | 锐捷网络股份有限公司 | Method and device for detecting network communication fault and electronic equipment |
US11418432B1 (en) | 2021-04-22 | 2022-08-16 | Schweitzer Engineering Laboratories, Inc. | Automated communication flow discovery and configuration in a software defined network |
US11836137B2 (en) | 2021-05-19 | 2023-12-05 | Crowdstrike, Inc. | Real-time streaming graph queries |
US11336564B1 (en) | 2021-09-01 | 2022-05-17 | Schweitzer Engineering Laboratories, Inc. | Detection of active hosts using parallel redundancy protocol in software defined networks |
US11750502B2 (en) | 2021-09-01 | 2023-09-05 | Schweitzer Engineering Laboratories, Inc. | Detection of in-band software defined network controllers using parallel redundancy protocol |
US11838174B2 (en) | 2022-02-24 | 2023-12-05 | Schweitzer Engineering Laboratories, Inc. | Multicast fast failover handling |
US11848860B2 (en) | 2022-02-24 | 2023-12-19 | Schweitzer Engineering Laboratories, Inc. | Multicast fast failover turnaround overlap handling |
Also Published As
Publication number | Publication date |
---|---|
US7872982B2 (en) | 2011-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7872982B2 (en) | Implementing an error log analysis model to facilitate faster problem isolation and repair | |
US11500757B2 (en) | Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data | |
Chen et al. | Towards intelligent incident management: why we need it and how we make it | |
US9413597B2 (en) | Method and system for providing aggregated network alarms | |
US8812649B2 (en) | Method and system for processing fault alarms and trouble tickets in a managed network services system | |
US11513935B2 (en) | System and method for detecting anomalies by discovering sequences in log entries | |
US8676945B2 (en) | Method and system for processing fault alarms and maintenance events in a managed network services system | |
US7664986B2 (en) | System and method for determining fault isolation in an enterprise computing system | |
US6792456B1 (en) | Systems and methods for authoring and executing operational policies that use event rates | |
US7580994B1 (en) | Method and apparatus for enabling dynamic self-healing of multi-media services | |
US8738760B2 (en) | Method and system for providing automated data retrieval in support of fault isolation in a managed services network | |
US7607043B2 (en) | Analysis of mutually exclusive conflicts among redundant devices | |
US7426654B2 (en) | Method and system for providing customer controlled notifications in a managed network services system | |
US8924533B2 (en) | Method and system for providing automated fault isolation in a managed services network | |
US7730364B2 (en) | Systems and methods for predictive failure management | |
US6636981B1 (en) | Method and system for end-to-end problem determination and fault isolation for storage area networks | |
US10698605B2 (en) | Multipath storage device based on multi-dimensional health diagnosis | |
US20140082423A1 (en) | Method and apparatus for cause analysis involving configuration changes | |
US20080082661A1 (en) | Method and Apparatus for Network Monitoring of Communications Networks | |
EP1405187A4 (en) | Method and system for correlating and determining root causes of system and enterprise events | |
JPH04230538A (en) | Method and apparatus for detecting faulty software component | |
US7469287B1 (en) | Apparatus and method for monitoring objects in a network and automatically validating events relating to the objects | |
US20100280855A1 (en) | Management of a first stand-alone system used as a subsystem within a second system | |
CN115664939A (en) | Comprehensive operation and maintenance method and device based on automation technology and storage medium | |
CN113472577B (en) | Cluster inspection method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ATKINS, MARK G;COHEN, MICHAL B;DOXTADER, JOHN W;AND OTHERS;REEL/FRAME:018368/0207;SIGNING DATES FROM 20060928 TO 20060929 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ATKINS, MARK G;COHEN, MICHAL B;DOXTADER, JOHN W;AND OTHERS;SIGNING DATES FROM 20060928 TO 20060929;REEL/FRAME:018368/0207 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20150118 |