WO2015072078A1 - Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program - Google Patents

Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program Download PDF

Info

Publication number
WO2015072078A1
WO2015072078A1 PCT/JP2014/005217 JP2014005217W WO2015072078A1 WO 2015072078 A1 WO2015072078 A1 WO 2015072078A1 JP 2014005217 W JP2014005217 W JP 2014005217W WO 2015072078 A1 WO2015072078 A1 WO 2015072078A1
Authority
WO
WIPO (PCT)
Prior art keywords
procedure
sub
reconstruction
failure
component
Prior art date
Application number
PCT/JP2014/005217
Other languages
French (fr)
Japanese (ja)
Inventor
紅美子 但野
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2015547615A priority Critical patent/JPWO2015072078A1/en
Publication of WO2015072078A1 publication Critical patent/WO2015072078A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error

Definitions

  • the present invention relates to a service resumption procedure generation device, a service resumption procedure generation method, and a service resumption procedure generation program for generating a procedure for resuming a service of an information system stopped due to the occurrence of a failure.
  • failure recovery procedure In order to restart the information system service from such a situation, an operation procedure (so-called failure recovery procedure) is required to restore the entire information system from a failure state to a serviceable state in response to simultaneous failures. It is. In order to recover the entire information system, many fault recovery procedures first check the system status and identify the cause of the fault, and then correct the problem.
  • the information system failure recovery procedure includes sub-procedures (for example, command input, graphical user interface operation, etc.) for recovering a failure occurring in a component. Since the sub-procedures required for each failure occurring in the component are different, the failure recovery procedure differs depending on the combination of the failures that have occurred. Since the number of combinations of failures that can occur simultaneously in a large number of components is enormous, it is impractical for a user to manually generate a failure recovery procedure for all combinations. Therefore, it is reasonable to automatically generate a failure recovery procedure.
  • sub-procedures for example, command input, graphical user interface operation, etc.
  • RTO Recovery time objective, target recovery time
  • software rejuvenation for software aging is known as a failure handling method that does not require identification of the cause of failure and correction of a problem part.
  • Software aging is a general term for deterioration phenomena (memory leak, file fragmentation, etc.) that occur in the operating environment due to continuous operation for a long time. As the operating environment deteriorates, the information system may be damaged.
  • Software rejuvenation is a technique for preventing a failure caused by software aging by initializing at least a part of the internal state of the information system.
  • Patent Document 1 describes a method of simultaneously rejuvenating a host machine and a virtual machine that need to be rejuvenated while continuously operating a host machine and a virtual machine that do not need to be rejuvenated.
  • Non-Patent Document 1 describes a software life extension method.
  • the software life extension method described in Non-Patent Document 1 when software aging occurs in a virtual machine that operates in a virtual environment, additional resources are allocated to the virtual machine to increase the operating time of the software. Lengthen.
  • Patent Document 1 and Non-Patent Document 1 have a problem that they cannot cope with failures other than information system failures caused by software aging.
  • a failure handling method that can deal with failures other than failures caused by software aging (for example, file corruption, setting mistakes, rewriting due to unauthorized access, etc.) and does not require identification of the cause of failure and correction of the problem location.
  • One is rebuilding at least part of the information system.
  • An object of the present invention is to provide a service resumption procedure generation apparatus, a service resumption procedure generation method, and a service resumption procedure generation program that are automatically generated.
  • the service resuming procedure generating apparatus includes a failure combination accepting unit that accepts a combination of faults occurring in a component included in the information system, and information on a sub procedure that is a procedure for recovering a fault occurring in the component.
  • Sub-procedure storage means for storing in association with component identifiers, reconstruction procedure storage means for storing information on reconstruction procedures, which are procedures for reconstructing components, in association with component identifiers, and failure combinations Based on the combination of faults received by the receiving means, sub-procedure specifying means for specifying a sub-procedure necessary for recovering a component in which a fault has occurred from the fault state, and information on the sub-procedures specified by the sub-procedure specifying means Disaster recovery by connecting identified subprocedures to generate disaster recovery procedures based on Based on the combination of failures received by the order generation means, the failure combination acceptance means, a reconstruction procedure identification means for identifying a reconstruction procedure necessary to reconstruct the component in which the failure has occurred, and the generated failure recovery Included in the generated disaster recovery procedure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure if the procedure does not meet the prescribed requirements
  • Sub-procedure replacement means that replaces at least
  • the service restart procedure generation method associates sub-procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, with a component identifier in a predetermined sub-procedure storage unit. And stores the information on the reconstruction procedure, which is a procedure for reconstructing the component, in association with the identifier of the component, and the information processing apparatus stores the information on the component of the information system. Accepts a combination of faults that have occurred, and the information processing device identifies the sub-procedure necessary to recover the faulty component from the fault state based on the received fault combination.
  • the information processing apparatus generates a fault recovery procedure by connecting the identified sub-procedures based on the sub-procedure information Based on the combination of accepted faults, the rebuild procedure required to rebuild the component in which the fault has occurred is identified, and the information processing device is in the case where the generated fault recovery procedure does not meet the prescribed requirements Based on the information on each sub-procedure included in the generated disaster recovery procedure and the information on the identified reconstruction procedure, at least a part of the sub-procedure included in the generated disaster recovery procedure is reconstructed. And the information processing apparatus outputs a failure recovery procedure in which at least a part of the sub-procedure is replaced with a reconstruction procedure as a service resumption procedure.
  • the service resumption procedure generation program comprises sub procedure storage means for storing sub procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier.
  • sub procedure information which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier.
  • the sub-procedure specifying process for identifying the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the received fault combination receiving process, Rebuild steps required to rebuild the component that is failing Reconstruction procedure identification process to be identified, and information on each sub-procedure included in the generated disaster recovery procedure when the generated disaster recovery procedure does not meet the prescribed requirements
  • Sub-procedure replacement process that replaces at least
  • an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.
  • the failure recovery procedure is a procedure for causing the information system to restart the service by recovering the component group in which the failure has occurred in the information system from the failure state.
  • the failure recovery procedure includes sub-procedures that are a group of procedures for recovering from a failure occurring in a component included in the information system.
  • Each sub-procedure is not particularly limited as long as it is a procedure for recovering a failure occurring in a component.
  • each sub-procedure may include system management operations such as restart, data recovery, and setting change.
  • Each sub-procedure may be described in advance in a document or a manual.
  • Each sub-procedure may be provided as an automated script or program using an existing system configuration management tool such as JP1.
  • the system operator restores the failed component from the failed state according to the failure recovery procedure.
  • the sub-procedures required to restore the information system differ depending on the combination of components in which a failure has occurred and the combination of failures that have occurred. Therefore, the operator first specifies the sub procedure necessary for restoring the information system, and then executes the sub procedure to be executed for restoring the information system.
  • Component failures include not only the component being down, but also being unable to use the component normally such as “some of the required commands cannot be executed” and “some of the data necessary for the system has been lost”. included. Therefore, in the present invention, the sub-procedures included in the failure recovery procedure are specified according to the combination of components in which such a failure has occurred and the combination of the failures that have occurred.
  • the reconstruction procedure is a procedure for reconstructing a component group in which a failure has occurred.
  • the rebuilding procedure is not particularly limited as long as it is a procedure for rebuilding a component. It should be noted that the reconstruction procedure is not always prepared for all components from the viewpoint of restrictions on the implementation of the information system and the cost of preparing the reconstruction procedure itself.
  • Each reconstruction procedure may be described in advance in a document or a manual.
  • Each reconstruction procedure may be provided as an automated script or program using an existing system configuration management tool such as Chef.
  • the present invention provides a service resumption procedure that is a procedure for causing the information system to resume a service by appropriately combining the above-described failure recovery procedure (more specifically, a sub-procedure included in the failure recovery procedure) and a reconstruction procedure. Is generated.
  • a service resumption procedure regardless of whether the sub-procedure for restoring the failure or the reconstruction procedure is used, the information system is restored from the failure state to the serviceable state. Refers to the procedure.
  • the information system components include all of the information systems that can be the target of failure recovery or reconstruction in the information system. Examples include an application, a task, a thread, a VM (Virtual Machine), a central processing unit (CPU), a peripheral device, a storage, a server device, and a personal computer.
  • the component may be software or hardware.
  • a certain component may include a plurality of components.
  • the term “component” may mean a component group including a plurality of components.
  • sub-procedure it may mean a sub-procedure group including a plurality of sub-procedures.
  • reconstruction procedure may mean a reconstruction procedure group including a plurality of reconstruction procedures.
  • a plurality of subprocedures may be included in a subprocedure to which a certain ID is allocated.
  • a plurality of reconstruction procedures may be included in the reconstruction procedure to which a certain ID is allocated. This is because, depending on the combination of simultaneous failures, it may be better to handle a plurality of failures and components together.
  • FIG. 1 is a block diagram illustrating a configuration example of a service restart procedure generation device according to the first embodiment of this invention.
  • the service resumption procedure generation apparatus 1 includes a failure combination reception unit 101, a sub procedure identification unit 102, a sub procedure storage unit 103, a failure recovery procedure generation unit 104, and a required time.
  • An estimation unit 105, a procedure output unit 106, a reconstruction procedure specifying unit 107, a reconstruction procedure storage unit 108, a sub procedure replacement unit 109, and a time requirement receiving unit 110 are provided.
  • the service restart procedure generating device 1 is realized by a general information processing device (computer).
  • the service restart procedure generation device 1 is, for example, a server device or a personal computer.
  • the service restart procedure generation device 1 includes a CPU, a storage device, an input device, and an output device (not shown).
  • the storage device is, for example, a memory and a hard disk drive (HDD; Hard Disk Drive).
  • the input device is, for example, a keyboard, a mouse, various network interfaces, or the like.
  • the output device is, for example, a display or various network interfaces.
  • the service resuming procedure generation device 1 is configured to realize each unit illustrated in FIG. 1 by a CPU executing a program stored in a storage device.
  • the failure combination receiving unit 101 receives a combination of failures occurring in the information system components.
  • the information indicating the combination of faults may be a set of identifiers of components in which faults have occurred. For example, it may be specified using a component name such as ⁇ “application A” and “database B” ⁇ or a number assigned to the component in advance such as ⁇ 1, 2, 3 ⁇ .
  • the information indicating the combination of failures further includes the type of failure that has occurred in each component. That is, the information indicating the combination of failures may be a set of failure information including the identifier of the component in which the failure has occurred and the failure type.
  • the sub-procedure storage means 103 stores information on various sub-procedures for recovering a failure of a component included in the information system.
  • the sub-procedure storage means 103 associates a sub-procedure ID identifying the sub-procedure with the corresponding failure type and the sub-procedure itself (information indicating the specific processing contents of the sub-procedure, such as A script or a program for actually executing the processing contents).
  • FIG. 2 is an activity diagram showing an example of a sub procedure. In the present invention, the actual state of the sub procedure is not particularly limited. That is, the sub-procedure provided to the user may be information indicating the specific processing contents of the sub-procedure as shown in the activity diagram of FIG.
  • a sub-procedure which is a procedure for recovery from a failure, includes a specific action of a failure cause, so that it is difficult to automate all of them.
  • the recovery operation corresponding to the cause for example, restart, data recovery, setting change, etc.
  • a script or the like It is possible to do it.
  • FIG. 3 is an explanatory diagram showing a part of an example of information stored in the sub procedure storage means 103.
  • the sub-procedure storage means 103 of the present embodiment stores the identifier of the component to be recovered by the sub-procedure (sub-procedure identified by the sub-procedure ID) in association with the sub-procedure ID.
  • the sub-procedure storage means 103 of the present embodiment stores, for each sub-procedure, information including at least the sub-procedure itself and the identifier of the component that the sub-procedure recovers.
  • the sub-procedure is defined, for example, for each component failure included in the information system. Note that it is not necessarily defined for each failure, and for example, one sub-procedure may be defined for a specific combination of failures.
  • the sub-procedure storage means 103 associates the sub-procedure ID with a precondition indicating a condition necessary for executing the sub-procedure, the name of the sub-procedure, and execution of the sub-procedure. Additional information such as the status of the component to be realized (component transition destination state) and the time required for each operation operation (setting change, reboot, shutdown, etc.) in the sub procedure may be stored.
  • the preconditions are, for example, information on sub-procedures that need to be executed in advance, information on preconditions, sub-procedures that cannot be executed simultaneously, and the like.
  • the name of the sub procedure is used to help the user understand.
  • the state of the transition destination of the component is used, for example, as a material for determining whether or not the preconditions of the subsequent subprocedures are satisfied when a failure recovery procedure is generated by connecting a plurality of subprocedures.
  • the required time for each operation is used to estimate the execution time of the sub procedure.
  • the information stored in the sub procedure storage means 103 may be referred to as sub procedure information.
  • the reconstruction procedure storage means 108 stores information on the reconstruction procedure, which is a procedure for reconstructing the component in which the failure has occurred.
  • the reconstruction procedure storage unit 108 associates the reconstruction procedure ID with the reconstruction procedure ID for identifying the reconstruction procedure itself (information indicating the specific processing content of the reconstruction procedure, or such processing).
  • the actual status of the reconstruction procedure is not particularly limited. That is, the reconstruction procedure provided to the user may be information for causing the user to execute the reconstruction procedure, or may be a script or a program that actually executes the procedure for the information system. Alternatively, a combination of these may be used.
  • Rebuilding is an operation to re-create from scratch, unlike a simple restart such as clearing the internal state.
  • operations such as setting of resources to be allocated to the VM, setting of an OS IP address and OS firewall on the VM, and transfer of a VM image are included.
  • the reconstruction procedure does not require an understanding of the current state or the specific operation of the failure, and there are many cases where an appropriate procedure has been found, for example, there is a track record of normal operation once it has been constructed. If an appropriate procedure is known in advance, the reconstruction procedure can be automated by a script or the like.
  • FIG. 4 is an explanatory diagram showing a part of the information stored in the reconstruction procedure storage means 108.
  • the reconstruction procedure storage means 108 of the present embodiment associates the reconstruction procedure ID with the reconstruction procedure ID, and further reconstructs the reconstruction procedure (reconstruction procedure identified by the reconstruction procedure ID).
  • the identifier of the component to be stored is stored.
  • the reconstruction procedure storage unit 108 of the present embodiment stores, for each reconstruction procedure, information including at least the reconstruction procedure itself and the identifiers of components to be reconstructed by the reconstruction procedure.
  • the reconstruction procedure is defined for each component included in the information system, for example. Note that it is not necessarily defined for each component. For example, one reconstruction procedure may be defined for a specific combination of components.
  • the reconstruction procedure storage means 108 associates the reconstruction procedure ID with the execution time of the reconstruction procedure and the component state (component transition destination state) realized by the execution of the reconstruction procedure. It may be stored.
  • the sub-procedure specifying unit 102 specifies a sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 101.
  • the sub-procedure specifying unit 102 for example, based on the combination of faults received by the fault combination receiving unit 101, for each component in which a fault has occurred, the sub-procedure ID of the sub-procedure for recovering the fault occurring in the component Is identified with reference to the information stored in the sub procedure storage means 103.
  • the failure recovery procedure generation means 104 connects the subprocedures specified by the subprocedure specification means 102 (combining them in an appropriate order) to generate a failure recovery procedure. Most simply, the failure recovery procedure generation unit 104 may generate the failure recovery procedure by connecting the identified sub-procedures in series. Further, when there is a restriction such as an order between the specified sub-procedures, the failure recovery procedure generating unit 104 connects the specified sub-procedures so as to satisfy the restrictions. Moreover, when there are a plurality of candidates for the connection method, the failure recovery procedure generation unit 104 may be connected so as to shorten the time required for failure recovery, for example, by parallelizing as much as possible. Note that the failure recovery procedure generated by the failure recovery procedure generator 104 is a first candidate for a service restart procedure that is a generation target of the present invention.
  • the time requirement accepting unit 110 accepts the time requirement for the service resumption procedure.
  • the time requirement is, for example, RTO, and more specifically, 1 day, 3 hours, 5 minutes, and the like.
  • the time requirement is determined according to the contract with the customer.
  • the required time estimation means 105 estimates a required time which is a time required for executing the generated service restart procedure candidate (hereinafter referred to as a service restart procedure candidate).
  • the required time estimating means 105 first estimates the required time for the failure recovery procedure generated by the failure recovery procedure generating means 104 as the first candidate for the service restart procedure.
  • the required time estimating means 105 may estimate the required time by adding the time required for the operation operation included in each of the sub-procedures included in the failure recovery procedure sequentially. Further, the required time estimating means 105 may use a calculation formula that increases the required time in proportion to the number of operation operations included in the failure recovery procedure. In order to make it more accurate, the required time estimation means 105 generates a probabilistic model such as Stochastic reward net from an activity diagram representing, for example, a failure recovery procedure, and estimates the required time using an analysis tool such as Stochastic Petri Nets Package. May be.
  • a probabilistic model such as Stochastic reward net from an activity diagram representing, for example, a failure recovery procedure
  • the required time estimation unit 105 sends a service resumption procedure candidate (more specifically, to the sub procedure replacement unit 109 described later). Causes at least part of the sub-procedures included in the failure recovery procedure) to be replaced with the reconstruction procedure.
  • the sub-procedure replacement unit 109 replaces the sub-procedure with the reconstruction procedure until the time required for the updated service restart procedure candidate satisfies the time requirement.
  • the required time estimation means 105 as a method for estimating the required time of an updated service resuming procedure candidate, includes, for example, the operations included in each of the sub-procedures and reconstruction procedures included in the updated service resuming procedure candidate as described above.
  • the time required for the operation may be added sequentially.
  • a method using a predetermined calculation formula that increases the required time in proportion to the number of operation operations may be used, or a reconstruction procedure that replaces the time required to execute the replaced sub procedure It is also possible to use a method in which the execution time is replaced.
  • the reconstruction procedure specifying unit 107 is necessary for reconstructing a component in which a failure has occurred based on the information stored in the reconstruction procedure storage unit 108 based on the combination of failures received by the failure combination receiving unit 101. Identify the rebuild procedure.
  • the reconstruction procedure specifying unit 107 stores, for example, the reconstruction procedure ID of the reconstruction procedure for reconstructing each component in which a failure has occurred based on the combination of failures received by the failure combination receiving unit 101.
  • the information stored in the means 108 is specified with reference to the information.
  • the sub-procedure replacement means 109 based on the required time of the service restart procedure candidate estimated by the required time estimation means 105, when the service restart procedure candidate does not satisfy the time requirement, the failure recovery included in the service restart procedure candidate At least a part of the sub-procedure for is replaced with at least one of the reconstruction procedures identified by the reconstruction procedure identification means 107.
  • sub-procedures may be replaced one by one, or a plurality of sub-procedures may be replaced simultaneously.
  • the sub-procedure replacement means 109 may change the number of replacements according to the excess time that is the time required for the service restart procedure candidate to exceed the time requirement. Further, the sub-procedure replacement unit 109 may preferentially replace the reconstruction procedure with a short execution time.
  • the sub-procedure replacement means 109 reduces the total number of replacements from the sub-procedure to the reconstruction procedure from a small number such as ⁇ 1, 2, 3,.
  • the replacement process may be stopped when the time requirement is satisfied.
  • the sub-procedure replacement means 109 has a large difference between the execution time of the sub-procedure that becomes the replacement source and the execution time of the reconstruction procedure that becomes the replacement destination, that is, the difference between the execution times before and after replacement is large (time You may make it replace in order from the thing with a big shortening effect.
  • the sub-procedure replacement means 109 for example, the sub-procedure included in the generated service restart procedure candidate based on the information on each sub-procedure included in the generated service restart procedure candidate and the information on the identified reconfiguration procedure. Replace at least part of the procedure with the rebuild procedure.
  • the sub-procedure replacement unit 109 determines which sub-procedure becomes which re-construction procedure based on the identifier of the component associated with each sub-procedure and the identifier of the component associated with each re-construction procedure. You may judge whether it can replace.
  • a reconfiguration procedure for one component and a sub-unit for recovering from a failure occurring in all the components included in that component may be replaceable.
  • information indicating the component inclusion relationship may be stored separately.
  • sub-procedures are often arranged and prepared for each component failure, the correspondence relationship between the component failure and the reconstruction procedure is stored in advance, and which sub-procedure is based on the correspondence relationship. And which reconstruction procedure can be replaced.
  • preconditions regarding the replacement to the reconstruction procedure may be stored together, and the possibility of replacement may be determined according to the situation.
  • Preconditions include, for example, a specific system state (the OS is operating normally, the database is not being backed up, etc.), the order (designation of sub-procedures to be executed in advance or reconstruction procedure, etc.) .
  • the sub-procedure replacement means 109 adds a new procedure (sub-procedure or reconstruction procedure), for example, to satisfy the preconditions of the replacement-destination reconstruction procedure.
  • the execution order of each procedure in the service restart procedure candidate may be changed.
  • the sub procedure replacement means 109 outputs a message to that effect if the required time of the service restart procedure candidate after replacement does not satisfy the time requirement.
  • the procedure output unit 106 sets the service restart procedure candidate as the service restart procedure. (Or its candidates). That is, the procedure output means 106 provides only the service restart procedure candidate that satisfies the accepted time requirement to the user.
  • the procedure output means 106 may be provided to the user by, for example, outputting the specific processing contents of such a service restart procedure candidate in the form of an activity diagram, for example. Note that the procedure output means 106 outputs a message “no corresponding procedure” if no service restart procedure candidate satisfying the accepted time requirement is generated, or as reference information for the operator's judgment
  • the service resuming procedure candidate with the shortest required time may be output.
  • the sub procedure storage unit 103 and the reconstruction procedure storage unit 108 are realized by a storage device, for example.
  • the failure combination receiving unit 101 and the time requirement receiving unit 110 are realized by, for example, a CPU that operates according to a program and an input device.
  • the procedure output means 106 is realized by, for example, a CPU that operates according to a program and an output device.
  • the sub procedure specifying unit 102, the failure recovery procedure generating unit 104, the required time estimating unit 105, the reconstruction procedure specifying unit 107, and the sub procedure replacing unit 109 are realized by a CPU that operates according to a program, for example.
  • FIG. 5 is a flowchart showing an example of the operation of the service restart procedure generating device 1 of the present embodiment.
  • the failure combination receiving unit 101 receives a combination of failures that have occurred in the components of the information system (step S101).
  • a combination of faults occurring in the components of the information system may be input by the user or may be acquired directly from the information system.
  • the sub-procedure specifying unit 102 specifies a sub-procedure necessary for setting the state of the component group in which the failure has occurred to the recovery state based on the combination of failures received in step S101 (step S102).
  • the failure recovery procedure generation means 104 generates a failure recovery procedure that is the first candidate for the service restart procedure by connecting the sub-procedures identified in step S102 (step S103).
  • the reconstruction procedure specifying unit 107 specifies a reconstruction procedure necessary for reconstructing the component group in which the failure has occurred based on the combination of failures received in step S101 (step S104). At this time, a component for which a reconstruction procedure is not prepared is skipped ("No reconstruction procedure"). Note that this step may be executed at another timing between step S101 and step S108.
  • step S105 the time requirement accepting unit 110 accepts the time requirement for the service resumption procedure. Note that this step may be executed at another timing between step S101 and step S107.
  • the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). In the first time of this step, the required time estimation unit 105 estimates the required time for the failure recovery procedure generated in step S104. In the second and subsequent times, the required time estimating means 105 estimates the required time of a new service resuming procedure candidate generated by replacing a part of the sub procedure with the reconstructing procedure in step S108.
  • the required time estimating means 105 determines whether or not the required time estimated in step S106 satisfies the time requirement received in step S105 (step S107).
  • the sub-procedure replacement unit 109 may perform this step.
  • the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).
  • the sub procedure replacement unit 109 selects any of the sub procedures included in the service restart procedure candidate as one of the identified reconstruction procedures.
  • the service restart procedure candidate is updated by replacing with (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).
  • a failure recovery procedure corresponding to the combination of failures that occurred is generated, and then a part of the sub-procedures included in the failure recovery procedure is replayed until the specified time requirement is satisfied.
  • a method that replaces the construction procedure is adopted. Therefore, according to the present embodiment, a service resumption procedure that satisfies the time requirement can be provided to the user even if the time requirement cannot be satisfied by the normal failure recovery procedure.
  • the range of reconstruction can be limited to the minimum range necessary to satisfy the specified time requirement, so that the failure is possible as much as possible while satisfying the specified time requirement. Service can be resumed with the cause removed.
  • Embodiment 2 a second embodiment of the present invention will be described.
  • an RTO is defined for a service provided by an information system
  • a fee that must be paid as a penalty for an amount exceeding the RTO when a failure occurs is often defined. These costs are commonly referred to as downtime costs.
  • the cost that can be invested in information system failure recovery is generally limited.
  • a service restart procedure that satisfies such cost requirements is generated.
  • the service restart procedure generating apparatus is that the service restart procedure generating apparatus according to the first embodiment generates a service restart procedure that satisfies a specified cost requirement based on a downtime cost with respect to an excess time of a service restart procedure candidate. Different from the generator 1. Hereinafter, differences from the service restart procedure generating apparatus 1 of the first embodiment will be mainly described.
  • FIG. 6 is a block diagram illustrating a configuration example of the service restart procedure generation device according to the present embodiment.
  • the service resumption procedure generation device 2 illustrated in FIG. 6 includes a required cost estimation unit 111 and a cost requirement reception unit 112 in addition to the service resumption procedure generation device 1 of the first embodiment illustrated in FIG.
  • the required cost estimation unit 111 estimates a required cost, which is a cost required to execute a service resuming procedure candidate, based on the required time estimated by the required time estimation unit 105 and the time requirement received by the time requirement reception unit 110.
  • the required cost may be, for example, a downtime cost.
  • the calculation method of the downtime cost may be simply proportional to the excess time of the service restart procedure candidate, or may use a calculation formula determined in advance by a contract with a customer who is a service user.
  • the required cost estimation unit 111 may estimate not only the downtime cost but also the cost including the cost required for executing the service restart procedure candidate such as the labor cost and the equipment cost required for the failure recovery as the required cost. .
  • the required cost estimating unit 111 sends a service resuming procedure candidate (more specifically, failure recovery among them) to the sub procedure replacing unit 109.
  • a service resuming procedure candidate (more specifically, failure recovery among them)
  • the sub procedure replacement unit 109 replaces the sub procedure with the reconstruction procedure until the required cost of the updated service restart procedure candidate satisfies the cost requirement.
  • the cost requirement receiving means 112 receives the cost requirement for the service resumption procedure.
  • the required time estimation unit 105 of the present embodiment unlike the required time estimation unit 105 of the first embodiment, performs only the estimation process of the required time of the service restart procedure candidate. That is, the required time estimation unit 105 does not issue a replacement instruction to the reconstruction procedure even when the estimated required time does not satisfy the time requirement received by the time requirement reception unit 110.
  • the replacement cost instruction for the reconstruction procedure is performed by the required cost estimation unit 111 as described above.
  • the sub-procedure replacement unit 109 performs the service resumption procedure when the service resumption procedure candidate does not satisfy the cost requirement based on the required cost of the service resumption procedure candidate estimated by the required cost estimation unit 111. At least a part of the sub-procedure for failure recovery included in the candidate is replaced with at least one of the reconstruction procedures specified by the reconstruction procedure specifying means 107. Note that the sub-procedure replacement unit 109 outputs a message to the effect that if the required cost of the service restart procedure candidate after replacement does not satisfy the cost requirement even if all replaceable reconstruction procedures are replaced.
  • the required cost estimation unit 111 and the cost requirement reception unit 112 are realized by a CPU that operates according to a program, for example.
  • FIG. 7 is a flowchart showing an example of the operation of the service restart procedure generating device 2 of the present embodiment. As shown in FIG. 7, first, similarly to the case of the first embodiment, the processing from step S101 to step S105 is performed.
  • the cost requirement receiving unit 112 receives the cost requirement for the service resumption procedure (step S1051). Note that this step may be executed at another timing between steps S101 and S1062.
  • the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). This step is the same as in the first embodiment.
  • the required cost estimation unit 111 estimates the required cost of the generated service restart procedure candidate based on the required time estimated in step S106 and the time requirement received in step S105 (step S1061). In the first step of this step, the required cost estimation unit 111 estimates the required cost of the failure recovery procedure generated in step S104. In the second and subsequent times, the required cost estimation unit 111 estimates the required cost of a new service restart procedure candidate generated by replacing a part of the sub procedure with the reconstruction procedure in step S108.
  • the required cost estimation unit 111 determines whether or not the required cost estimated in step S1061 satisfies the cost requirement received in step S1051 (step S1062). Note that, similarly to the first embodiment, the sub-procedure replacement unit 109 may perform this step.
  • the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).
  • the sub procedure replacement unit 109 replaces a part of the sub procedure included in the service restart procedure candidate with the reconstruction procedure, thereby resuming the service restart procedure candidate. Is updated (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).
  • the present embodiment after generating a failure recovery procedure according to a combination of failures that have occurred, a part of sub-procedures included in the failure recovery procedure until the specified cost requirement is satisfied.
  • the method of replacing with a reconstruction procedure is adopted. Therefore, according to the present embodiment, even if the time requirement cannot be satisfied by the normal failure recovery procedure, a service resumption procedure that satisfies the cost requirement can be generated. It is possible to provide a user with a service resumption procedure that can be executed even in a case where there is a limit.
  • the range of reconstruction can be limited to the minimum range necessary to satisfy the specified cost requirement, so that the failure is possible as much as possible while satisfying the specified cost requirement. Service can be resumed with the cause removed.
  • FIG. 8 is a block diagram showing an outline of the service restarting device procedure generation device of the present invention.
  • 8 includes a failure combination receiving unit 501, a sub procedure storage unit 502, a reconstruction procedure storage unit 503, a sub procedure identification unit 504, a failure recovery procedure generation unit 505, A construction procedure specifying unit 506, a sub procedure replacement unit 507, and a procedure output unit 508 are provided.
  • the failure combination accepting unit 501 (for example, the failure combination accepting unit 101) accepts a combination of failures occurring in the components included in the information system.
  • the sub-procedure storage unit 502 (for example, the sub-procedure storage unit 103) stores sub-procedure information, which is a procedure for recovering a failure occurring in a component included in the information system, in association with a component identifier.
  • the reconstruction procedure storage unit 503 (for example, the reconstruction procedure storage unit 108) stores information on a reconstruction procedure, which is a procedure for reconstructing a component included in the information system, in association with the component identifier.
  • the sub-procedure specifying unit 504 (for example, the sub-procedure specifying unit 102) specifies the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 501. To do.
  • the failure recovery procedure generation unit 505 (for example, the failure recovery procedure generation unit 104) generates a failure recovery procedure by connecting the specified sub-procedures based on the sub-procedure information specified by the sub-procedure specifying unit 504. .
  • the rebuilding procedure specifying unit 506 (for example, the rebuilding procedure specifying unit 107) performs a rebuilding procedure necessary for rebuilding a component in which a fault has occurred based on the combination of faults received by the fault combination receiving unit 501. Identify.
  • the sub-procedure replacement unit 507 (for example, the sub-procedure replacement unit 109) specifies and specifies information on each sub-procedure included in the generated fault recovery procedure when the generated fault recovery procedure does not satisfy a predetermined requirement. Based on the information on the reconstructed procedure, at least a part of the subprocedure included in the generated failure recovery procedure is replaced with the reconstructed procedure.
  • the procedure output unit 508 (for example, the procedure output unit 106) outputs a failure recovery procedure in which at least a part of the sub procedure is replaced by the reconstruction procedure by the sub procedure replacement unit 507 as a service restart procedure.
  • an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.
  • the service restart procedure generating apparatus of the present invention includes a time requirement receiving unit (for example, the time requirement receiving unit 110) that receives a time requirement imposed on the service restart procedure, and a time required for performing the specified procedure.
  • Required time estimation means for example, required time estimation means 105 for estimating a certain required time may be provided.
  • the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the time requirement when the time required for the failure recovery procedure estimated by the required time estimation unit does not satisfy the time requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.
  • the service resumption procedure generating apparatus of the present invention further involves cost requirement acceptance means (for example, cost requirement acceptance means 112) for accepting cost requirements imposed on the service resumption procedure, and implementation of the designated procedure. It may be provided with required cost estimation means (for example, required cost estimation means 111) for estimating a required cost including a downtime cost for exceeding the failure recovery time.
  • required cost estimation means for example, required cost estimation means 111 for estimating a required cost including a downtime cost for exceeding the failure recovery time.
  • the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the cost requirement when the required cost of the failure recovery procedure estimated by the required time estimation unit does not satisfy the cost requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.
  • the sub-procedure replacement means 507 sets the generated failure recovery procedure as the first candidate for the service resumption procedure, until the generated service resumption procedure candidate satisfies a predetermined requirement,
  • the replacement process for replacing at least a part with the reconstruction procedure is repeatedly performed, and the procedure output unit 508 outputs, as the service restart procedure, a candidate for the service restart procedure that is determined to satisfy the predetermined requirement by the sub procedure replacement unit 507. Also good.
  • sub-procedure replacement means 507 may perform replacement in descending order of execution time difference before and after replacement.
  • the reconstruction procedure may be provided in a script or program.
  • the required time or required cost is used as an evaluation index for the service restart procedure.
  • other evaluation indexes related to system requirements such as the success rate of the service restart procedure, are shown. It may be used.
  • each function of the service restart procedure generating device is realized by software, more specifically, a CPU executing a program, but may be realized by hardware such as a circuit. Good.
  • the program is stored in a storage device, but may be stored in a computer-readable recording medium.
  • the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
  • the present invention is applicable not only to restarting a service, but also to an apparatus, system, method, and program used for recovering, for example, an information system in which a failure has occurred to a state where there is no failure.

Abstract

Provided is a service resumption sequence generating device comprising: a fault combination acceptance means for accepting a combination of faults occurring in components of an information system; a sub-sequence specification means for specifying sub-sequences required for the components in which the faults are occurring to recover from the fault state; a fault recovery sequence generating means for connecting the specified sub-sequences and generating a fault recovery sequence; a rebuild sequence specification means for specifying a rebuild sequence necessary for rebuilding the components in which the faults are occurring; a sub-sequence substitution means for substituting at least a portion of the sub-sequences included in the fault recovery sequence with the rebuild sequence if the fault recovery sequence does not satisfy a prescribed requirement; and a sequence output means for outputting the post-substitution fault recovery sequence as a service resumption sequence.

Description

サービス再開手順生成装置、サービス再開手順生成方法およびサービス再開手順生成プログラムService resumption procedure generation device, service resumption procedure generation method, and service resumption procedure generation program
 本発明は、障害の発生により停止した情報システムのサービスを再開する手順を生成するサービス再開手順生成装置、サービス再開手順生成方法およびサービス再開手順生成プログラムに関する。 The present invention relates to a service resumption procedure generation device, a service resumption procedure generation method, and a service resumption procedure generation program for generating a procedure for resuming a service of an information system stopped due to the occurrence of a failure.
 大規模災害の発生時などには、情報システム中の多くのコンポーネントに同時に障害が発生する可能性がある。このような状況から情報システムのサービスを再開するためには、同時障害の発生に対して情報システム全体を障害状態からサービス可能な状態にまで復旧させるための運用手順(いわゆる障害復旧手順)が必要である。情報システム全体を復旧させるために、多くの障害復旧手順では、まずシステムの状態の確認や障害原因の特定を行った上で、問題個所を修正する。 In the event of a large-scale disaster, many components in the information system may fail simultaneously. In order to restart the information system service from such a situation, an operation procedure (so-called failure recovery procedure) is required to restore the entire information system from a failure state to a serviceable state in response to simultaneous failures. It is. In order to recover the entire information system, many fault recovery procedures first check the system status and identify the cause of the fault, and then correct the problem.
 情報システムの障害復旧手順には、コンポーネントに発生している障害を復旧させるためのサブ手順(例えば、コマンドの入力、グラフィカルユーザインタフェースの操作など)が含まれている。コンポーネントに発生している障害ごとに必要とされるサブ手順は異なるため、発生した障害の組合せに応じて障害復旧手順は異なる。多数のコンポーネントに同時発生しうる障害の組合せの数は膨大であるため、ユーザが手動で全ての組合せに対して障害復旧手順を生成することは非現実的である。よって、障害復旧手順を自動生成することが合理的である。 The information system failure recovery procedure includes sub-procedures (for example, command input, graphical user interface operation, etc.) for recovering a failure occurring in a component. Since the sub-procedures required for each failure occurring in the component are different, the failure recovery procedure differs depending on the combination of the failures that have occurred. Since the number of combinations of failures that can occur simultaneously in a large number of components is enormous, it is impractical for a user to manually generate a failure recovery procedure for all combinations. Therefore, it is reasonable to automatically generate a failure recovery procedure.
 情報システムの障害復旧に関して定められる一般的な顧客要件の一つに、復旧に要する時間を表すRTO(Recovery time objective,目標復旧時間)と呼ばれる指標がある。情報システムの障害復旧手順がRTOを満たせない場合、情報システムの提供者は、顧客に対してペナルティコストを支払わなければならない場合がある。そのような場合には、情報システムの管理者は、RTOを満たすように障害復旧手順を生成する必要がある。 One of the general customer requirements defined for information system failure recovery is an index called RTO (Recovery time objective, target recovery time) that represents the time required for recovery. If the information system failure recovery procedure fails to meet the RTO, the information system provider may have to pay a penalty cost to the customer. In such a case, the administrator of the information system needs to generate a failure recovery procedure so as to satisfy the RTO.
 しかし、SLA(Service Level Agreement)に基づき一定のRTOを保証しているような情報システムにおいて、障害原因の特定を行った上で問題箇所を修正する通常の障害復旧手順に従うだけでは、RTO内にサービスを再開できない場合がある。なぜならば、障害原因が複雑で特定に時間がかかる場合や、問題箇所が多いために修正完了までに長い時間を要する場合などがあるためである。 However, in an information system that guarantees a certain RTO based on SLA (Service Level Agreement), it is not necessary to follow the normal failure recovery procedure to correct the problem after identifying the cause of the failure. Service may not be restarted. This is because there are cases where the cause of the failure is complicated and it takes a long time to specify, or because there are many problem parts, it may take a long time to complete the correction.
 ところで、障害原因の特定や問題箇所の修正を必要としない障害対応方法として、ソフトウェアエージング(Software Aging)に対するソフトウェア若化が知られている。ソフトウェアエージングとは、長時間連続稼働によって稼働環境に生じる劣化現象(メモリリーク、ファイルのフラグメンテーション等)の総称である。稼働環境の劣化が進むと、情報システムに障害を引き起こしうる。また、ソフトウェア若化(Software Rejuvenation)は、情報システムの少なくとも一部の内部状態を初期化することでソフトウェアエージングによる障害を未然に防ぐ手法である。 By the way, software rejuvenation for software aging is known as a failure handling method that does not require identification of the cause of failure and correction of a problem part. Software aging is a general term for deterioration phenomena (memory leak, file fragmentation, etc.) that occur in the operating environment due to continuous operation for a long time. As the operating environment deteriorates, the information system may be damaged. Software rejuvenation is a technique for preventing a failure caused by software aging by initializing at least a part of the internal state of the information system.
 特許文献1には、若化する必要がないホストマシンおよび仮想マシンを連続的に稼働させながら、若化する必要があるホストマシンおよび仮想マシンを同時に若化する方法が記載されている。 Patent Document 1 describes a method of simultaneously rejuvenating a host machine and a virtual machine that need to be rejuvenated while continuously operating a host machine and a virtual machine that do not need to be rejuvenated.
 また、障害原因の特定および問題箇所の修正を必要としない障害対応方法の他の例として、非特許文献1には、ソフトウェア延命方法が記載されている。非特許文献1に記載されているソフトウェア延命方法は、仮想化環境上で動作する仮想マシンにソフトウェアエージングが起きた場合に、仮想マシンに新たに資源を追加で割り当てることで、ソフトウェアの稼働時間を長くする。 Further, as another example of the failure handling method that does not require the identification of the cause of the failure and the correction of the problem part, Non-Patent Document 1 describes a software life extension method. In the software life extension method described in Non-Patent Document 1, when software aging occurs in a virtual machine that operates in a virtual environment, additional resources are allocated to the virtual machine to increase the operating time of the software. Lengthen.
国際公開第2010/122710号パンフレットInternational Publication No. 2010/122710 Pamphlet
 しかし、特許文献1および非特許文献1に記載された方法は、ソフトウェアエージングによって引き起こされる情報システムの障害以外の障害には対応できないという問題があった。 However, the methods described in Patent Document 1 and Non-Patent Document 1 have a problem that they cannot cope with failures other than information system failures caused by software aging.
 ところで、ソフトウェアエージングによって引き起こされる障害以外の障害(たとえば、ファイル破損、設定ミス、不正アクセスによる書き換え等)に対応可能で、かつ障害原因の特定および問題箇所の修正を必要としない障害対応方法の1つに、情報システムの少なくとも一部に対する再構築がある。 By the way, a failure handling method that can deal with failures other than failures caused by software aging (for example, file corruption, setting mistakes, rewriting due to unauthorized access, etc.) and does not require identification of the cause of failure and correction of the problem location. One is rebuilding at least part of the information system.
 再構築の手順は現在のシステム状態に依存しないため、Chef等のシステム構成管理ツールなどを用いた自動化が容易である。一方、再構築を行っても、障害原因が除去されるわけではないため、再び同じ原因による障害が起きる可能性がある。また、再構築を行うと、後日障害原因の特定や問題箇所の修正をする場合に必要な情報を失う可能性がある。このため、時間的要件またはコスト的要件から再構築が必要となるような場合であっても、再構築を行う範囲は最小限であることが好ましい。 Since the rebuilding procedure does not depend on the current system state, automation using a system configuration management tool such as Chef is easy. On the other hand, even if reconstruction is performed, the cause of the failure is not removed, so that the failure due to the same cause may occur again. In addition, if reconstruction is performed, there is a possibility that information necessary for identifying the cause of the failure or correcting the problem part will be lost at a later date. For this reason, even if it is a case where reconstruction is needed from time requirements or cost requirements, it is preferable that the range which performs reconstruction is the minimum.
 本発明は、上述した点に鑑みてなされたものであり、通常の障害復旧手順では所定の要件を満たすことができない場合であっても、発生した障害の組合せに応じて最適なサービス再開手順を自動的に生成するサービス再開手順生成装置、サービス再開手順生成方法およびサービス再開手順生成プログラムを提供することを目的とする。 The present invention has been made in view of the above points, and even if a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure is performed according to a combination of failures that have occurred. An object of the present invention is to provide a service resumption procedure generation apparatus, a service resumption procedure generation method, and a service resumption procedure generation program that are automatically generated.
 本発明によるサービス再開手順生成装置は、情報システムが備えるコンポーネントに発生中の障害の組合せを受け付ける障害組合せ受付手段と、コンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納するサブ手順格納手段と、コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納する再構築手順格納手段と、障害組合せ受付手段が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定するサブ手順特定手段と、サブ手順特定手段によって特定されたサブ手順の情報に基づいて、特定されたサブ手順を接続して障害復旧手順を生成する障害復旧手順生成手段と、障害組合せ受付手段が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する再構築手順特定手段と、生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えるサブ手順置換手段と、サブ手順置換手段によってサブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する手順出力手段とを備えたことを特徴とする。 The service resuming procedure generating apparatus according to the present invention includes a failure combination accepting unit that accepts a combination of faults occurring in a component included in the information system, and information on a sub procedure that is a procedure for recovering a fault occurring in the component. Sub-procedure storage means for storing in association with component identifiers, reconstruction procedure storage means for storing information on reconstruction procedures, which are procedures for reconstructing components, in association with component identifiers, and failure combinations Based on the combination of faults received by the receiving means, sub-procedure specifying means for specifying a sub-procedure necessary for recovering a component in which a fault has occurred from the fault state, and information on the sub-procedures specified by the sub-procedure specifying means Disaster recovery by connecting identified subprocedures to generate disaster recovery procedures based on Based on the combination of failures received by the order generation means, the failure combination acceptance means, a reconstruction procedure identification means for identifying a reconstruction procedure necessary to reconstruct the component in which the failure has occurred, and the generated failure recovery Included in the generated disaster recovery procedure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure if the procedure does not meet the prescribed requirements Sub-procedure replacement means that replaces at least a part of the sub-procedure with a reconstruction procedure, and a procedure for outputting, as a service restart procedure, a failure recovery procedure in which at least a part of the sub-procedure is replaced by the reconstruction procedure by the sub-procedure replacement means Output means.
 また、本発明によるサービス再開手順生成方法は、所定のサブ手順格納手段に、情報システムが備えるコンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納し、所定の再構築手順格納手段に、コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納し、情報処理装置が、情報システムのコンポーネントに発生中の障害の組合せを受け付け、情報処理装置が、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定し、情報処理装置が、特定されたサブ手順の情報に基づいて、特定されたサブ手順を接続して障害復旧手順を生成し、情報処理装置が、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定し、情報処理装置が、生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換え、情報処理装置が、サブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力することを特徴とする。 Also, the service restart procedure generation method according to the present invention associates sub-procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, with a component identifier in a predetermined sub-procedure storage unit. And stores the information on the reconstruction procedure, which is a procedure for reconstructing the component, in association with the identifier of the component, and the information processing apparatus stores the information on the component of the information system. Accepts a combination of faults that have occurred, and the information processing device identifies the sub-procedure necessary to recover the faulty component from the fault state based on the received fault combination. The information processing apparatus generates a fault recovery procedure by connecting the identified sub-procedures based on the sub-procedure information Based on the combination of accepted faults, the rebuild procedure required to rebuild the component in which the fault has occurred is identified, and the information processing device is in the case where the generated fault recovery procedure does not meet the prescribed requirements Based on the information on each sub-procedure included in the generated disaster recovery procedure and the information on the identified reconstruction procedure, at least a part of the sub-procedure included in the generated disaster recovery procedure is reconstructed. And the information processing apparatus outputs a failure recovery procedure in which at least a part of the sub-procedure is replaced with a reconstruction procedure as a service resumption procedure.
 また、本発明によるサービス再開手順生成プログラムは、情報システムが備えるコンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納するサブ手順格納手段と、コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納する再構築手順格納手段とを備えたコンピュータに、情報システムのコンポーネントに発生中の障害の組合せを受け付ける障害組合せ受付処理、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定するサブ手順特定処理、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する再構築手順特定処理、生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えるサブ手順置換処理、およびサブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する手順出力処理を実行させることを特徴とする。 Further, the service resumption procedure generation program according to the present invention comprises sub procedure storage means for storing sub procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier. A combination of faults occurring in the components of the information system in a computer having reconstruction procedure storage means for storing the information of the reconstruction procedure, which is a procedure for rebuilding the component, in association with the identifier of the component Based on the combination of the received fault, the sub-procedure specifying process for identifying the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the received fault combination receiving process, Rebuild steps required to rebuild the component that is failing Reconstruction procedure identification process to be identified, and information on each sub-procedure included in the generated disaster recovery procedure when the generated disaster recovery procedure does not meet the prescribed requirements Sub-procedure replacement process that replaces at least a part of the sub-procedure included in the generated disaster recovery procedure with the reconstruction procedure, and a disaster recovery procedure in which at least a part of the sub-procedure is replaced with the reconstruction procedure. And a procedure output process for outputting as a service resumption procedure is executed.
 本発明によれば、通常の障害復旧手順では所定の要件を満たすことができない場合であっても、発生した障害の組合せに応じて最適なサービス再開手順を自動的に生成することができる。 According to the present invention, even when a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.
第1の実施形態のサービス再開手順生成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the service resumption procedure production | generation apparatus of 1st Embodiment. サブ手順の例を示すアクティビティ図である。It is an activity diagram which shows the example of a sub procedure. サブ手順格納手段103に格納される情報の例を示す説明図である。It is explanatory drawing which shows the example of the information stored in the sub procedure storage means 103. FIG. 再構築手順格納手段108に格納される情報の例を示す説明図である。It is explanatory drawing which shows the example of the information stored in the reconstruction procedure storage means. 第1の実施形態のサービス再開手順生成装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the service resumption procedure production | generation apparatus of 1st Embodiment. 第2の実施形態のサービス再開手順生成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the service resumption procedure production | generation apparatus of 2nd Embodiment. 第2の実施形態のサービス再開手順生成装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the service resumption procedure production | generation apparatus of 2nd Embodiment. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention.
 以下、本発明の実施形態を図面を参照して説明する。始めに、本発明における障害復旧手順について説明する。障害復旧手順は、情報システム中の障害が発生しているコンポーネント群を障害状態から復旧させることによって、情報システムにサービスを再開させる手順である。障害復旧手順は、情報システムが備えるコンポーネントに発生している障害を復旧させるためのひとまとまりの手順であるサブ手順を含む。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, the failure recovery procedure in the present invention will be described. The failure recovery procedure is a procedure for causing the information system to restart the service by recovering the component group in which the failure has occurred in the information system from the failure state. The failure recovery procedure includes sub-procedures that are a group of procedures for recovering from a failure occurring in a component included in the information system.
 各サブ手順は、コンポーネントに発生中の障害を復旧させるための手順であれば、特に限定されない。例えば、各サブ手順は、再起動、データ復旧、設定変更などのシステム管理操作を含んでいてもよい。なお、各サブ手順は、予めドキュメントやマニュアルなどに記述されてもよい。また、各サブ手順は、JP1などの既存のシステム構成管理ツールを用いて自動化されたスクリプトやプログラムとして提供されてもよい。 Each sub-procedure is not particularly limited as long as it is a procedure for recovering a failure occurring in a component. For example, each sub-procedure may include system management operations such as restart, data recovery, and setting change. Each sub-procedure may be described in advance in a document or a manual. Each sub-procedure may be provided as an automated script or program using an existing system configuration management tool such as JP1.
 災害等によって情報システムが備えるコンポーネントに障害が発生し、サービスの提供が停止したとき、システムオペレータ(以下、単にオペレータと記す。)は、障害復旧手順に従って障害が発生したコンポーネントを障害状態から復旧させるなどして、情報システムをサービス可能な状態まで復旧させる責任を担う。情報システムを復旧させるために必要なサブ手順は、障害が発生したコンポーネントの組み合わせや、発生した障害の組み合わせに依存して異なる。そのため、オペレータはまず最初に情報システムを復旧させるために必要なサブ手順を特定し、その次に情報システムを復旧させるために実行すべきサブ手順を実行する。コンポーネントの障害には、コンポーネントのダウンだけでなく、「必須のコマンドの一部が実行できない」、「システムに必要なデータの一部が消失している」等のコンポーネントを正常に利用できない状態も含まれる。したがって、本発明では、このような障害が発生しているコンポーネントの組合せや、発生した障害の組合せに応じて、障害復旧手順に含ませるサブ手順を特定する。 When a failure occurs in a component of an information system due to a disaster or the like, and the provision of services stops, the system operator (hereinafter simply referred to as an operator) restores the failed component from the failed state according to the failure recovery procedure. To take responsibility for restoring the information system to a serviceable state. The sub-procedures required to restore the information system differ depending on the combination of components in which a failure has occurred and the combination of failures that have occurred. Therefore, the operator first specifies the sub procedure necessary for restoring the information system, and then executes the sub procedure to be executed for restoring the information system. Component failures include not only the component being down, but also being unable to use the component normally such as “some of the required commands cannot be executed” and “some of the data necessary for the system has been lost”. included. Therefore, in the present invention, the sub-procedures included in the failure recovery procedure are specified according to the combination of components in which such a failure has occurred and the combination of the failures that have occurred.
 次に、「再構築手順」について説明する。再構築手順は、障害発生中のコンポーネント群を再構築するための手順である。再構築手順は、コンポーネントを再構築するための手順であれば、特に限定されない。なお、情報システムの実装上の制限や、再構築手順の準備そのものにかかるコストなどの観点から、全てのコンポーネントに対して再構築手順が用意されているとは限らない。なお、各再構築手順は、予めドキュメントやマニュアルなどに記述されてもよい。また、各再構築手順は、Chefなどの既存のシステム構成管理ツールを用いて自動化されたスクリプトやプログラムとして提供されてもよい。 Next, “Reconstruction procedure” will be explained. The reconstruction procedure is a procedure for reconstructing a component group in which a failure has occurred. The rebuilding procedure is not particularly limited as long as it is a procedure for rebuilding a component. It should be noted that the reconstruction procedure is not always prepared for all components from the viewpoint of restrictions on the implementation of the information system and the cost of preparing the reconstruction procedure itself. Each reconstruction procedure may be described in advance in a document or a manual. Each reconstruction procedure may be provided as an automated script or program using an existing system configuration management tool such as Chef.
 本発明は、上述した障害復旧手順(より具体的には、障害復旧手順に含まれるサブ手順)と再構築手順とを適宜組み合わせて、情報システムにサービスを再開させるための手順であるサービス再開手順を生成する。以下、「サービス再開手順」といった場合には、障害を復旧させるためのサブ手順を用いるか、再構築手順を用いるかを問わず、情報システムを障害状態からサービス可能な状態にまで復旧させるための手順をいう。 The present invention provides a service resumption procedure that is a procedure for causing the information system to resume a service by appropriately combining the above-described failure recovery procedure (more specifically, a sub-procedure included in the failure recovery procedure) and a reconstruction procedure. Is generated. Hereinafter, in the case of “service resumption procedure”, regardless of whether the sub-procedure for restoring the failure or the reconstruction procedure is used, the information system is restored from the failure state to the serviceable state. Refers to the procedure.
 また、情報システムのコンポーネントには、情報システムにおいて障害復旧または再構築の処理対象となりうる全てのものが含まれる。一例としては、アプリケーション、タスク、スレッド、VM(Virtual Machine)、中央処理装置(CPU;Central Processing Unit)、周辺機器、ストレージ、サーバ装置、パーソナルコンピュータ等が挙げられる。コンポーネントは、ソフトウェアであってもよいし、ハードウェアであってもよい。また、あるコンポーネントは複数のコンポーネントを含む場合がある。以下、「コンポーネント」といった場合に、複数のコンポーネントを含むコンポーネント群を意味する場合がある。同様に、「サブ手順」といった場合に、複数のサブ手順を含むサブ手順群を意味する場合がある。同様に、「再構築手順」といった場合に、複数の再構築手順を含む再構築手順群を意味する場合がある。より具体的には、あるIDが割り振られたサブ手順の中に、複数のサブ手順が含まれていてもよいことを意味する。同様に、あるIDが割り振られた再構築手順の中に、複数の再構築手順が含まれていてもよいことを意味する。同時障害の組み合わせによっては、複数の障害やコンポーネントをまとめて取り扱う方がよい場合もあるからである。 In addition, the information system components include all of the information systems that can be the target of failure recovery or reconstruction in the information system. Examples include an application, a task, a thread, a VM (Virtual Machine), a central processing unit (CPU), a peripheral device, a storage, a server device, and a personal computer. The component may be software or hardware. A certain component may include a plurality of components. Hereinafter, the term “component” may mean a component group including a plurality of components. Similarly, in the case of “sub-procedure”, it may mean a sub-procedure group including a plurality of sub-procedures. Similarly, “reconstruction procedure” may mean a reconstruction procedure group including a plurality of reconstruction procedures. More specifically, it means that a plurality of subprocedures may be included in a subprocedure to which a certain ID is allocated. Similarly, it means that a plurality of reconstruction procedures may be included in the reconstruction procedure to which a certain ID is allocated. This is because, depending on the combination of simultaneous failures, it may be better to handle a plurality of failures and components together.
実施形態1.
 図1は、本発明の第1の実施形態のサービス再開手順生成装置の構成例を示すブロック図である。図1に示すように、本実施形態のサービス再開手順生成装置1は、障害組合せ受付手段101と、サブ手順特定手段102と、サブ手順格納手段103と、障害復旧手順生成手段104と、所要時間推定手段105と、手順出力手段106と、再構築手順特定手段107と、再構築手順格納手段108と、サブ手順置換手段109と、時間要件受付手段110とを備える。
Embodiment 1. FIG.
FIG. 1 is a block diagram illustrating a configuration example of a service restart procedure generation device according to the first embodiment of this invention. As shown in FIG. 1, the service resumption procedure generation apparatus 1 according to the present embodiment includes a failure combination reception unit 101, a sub procedure identification unit 102, a sub procedure storage unit 103, a failure recovery procedure generation unit 104, and a required time. An estimation unit 105, a procedure output unit 106, a reconstruction procedure specifying unit 107, a reconstruction procedure storage unit 108, a sub procedure replacement unit 109, and a time requirement receiving unit 110 are provided.
 サービス再開手順生成装置1は、一般的な情報処理装置(コンピュータ)により実現される。サービス再開手順生成装置1は、例えば、サーバ装置やパーソナルコンピュータ等である。また、サービス再開手順生成装置1は、図示しないCPU、記憶装置、入力装置および出力装置を備える。記憶装置は、例えば、メモリおよびハードディスク駆動装置(HDD;Hard Disk Drive)である。また、入力装置は、例えば、キーボードやマウスや各種ネットワークインタフェース等である。また、出力装置は、例えば、ディスプレイや各種ネットワークインタフェース等である。サービス再開手順生成装置1は、記憶装置に記憶されているプログラムをCPUが実行することにより、図1に示される各手段を実現するように構成されている。 The service restart procedure generating device 1 is realized by a general information processing device (computer). The service restart procedure generation device 1 is, for example, a server device or a personal computer. In addition, the service restart procedure generation device 1 includes a CPU, a storage device, an input device, and an output device (not shown). The storage device is, for example, a memory and a hard disk drive (HDD; Hard Disk Drive). The input device is, for example, a keyboard, a mouse, various network interfaces, or the like. The output device is, for example, a display or various network interfaces. The service resuming procedure generation device 1 is configured to realize each unit illustrated in FIG. 1 by a CPU executing a program stored in a storage device.
 障害組合せ受付手段101は、情報システムのコンポーネントに発生中の障害の組合せを受け付ける。障害の組合せを示す情報は、障害が発生したコンポーネントの識別子の組であってもよい。例えば、{「アプリA」、「データベースB」}というようにコンポーネントの名称や、{1,2,3}というように予めコンポーネントに割り当てられた番号等を用いて指定されてもよい。なお、障害の組合せを示す情報は、さらに、各コンポーネントに発生した障害の種別等を含む。すなわち、障害の組合せを示す情報は、障害が発生したコンポーネントの識別子と障害種別等とを含む障害情報の組であってもよい。 The failure combination receiving unit 101 receives a combination of failures occurring in the information system components. The information indicating the combination of faults may be a set of identifiers of components in which faults have occurred. For example, it may be specified using a component name such as {“application A” and “database B”} or a number assigned to the component in advance such as {1, 2, 3}. The information indicating the combination of failures further includes the type of failure that has occurred in each component. That is, the information indicating the combination of failures may be a set of failure information including the identifier of the component in which the failure has occurred and the failure type.
 サブ手順格納手段103は、情報システムが備えるコンポーネントの障害を復旧させるための各種サブ手順の情報を格納する。サブ手順格納手段103は、例えば、サブ手順を識別するサブ手順IDに対応づけて、対応する障害種別と、そのサブ手順自身(そのサブ手順の具体的な処理内容を示す情報や、そのような処理内容を実際に実行するスクリプトやプログラム等)とを格納する。図2は、サブ手順の一例を示すアクティビティ図である。本発明において、サブ手順の実態は特に問わない。すなわち、ユーザに提供するサブ手順は、図2のアクティビティ図に示されるような、当該サブ手順の具体的な処理内容を示す情報であってもよいし、実際に情報システムに対してその手順を実行するようなスクリプトやプログラムであってもよいし、これらを組み合わせたものであってもよい。一般に、障害復旧のための手順であるサブ手順には障害原因の特定動作が含まれるため、全てを自動化することは難しい。そのような場合には、ユーザに障害原因の特定動作を行わせた後で、原因に応じた復旧動作(例えば、再起動、データ復旧、設定変更等)はスクリプト等を実行して自動的に行うといったことも考えられる。 The sub-procedure storage means 103 stores information on various sub-procedures for recovering a failure of a component included in the information system. The sub-procedure storage means 103, for example, associates a sub-procedure ID identifying the sub-procedure with the corresponding failure type and the sub-procedure itself (information indicating the specific processing contents of the sub-procedure, such as A script or a program for actually executing the processing contents). FIG. 2 is an activity diagram showing an example of a sub procedure. In the present invention, the actual state of the sub procedure is not particularly limited. That is, the sub-procedure provided to the user may be information indicating the specific processing contents of the sub-procedure as shown in the activity diagram of FIG. It may be a script or program to be executed, or a combination of these. In general, a sub-procedure, which is a procedure for recovery from a failure, includes a specific action of a failure cause, so that it is difficult to automate all of them. In such a case, after causing the user to specify the cause of the failure, the recovery operation corresponding to the cause (for example, restart, data recovery, setting change, etc.) is automatically executed by executing a script or the like. It is possible to do it.
 また、図3はサブ手順格納手段103に格納される情報の例を一部抜粋して示す説明図である。図3に示すように、本実施形態のサブ手順格納手段103は、サブ手順IDに対応づけて、さらに、そのサブ手順(サブ手順IDによって識別されるサブ手順)が復旧させるコンポーネントの識別子を格納する。すなわち、本実施形態のサブ手順格納手段103は、サブ手順ごとに、そのサブ手順自身と、そのサブ手順が復旧させるコンポーネントの識別子とを少なくとも含む情報を格納する。サブ手順は、例えば、情報システムが備えるコンポーネントの障害ごとに定義される。なお、必ずしも障害ごとに定義されなくてもよく、例えば、特定の障害の組合せに対して1つのサブ手順が定義されてもよい。 FIG. 3 is an explanatory diagram showing a part of an example of information stored in the sub procedure storage means 103. As shown in FIG. 3, the sub-procedure storage means 103 of the present embodiment stores the identifier of the component to be recovered by the sub-procedure (sub-procedure identified by the sub-procedure ID) in association with the sub-procedure ID. To do. That is, the sub-procedure storage means 103 of the present embodiment stores, for each sub-procedure, information including at least the sub-procedure itself and the identifier of the component that the sub-procedure recovers. The sub-procedure is defined, for example, for each component failure included in the information system. Note that it is not necessarily defined for each failure, and for example, one sub-procedure may be defined for a specific combination of failures.
 また、サブ手順格納手段103は、サブ手順IDに対応づけて、さらに、そのサブ手順を実行する際に必要となる条件を示す前提条件や、そのサブ手順の名前や、そのサブ手順の実行によって実現するコンポーネントの状態(コンポーネントの遷移先状態)や、そのサブ手順中の各運用操作(設定変更、リブート、シャットダウンなど)に対する所要時間などの付随情報を格納してもよい。前提条件は、例えば、事前に実行が必要なサブ手順の情報や、前提とする状態、同時実行不能なサブ手順等の情報である。なお、サブ手順の名前は、ユーザの理解を助けるために用いられる。また、コンポーネントの遷移先の状態は、例えば、複数のサブ手順を接続して障害復旧手順を生成する際に、後に続くサブ手順の前提条件を満たすか否かの判定材料に用いられる。また、各運用操作に対する所要時間は、サブ手順の実行時間を推定するために用いられる。以下、サブ手順格納手段103に格納される情報を、サブ手順情報と呼ぶ場合がある。 Further, the sub-procedure storage means 103 associates the sub-procedure ID with a precondition indicating a condition necessary for executing the sub-procedure, the name of the sub-procedure, and execution of the sub-procedure. Additional information such as the status of the component to be realized (component transition destination state) and the time required for each operation operation (setting change, reboot, shutdown, etc.) in the sub procedure may be stored. The preconditions are, for example, information on sub-procedures that need to be executed in advance, information on preconditions, sub-procedures that cannot be executed simultaneously, and the like. The name of the sub procedure is used to help the user understand. Further, the state of the transition destination of the component is used, for example, as a material for determining whether or not the preconditions of the subsequent subprocedures are satisfied when a failure recovery procedure is generated by connecting a plurality of subprocedures. The required time for each operation is used to estimate the execution time of the sub procedure. Hereinafter, the information stored in the sub procedure storage means 103 may be referred to as sub procedure information.
 再構築手順格納手段108は、障害が発生したコンポーネントを再構築するための手順である再構築手順の情報を格納する。再構築手順格納手段108は、例えば、再構築手順を識別する再構築手順IDに対応づけて、その再構築手順自身(その再構築手順の具体的な処理内容を示す情報や、そのような処理内容を実際に実行するためのスクリプトやプログラム等)を格納する。再構築の例としては、アプリケーションの再構築や、VM(Virtual Machine)の再配備などが挙げられる。本発明において、再構築手順の実態は特に問わない。すなわち、ユーザに提供する再構築手順は、ユーザに当該再構築手順を実行させるための情報であってもよいし、実際に情報システムに対してその手順を実行するようなスクリプトやプログラムであってもよいし、これらを組み合わせたものであってもよい。再構築は、内部状態をクリアするといった単なる再起動と異なり、もう一度ゼロから作り直す動作である。例えば、VMの再配置であれば、VMへ割り当てるリソースの設定、VM上のOSのIPアドレスやOSのファイアウォールの設定、VMイメージの転送などを行うなどの動作が含まれる。しかし、再構築手順は、現状の把握や障害の特定動作が必要でなく、また一度構築した際に正常に動作した実績があるなど、適切な手順が判明していることが多い。適切な手順が予め判明していれば、再構築手順をスクリプト等により自動化することも可能である。 The reconstruction procedure storage means 108 stores information on the reconstruction procedure, which is a procedure for reconstructing the component in which the failure has occurred. For example, the reconstruction procedure storage unit 108 associates the reconstruction procedure ID with the reconstruction procedure ID for identifying the reconstruction procedure itself (information indicating the specific processing content of the reconstruction procedure, or such processing). Stores scripts and programs to actually execute the contents. Examples of reconfiguration include reconfiguration of an application and redeployment of a virtual machine (VM). In the present invention, the actual status of the reconstruction procedure is not particularly limited. That is, the reconstruction procedure provided to the user may be information for causing the user to execute the reconstruction procedure, or may be a script or a program that actually executes the procedure for the information system. Alternatively, a combination of these may be used. Rebuilding is an operation to re-create from scratch, unlike a simple restart such as clearing the internal state. For example, in the case of rearrangement of a VM, operations such as setting of resources to be allocated to the VM, setting of an OS IP address and OS firewall on the VM, and transfer of a VM image are included. However, the reconstruction procedure does not require an understanding of the current state or the specific operation of the failure, and there are many cases where an appropriate procedure has been found, for example, there is a track record of normal operation once it has been constructed. If an appropriate procedure is known in advance, the reconstruction procedure can be automated by a script or the like.
 図4は、再構築手順格納手段108に格納される情報の一部を抜粋して示す説明図である。図4に示すように、本実施形態の再構築手順格納手段108は、再構築手順IDに対応づけて、さらに、その再構築手順(再構築手順IDによって識別される再構築手順)が再構築するコンポーネントの識別子を格納する。すなわち、本実施形態の再構築手順格納手段108は、再構築手順ごとに、その再構築手順自身と、その再構築手順が再構築するコンポーネントの識別子とを少なくとも含む情報を格納する。再構築手順は、例えば、情報システムが備えるコンポーネントごとに定義される。なお、必ずしもコンポーネントごとに定義されなくてもよく、例えば、特定のコンポーネントの組み合わせに対して1つの再構築手順が定義されてもよい。 FIG. 4 is an explanatory diagram showing a part of the information stored in the reconstruction procedure storage means 108. As shown in FIG. 4, the reconstruction procedure storage means 108 of the present embodiment associates the reconstruction procedure ID with the reconstruction procedure ID, and further reconstructs the reconstruction procedure (reconstruction procedure identified by the reconstruction procedure ID). The identifier of the component to be stored is stored. In other words, the reconstruction procedure storage unit 108 of the present embodiment stores, for each reconstruction procedure, information including at least the reconstruction procedure itself and the identifiers of components to be reconstructed by the reconstruction procedure. The reconstruction procedure is defined for each component included in the information system, for example. Note that it is not necessarily defined for each component. For example, one reconstruction procedure may be defined for a specific combination of components.
 また、再構築手順格納手段108は、再構築手順IDに対応づけて、さらに、その再構築手順の実行時間や、その再構築手順の実行によって実現するコンポーネントの状態(コンポーネントの遷移先状態)を格納してもよい。 Further, the reconstruction procedure storage means 108 associates the reconstruction procedure ID with the execution time of the reconstruction procedure and the component state (component transition destination state) realized by the execution of the reconstruction procedure. It may be stored.
 サブ手順特定手段102は、障害組合せ受付手段101が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定する。サブ手順特定手段102は、例えば、障害組合せ受付手段101が受け付けた障害の組合せに基づき、障害が発生した各コンポーネントについて、当該コンポーネントに発生している障害を復旧させるためのサブ手順のサブ手順IDを、サブ手順格納手段103に格納された情報を参照して特定する。 The sub-procedure specifying unit 102 specifies a sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 101. The sub-procedure specifying unit 102, for example, based on the combination of faults received by the fault combination receiving unit 101, for each component in which a fault has occurred, the sub-procedure ID of the sub-procedure for recovering the fault occurring in the component Is identified with reference to the information stored in the sub procedure storage means 103.
 障害復旧手順生成手段104は、サブ手順特定手段102により特定された各サブ手順を接続して(適当な順で組み合わせて)、障害復旧手順を生成する。障害復旧手順生成手段104は、最も単純には、特定された各サブ手順を直列に接続することによって障害復旧手順を生成してもよい。また、障害復旧手順生成手段104は、特定されたサブ手順の間に順序などの制約がある場合には、その制約を満たすように特定された各サブ手順を接続する。また、障害復旧手順生成手段104は、接続方法に複数の候補がありうる場合には、できるだけ並列化させるなど、障害復旧にかかる時間が短くなるように接続してもよい。なお、障害復旧手順生成手段104によって生成された障害復旧手順は、本発明の生成対象であるサービス再開手順の第1候補とされる。 The failure recovery procedure generation means 104 connects the subprocedures specified by the subprocedure specification means 102 (combining them in an appropriate order) to generate a failure recovery procedure. Most simply, the failure recovery procedure generation unit 104 may generate the failure recovery procedure by connecting the identified sub-procedures in series. Further, when there is a restriction such as an order between the specified sub-procedures, the failure recovery procedure generating unit 104 connects the specified sub-procedures so as to satisfy the restrictions. Moreover, when there are a plurality of candidates for the connection method, the failure recovery procedure generation unit 104 may be connected so as to shorten the time required for failure recovery, for example, by parallelizing as much as possible. Note that the failure recovery procedure generated by the failure recovery procedure generator 104 is a first candidate for a service restart procedure that is a generation target of the present invention.
 時間要件受付手段110は、サービス再開手順に対する時間要件を受け付ける。時間要件は、例えばRTOであって、より具体的には、1日、3時間、5分などである。時間要件は、顧客との契約等に応じて決められる。 The time requirement accepting unit 110 accepts the time requirement for the service resumption procedure. The time requirement is, for example, RTO, and more specifically, 1 day, 3 hours, 5 minutes, and the like. The time requirement is determined according to the contract with the customer.
 所要時間推定手段105は、生成されたサービス再開手順の候補(以下、サービス再開手順候補という。)の実行にかかる時間である所要時間を推定する。所要時間推定手段105は、最初に、サービス再開手順の第1候補として障害復旧手順生成手段104が生成した障害復旧手順の所要時間を推定する。 The required time estimation means 105 estimates a required time which is a time required for executing the generated service restart procedure candidate (hereinafter referred to as a service restart procedure candidate). The required time estimating means 105 first estimates the required time for the failure recovery procedure generated by the failure recovery procedure generating means 104 as the first candidate for the service restart procedure.
 所要時間推定手段105は、最も単純には、障害復旧手順に含まれるサブ手順が各々含む運用操作に要する時間をシーケンシャルに足し合わせることによって、所要時間を推定してもよい。また、所要時間推定手段105は、障害復旧手順に含まれる運用操作の数に比例して所要時間が増加するような計算式を利用してもよい。より正確にするために、所要時間推定手段105は、例えば障害復旧手順を表すアクティビティ図から、Stochastic reward netなどの確率モデルを生成し、Stochastic Petri Nets Packageなどの解析ツールを用いて所要時間を推定してもよい。 The required time estimating means 105 may estimate the required time by adding the time required for the operation operation included in each of the sub-procedures included in the failure recovery procedure sequentially. Further, the required time estimating means 105 may use a calculation formula that increases the required time in proportion to the number of operation operations included in the failure recovery procedure. In order to make it more accurate, the required time estimation means 105 generates a probabilistic model such as Stochastic reward net from an activity diagram representing, for example, a failure recovery procedure, and estimates the required time using an analysis tool such as Stochastic Petri Nets Package. May be.
 また、所要時間推定手段105は、推定した障害復旧手順の所要時間が、時間要件として示された時間を上回る場合には、後述するサブ手順置換手段109に、サービス再開手順候補(より具体的には、そのうちの障害復旧手順)に含まれるサブ手順の少なくとも一部を再構築手順に置き換えさせる。サブ手順置換手段109によるサブ手順の再構築手順への置き換えは、更新後のサービス再開手順候補の所要時間が時間要件を満たすようになるまで行われる。 Further, when the estimated time required for the failure recovery procedure exceeds the time indicated as the time requirement, the required time estimation unit 105 sends a service resumption procedure candidate (more specifically, to the sub procedure replacement unit 109 described later). Causes at least part of the sub-procedures included in the failure recovery procedure) to be replaced with the reconstruction procedure. The sub-procedure replacement unit 109 replaces the sub-procedure with the reconstruction procedure until the time required for the updated service restart procedure candidate satisfies the time requirement.
 所要時間推定手段105は、更新後のサービス再開手順候補の所要時間を推定する方法として、例えば、上述したような、更新後のサービス再開手順候補に含まれるサブ手順や再構築手順が各々含む運用操作に要する時間をシーケンシャルに足し合わせてもよい。それ以外にも、運用操作の数に比例して所要時間が増すような所定の計算式を利用する方法を用いてもよいし、置き換えられたサブ手順の実行に要する時間を置き換えた再構築手順の実行時間に差し替えるといった方法を用いてもよい。 The required time estimation means 105, as a method for estimating the required time of an updated service resuming procedure candidate, includes, for example, the operations included in each of the sub-procedures and reconstruction procedures included in the updated service resuming procedure candidate as described above. The time required for the operation may be added sequentially. In addition, a method using a predetermined calculation formula that increases the required time in proportion to the number of operation operations may be used, or a reconstruction procedure that replaces the time required to execute the replaced sub procedure It is also possible to use a method in which the execution time is replaced.
 再構築手順特定手段107は、障害組合せ受付手段101が受け付けた障害の組合せに基づいて、再構築手順格納手段108に格納された情報に基づき、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する。再構築手順特定手段107は、例えば、障害組合せ受付手段101が受け付けた障害の組合せに基づき、障害発生中の各コンポーネントを再構築するための再構築手順の再構築手順IDを、再構築手順格納手段108に格納された情報を参照して特定する。 The reconstruction procedure specifying unit 107 is necessary for reconstructing a component in which a failure has occurred based on the information stored in the reconstruction procedure storage unit 108 based on the combination of failures received by the failure combination receiving unit 101. Identify the rebuild procedure. The reconstruction procedure specifying unit 107 stores, for example, the reconstruction procedure ID of the reconstruction procedure for reconstructing each component in which a failure has occurred based on the combination of failures received by the failure combination receiving unit 101. The information stored in the means 108 is specified with reference to the information.
 サブ手順置換手段109は、所要時間推定手段105によって推定されたサービス再開手順候補の所要時間に基づいて、サービス再開手順候補が時間要件を満たしていない場合に、サービス再開手順候補に含まれる障害復旧のためのサブ手順の少なくとも一部を、再構築手順特定手段107によって特定された再構築手順の少なくとも一つに置き換える。置き換え方法は、サブ手順を1つずつ置き換えてもよいし、複数のサブ手順を同時に置き換えてもよい。また、サブ手順置換手段109は、時間要件に対してサービス再開手順候補の所要時間が超過した時間である超過時間に応じて、置き換える数を変えてもよい。また、サブ手順置換手段109は、実行時間の短い再構築手順を優先的に置き換えるようにしてもよい。 The sub-procedure replacement means 109, based on the required time of the service restart procedure candidate estimated by the required time estimation means 105, when the service restart procedure candidate does not satisfy the time requirement, the failure recovery included in the service restart procedure candidate At least a part of the sub-procedure for is replaced with at least one of the reconstruction procedures identified by the reconstruction procedure identification means 107. In the replacement method, sub-procedures may be replaced one by one, or a plurality of sub-procedures may be replaced simultaneously. Further, the sub-procedure replacement means 109 may change the number of replacements according to the excess time that is the time required for the service restart procedure candidate to exceed the time requirement. Further, the sub-procedure replacement unit 109 may preferentially replace the reconstruction procedure with a short execution time.
 サブ手順置換手段109は、再構築の範囲を最小限に抑えるために、サブ手順から再構築手順へのトータルの置き換え数を{1、2、3、・・・}といったように少ない数から段階的に増やしていき、時間要件を満たした時点で置き換え処理を止めてもよい。また、例えばサブ手順置換手段109は、置き換え元となるサブ手順の実行時間と、置き換え先となる再構築手順の実行時間との差が大きい、すなわち、置き換え前後の実行時間の差が大きい(時間短縮効果が大きい)ものから順に置き換えるようにしてもよい。 The sub-procedure replacement means 109 reduces the total number of replacements from the sub-procedure to the reconstruction procedure from a small number such as {1, 2, 3,. The replacement process may be stopped when the time requirement is satisfied. Further, for example, the sub-procedure replacement means 109 has a large difference between the execution time of the sub-procedure that becomes the replacement source and the execution time of the reconstruction procedure that becomes the replacement destination, that is, the difference between the execution times before and after replacement is large (time You may make it replace in order from the thing with a big shortening effect.
 一般に、1つのコンポーネントの再構築によって、その再構築の対象となったコンポーネントに発生している障害を復旧するためのサブ手順は全て不要となる。サブ手順置換手段109は、例えば、生成されたサービス再開手順候補に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成されたサービス再開手順候補に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える。例えば、サブ手順置換手段109は、各サブ手順に対応づけられているコンポーネントの識別子と、各再構築手順に対応づけられているコンポーネントの識別子とを基に、どのサブ手順がどの再構築手順に置き換え可能かを判断してもよい。なお、あるコンポーネントが複数のコンポーネントを包含しているような場合には、1つのコンポーネントの再構築手順と、そのコンポーネントが包含している全てのコンポーネントに発生している障害を復旧するためのサブ手順とが置き換え可能になる場合がある。そのような場合には、コンポーネントの内包関係を示す情報を別途記憶しておいてもよい。また、サブ手順はコンポーネントの障害ごとに整理されて用意されることが多いことから、コンポーネントの障害と再構築手順との対応関係を予め記憶しておき、その対応関係に基づいて、どのサブ手順とどの再構築手順とが置き換え可能かを判断してもよい。また、コンポーネントの障害と再構築手順との対応関係とともに、再構築手順への置き換えに関する前提条件を一緒に格納しておき、さらに状況に応じて置き換え可能性を判断してもよい。前提条件としては、例えば、特定のシステム状態(OSが正常稼働している、データベースがバックアップ中でない等)や、順序(事前に実行すべきサブ手順や再構築手順の指定等)などが挙げられる。サブ手順置換手段109は、あるサブ手順を対応する再構築手順に置き換える際に、例えば置き換え先の再構築手順の前提条件を満たすために、新たな手順(サブ手順や再構築手順)を追加したり、サービス再開手順候補内の各手順の実行順序を変更してもよい。 Generally, all sub-procedures for recovering a failure that has occurred in a component that has been reconstructed by reconstructing one component become unnecessary. The sub-procedure replacement means 109, for example, the sub-procedure included in the generated service restart procedure candidate based on the information on each sub-procedure included in the generated service restart procedure candidate and the information on the identified reconfiguration procedure. Replace at least part of the procedure with the rebuild procedure. For example, the sub-procedure replacement unit 109 determines which sub-procedure becomes which re-construction procedure based on the identifier of the component associated with each sub-procedure and the identifier of the component associated with each re-construction procedure. You may judge whether it can replace. When a certain component includes multiple components, a reconfiguration procedure for one component and a sub-unit for recovering from a failure occurring in all the components included in that component The procedure may be replaceable. In such a case, information indicating the component inclusion relationship may be stored separately. Also, since sub-procedures are often arranged and prepared for each component failure, the correspondence relationship between the component failure and the reconstruction procedure is stored in advance, and which sub-procedure is based on the correspondence relationship. And which reconstruction procedure can be replaced. In addition to the correspondence between the failure of the component and the reconstruction procedure, preconditions regarding the replacement to the reconstruction procedure may be stored together, and the possibility of replacement may be determined according to the situation. Preconditions include, for example, a specific system state (the OS is operating normally, the database is not being backed up, etc.), the order (designation of sub-procedures to be executed in advance or reconstruction procedure, etc.) . When replacing a sub-procedure with a corresponding reconstruction procedure, the sub-procedure replacement means 109 adds a new procedure (sub-procedure or reconstruction procedure), for example, to satisfy the preconditions of the replacement-destination reconstruction procedure. Alternatively, the execution order of each procedure in the service restart procedure candidate may be changed.
 また、サブ手順置換手段109は、置き換え可能な再構築手順を全て置き換えても、置き換え後のサービス再開手順候補の所要時間が時間要件を満たさない場合には、その旨を出力する。 In addition, even if all replaceable reconstruction procedures are replaced, the sub procedure replacement means 109 outputs a message to that effect if the required time of the service restart procedure candidate after replacement does not satisfy the time requirement.
 手順出力手段106は、上述したサブ手順の再構築手順への置き換え処理の結果、所要時間が時間要件を満たすサービス再開手順候補が生成された場合には、そのサービス再開手順候補を、サービス再開手順(またはその候補)として出力する。すなわち、手順出力手段106は、受け付けた時間要件を満たすサービス再開手順候補のみをユーザに提供する。手順出力手段106は、例えば、そのようなサービス再開手順候補の具体的な処理内容を、例えばアクティビティ図の形式にしてディスプレイに出力することにより、ユーザに提供してもよい。なお、手順出力手段106は、受け付けた時間要件を満たすサービス再開手順候補が生成されなかった場合には、「該当手順なし」の旨を出力するか、または、オペレータの判断のための参考情報として、最も所要時間が短いサービス再開手順候補を出力してもよい。 If the service output procedure candidate that satisfies the time requirement is generated as a result of the replacement process of the sub-procedure described above, the procedure output unit 106 sets the service restart procedure candidate as the service restart procedure. (Or its candidates). That is, the procedure output means 106 provides only the service restart procedure candidate that satisfies the accepted time requirement to the user. The procedure output means 106 may be provided to the user by, for example, outputting the specific processing contents of such a service restart procedure candidate in the form of an activity diagram, for example. Note that the procedure output means 106 outputs a message “no corresponding procedure” if no service restart procedure candidate satisfying the accepted time requirement is generated, or as reference information for the operator's judgment The service resuming procedure candidate with the shortest required time may be output.
 本実施形態において、サブ手順格納手段103および再構築手順格納手段108は、例えば、記憶装置によって実現される。また、障害組合せ受付手段101および時間要件受付手段110は、例えば、プログラムに従って動作するCPUと、入力装置とによって実現される。また、手順出力手段106は、例えば、プログラムに従って動作するCPUと、出力装置とによって実現される。また、サブ手順特定手段102、障害復旧手順生成手段104、所要時間推定手段105、再構築手順特定手段107およびサブ手順置換手段109は、例えば、プログラムに従って動作するCPUによって実現される。 In this embodiment, the sub procedure storage unit 103 and the reconstruction procedure storage unit 108 are realized by a storage device, for example. Moreover, the failure combination receiving unit 101 and the time requirement receiving unit 110 are realized by, for example, a CPU that operates according to a program and an input device. The procedure output means 106 is realized by, for example, a CPU that operates according to a program and an output device. The sub procedure specifying unit 102, the failure recovery procedure generating unit 104, the required time estimating unit 105, the reconstruction procedure specifying unit 107, and the sub procedure replacing unit 109 are realized by a CPU that operates according to a program, for example.
 次に、本実施形態のサービス再開手順生成装置1の動作を説明する。図5は、本実施形態のサービス再開手順生成装置1の動作の一例を示すフローチャートである。図5に示すように、まず障害組合せ受付手段101が、情報システムのコンポーネントに発生した障害の組み合わせを受け付ける(ステップS101)。情報システムのコンポーネントに発生した障害の組み合わせは、ユーザが入力してもよいし、情報システムから直接取得してもよい。 Next, the operation of the service restart procedure generating device 1 according to this embodiment will be described. FIG. 5 is a flowchart showing an example of the operation of the service restart procedure generating device 1 of the present embodiment. As shown in FIG. 5, first, the failure combination receiving unit 101 receives a combination of failures that have occurred in the components of the information system (step S101). A combination of faults occurring in the components of the information system may be input by the user or may be acquired directly from the information system.
 次に、サブ手順特定手段102は、ステップS101で受け付けた障害の組合せに基づいて、障害が起きたコンポーネント群の状態を復旧状態にするために必要なサブ手順を特定する(ステップS102)。 Next, the sub-procedure specifying unit 102 specifies a sub-procedure necessary for setting the state of the component group in which the failure has occurred to the recovery state based on the combination of failures received in step S101 (step S102).
 次に、障害復旧手順生成手段104は、ステップS102で特定されたサブ手順を接続することにより、サービス再開手順の第1候補となる障害復旧手順を生成する(ステップS103)。 Next, the failure recovery procedure generation means 104 generates a failure recovery procedure that is the first candidate for the service restart procedure by connecting the sub-procedures identified in step S102 (step S103).
 次に、再構築手順特定手段107が、ステップS101で受け付けた障害の組合せに基づいて、障害が起きたコンポーネント群を再構築するために必要な再構築手順を特定する(ステップS104)。このとき、再構築手順が用意されていないコンポーネントについては、スキップする(「再構築手順なし」とする)。なお、本ステップは、ステップS101からステップS108の間の別のタイミングで実行されてもよい。 Next, the reconstruction procedure specifying unit 107 specifies a reconstruction procedure necessary for reconstructing the component group in which the failure has occurred based on the combination of failures received in step S101 (step S104). At this time, a component for which a reconstruction procedure is not prepared is skipped ("No reconstruction procedure"). Note that this step may be executed at another timing between step S101 and step S108.
 次に、時間要件受付手段110が、サービス再開手順に対する時間要件を受け付ける(ステップS105)。なお、本ステップは、ステップS101からステップS107の間の別のタイミングで実行されてもよい。 Next, the time requirement accepting unit 110 accepts the time requirement for the service resumption procedure (step S105). Note that this step may be executed at another timing between step S101 and step S107.
 次に、所要時間推定手段105は、生成されたサービス再開手順候補の所要時間を推定する(ステップS106)。なお、本ステップの第1回目では、所要時間推定手段105は、ステップS104で生成された障害復旧手順の所要時間を推定する。また、2回目以降では、所要時間推定手段105は、ステップS108によりサブ手順の一部が再構築手順に置換されることにより生成された、新たなサービス再開手順候補の所要時間を推定する。 Next, the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). In the first time of this step, the required time estimation unit 105 estimates the required time for the failure recovery procedure generated in step S104. In the second and subsequent times, the required time estimating means 105 estimates the required time of a new service resuming procedure candidate generated by replacing a part of the sub procedure with the reconstructing procedure in step S108.
 次に、所要時間推定手段105は、ステップS106で推定された所要時間がステップS105で受け付けた時間要件を満たすか否かを判定する(ステップS107)。なお、サブ手順置換手段109が本ステップを行ってもよい。 Next, the required time estimating means 105 determines whether or not the required time estimated in step S106 satisfies the time requirement received in step S105 (step S107). The sub-procedure replacement unit 109 may perform this step.
 所要時間が時間要件を満たす場合(ステップS107のYes)、手順出力手段106が、最終的に得られたサービス再開手順候補をサービス再開手順として、ディスプレイ等に出力する(ステップS109)。 If the required time satisfies the time requirement (Yes in step S107), the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).
 一方、所要時間が時間要件を満たさない場合(ステップS107のNo)、サブ手順置換手段109が、サービス再開手順候補に含まれるサブ手順の一部を、特定された再構築手順のうちのいずれかに置き換えることにより、サービス再開手順候補を更新する(ステップS108)。そして、ステップS108で生成された新たなサービス再開手順候補について、再び上述した処理を繰り返す(ステップS106に戻る)。 On the other hand, if the required time does not satisfy the time requirement (No in step S107), the sub procedure replacement unit 109 selects any of the sub procedures included in the service restart procedure candidate as one of the identified reconstruction procedures. The service restart procedure candidate is updated by replacing with (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).
 以上のように、本実施形態では、まず発生した障害の組合せに応じた障害復旧手順を生成した上で、指定された時間要件を満たすまで、障害復旧手順に含まれるサブ手順の一部を再構築手順に置き換える方法を採用している。したがって、本実施形態によれば、通常の障害復旧手順では時間要件を満たすことができない場合であっても、時間要件を満たすサービス再開手順をユーザに提供できる。 As described above, in the present embodiment, first, a failure recovery procedure corresponding to the combination of failures that occurred is generated, and then a part of the sub-procedures included in the failure recovery procedure is replayed until the specified time requirement is satisfied. A method that replaces the construction procedure is adopted. Therefore, according to the present embodiment, a service resumption procedure that satisfies the time requirement can be provided to the user even if the time requirement cannot be satisfied by the normal failure recovery procedure.
 また、本実施形態によれば、再構築の範囲を、指定された時間要件を満たすために必要な最小限の範囲にとどめることができるので、指定された時間要件を満たしつつ、可能な限り障害原因が除去された状態でサービスを再開できる。 In addition, according to the present embodiment, the range of reconstruction can be limited to the minimum range necessary to satisfy the specified time requirement, so that the failure is possible as much as possible while satisfying the specified time requirement. Service can be resumed with the cause removed.
実施形態2.
 次に、本発明の第2の実施形態を説明する。情報システムが提供するサービスにRTOが定められているような場合、障害発生時にRTOを超過した分に対してペナルティとして支払わなければならない費用が定められている場合も多い。これらの費用は、一般にダウンタイムコストと呼ばれる。一方、情報システムの障害復旧に投入可能な費用は一般に限られている。本実施形態では、そのようなコスト要件を満たすサービス再開手順を生成する。
Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described. When an RTO is defined for a service provided by an information system, there are many cases where a fee that must be paid as a penalty for an amount exceeding the RTO when a failure occurs is often defined. These costs are commonly referred to as downtime costs. On the other hand, the cost that can be invested in information system failure recovery is generally limited. In the present embodiment, a service restart procedure that satisfies such cost requirements is generated.
 本実施形態のサービス再開手順生成装置は、サービス再開手順候補の超過時間に対するダウンタイムコストに基づき、指定されたコスト要件を満たすサービス再開手順を生成する点が、第1の実施形態のサービス再開手順生成装置1と異なる。以下、主に第1の実施形態のサービス再開手順生成装置1との相違点を説明する。 The service restart procedure generating apparatus according to the first embodiment is that the service restart procedure generating apparatus according to the first embodiment generates a service restart procedure that satisfies a specified cost requirement based on a downtime cost with respect to an excess time of a service restart procedure candidate. Different from the generator 1. Hereinafter, differences from the service restart procedure generating apparatus 1 of the first embodiment will be mainly described.
 図6は、本実施形態のサービス再開手順生成装置の構成例を示すブロック図である。図6に示すサービス再開手順生成装置2は、図1に示す第1の実施形態のサービス再開手順生成装置1に加えて、所要コスト推定手段111と、コスト要件受付手段112とを備える。 FIG. 6 is a block diagram illustrating a configuration example of the service restart procedure generation device according to the present embodiment. The service resumption procedure generation device 2 illustrated in FIG. 6 includes a required cost estimation unit 111 and a cost requirement reception unit 112 in addition to the service resumption procedure generation device 1 of the first embodiment illustrated in FIG.
 所要コスト推定手段111は、所要時間推定手段105が推定した所要時間と、時間要件受付手段110が受け付けた時間要件とに基づき、サービス再開手順候補の実行にかかるコストである所要コストを推定する。所要コストは、例えば、ダウンタイムコストであってもよい。ダウンタイムコストの計算方法は、単純にはサービス再開手順候補の超過時間に比例するとしてもよいし、サービス利用者である顧客との間の契約などで予め決められた計算式を用いてもよい。また、所要コスト推定手段111は、所要コストとして、ダウンタイムコストだけでなく、障害復旧に要する人件費や設備費などのサービス再開手順候補の実行に要する費用を含めたコストを推定してもよい。 The required cost estimation unit 111 estimates a required cost, which is a cost required to execute a service resuming procedure candidate, based on the required time estimated by the required time estimation unit 105 and the time requirement received by the time requirement reception unit 110. The required cost may be, for example, a downtime cost. The calculation method of the downtime cost may be simply proportional to the excess time of the service restart procedure candidate, or may use a calculation formula determined in advance by a contract with a customer who is a service user. . Further, the required cost estimation unit 111 may estimate not only the downtime cost but also the cost including the cost required for executing the service restart procedure candidate such as the labor cost and the equipment cost required for the failure recovery as the required cost. .
 所要コスト推定手段111は、推定した所要コストが、コスト要件受付手段112で受け付けたコスト要件を超過する場合、サブ手順置換手段109に、サービス再開手順候補(より具体的には、そのうちの障害復旧手順)に含まれるサブ手順の少なくとも一つを再構築手順に置き換えさせる。本実施形態では、サブ手順置換手段109によるサブ手順の再構築手順への置き換えは、更新後のサービス再開手順候補の所要コストがコスト要件を満たすようになるまで行われる。 When the estimated required cost exceeds the cost requirement accepted by the cost requirement accepting unit 112, the required cost estimating unit 111 sends a service resuming procedure candidate (more specifically, failure recovery among them) to the sub procedure replacing unit 109. At least one of the sub-procedures included in (Procedure) is replaced with a reconstruction procedure. In the present embodiment, the sub procedure replacement unit 109 replaces the sub procedure with the reconstruction procedure until the required cost of the updated service restart procedure candidate satisfies the cost requirement.
 コスト要件受付手段112は、サービス再開手順に対するコスト要件を受け付ける。 The cost requirement receiving means 112 receives the cost requirement for the service resumption procedure.
 また、本実施形態の所要時間推定手段105は、第1の実施形態の所要時間推定手段105と異なり、サービス再開手順候補の所要時間の推定処理のみを行う。すなわち、所要時間推定手段105は、推定した所要時間が、時間要件受付手段110が受け付けた時間要件を満たしていない場合でも、再構築手順への置き換え指示等は行わない。再構築手順への置き換え指示等は、既に説明したように所要コスト推定手段111が行う。 Further, the required time estimation unit 105 of the present embodiment, unlike the required time estimation unit 105 of the first embodiment, performs only the estimation process of the required time of the service restart procedure candidate. That is, the required time estimation unit 105 does not issue a replacement instruction to the reconstruction procedure even when the estimated required time does not satisfy the time requirement received by the time requirement reception unit 110. The replacement cost instruction for the reconstruction procedure is performed by the required cost estimation unit 111 as described above.
 また、本実施形態のサブ手順置換手段109は、所要コスト推定手段111によって推定されたサービス再開手順候補の所要コストに基づいて、サービス再開手順候補がコスト要件を満たしていない場合に、サービス再開手順候補に含まれる障害復旧のためのサブ手順の少なくとも一部を、再構築手順特定手段107によって特定された再構築手順の少なくとも一つに置き換える。なお、サブ手順置換手段109は、置き換え可能な再構築手順を全て置き換えても、置き換え後のサービス再開手順候補の所要コストがコスト要件を満たさない場合には、その旨を出力する。 Further, the sub-procedure replacement unit 109 according to the present embodiment performs the service resumption procedure when the service resumption procedure candidate does not satisfy the cost requirement based on the required cost of the service resumption procedure candidate estimated by the required cost estimation unit 111. At least a part of the sub-procedure for failure recovery included in the candidate is replaced with at least one of the reconstruction procedures specified by the reconstruction procedure specifying means 107. Note that the sub-procedure replacement unit 109 outputs a message to the effect that if the required cost of the service restart procedure candidate after replacement does not satisfy the cost requirement even if all replaceable reconstruction procedures are replaced.
 本実施形態において、所要コスト推定手段111およびコスト要件受付手段112は、例えば、プログラムに従って動作するCPUによって実現される。 In the present embodiment, the required cost estimation unit 111 and the cost requirement reception unit 112 are realized by a CPU that operates according to a program, for example.
 次に、本実施形態のサービス再開手順生成装置2の動作を説明する。図7は、本実施形態のサービス再開手順生成装置2の動作の一例を示すフローチャートである。図7に示すように、まず、第1の実施形態の場合と同様に、ステップS101からステップS105の処理を行う。 Next, the operation of the service restart procedure generation device 2 according to this embodiment will be described. FIG. 7 is a flowchart showing an example of the operation of the service restart procedure generating device 2 of the present embodiment. As shown in FIG. 7, first, similarly to the case of the first embodiment, the processing from step S101 to step S105 is performed.
 本実施形態では、次に、コスト要件受付手段112が、サービス再開手順に対するコスト要件を受け付ける(ステップS1051)。なお、本ステップは、ステップS101からS1062の間の別のタイミングで実行されてもよい。 In this embodiment, next, the cost requirement receiving unit 112 receives the cost requirement for the service resumption procedure (step S1051). Note that this step may be executed at another timing between steps S101 and S1062.
 次に、所要時間推定手段105が、生成されたサービス再開手順候補の所要時間を推定する(ステップS106)。なお、本ステップは第1の実施形態と同様である。 Next, the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). This step is the same as in the first embodiment.
 次に、所要コスト推定手段111が、ステップS106で推定された所要時間と、ステップS105で受け付けた時間要件に基づき、生成されたサービス再開手順候補の所要コストを推定する(ステップS1061)。なお、本ステップの第1回目では、所要コスト推定手段111は、ステップS104で生成された障害復旧手順の所要コストを推定する。また、2回目以降では、所要コスト推定手段111は、ステップS108によりサブ手順の一部が再構築手順に置換されることにより生成された、新たなサービス再開手順候補の所要コストを推定する。 Next, the required cost estimation unit 111 estimates the required cost of the generated service restart procedure candidate based on the required time estimated in step S106 and the time requirement received in step S105 (step S1061). In the first step of this step, the required cost estimation unit 111 estimates the required cost of the failure recovery procedure generated in step S104. In the second and subsequent times, the required cost estimation unit 111 estimates the required cost of a new service restart procedure candidate generated by replacing a part of the sub procedure with the reconstruction procedure in step S108.
 次に、所要コスト推定手段111は、ステップS1061で推定された所要コストがステップS1051で受け付けたコスト要件を満たすか否かを判定する(ステップS1062)。なお、第1の実施形態と同様、サブ手順置換手段109が本ステップを行ってもよい。 Next, the required cost estimation unit 111 determines whether or not the required cost estimated in step S1061 satisfies the cost requirement received in step S1051 (step S1062). Note that, similarly to the first embodiment, the sub-procedure replacement unit 109 may perform this step.
 所要コストがコスト要件を満たす場合(ステップS1062のYes)、手順出力手段106が、最終的に得られたサービス再開手順候補をサービス再開手順として、ディスプレイ等に出力する(ステップS109)。 If the required cost satisfies the cost requirement (Yes in step S1062), the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).
 一方、所要コストがコスト要件を満たさない場合(ステップS1062のNo)、サブ手順置換手段109は、サービス再開手順候補に含まれるサブ手順の一部を再構築手順に置き換えることにより、サービス再開手順候補を更新する(ステップS108)。そして、ステップS108で生成された新たなサービス再開手順候補について、再び上述した処理を繰り返す(ステップS106に戻る)。 On the other hand, when the required cost does not satisfy the cost requirement (No in step S1062), the sub procedure replacement unit 109 replaces a part of the sub procedure included in the service restart procedure candidate with the reconstruction procedure, thereby resuming the service restart procedure candidate. Is updated (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).
 以上のように、本実施形態によれば、まず発生した障害の組合せに応じた障害復旧手順を生成した上で、指定されたコスト要件を満たすまで、障害復旧手順に含まれるサブ手順の一部を再構築手順に置き換える方法を採用している。したがって、本実施形態によれば、通常の障害復旧手順では時間要件を満たすことができない場合であっても、コスト要件を満たすサービス再開手順を生成できるので、情報システムの障害復旧に投入可能な費用が限られるような場合にも実行可能なサービス再開手順をユーザに提供することができる。 As described above, according to the present embodiment, after generating a failure recovery procedure according to a combination of failures that have occurred, a part of sub-procedures included in the failure recovery procedure until the specified cost requirement is satisfied. The method of replacing with a reconstruction procedure is adopted. Therefore, according to the present embodiment, even if the time requirement cannot be satisfied by the normal failure recovery procedure, a service resumption procedure that satisfies the cost requirement can be generated. It is possible to provide a user with a service resumption procedure that can be executed even in a case where there is a limit.
 また、本実施形態によれば、再構築の範囲を、指定されたコスト要件を満たすために必要な最小限の範囲にとどめることができるので、指定されたコスト要件を満たしつつ、可能な限り障害原因が除去された状態でサービスを再開できる。 In addition, according to the present embodiment, the range of reconstruction can be limited to the minimum range necessary to satisfy the specified cost requirement, so that the failure is possible as much as possible while satisfying the specified cost requirement. Service can be resumed with the cause removed.
 次に、本発明の概要を説明する。図8は、本発明のサービス再開装置手順生成装置の概要を示すブロック図である。図8に示すサービス再開手順生成装置500は、障害組合せ受付手段501と、サブ手順格納手段502と、再構築手順格納手段503と、サブ手順特定手段504と、障害復旧手順生成手段505と、再構築手順特定手段506と、サブ手順置換手段507と、手順出力手段508とを備える。 Next, the outline of the present invention will be described. FIG. 8 is a block diagram showing an outline of the service restarting device procedure generation device of the present invention. 8 includes a failure combination receiving unit 501, a sub procedure storage unit 502, a reconstruction procedure storage unit 503, a sub procedure identification unit 504, a failure recovery procedure generation unit 505, A construction procedure specifying unit 506, a sub procedure replacement unit 507, and a procedure output unit 508 are provided.
 障害組合せ受付手段501(例えば、障害組合せ受付手段101)は、情報システムが備えるコンポーネントに発生中の障害の組合せを受け付ける。 The failure combination accepting unit 501 (for example, the failure combination accepting unit 101) accepts a combination of failures occurring in the components included in the information system.
 サブ手順格納手段502(例えば、サブ手順格納手段103)は、情報システムが備えるコンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納する。 The sub-procedure storage unit 502 (for example, the sub-procedure storage unit 103) stores sub-procedure information, which is a procedure for recovering a failure occurring in a component included in the information system, in association with a component identifier.
 再構築手順格納手段503(例えば、再構築手順格納手段108)は、情報システムが備えるコンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納する。 The reconstruction procedure storage unit 503 (for example, the reconstruction procedure storage unit 108) stores information on a reconstruction procedure, which is a procedure for reconstructing a component included in the information system, in association with the component identifier.
 サブ手順特定手段504(例えば、サブ手順特定手段102)は、障害組合せ受付手段501が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定する。 The sub-procedure specifying unit 504 (for example, the sub-procedure specifying unit 102) specifies the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 501. To do.
 障害復旧手順生成手段505(例えば、障害復旧手順生成手段104)は、サブ手順特定手段504によって特定されたサブ手順の情報に基づいて、特定されたサブ手順を接続して障害復旧手順を生成する。 The failure recovery procedure generation unit 505 (for example, the failure recovery procedure generation unit 104) generates a failure recovery procedure by connecting the specified sub-procedures based on the sub-procedure information specified by the sub-procedure specifying unit 504. .
 再構築手順特定手段506(例えば、再構築手順特定手段107)は、障害組合せ受付手段501が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する。 The rebuilding procedure specifying unit 506 (for example, the rebuilding procedure specifying unit 107) performs a rebuilding procedure necessary for rebuilding a component in which a fault has occurred based on the combination of faults received by the fault combination receiving unit 501. Identify.
 サブ手順置換手段507(例えば、サブ手順置換手段109)は、生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える。 The sub-procedure replacement unit 507 (for example, the sub-procedure replacement unit 109) specifies and specifies information on each sub-procedure included in the generated fault recovery procedure when the generated fault recovery procedure does not satisfy a predetermined requirement. Based on the information on the reconstructed procedure, at least a part of the subprocedure included in the generated failure recovery procedure is replaced with the reconstructed procedure.
 手順出力手段508(例えば、手順出力手段106)は、サブ手順置換手段507によってサブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する。 The procedure output unit 508 (for example, the procedure output unit 106) outputs a failure recovery procedure in which at least a part of the sub procedure is replaced by the reconstruction procedure by the sub procedure replacement unit 507 as a service restart procedure.
 以上のような構成によって、通常の障害復旧手順では所定の要件を満たすことができない場合であっても、発生した障害の組合せに応じて最適なサービス再開手順を自動的に生成することができる。 With the configuration as described above, even when a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.
 また、本発明のサービス再開手順生成装置は、サービス再開手順に対して課される時間要件を受け付ける時間要件受付手段(例えば、時間要件受付手段110)と、指定された手順の実施にかかる時間である所要時間を推定する所要時間推定手段(例えば、所要時間推定手段105)とを備えていてもよい。そのような場合に、サブ手順置換手段507は、所要時間推定手段によって推定された障害復旧手順の所要時間が時間要件を満たしていない場合に、時間要件を満たすように、障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えてもよい。 In addition, the service restart procedure generating apparatus of the present invention includes a time requirement receiving unit (for example, the time requirement receiving unit 110) that receives a time requirement imposed on the service restart procedure, and a time required for performing the specified procedure. Required time estimation means (for example, required time estimation means 105) for estimating a certain required time may be provided. In such a case, the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the time requirement when the time required for the failure recovery procedure estimated by the required time estimation unit does not satisfy the time requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.
 また、本発明のサービス再開手順生成装置は、さらに、サービス再開手順に対して課されるコスト要件を受け付けるコスト要件受付手段(例えば、コスト要件受付手段112)と、指定された手順の実施にかかるコストであって、障害復旧時間の超過に対するダウンタイムコストを含む所要コストを推定する所要コスト推定手段(例えば、所要コスト推定手段111)とを備えていてもよい。そのような場合に、サブ手順置換手段507は、所要時間推定手段によって推定された障害復旧手順の所要コストがコスト要件を満たしていない場合に、コスト要件を満たすように、障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えてもよい。 In addition, the service resumption procedure generating apparatus of the present invention further involves cost requirement acceptance means (for example, cost requirement acceptance means 112) for accepting cost requirements imposed on the service resumption procedure, and implementation of the designated procedure. It may be provided with required cost estimation means (for example, required cost estimation means 111) for estimating a required cost including a downtime cost for exceeding the failure recovery time. In such a case, the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the cost requirement when the required cost of the failure recovery procedure estimated by the required time estimation unit does not satisfy the cost requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.
 また、サブ手順置換手段507は、生成された障害復旧手順をサービス再開手順の第1の候補として、生成されたサービス再開手順の候補が所定の要件を満たすまで、当該候補に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える置き換え処理を繰り返し行い、手順出力手段508は、サブ手順置換手段507によって所定の要件を満たすとされたサービス再開手順の候補を、サービス再開手順として出力してもよい。 Further, the sub-procedure replacement means 507 sets the generated failure recovery procedure as the first candidate for the service resumption procedure, until the generated service resumption procedure candidate satisfies a predetermined requirement, The replacement process for replacing at least a part with the reconstruction procedure is repeatedly performed, and the procedure output unit 508 outputs, as the service restart procedure, a candidate for the service restart procedure that is determined to satisfy the predetermined requirement by the sub procedure replacement unit 507. Also good.
 また、サブ手順置換手段507は、置き換え前後の実行時間の差が大きいものから順に置き換えてもよい。 Further, the sub-procedure replacement means 507 may perform replacement in descending order of execution time difference before and after replacement.
 また、再構築手順が、スクリプトまたはプログラムで提供されてもよい。 Also, the reconstruction procedure may be provided in a script or program.
 また、上記実施形態では、サービス再開手順の評価指標として、所要時間または所要コストを用いる例が示されているが、サービス再開手順の実施の成功率などの、システム要件に関わる他の評価指標を用いてもよい。 In the above embodiment, an example is shown in which the required time or required cost is used as an evaluation index for the service restart procedure. However, other evaluation indexes related to system requirements, such as the success rate of the service restart procedure, are shown. It may be used.
 また、上記各実施形態では、サービス再開手順生成装置の各機能は、ソフトウェア、より具体的にはCPUがプログラムを実行することにより実現されていたが、回路等のハードウェアにより実現されていてもよい。 Further, in each of the above embodiments, each function of the service restart procedure generating device is realized by software, more specifically, a CPU executing a program, but may be realized by hardware such as a circuit. Good.
 また、上記各実施形態においてプログラムは、記憶装置に記憶されるとしたが、コンピュータが読み取り可能な記録媒体に記憶されていてもよい。例えば、記録媒体は、フレキシブルディスク、光ディスク、光磁気ディスク、または半導体メモリ等の可搬性を有する媒体である。 In each of the above embodiments, the program is stored in a storage device, but may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
 以上、実施形態および実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2013年11月13日に出願された日本特許出願2013-234751を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2013-234751 filed on November 13, 2013, the entire disclosure of which is incorporated herein.
 本発明は、サービスを再開させるためだけでなく、例えば障害が発生した情報システムを障害がない状態にまで復旧させるために用いられる装置、システム、方法およびプログラムに適用可能である。 The present invention is applicable not only to restarting a service, but also to an apparatus, system, method, and program used for recovering, for example, an information system in which a failure has occurred to a state where there is no failure.
 1、2、500 サービス再開手順生成装置
 101、501 障害組合せ受付手段
 102、504 サブ手順特定手段
 103、502 サブ手順格納手段
 104、505 障害復旧手順生成手段
 105 所要時間推定手段
 106、508 手順出力手段
 107、506 再構築手順特定手段
 108、503 再構築手順格納手段
 109、507 サブ手順置換手段
 110 時間要件受付手段
 111 所要コスト推定手段
 112 コスト要件受付手段
1, 2, 500 Service resumption procedure generation device 101, 501 Failure combination acceptance means 102, 504 Sub procedure identification means 103, 502 Sub procedure storage means 104, 505 Failure recovery procedure generation means 105 Required time estimation means 106, 508 Procedure output means 107, 506 Reconstruction procedure specifying means 108, 503 Reconstruction procedure storage means 109, 507 Sub procedure replacement means 110 Time requirement accepting means 111 Required cost estimating means 112 Cost requirement accepting means

Claims (8)

  1.  情報システムが備えるコンポーネントに発生中の障害の組合せを受け付ける障害組合せ受付手段と、
     前記コンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納するサブ手順格納手段と、
     前記コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納する再構築手順格納手段と、
     前記障害組合せ受付手段が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定するサブ手順特定手段と、
     前記サブ手順特定手段によって特定されたサブ手順の情報に基づいて、特定されたサブ手順を接続して障害復旧手順を生成する障害復旧手順生成手段と、
     前記障害組合せ受付手段が受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する再構築手順特定手段と、
     生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えるサブ手順置換手段と、
     前記サブ手順置換手段によってサブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する手順出力手段とを備えた
     ことを特徴とするサービス再開手順生成装置。
    A failure combination receiving means for receiving a combination of failures occurring in a component included in the information system;
    Sub-procedure storage means for storing sub-procedure information, which is a procedure for recovering a failure occurring in the component, in association with a component identifier;
    Reconstruction procedure storage means for storing information on a reconstruction procedure, which is a procedure for reconstructing the component, in association with an identifier of the component;
    Sub-procedure specifying means for specifying a sub-procedure necessary for recovering a component in which a fault has occurred from a fault state based on the combination of faults received by the fault combination receiving means;
    Based on the information of the sub procedure specified by the sub procedure specifying means, a fault recovery procedure generating means for generating a fault recovery procedure by connecting the specified sub procedures,
    Based on the combination of faults received by the fault combination receiving means, a rebuilding procedure specifying means for specifying a rebuilding procedure necessary for rebuilding a component in which a fault has occurred,
    If the generated disaster recovery procedure does not meet the prescribed requirements, the generated failure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure Sub-procedure replacement means for replacing at least a part of the sub-procedure included in the recovery procedure with the reconstruction procedure;
    A service resumption procedure generation apparatus comprising: a procedure output unit that outputs a failure recovery procedure in which at least a part of a sub procedure is replaced by a reconstruction procedure by the sub procedure replacement unit.
  2.  サービス再開手順に対して課される時間要件を受け付ける時間要件受付手段と、
     指定された手順の実施にかかる時間である所要時間を推定する所要時間推定手段とを備え、
     サブ手順置換手段は、前記所要時間推定手段によって推定された障害復旧手順の所要時間が前記時間要件を満たしていない場合に、前記時間要件を満たすように、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える
     請求項1に記載のサービス再開手順生成装置。
    A time requirement accepting means for accepting a time requirement imposed on the service restart procedure;
    A time estimation means for estimating a time required for performing a specified procedure,
    The sub-procedure replacement means includes a sub-procedure included in the generated fault recovery procedure so as to satisfy the time requirement when the time required for the fault recovery procedure estimated by the required time estimation means does not satisfy the time requirement. The service restart procedure generation device according to claim 1, wherein at least a part of the procedure is replaced with a reconstruction procedure.
  3.  サービス再開手順に対して課されるコスト要件を受け付けるコスト要件受付手段と、
     指定された手順の実施にかかるコストであって、障害復旧時間の超過に対して課されるダウンタイムコストを含む所要コストを推定する所要コスト推定手段とを備え、
     サブ手順置換手段は、前記所要時間推定手段によって推定された障害復旧手順の所要コストが前記コスト要件を満たしていない場合に、前記コスト要件を満たすように、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える
     請求項2に記載のサービス再開手順生成装置。
    A cost requirement accepting means for accepting a cost requirement imposed on the service restart procedure;
    A required cost estimating means for estimating a required cost including a downtime cost imposed on an excess of the failure recovery time, which is a cost for performing a specified procedure,
    The sub procedure replacement means includes a sub procedure included in the generated fault recovery procedure so as to satisfy the cost requirement when the required cost of the failure recovery procedure estimated by the required time estimation means does not satisfy the cost requirement. The service restart procedure generation device according to claim 2, wherein at least a part of the procedure is replaced with a reconstruction procedure.
  4.  サブ手順置換手段は、生成された障害復旧手順をサービス再開手順の第1の候補とし、生成されたサービス再開手順の候補が所定の要件を満たすまで、当該候補に含まれるサブ手順の少なくとも一部を、再構築手順に置き換える置き換え処理を繰り返し行い、
     手順出力手段は、前記サブ手順置換手段によって所定の要件を満たすとされたサービス再開手順の候補を、サービス再開手順として出力する
     請求項1から請求項3のうちのいずれか1項に記載のサービス再開手順生成装置。
    The sub-procedure replacement means sets the generated failure recovery procedure as the first candidate for the service restart procedure, and at least a part of the sub-procedures included in the candidate until the generated service restart procedure candidate satisfies a predetermined requirement Repeat the replacement process to replace
    4. The service according to claim 1, wherein the procedure output unit outputs a service restart procedure candidate determined to satisfy a predetermined requirement by the sub-procedure replacement unit as a service restart procedure. 5. Resume procedure generator.
  5.  サブ手順置換手段は、置き換え前後の実行時間の差が大きいものから順に置き換える
     請求項1から請求項4のうちのいずれか1項に記載のサービス再開手順生成装置。
    The service resumption procedure generation device according to any one of claims 1 to 4, wherein the sub-procedure replacement means replaces the sub-procedure replacement means in descending order of execution time before and after replacement.
  6.  再構築手順が、スクリプトまたはプログラムとして提供される
     請求項1から請求項5のうちのいずれか1項に記載のサービス再開手順生成装置。
    The service resumption procedure generation device according to any one of claims 1 to 5, wherein the reconstruction procedure is provided as a script or a program.
  7.  所定のサブ手順格納手段に、情報システムが備えるコンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納し、
     所定の再構築手順格納手段に、前記コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納し、
     情報処理装置が、前記情報システムのコンポーネントに発生中の障害の組合せを受け付け、
     前記情報処理装置が、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定し、
     前記情報処理装置が、特定されたサブ手順の情報に基づいて、特定されたサブ手順を接続して障害復旧手順を生成し、
     前記情報処理装置が、受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定し、
     前記情報処理装置が、生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換え、
     前記情報処理装置が、サブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する
     ことを特徴とするサービス再開手順生成方法。
    In a predetermined sub-procedure storage means, information on a sub-procedure, which is a procedure for recovering a failure occurring in a component included in the information system, is stored in association with a component identifier,
    In a predetermined reconstruction procedure storage means, information on the reconstruction procedure, which is a procedure for reconstructing the component, is stored in association with the identifier of the component,
    The information processing apparatus accepts a combination of failures occurring in the components of the information system,
    Based on the received failure combination, the information processing device identifies a sub-procedure necessary for recovering the component in which the failure has occurred from the failure state,
    The information processing apparatus generates a failure recovery procedure by connecting the identified sub procedure based on the identified sub procedure information,
    The information processing apparatus identifies a restructuring procedure necessary for reconstructing a component in which a failure has occurred, based on the received failure combination,
    When the generated failure recovery procedure does not satisfy a predetermined requirement, the information processing apparatus is based on information on each sub-procedure included in the generated failure recovery procedure and information on the identified reconstruction procedure Replace at least some of the sub-procedures included in the generated disaster recovery procedure with the rebuild procedure,
    The information processing apparatus outputs, as a service restart procedure, a failure recovery procedure in which at least a part of the sub procedure is replaced with a reconstruction procedure.
  8.  情報システムが備えるコンポーネントに発生中の障害を復旧させるための手順であるサブ手順の情報を、コンポーネントの識別子と対応づけて格納するサブ手順格納手段と、前記コンポーネントを再構築するための手順である再構築手順の情報を、コンポーネントの識別子と対応づけて格納する再構築手順格納手段とを備えたコンピュータに、
     情報システムのコンポーネントに発生中の障害の組合せを受け付ける障害組合せ受付処理、
     受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを障害状態から復旧させるために必要なサブ手順を特定するサブ手順特定処理、
     受け付けた障害の組合せに基づいて、障害発生中のコンポーネントを再構築するために必要な再構築手順を特定する再構築手順特定処理、
     生成された障害復旧手順が所定の要件を満たしていない場合に、生成された障害復旧手順に含まれる各サブ手順の情報と、特定された再構築手順の情報とに基づいて、生成された障害復旧手順に含まれるサブ手順の少なくとも一部を、再構築手順に置き換えるサブ手順置換処理、および
     サブ手順の少なくとも一部が再構築手順に置き換えられた障害復旧手順を、サービス再開手順として出力する手順出力処理
     を実行させるためのサービス再開手順生成プログラム。
    Sub-procedure storage means for storing sub-procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier, and a procedure for reconstructing the component Reconstruction procedure information is stored in association with a component identifier in a computer having reconstruction procedure storage means.
    Fault combination acceptance processing for accepting fault combinations occurring in information system components,
    A sub-procedure identification process for identifying a sub-procedure necessary for recovering a component in which a fault has occurred from a fault state based on the combination of faults received;
    Reconstruction procedure identification process for identifying the reconstruction procedure necessary to reconstruct the component in which the failure has occurred, based on the combination of the received failures
    If the generated disaster recovery procedure does not meet the prescribed requirements, the generated failure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure Sub-procedure replacement process that replaces at least a part of sub-procedures included in the recovery procedure with a reconstruction procedure, and a procedure that outputs a disaster recovery procedure in which at least part of the sub-procedure is replaced with a reconstruction procedure as a service restart procedure Service restart procedure generation program for executing output processing.
PCT/JP2014/005217 2013-11-13 2014-10-15 Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program WO2015072078A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2015547615A JPWO2015072078A1 (en) 2013-11-13 2014-10-15 Service resumption procedure generation device, service resumption procedure generation method, and service resumption procedure generation program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2013234751 2013-11-13
JP2013-234751 2013-11-13

Publications (1)

Publication Number Publication Date
WO2015072078A1 true WO2015072078A1 (en) 2015-05-21

Family

ID=53057031

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/005217 WO2015072078A1 (en) 2013-11-13 2014-10-15 Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program

Country Status (2)

Country Link
JP (1) JPWO2015072078A1 (en)
WO (1) WO2015072078A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010238B2 (en) 2017-08-01 2021-05-18 Hitachi, Ltd. Management system of storage system
JP7320415B2 (en) 2019-09-13 2023-08-03 東芝テック株式会社 Processing device and start-up method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567937B1 (en) * 1999-11-17 2003-05-20 Isengard Corporation Technique for remote state notification and software fault recovery
JP2008210148A (en) * 2007-02-26 2008-09-11 Hitachi Information Systems Ltd Failure handling system and failure handling method
JP2011076161A (en) * 2009-09-29 2011-04-14 Nomura Research Institute Ltd Incident management system
JP2011159218A (en) * 2010-02-03 2011-08-18 Mitsubishi Heavy Ind Ltd Trouble handling support system, trouble handling support method and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567937B1 (en) * 1999-11-17 2003-05-20 Isengard Corporation Technique for remote state notification and software fault recovery
JP2008210148A (en) * 2007-02-26 2008-09-11 Hitachi Information Systems Ltd Failure handling system and failure handling method
JP2011076161A (en) * 2009-09-29 2011-04-14 Nomura Research Institute Ltd Incident management system
JP2011159218A (en) * 2010-02-03 2011-08-18 Mitsubishi Heavy Ind Ltd Trouble handling support system, trouble handling support method and program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010238B2 (en) 2017-08-01 2021-05-18 Hitachi, Ltd. Management system of storage system
JP7320415B2 (en) 2019-09-13 2023-08-03 東芝テック株式会社 Processing device and start-up method

Also Published As

Publication number Publication date
JPWO2015072078A1 (en) 2017-03-16

Similar Documents

Publication Publication Date Title
US8140907B2 (en) Accelerated virtual environments deployment troubleshooting based on two level file system signature
US9712418B2 (en) Automated network control
JP7110415B2 (en) Fault injection method, device, electronic equipment, storage medium, and program
WO2014033945A1 (en) Management system which manages computer system having plurality of devices to be monitored
US9442791B2 (en) Building an intelligent, scalable system dump facility
JP6249016B2 (en) Fault recovery procedure generation device, fault recovery procedure generation method, and fault recovery procedure generation program
Grottke et al. Recovery from software failures caused by mandelbugs
JP6604218B2 (en) Test apparatus, network system, and test method
JP6009089B2 (en) Management system for managing computer system and management method thereof
KR20180072860A (en) Self-diagnosis and automatic diagnostic data collection of device driver detection errors
CN107943617B (en) Data restoration method and device and server cluster
CN110865907B (en) Method and system for providing service redundancy between master server and slave server
WO2015072078A1 (en) Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program
US8689048B1 (en) Non-logging resumable distributed cluster
JPWO2014061199A1 (en) System design method, system design apparatus, and system design program
WO2012131868A1 (en) Management method and management device for computer system
Mouallem et al. A fault-tolerance architecture for kepler-based distributed scientific workflows
JP2007226287A (en) System environment reproducing method and system environment correcting method
US20230088318A1 (en) Remotely healing crashed processes
Kumar et al. A stochastic process of software fault detection and correction for business operations
Rezaei et al. Sustained resilience via live process cloning
JP6528769B2 (en) INFORMATION PROCESSING APPARATUS, PROCESSING METHOD, AND PROGRAM
Zhou et al. Delta execution for software reliability
Rahme et al. Preventive maintenance for cloud-based software systems subject to non-constant failure rates
WO2022239060A1 (en) System verification device, system verification method, and system verification program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14861939

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2015547615

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14861939

Country of ref document: EP

Kind code of ref document: A1