WO2015072078A1

WO2015072078A1 - Service resumption sequence generating device, service resumption sequence generating method, and service resumption sequence generating program

Info

Publication number: WO2015072078A1
Application number: PCT/JP2014/005217
Authority: WO
Inventors: 紅美子但野
Original assignee: 日本電気株式会社
Priority date: 2013-11-13
Filing date: 2014-10-15
Publication date: 2015-05-21
Also published as: JPWO2015072078A1

Abstract

Provided is a service resumption sequence generating device comprising: a fault combination acceptance means for accepting a combination of faults occurring in components of an information system; a sub-sequence specification means for specifying sub-sequences required for the components in which the faults are occurring to recover from the fault state; a fault recovery sequence generating means for connecting the specified sub-sequences and generating a fault recovery sequence; a rebuild sequence specification means for specifying a rebuild sequence necessary for rebuilding the components in which the faults are occurring; a sub-sequence substitution means for substituting at least a portion of the sub-sequences included in the fault recovery sequence with the rebuild sequence if the fault recovery sequence does not satisfy a prescribed requirement; and a sequence output means for outputting the post-substitution fault recovery sequence as a service resumption sequence.

Description

Service resumption procedure generation device, service resumption procedure generation method, and service resumption procedure generation program

The present invention relates to a service resumption procedure generation device, a service resumption procedure generation method, and a service resumption procedure generation program for generating a procedure for resuming a service of an information system stopped due to the occurrence of a failure.

In the event of a large-scale disaster, many components in the information system may fail simultaneously. In order to restart the information system service from such a situation, an operation procedure (so-called failure recovery procedure) is required to restore the entire information system from a failure state to a serviceable state in response to simultaneous failures. It is. In order to recover the entire information system, many fault recovery procedures first check the system status and identify the cause of the fault, and then correct the problem.

The information system failure recovery procedure includes sub-procedures (for example, command input, graphical user interface operation, etc.) for recovering a failure occurring in a component. Since the sub-procedures required for each failure occurring in the component are different, the failure recovery procedure differs depending on the combination of the failures that have occurred. Since the number of combinations of failures that can occur simultaneously in a large number of components is enormous, it is impractical for a user to manually generate a failure recovery procedure for all combinations. Therefore, it is reasonable to automatically generate a failure recovery procedure.

One of the general customer requirements defined for information system failure recovery is an index called RTO (Recovery time objective, target recovery time) that represents the time required for recovery. If the information system failure recovery procedure fails to meet the RTO, the information system provider may have to pay a penalty cost to the customer. In such a case, the administrator of the information system needs to generate a failure recovery procedure so as to satisfy the RTO.

However, in an information system that guarantees a certain RTO based on SLA (Service Level Agreement), it is not necessary to follow the normal failure recovery procedure to correct the problem after identifying the cause of the failure. Service may not be restarted. This is because there are cases where the cause of the failure is complicated and it takes a long time to specify, or because there are many problem parts, it may take a long time to complete the correction.

By the way, software rejuvenation for software aging is known as a failure handling method that does not require identification of the cause of failure and correction of a problem part. Software aging is a general term for deterioration phenomena (memory leak, file fragmentation, etc.) that occur in the operating environment due to continuous operation for a long time. As the operating environment deteriorates, the information system may be damaged. Software rejuvenation is a technique for preventing a failure caused by software aging by initializing at least a part of the internal state of the information system.

Patent Document 1 describes a method of simultaneously rejuvenating a host machine and a virtual machine that need to be rejuvenated while continuously operating a host machine and a virtual machine that do not need to be rejuvenated.

Further, as another example of the failure handling method that does not require the identification of the cause of the failure and the correction of the problem part, Non-Patent Document 1 describes a software life extension method. In the software life extension method described in Non-Patent Document 1, when software aging occurs in a virtual machine that operates in a virtual environment, additional resources are allocated to the virtual machine to increase the operating time of the software. Lengthen.

International Publication No. 2010/122710 Pamphlet

However, the methods described in Patent Document 1 and Non-Patent Document 1 have a problem that they cannot cope with failures other than information system failures caused by software aging.

By the way, a failure handling method that can deal with failures other than failures caused by software aging (for example, file corruption, setting mistakes, rewriting due to unauthorized access, etc.) and does not require identification of the cause of failure and correction of the problem location. One is rebuilding at least part of the information system.

Since the rebuilding procedure does not depend on the current system state, automation using a system configuration management tool such as Chef is easy. On the other hand, even if reconstruction is performed, the cause of the failure is not removed, so that the failure due to the same cause may occur again. In addition, if reconstruction is performed, there is a possibility that information necessary for identifying the cause of the failure or correcting the problem part will be lost at a later date. For this reason, even if it is a case where reconstruction is needed from time requirements or cost requirements, it is preferable that the range which performs reconstruction is the minimum.

The present invention has been made in view of the above points, and even if a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure is performed according to a combination of failures that have occurred. An object of the present invention is to provide a service resumption procedure generation apparatus, a service resumption procedure generation method, and a service resumption procedure generation program that are automatically generated.

The service resuming procedure generating apparatus according to the present invention includes a failure combination accepting unit that accepts a combination of faults occurring in a component included in the information system, and information on a sub procedure that is a procedure for recovering a fault occurring in the component. Sub-procedure storage means for storing in association with component identifiers, reconstruction procedure storage means for storing information on reconstruction procedures, which are procedures for reconstructing components, in association with component identifiers, and failure combinations Based on the combination of faults received by the receiving means, sub-procedure specifying means for specifying a sub-procedure necessary for recovering a component in which a fault has occurred from the fault state, and information on the sub-procedures specified by the sub-procedure specifying means Disaster recovery by connecting identified subprocedures to generate disaster recovery procedures based on Based on the combination of failures received by the order generation means, the failure combination acceptance means, a reconstruction procedure identification means for identifying a reconstruction procedure necessary to reconstruct the component in which the failure has occurred, and the generated failure recovery Included in the generated disaster recovery procedure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure if the procedure does not meet the prescribed requirements Sub-procedure replacement means that replaces at least a part of the sub-procedure with a reconstruction procedure, and a procedure for outputting, as a service restart procedure, a failure recovery procedure in which at least a part of the sub-procedure is replaced by the reconstruction procedure by the sub-procedure replacement means Output means.

Also, the service restart procedure generation method according to the present invention associates sub-procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, with a component identifier in a predetermined sub-procedure storage unit. And stores the information on the reconstruction procedure, which is a procedure for reconstructing the component, in association with the identifier of the component, and the information processing apparatus stores the information on the component of the information system. Accepts a combination of faults that have occurred, and the information processing device identifies the sub-procedure necessary to recover the faulty component from the fault state based on the received fault combination. The information processing apparatus generates a fault recovery procedure by connecting the identified sub-procedures based on the sub-procedure information Based on the combination of accepted faults, the rebuild procedure required to rebuild the component in which the fault has occurred is identified, and the information processing device is in the case where the generated fault recovery procedure does not meet the prescribed requirements Based on the information on each sub-procedure included in the generated disaster recovery procedure and the information on the identified reconstruction procedure, at least a part of the sub-procedure included in the generated disaster recovery procedure is reconstructed. And the information processing apparatus outputs a failure recovery procedure in which at least a part of the sub-procedure is replaced with a reconstruction procedure as a service resumption procedure.

Further, the service resumption procedure generation program according to the present invention comprises sub procedure storage means for storing sub procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier. A combination of faults occurring in the components of the information system in a computer having reconstruction procedure storage means for storing the information of the reconstruction procedure, which is a procedure for rebuilding the component, in association with the identifier of the component Based on the combination of the received fault, the sub-procedure specifying process for identifying the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the received fault combination receiving process, Rebuild steps required to rebuild the component that is failing Reconstruction procedure identification process to be identified, and information on each sub-procedure included in the generated disaster recovery procedure when the generated disaster recovery procedure does not meet the prescribed requirements Sub-procedure replacement process that replaces at least a part of the sub-procedure included in the generated disaster recovery procedure with the reconstruction procedure, and a disaster recovery procedure in which at least a part of the sub-procedure is replaced with the reconstruction procedure. And a procedure output process for outputting as a service resumption procedure is executed.

According to the present invention, even when a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.

It is a block diagram which shows the structural example of the service resumption procedure production | generation apparatus of 1st Embodiment. It is an activity diagram which shows the example of a sub procedure. It is explanatory drawing which shows the example of the information stored in the sub procedure storage means 103. FIG. It is explanatory drawing which shows the example of the information stored in the reconstruction procedure storage means. It is a flowchart which shows an example of operation | movement of the service resumption procedure production | generation apparatus of 1st Embodiment. It is a block diagram which shows the structural example of the service resumption procedure production | generation apparatus of 2nd Embodiment. It is a flowchart which shows an example of operation | movement of the service resumption procedure production | generation apparatus of 2nd Embodiment. It is a block diagram which shows the outline | summary of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, the failure recovery procedure in the present invention will be described. The failure recovery procedure is a procedure for causing the information system to restart the service by recovering the component group in which the failure has occurred in the information system from the failure state. The failure recovery procedure includes sub-procedures that are a group of procedures for recovering from a failure occurring in a component included in the information system.

Each sub-procedure is not particularly limited as long as it is a procedure for recovering a failure occurring in a component. For example, each sub-procedure may include system management operations such as restart, data recovery, and setting change. Each sub-procedure may be described in advance in a document or a manual. Each sub-procedure may be provided as an automated script or program using an existing system configuration management tool such as JP1.

When a failure occurs in a component of an information system due to a disaster or the like, and the provision of services stops, the system operator (hereinafter simply referred to as an operator) restores the failed component from the failed state according to the failure recovery procedure. To take responsibility for restoring the information system to a serviceable state. The sub-procedures required to restore the information system differ depending on the combination of components in which a failure has occurred and the combination of failures that have occurred. Therefore, the operator first specifies the sub procedure necessary for restoring the information system, and then executes the sub procedure to be executed for restoring the information system. Component failures include not only the component being down, but also being unable to use the component normally such as “some of the required commands cannot be executed” and “some of the data necessary for the system has been lost”. included. Therefore, in the present invention, the sub-procedures included in the failure recovery procedure are specified according to the combination of components in which such a failure has occurred and the combination of the failures that have occurred.

Next, “Reconstruction procedure” will be explained. The reconstruction procedure is a procedure for reconstructing a component group in which a failure has occurred. The rebuilding procedure is not particularly limited as long as it is a procedure for rebuilding a component. It should be noted that the reconstruction procedure is not always prepared for all components from the viewpoint of restrictions on the implementation of the information system and the cost of preparing the reconstruction procedure itself. Each reconstruction procedure may be described in advance in a document or a manual. Each reconstruction procedure may be provided as an automated script or program using an existing system configuration management tool such as Chef.

The present invention provides a service resumption procedure that is a procedure for causing the information system to resume a service by appropriately combining the above-described failure recovery procedure (more specifically, a sub-procedure included in the failure recovery procedure) and a reconstruction procedure. Is generated. Hereinafter, in the case of “service resumption procedure”, regardless of whether the sub-procedure for restoring the failure or the reconstruction procedure is used, the information system is restored from the failure state to the serviceable state. Refers to the procedure.

In addition, the information system components include all of the information systems that can be the target of failure recovery or reconstruction in the information system. Examples include an application, a task, a thread, a VM (Virtual Machine), a central processing unit (CPU), a peripheral device, a storage, a server device, and a personal computer. The component may be software or hardware. A certain component may include a plurality of components. Hereinafter, the term “component” may mean a component group including a plurality of components. Similarly, in the case of “sub-procedure”, it may mean a sub-procedure group including a plurality of sub-procedures. Similarly, “reconstruction procedure” may mean a reconstruction procedure group including a plurality of reconstruction procedures. More specifically, it means that a plurality of subprocedures may be included in a subprocedure to which a certain ID is allocated. Similarly, it means that a plurality of reconstruction procedures may be included in the reconstruction procedure to which a certain ID is allocated. This is because, depending on the combination of simultaneous failures, it may be better to handle a plurality of failures and components together.

Embodiment 1. FIG.
FIG. 1 is a block diagram illustrating a configuration example of a service restart procedure generation device according to the first embodiment of this invention. As shown in FIG. 1, the service resumption procedure generation apparatus 1 according to the present embodiment includes a failure combination reception unit 101, a sub procedure identification unit 102, a sub procedure storage unit 103, a failure recovery procedure generation unit 104, and a required time. An estimation unit 105, a procedure output unit 106, a reconstruction procedure specifying unit 107, a reconstruction procedure storage unit 108, a sub procedure replacement unit 109, and a time requirement receiving unit 110 are provided.

The service restart procedure generating device 1 is realized by a general information processing device (computer). The service restart procedure generation device 1 is, for example, a server device or a personal computer. In addition, the service restart procedure generation device 1 includes a CPU, a storage device, an input device, and an output device (not shown). The storage device is, for example, a memory and a hard disk drive (HDD; Hard Disk Drive). The input device is, for example, a keyboard, a mouse, various network interfaces, or the like. The output device is, for example, a display or various network interfaces. The service resuming procedure generation device 1 is configured to realize each unit illustrated in FIG. 1 by a CPU executing a program stored in a storage device.

The failure combination receiving unit 101 receives a combination of failures occurring in the information system components. The information indicating the combination of faults may be a set of identifiers of components in which faults have occurred. For example, it may be specified using a component name such as {“application A” and “database B”} or a number assigned to the component in advance such as {1, 2, 3}. The information indicating the combination of failures further includes the type of failure that has occurred in each component. That is, the information indicating the combination of failures may be a set of failure information including the identifier of the component in which the failure has occurred and the failure type.

The sub-procedure storage means 103 stores information on various sub-procedures for recovering a failure of a component included in the information system. The sub-procedure storage means 103, for example, associates a sub-procedure ID identifying the sub-procedure with the corresponding failure type and the sub-procedure itself (information indicating the specific processing contents of the sub-procedure, such as A script or a program for actually executing the processing contents). FIG. 2 is an activity diagram showing an example of a sub procedure. In the present invention, the actual state of the sub procedure is not particularly limited. That is, the sub-procedure provided to the user may be information indicating the specific processing contents of the sub-procedure as shown in the activity diagram of FIG. It may be a script or program to be executed, or a combination of these. In general, a sub-procedure, which is a procedure for recovery from a failure, includes a specific action of a failure cause, so that it is difficult to automate all of them. In such a case, after causing the user to specify the cause of the failure, the recovery operation corresponding to the cause (for example, restart, data recovery, setting change, etc.) is automatically executed by executing a script or the like. It is possible to do it.

FIG. 3 is an explanatory diagram showing a part of an example of information stored in the sub procedure storage means 103. As shown in FIG. 3, the sub-procedure storage means 103 of the present embodiment stores the identifier of the component to be recovered by the sub-procedure (sub-procedure identified by the sub-procedure ID) in association with the sub-procedure ID. To do. That is, the sub-procedure storage means 103 of the present embodiment stores, for each sub-procedure, information including at least the sub-procedure itself and the identifier of the component that the sub-procedure recovers. The sub-procedure is defined, for example, for each component failure included in the information system. Note that it is not necessarily defined for each failure, and for example, one sub-procedure may be defined for a specific combination of failures.

Further, the sub-procedure storage means 103 associates the sub-procedure ID with a precondition indicating a condition necessary for executing the sub-procedure, the name of the sub-procedure, and execution of the sub-procedure. Additional information such as the status of the component to be realized (component transition destination state) and the time required for each operation operation (setting change, reboot, shutdown, etc.) in the sub procedure may be stored. The preconditions are, for example, information on sub-procedures that need to be executed in advance, information on preconditions, sub-procedures that cannot be executed simultaneously, and the like. The name of the sub procedure is used to help the user understand. Further, the state of the transition destination of the component is used, for example, as a material for determining whether or not the preconditions of the subsequent subprocedures are satisfied when a failure recovery procedure is generated by connecting a plurality of subprocedures. The required time for each operation is used to estimate the execution time of the sub procedure. Hereinafter, the information stored in the sub procedure storage means 103 may be referred to as sub procedure information.

The reconstruction procedure storage means 108 stores information on the reconstruction procedure, which is a procedure for reconstructing the component in which the failure has occurred. For example, the reconstruction procedure storage unit 108 associates the reconstruction procedure ID with the reconstruction procedure ID for identifying the reconstruction procedure itself (information indicating the specific processing content of the reconstruction procedure, or such processing). Stores scripts and programs to actually execute the contents. Examples of reconfiguration include reconfiguration of an application and redeployment of a virtual machine (VM). In the present invention, the actual status of the reconstruction procedure is not particularly limited. That is, the reconstruction procedure provided to the user may be information for causing the user to execute the reconstruction procedure, or may be a script or a program that actually executes the procedure for the information system. Alternatively, a combination of these may be used. Rebuilding is an operation to re-create from scratch, unlike a simple restart such as clearing the internal state. For example, in the case of rearrangement of a VM, operations such as setting of resources to be allocated to the VM, setting of an OS IP address and OS firewall on the VM, and transfer of a VM image are included. However, the reconstruction procedure does not require an understanding of the current state or the specific operation of the failure, and there are many cases where an appropriate procedure has been found, for example, there is a track record of normal operation once it has been constructed. If an appropriate procedure is known in advance, the reconstruction procedure can be automated by a script or the like.

FIG. 4 is an explanatory diagram showing a part of the information stored in the reconstruction procedure storage means 108. As shown in FIG. 4, the reconstruction procedure storage means 108 of the present embodiment associates the reconstruction procedure ID with the reconstruction procedure ID, and further reconstructs the reconstruction procedure (reconstruction procedure identified by the reconstruction procedure ID). The identifier of the component to be stored is stored. In other words, the reconstruction procedure storage unit 108 of the present embodiment stores, for each reconstruction procedure, information including at least the reconstruction procedure itself and the identifiers of components to be reconstructed by the reconstruction procedure. The reconstruction procedure is defined for each component included in the information system, for example. Note that it is not necessarily defined for each component. For example, one reconstruction procedure may be defined for a specific combination of components.

Further, the reconstruction procedure storage means 108 associates the reconstruction procedure ID with the execution time of the reconstruction procedure and the component state (component transition destination state) realized by the execution of the reconstruction procedure. It may be stored.

The sub-procedure specifying unit 102 specifies a sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 101. The sub-procedure specifying unit 102, for example, based on the combination of faults received by the fault combination receiving unit 101, for each component in which a fault has occurred, the sub-procedure ID of the sub-procedure for recovering the fault occurring in the component Is identified with reference to the information stored in the sub procedure storage means 103.

The failure recovery procedure generation means 104 connects the subprocedures specified by the subprocedure specification means 102 (combining them in an appropriate order) to generate a failure recovery procedure. Most simply, the failure recovery procedure generation unit 104 may generate the failure recovery procedure by connecting the identified sub-procedures in series. Further, when there is a restriction such as an order between the specified sub-procedures, the failure recovery procedure generating unit 104 connects the specified sub-procedures so as to satisfy the restrictions. Moreover, when there are a plurality of candidates for the connection method, the failure recovery procedure generation unit 104 may be connected so as to shorten the time required for failure recovery, for example, by parallelizing as much as possible. Note that the failure recovery procedure generated by the failure recovery procedure generator 104 is a first candidate for a service restart procedure that is a generation target of the present invention.

The time requirement accepting unit 110 accepts the time requirement for the service resumption procedure. The time requirement is, for example, RTO, and more specifically, 1 day, 3 hours, 5 minutes, and the like. The time requirement is determined according to the contract with the customer.

The required time estimation means 105 estimates a required time which is a time required for executing the generated service restart procedure candidate (hereinafter referred to as a service restart procedure candidate). The required time estimating means 105 first estimates the required time for the failure recovery procedure generated by the failure recovery procedure generating means 104 as the first candidate for the service restart procedure.

The required time estimating means 105 may estimate the required time by adding the time required for the operation operation included in each of the sub-procedures included in the failure recovery procedure sequentially. Further, the required time estimating means 105 may use a calculation formula that increases the required time in proportion to the number of operation operations included in the failure recovery procedure. In order to make it more accurate, the required time estimation means 105 generates a probabilistic model such as Stochastic reward net from an activity diagram representing, for example, a failure recovery procedure, and estimates the required time using an analysis tool such as Stochastic Petri Nets Package. May be.

Further, when the estimated time required for the failure recovery procedure exceeds the time indicated as the time requirement, the required time estimation unit 105 sends a service resumption procedure candidate (more specifically, to the sub procedure replacement unit 109 described later). Causes at least part of the sub-procedures included in the failure recovery procedure) to be replaced with the reconstruction procedure. The sub-procedure replacement unit 109 replaces the sub-procedure with the reconstruction procedure until the time required for the updated service restart procedure candidate satisfies the time requirement.

The required time estimation means 105, as a method for estimating the required time of an updated service resuming procedure candidate, includes, for example, the operations included in each of the sub-procedures and reconstruction procedures included in the updated service resuming procedure candidate as described above. The time required for the operation may be added sequentially. In addition, a method using a predetermined calculation formula that increases the required time in proportion to the number of operation operations may be used, or a reconstruction procedure that replaces the time required to execute the replaced sub procedure It is also possible to use a method in which the execution time is replaced.

The reconstruction procedure specifying unit 107 is necessary for reconstructing a component in which a failure has occurred based on the information stored in the reconstruction procedure storage unit 108 based on the combination of failures received by the failure combination receiving unit 101. Identify the rebuild procedure. The reconstruction procedure specifying unit 107 stores, for example, the reconstruction procedure ID of the reconstruction procedure for reconstructing each component in which a failure has occurred based on the combination of failures received by the failure combination receiving unit 101. The information stored in the means 108 is specified with reference to the information.

The sub-procedure replacement means 109, based on the required time of the service restart procedure candidate estimated by the required time estimation means 105, when the service restart procedure candidate does not satisfy the time requirement, the failure recovery included in the service restart procedure candidate At least a part of the sub-procedure for is replaced with at least one of the reconstruction procedures identified by the reconstruction procedure identification means 107. In the replacement method, sub-procedures may be replaced one by one, or a plurality of sub-procedures may be replaced simultaneously. Further, the sub-procedure replacement means 109 may change the number of replacements according to the excess time that is the time required for the service restart procedure candidate to exceed the time requirement. Further, the sub-procedure replacement unit 109 may preferentially replace the reconstruction procedure with a short execution time.

The sub-procedure replacement means 109 reduces the total number of replacements from the sub-procedure to the reconstruction procedure from a small number such as {1, 2, 3,. The replacement process may be stopped when the time requirement is satisfied. Further, for example, the sub-procedure replacement means 109 has a large difference between the execution time of the sub-procedure that becomes the replacement source and the execution time of the reconstruction procedure that becomes the replacement destination, that is, the difference between the execution times before and after replacement is large (time You may make it replace in order from the thing with a big shortening effect.

Generally, all sub-procedures for recovering a failure that has occurred in a component that has been reconstructed by reconstructing one component become unnecessary. The sub-procedure replacement means 109, for example, the sub-procedure included in the generated service restart procedure candidate based on the information on each sub-procedure included in the generated service restart procedure candidate and the information on the identified reconfiguration procedure. Replace at least part of the procedure with the rebuild procedure. For example, the sub-procedure replacement unit 109 determines which sub-procedure becomes which re-construction procedure based on the identifier of the component associated with each sub-procedure and the identifier of the component associated with each re-construction procedure. You may judge whether it can replace. When a certain component includes multiple components, a reconfiguration procedure for one component and a sub-unit for recovering from a failure occurring in all the components included in that component The procedure may be replaceable. In such a case, information indicating the component inclusion relationship may be stored separately. Also, since sub-procedures are often arranged and prepared for each component failure, the correspondence relationship between the component failure and the reconstruction procedure is stored in advance, and which sub-procedure is based on the correspondence relationship. And which reconstruction procedure can be replaced. In addition to the correspondence between the failure of the component and the reconstruction procedure, preconditions regarding the replacement to the reconstruction procedure may be stored together, and the possibility of replacement may be determined according to the situation. Preconditions include, for example, a specific system state (the OS is operating normally, the database is not being backed up, etc.), the order (designation of sub-procedures to be executed in advance or reconstruction procedure, etc.) . When replacing a sub-procedure with a corresponding reconstruction procedure, the sub-procedure replacement means 109 adds a new procedure (sub-procedure or reconstruction procedure), for example, to satisfy the preconditions of the replacement-destination reconstruction procedure. Alternatively, the execution order of each procedure in the service restart procedure candidate may be changed.

In addition, even if all replaceable reconstruction procedures are replaced, the sub procedure replacement means 109 outputs a message to that effect if the required time of the service restart procedure candidate after replacement does not satisfy the time requirement.

If the service output procedure candidate that satisfies the time requirement is generated as a result of the replacement process of the sub-procedure described above, the procedure output unit 106 sets the service restart procedure candidate as the service restart procedure. (Or its candidates). That is, the procedure output means 106 provides only the service restart procedure candidate that satisfies the accepted time requirement to the user. The procedure output means 106 may be provided to the user by, for example, outputting the specific processing contents of such a service restart procedure candidate in the form of an activity diagram, for example. Note that the procedure output means 106 outputs a message “no corresponding procedure” if no service restart procedure candidate satisfying the accepted time requirement is generated, or as reference information for the operator's judgment The service resuming procedure candidate with the shortest required time may be output.

In this embodiment, the sub procedure storage unit 103 and the reconstruction procedure storage unit 108 are realized by a storage device, for example. Moreover, the failure combination receiving unit 101 and the time requirement receiving unit 110 are realized by, for example, a CPU that operates according to a program and an input device. The procedure output means 106 is realized by, for example, a CPU that operates according to a program and an output device. The sub procedure specifying unit 102, the failure recovery procedure generating unit 104, the required time estimating unit 105, the reconstruction procedure specifying unit 107, and the sub procedure replacing unit 109 are realized by a CPU that operates according to a program, for example.

Next, the operation of the service restart procedure generating device 1 according to this embodiment will be described. FIG. 5 is a flowchart showing an example of the operation of the service restart procedure generating device 1 of the present embodiment. As shown in FIG. 5, first, the failure combination receiving unit 101 receives a combination of failures that have occurred in the components of the information system (step S101). A combination of faults occurring in the components of the information system may be input by the user or may be acquired directly from the information system.

Next, the sub-procedure specifying unit 102 specifies a sub-procedure necessary for setting the state of the component group in which the failure has occurred to the recovery state based on the combination of failures received in step S101 (step S102).

Next, the failure recovery procedure generation means 104 generates a failure recovery procedure that is the first candidate for the service restart procedure by connecting the sub-procedures identified in step S102 (step S103).

Next, the reconstruction procedure specifying unit 107 specifies a reconstruction procedure necessary for reconstructing the component group in which the failure has occurred based on the combination of failures received in step S101 (step S104). At this time, a component for which a reconstruction procedure is not prepared is skipped ("No reconstruction procedure"). Note that this step may be executed at another timing between step S101 and step S108.

Next, the time requirement accepting unit 110 accepts the time requirement for the service resumption procedure (step S105). Note that this step may be executed at another timing between step S101 and step S107.

Next, the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). In the first time of this step, the required time estimation unit 105 estimates the required time for the failure recovery procedure generated in step S104. In the second and subsequent times, the required time estimating means 105 estimates the required time of a new service resuming procedure candidate generated by replacing a part of the sub procedure with the reconstructing procedure in step S108.

Next, the required time estimating means 105 determines whether or not the required time estimated in step S106 satisfies the time requirement received in step S105 (step S107). The sub-procedure replacement unit 109 may perform this step.

If the required time satisfies the time requirement (Yes in step S107), the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).

On the other hand, if the required time does not satisfy the time requirement (No in step S107), the sub procedure replacement unit 109 selects any of the sub procedures included in the service restart procedure candidate as one of the identified reconstruction procedures. The service restart procedure candidate is updated by replacing with (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).

As described above, in the present embodiment, first, a failure recovery procedure corresponding to the combination of failures that occurred is generated, and then a part of the sub-procedures included in the failure recovery procedure is replayed until the specified time requirement is satisfied. A method that replaces the construction procedure is adopted. Therefore, according to the present embodiment, a service resumption procedure that satisfies the time requirement can be provided to the user even if the time requirement cannot be satisfied by the normal failure recovery procedure.

In addition, according to the present embodiment, the range of reconstruction can be limited to the minimum range necessary to satisfy the specified time requirement, so that the failure is possible as much as possible while satisfying the specified time requirement. Service can be resumed with the cause removed.

Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described. When an RTO is defined for a service provided by an information system, there are many cases where a fee that must be paid as a penalty for an amount exceeding the RTO when a failure occurs is often defined. These costs are commonly referred to as downtime costs. On the other hand, the cost that can be invested in information system failure recovery is generally limited. In the present embodiment, a service restart procedure that satisfies such cost requirements is generated.

The service restart procedure generating apparatus according to the first embodiment is that the service restart procedure generating apparatus according to the first embodiment generates a service restart procedure that satisfies a specified cost requirement based on a downtime cost with respect to an excess time of a service restart procedure candidate. Different from the generator 1. Hereinafter, differences from the service restart procedure generating apparatus 1 of the first embodiment will be mainly described.

FIG. 6 is a block diagram illustrating a configuration example of the service restart procedure generation device according to the present embodiment. The service resumption procedure generation device 2 illustrated in FIG. 6 includes a required cost estimation unit 111 and a cost requirement reception unit 112 in addition to the service resumption procedure generation device 1 of the first embodiment illustrated in FIG.

The required cost estimation unit 111 estimates a required cost, which is a cost required to execute a service resuming procedure candidate, based on the required time estimated by the required time estimation unit 105 and the time requirement received by the time requirement reception unit 110. The required cost may be, for example, a downtime cost. The calculation method of the downtime cost may be simply proportional to the excess time of the service restart procedure candidate, or may use a calculation formula determined in advance by a contract with a customer who is a service user. . Further, the required cost estimation unit 111 may estimate not only the downtime cost but also the cost including the cost required for executing the service restart procedure candidate such as the labor cost and the equipment cost required for the failure recovery as the required cost. .

When the estimated required cost exceeds the cost requirement accepted by the cost requirement accepting unit 112, the required cost estimating unit 111 sends a service resuming procedure candidate (more specifically, failure recovery among them) to the sub procedure replacing unit 109. At least one of the sub-procedures included in (Procedure) is replaced with a reconstruction procedure. In the present embodiment, the sub procedure replacement unit 109 replaces the sub procedure with the reconstruction procedure until the required cost of the updated service restart procedure candidate satisfies the cost requirement.

The cost requirement receiving means 112 receives the cost requirement for the service resumption procedure.

Further, the required time estimation unit 105 of the present embodiment, unlike the required time estimation unit 105 of the first embodiment, performs only the estimation process of the required time of the service restart procedure candidate. That is, the required time estimation unit 105 does not issue a replacement instruction to the reconstruction procedure even when the estimated required time does not satisfy the time requirement received by the time requirement reception unit 110. The replacement cost instruction for the reconstruction procedure is performed by the required cost estimation unit 111 as described above.

Further, the sub-procedure replacement unit 109 according to the present embodiment performs the service resumption procedure when the service resumption procedure candidate does not satisfy the cost requirement based on the required cost of the service resumption procedure candidate estimated by the required cost estimation unit 111. At least a part of the sub-procedure for failure recovery included in the candidate is replaced with at least one of the reconstruction procedures specified by the reconstruction procedure specifying means 107. Note that the sub-procedure replacement unit 109 outputs a message to the effect that if the required cost of the service restart procedure candidate after replacement does not satisfy the cost requirement even if all replaceable reconstruction procedures are replaced.

In the present embodiment, the required cost estimation unit 111 and the cost requirement reception unit 112 are realized by a CPU that operates according to a program, for example.

Next, the operation of the service restart procedure generation device 2 according to this embodiment will be described. FIG. 7 is a flowchart showing an example of the operation of the service restart procedure generating device 2 of the present embodiment. As shown in FIG. 7, first, similarly to the case of the first embodiment, the processing from step S101 to step S105 is performed.

In this embodiment, next, the cost requirement receiving unit 112 receives the cost requirement for the service resumption procedure (step S1051). Note that this step may be executed at another timing between steps S101 and S1062.

Next, the required time estimating means 105 estimates the required time of the generated service restart procedure candidate (step S106). This step is the same as in the first embodiment.

Next, the required cost estimation unit 111 estimates the required cost of the generated service restart procedure candidate based on the required time estimated in step S106 and the time requirement received in step S105 (step S1061). In the first step of this step, the required cost estimation unit 111 estimates the required cost of the failure recovery procedure generated in step S104. In the second and subsequent times, the required cost estimation unit 111 estimates the required cost of a new service restart procedure candidate generated by replacing a part of the sub procedure with the reconstruction procedure in step S108.

Next, the required cost estimation unit 111 determines whether or not the required cost estimated in step S1061 satisfies the cost requirement received in step S1051 (step S1062). Note that, similarly to the first embodiment, the sub-procedure replacement unit 109 may perform this step.

If the required cost satisfies the cost requirement (Yes in step S1062), the procedure output means 106 outputs the finally obtained service restart procedure candidate as a service restart procedure to the display or the like (step S109).

On the other hand, when the required cost does not satisfy the cost requirement (No in step S1062), the sub procedure replacement unit 109 replaces a part of the sub procedure included in the service restart procedure candidate with the reconstruction procedure, thereby resuming the service restart procedure candidate. Is updated (step S108). Then, the above-described process is repeated again for the new service restart procedure candidate generated in step S108 (return to step S106).

As described above, according to the present embodiment, after generating a failure recovery procedure according to a combination of failures that have occurred, a part of sub-procedures included in the failure recovery procedure until the specified cost requirement is satisfied. The method of replacing with a reconstruction procedure is adopted. Therefore, according to the present embodiment, even if the time requirement cannot be satisfied by the normal failure recovery procedure, a service resumption procedure that satisfies the cost requirement can be generated. It is possible to provide a user with a service resumption procedure that can be executed even in a case where there is a limit.

In addition, according to the present embodiment, the range of reconstruction can be limited to the minimum range necessary to satisfy the specified cost requirement, so that the failure is possible as much as possible while satisfying the specified cost requirement. Service can be resumed with the cause removed.

Next, the outline of the present invention will be described. FIG. 8 is a block diagram showing an outline of the service restarting device procedure generation device of the present invention. 8 includes a failure combination receiving unit 501, a sub procedure storage unit 502, a reconstruction procedure storage unit 503, a sub procedure identification unit 504, a failure recovery procedure generation unit 505, A construction procedure specifying unit 506, a sub procedure replacement unit 507, and a procedure output unit 508 are provided.

The failure combination accepting unit 501 (for example, the failure combination accepting unit 101) accepts a combination of failures occurring in the components included in the information system.

The sub-procedure storage unit 502 (for example, the sub-procedure storage unit 103) stores sub-procedure information, which is a procedure for recovering a failure occurring in a component included in the information system, in association with a component identifier.

The reconstruction procedure storage unit 503 (for example, the reconstruction procedure storage unit 108) stores information on a reconstruction procedure, which is a procedure for reconstructing a component included in the information system, in association with the component identifier.

The sub-procedure specifying unit 504 (for example, the sub-procedure specifying unit 102) specifies the sub-procedure necessary for recovering the component in which the fault has occurred from the fault state based on the fault combination received by the fault combination receiving unit 501. To do.

The failure recovery procedure generation unit 505 (for example, the failure recovery procedure generation unit 104) generates a failure recovery procedure by connecting the specified sub-procedures based on the sub-procedure information specified by the sub-procedure specifying unit 504. .

The rebuilding procedure specifying unit 506 (for example, the rebuilding procedure specifying unit 107) performs a rebuilding procedure necessary for rebuilding a component in which a fault has occurred based on the combination of faults received by the fault combination receiving unit 501. Identify.

The sub-procedure replacement unit 507 (for example, the sub-procedure replacement unit 109) specifies and specifies information on each sub-procedure included in the generated fault recovery procedure when the generated fault recovery procedure does not satisfy a predetermined requirement. Based on the information on the reconstructed procedure, at least a part of the subprocedure included in the generated failure recovery procedure is replaced with the reconstructed procedure.

The procedure output unit 508 (for example, the procedure output unit 106) outputs a failure recovery procedure in which at least a part of the sub procedure is replaced by the reconstruction procedure by the sub procedure replacement unit 507 as a service restart procedure.

With the configuration as described above, even when a normal failure recovery procedure cannot satisfy a predetermined requirement, an optimal service restart procedure can be automatically generated according to a combination of failures that have occurred.

In addition, the service restart procedure generating apparatus of the present invention includes a time requirement receiving unit (for example, the time requirement receiving unit 110) that receives a time requirement imposed on the service restart procedure, and a time required for performing the specified procedure. Required time estimation means (for example, required time estimation means 105) for estimating a certain required time may be provided. In such a case, the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the time requirement when the time required for the failure recovery procedure estimated by the required time estimation unit does not satisfy the time requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.

In addition, the service resumption procedure generating apparatus of the present invention further involves cost requirement acceptance means (for example, cost requirement acceptance means 112) for accepting cost requirements imposed on the service resumption procedure, and implementation of the designated procedure. It may be provided with required cost estimation means (for example, required cost estimation means 111) for estimating a required cost including a downtime cost for exceeding the failure recovery time. In such a case, the sub procedure replacement unit 507 is included in the failure recovery procedure so as to satisfy the cost requirement when the required cost of the failure recovery procedure estimated by the required time estimation unit does not satisfy the cost requirement. At least some of the sub-procedures may be replaced with reconstruction procedures.

Further, the sub-procedure replacement means 507 sets the generated failure recovery procedure as the first candidate for the service resumption procedure, until the generated service resumption procedure candidate satisfies a predetermined requirement, The replacement process for replacing at least a part with the reconstruction procedure is repeatedly performed, and the procedure output unit 508 outputs, as the service restart procedure, a candidate for the service restart procedure that is determined to satisfy the predetermined requirement by the sub procedure replacement unit 507. Also good.

Further, the sub-procedure replacement means 507 may perform replacement in descending order of execution time difference before and after replacement.

Also, the reconstruction procedure may be provided in a script or program.

In the above embodiment, an example is shown in which the required time or required cost is used as an evaluation index for the service restart procedure. However, other evaluation indexes related to system requirements, such as the success rate of the service restart procedure, are shown. It may be used.

Further, in each of the above embodiments, each function of the service restart procedure generating device is realized by software, more specifically, a CPU executing a program, but may be realized by hardware such as a circuit. Good.

In each of the above embodiments, the program is stored in a storage device, but may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2013-234751 filed on November 13, 2013, the entire disclosure of which is incorporated herein.

The present invention is applicable not only to restarting a service, but also to an apparatus, system, method, and program used for recovering, for example, an information system in which a failure has occurred to a state where there is no failure.

1, 2, 500 Service resumption

procedure generation device

101, 501 Failure combination acceptance means 102, 504 Sub procedure identification means 103, 502 Sub procedure storage means 104, 505 Failure recovery procedure generation means 105 Required time estimation means 106, 508 Procedure output means 107, 506 Reconstruction procedure specifying means 108, 503 Reconstruction procedure storage means 109, 507 Sub procedure replacement means 110 Time requirement accepting means 111 Required cost estimating means 112 Cost requirement accepting means

Claims

A failure combination receiving means for receiving a combination of failures occurring in a component included in the information system;
Sub-procedure storage means for storing sub-procedure information, which is a procedure for recovering a failure occurring in the component, in association with a component identifier;
Reconstruction procedure storage means for storing information on a reconstruction procedure, which is a procedure for reconstructing the component, in association with an identifier of the component;
Sub-procedure specifying means for specifying a sub-procedure necessary for recovering a component in which a fault has occurred from a fault state based on the combination of faults received by the fault combination receiving means;
Based on the information of the sub procedure specified by the sub procedure specifying means, a fault recovery procedure generating means for generating a fault recovery procedure by connecting the specified sub procedures,
Based on the combination of faults received by the fault combination receiving means, a rebuilding procedure specifying means for specifying a rebuilding procedure necessary for rebuilding a component in which a fault has occurred,
If the generated disaster recovery procedure does not meet the prescribed requirements, the generated failure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure Sub-procedure replacement means for replacing at least a part of the sub-procedure included in the recovery procedure with the reconstruction procedure;
A service resumption procedure generation apparatus comprising: a procedure output unit that outputs a failure recovery procedure in which at least a part of a sub procedure is replaced by a reconstruction procedure by the sub procedure replacement unit.
A time requirement accepting means for accepting a time requirement imposed on the service restart procedure;
A time estimation means for estimating a time required for performing a specified procedure,
The sub-procedure replacement means includes a sub-procedure included in the generated fault recovery procedure so as to satisfy the time requirement when the time required for the fault recovery procedure estimated by the required time estimation means does not satisfy the time requirement. The service restart procedure generation device according to claim 1, wherein at least a part of the procedure is replaced with a reconstruction procedure.
A cost requirement accepting means for accepting a cost requirement imposed on the service restart procedure;
A required cost estimating means for estimating a required cost including a downtime cost imposed on an excess of the failure recovery time, which is a cost for performing a specified procedure,
The sub procedure replacement means includes a sub procedure included in the generated fault recovery procedure so as to satisfy the cost requirement when the required cost of the failure recovery procedure estimated by the required time estimation means does not satisfy the cost requirement. The service restart procedure generation device according to claim 2, wherein at least a part of the procedure is replaced with a reconstruction procedure.
The sub-procedure replacement means sets the generated failure recovery procedure as the first candidate for the service restart procedure, and at least a part of the sub-procedures included in the candidate until the generated service restart procedure candidate satisfies a predetermined requirement Repeat the replacement process to replace
4. The service according to claim 1, wherein the procedure output unit outputs a service restart procedure candidate determined to satisfy a predetermined requirement by the sub-procedure replacement unit as a service restart procedure. 5. Resume procedure generator.
The service resumption procedure generation device according to any one of claims 1 to 4, wherein the sub-procedure replacement means replaces the sub-procedure replacement means in descending order of execution time before and after replacement.
The service resumption procedure generation device according to any one of claims 1 to 5, wherein the reconstruction procedure is provided as a script or a program.
In a predetermined sub-procedure storage means, information on a sub-procedure, which is a procedure for recovering a failure occurring in a component included in the information system, is stored in association with a component identifier,
In a predetermined reconstruction procedure storage means, information on the reconstruction procedure, which is a procedure for reconstructing the component, is stored in association with the identifier of the component,
The information processing apparatus accepts a combination of failures occurring in the components of the information system,
Based on the received failure combination, the information processing device identifies a sub-procedure necessary for recovering the component in which the failure has occurred from the failure state,
The information processing apparatus generates a failure recovery procedure by connecting the identified sub procedure based on the identified sub procedure information,
The information processing apparatus identifies a restructuring procedure necessary for reconstructing a component in which a failure has occurred, based on the received failure combination,
When the generated failure recovery procedure does not satisfy a predetermined requirement, the information processing apparatus is based on information on each sub-procedure included in the generated failure recovery procedure and information on the identified reconstruction procedure Replace at least some of the sub-procedures included in the generated disaster recovery procedure with the rebuild procedure,
The information processing apparatus outputs, as a service restart procedure, a failure recovery procedure in which at least a part of the sub procedure is replaced with a reconstruction procedure.
Sub-procedure storage means for storing sub-procedure information, which is a procedure for recovering a failure occurring in a component included in an information system, in association with a component identifier, and a procedure for reconstructing the component Reconstruction procedure information is stored in association with a component identifier in a computer having reconstruction procedure storage means.
Fault combination acceptance processing for accepting fault combinations occurring in information system components,
A sub-procedure identification process for identifying a sub-procedure necessary for recovering a component in which a fault has occurred from a fault state based on the combination of faults received;
Reconstruction procedure identification process for identifying the reconstruction procedure necessary to reconstruct the component in which the failure has occurred, based on the combination of the received failures
If the generated disaster recovery procedure does not meet the prescribed requirements, the generated failure based on the information of each sub-procedure included in the generated disaster recovery procedure and the information of the identified reconstruction procedure Sub-procedure replacement process that replaces at least a part of sub-procedures included in the recovery procedure with a reconstruction procedure, and a procedure that outputs a disaster recovery procedure in which at least part of the sub-procedure is replaced with a reconstruction procedure as a service restart procedure Service restart procedure generation program for executing output processing.