Recherche Images Maps Play YouTube Actualités Gmail Drive Plus »
Recherche avancée dans les brevets | Images de page | Historique Web | Connexion

Brevets

  
[merged small][merged small][merged small][graphic][graphic][merged small][graphic][graphic][graphic][table][merged small][merged small]

Broad- Broad- Broadcast Vote cast Vote cast Vote —1—H' Ii—■—Ii > Ii—'— 1—

■ < • i ■

Phase 3 Phase 4 Phase 5 FIG. 5 1

Broad- Broad- Broad- Broad- Broad

cast Vote cast Vote cast Vote cast Vote cast Vote

[merged small][graphic][graphic][merged small][graphic][graphic][graphic][table][merged small][merged small][merged small][merged small][merged small]

APPARATUS AND METHOD FOR
MAINTAINING CONSISTENCY OF SHARED
DATA RESOURCES IN A CLUSTER
ENVIRONMENT

CROSS REFERENCE TO RELATED
APPLICATIONS

The present application is related to the following concurrently filed patent applications:

U.S. patent application Ser. No. 09/282,225, entitled "Apparatus and Method for Tracking Access to Data Resources in a Cluster Environment"; and

U.S. patent application Ser. No. 09/282,907, entitled "Error Detection Protocol."

TECHNICAL FIELD

The present invention relates generally to distributed networks, and in particular to core cluster functions for maintaining consistency of shared data resources in a cluster environment.

BACKGROUND INFORMATION

As computer systems and networks become increasingly complex, the need to have high availability of these systems is becoming correspondingly important. Data networks, and especially the Internet, are uniting the world into a single global marketplace that never closes. Employees, sales representatives, and suppliers in far-flung regions need access to an enterprise network systems every hour of the day. Furthermore, increasingly sophisticated customers expect twenty-four hour sales and service from a Web site.

As a result, tremendous competitive pressure is placed on companies to keep their systems running continuously, and to be continuously available. With inordinate amounts of downtime, customers would likely take their business elsewhere, costing a company their goodwill and a revenue loss. Furthermore, there are costs associated with lost employee productivity, diverted, canceled, and deferred customer orders, and lost market share. In sum, network server outages can potentially cost big money.

In the past, companies have ran on a handful of computers executing relatively simple software. This made it easier to manage the systems and isolate problems.

But in the present networked computing environment, information systems can contain hundreds of interdependent servers and applications. Any failure in one of these components can cause of cascade of failures that could bring down your server and leave a user susceptible to monetary losses.

Generally, there are several levels of availability. The particular use of a software application typically dictates the level of availability needed. There are four general levels of systems availability: base-availability systems, highavailability systems, continuous-operations environments, and continuous-availability environments.

Base-availability systems are ready for immediate use, but will experience both planned and unplanned outages. Such systems are used for application development.

Second, high-availability systems include technologies that sharply reduce the number and duration of unplanned outages. Planned outages still occur, but the servers also includes facilities that reduce their impact. High-availability systems are used by stock trading applications.

Third, continuous-operations environments use special technologies to ensure that there are no planned outages for

2

upgrades, backups, or other maintenance activities. Frequently, companies also use high-availability servers in these environments to reduce unplanned outages. Continuous-operations environments are used for Internet 5 applications, such as Internet servers and e-mail applications.

Last, continuous-availability environments seek to ensure that there are no planned or unplanned outages. To achieve this level of availability, companies must use dual servers or

10 clusters of redundant servers in which one servers automatically takes over if another server goes down. Continuousavailability environments are used in commerce and mission critical applications.

As network computing is being integrated more and more into the present commercial environment, the importance of having high availability for distributed systems on clusters of computer processors has been realized, especially for enterprises that run mission-critical applications. Networks with high availability characteristics have procedures within the cluster to deal with failures in the service groups, and

20 make provisions for the failures. High availability means a computing configuration that recovers from failures and provides a better level of protection against system downtime than standard hardware and software alone.

Conventionally, the strategy for handling failures is

25 through a failfast or failstop function. A computer module executed on a computer cluster is said to be failfast if it stops execution as soon as it detects a sever enough failure and if it has a small error latency. Such a strategy has reduced the possibility of cascaded failures due to a single failure

30 occurrence.

Another strategy for handling system failures is through fault containment. Fault containment endeavors to place barriers between components so that an error or fault in one component would not cause a failure in another.

35 With respect to clusters, an increased need for high availability of ever increasing clusters is required. But growth in the size of these clusters increases the risk of failure within the cluster from many sources, such as hardware failures, program failures, resource exhaustion, opera

40 tor or end-user errors, or any combination of these.

Up to now, high availability has been limited to hardware recovery in a cluster having only a handful of nodes. But hardware techniques are not enough to ensure high availability hardware recovery can compensate only for hardware

45 failures, which accounts for only a fraction of the availability risk factors.

An example for providing high availability has been with software applications clustering support. This technique has implemented software techniques for shared system

50 resources such as a shared disk and a communication protocol.

Another example for providing high availability has been with network systems clustering support. With systems clustering support, failover is initiated in the case of hard55 ware failures such as the failure of a node or a network adapter.

Generally, a need exists for simplified and local management of shared resources such as databases, in which local copies of the resource is maintained at each member node of 60 the cluster. Such efficient administrative functions aids the availability of the cluster and allows processor resources to be used for the execution and operation of software applications for a user.

65 SUMMARY OF THE INVENTION

Thus, provided herein is a method and apparatus for providing a recent set of replicas for a cluster data resource 3

within a cluster having a plurality of nodes. Each of the nodes having a group services client with membership and voting services. The method of the present invention concerns broadcasting a data resource open request to the nodes of the cluster, determining the most recent replica of the 5 cluster data resource among the nodes, and distributing the recent replica to the nodes of the cluster.

The apparatus of the present invention is for providing a recent set of replicas for a cluster data resource. The apparatus has a cluster having a plurality of nodes in a peer 1° relationship, each node has an electronic memory for storing a local replica of the cluster data resource. A group services client, which is executable by each node of the cluster, has cluster broadcasting and cluster voting capability. A database conflict resolution protocol ("DCRP"), which is execut- :5 able by each node of the cluster, interacts with the group services clients such that the DCRP broadcasts to the plurality of nodes a data resource modification request having a data resource identifier and a timestamp. The DCRP determines a recent replica of the cluster data resource 20 among the nodes with respect to the timestamp of the broadcast data resource modification request relative to a local timestamp associated with the data resource identifier, and distributes the recent replica of the cluster data resource to each required node of the plurality of nodes. 25

The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the

30

invention will be described hereinafter which form the subject of the claims of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present 35 invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a block diagram representation of a computer used for providing a node in the cluster of the present 40 invention;

FIG. 2 is a block diagram representing a cluster having a plurality of nodes;

FIG. 3 is a flow chart of a database conflict resolution protocol ("DCRP") of the present invention executed by the 45 nodes of the cluster;

FIG. 4 is an example of the DCRP of the present invention applied with the member nodes of the cluster having the same timestamp for a shared data resource; and 5Q

FIG. 5 is an example of the DCRP of the present invention applied when some of the member nodes have dissimilar timestamps for a shared data resource.

DETAILED DESCRIPTION

55

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. It should be noted, however, that those skilled in the art are capable of practicing the present invention without such specific details. In other instances, well-known 60 circuits have been shown in block diagram form in order not to obscure the present invention in unnecessary detail.

Although the present invention is described with reference to a specific embodiment for a technique to provide an aspect of high-availability to a cluster, it should be under- 65 stood that the present invention can be adapted for use with other high-availability techniques. All such variations are

4

intended to be included within the scope of the present invention. It will be recognized that, in the drawings, only those signal lines and processor blocks necessary for the operation of the present invention are shown.

Referring to the drawings, depicted elements are not necessarily shown to scale, and like or similar elements are designated by the same reference numeral through the several views.

Referring to FIG. 1, shown is a block diagram representation of a computer 100 used for providing a cluster of the present invention. The computer 100 has suitable hardware and operating system capabilities for providing networking capabilities for communication between different computers, or nodes, in a cluster 200. Each computer 100 used in the cluster has an executable core cluster software services component 102. The core cluster services software component 102 is a middle-ware layer having a set of executables and libraries that run on the resident operating system 104. The core cluster services is 32-bit and SMP ready. The core cluster services software component 102 has sub-components that include a portability layer 106, a cluster coordinator 108, a topology service 110, group services 112, and a Cluster Search Query Language ("CSQL") services 114.

The portability layer 106 provides a set of common functions used by the other components to access the resident operating system 104 while also masking operating system-dependent implementations, and functions relating to Reliability-Availability-Serviceability ("RAS") facilities such as tracing and logging of computer operations. The portability layer 106 in effect encapsulated operating-system dependent interfaces. Accordingly, the remaining subcomponents of the core cluster services software component 102 may interact with the operating systems 104 without having to be structured to interact with the particulars of that operating system 104.

The cluster coordinator sub-component 108 provides software facilities for start-up, stop, and restart of the core cluster services 102. Each computer in the cluster has a cluster coordinator, but the individual cluster coordinators do not communicate with each other; the scope of each cluster coordinator sub-component 108 is restricted to the computer 100 on which it runs. The cluster coordinator sub-component 108 is executed first, and then it brings up the other core cluster services sub-components. Also, the cluster coordinator sub-component 108 monitors each of the other services, and restarts the core cluster services component 102 in the event of a failure.

The topology services sub-component 110 exchanges heartbeat messages with topology services in other computers. Heartbeat messages are used to determine which nodes of a cluster are active and running. Each of node of a cluster checks the heartbeat of its neighbor node. Through knowledge of the configuration of the cluster and alternate paths, the topology services sub-component 110 can determine if the loss of a heartbeat represents an adapter failure or a node failure. The topology services sub-component 110 maintains information about which nodes are reachable from other nodes, and this information is used to build a reliable messaging facility.

The group services sub-component, or client, 112 allows the formation of process groups containing processes on the same or different machines in the cluster. A process can join a group as a provider or a subscriber. Providers participate in protocol action on the group while subscribers are notified on changes to the state of the group or membership in the 5

group. The group services client 112 supports notification on joins and departures of processes to a process group. The group services client 112 also supports a host group that can be subscribed to in order to obtain the status of all the nodes in the cluster. This status is a consistent view of the node 5 status information maintained by the topology services sub-component 110.

With respect to the present invention, the group services client 112 provides cluster-aware functions to handle failure and reintegration of members in a process group. These 1° functions are built on top of the reliable messaging facility being either atomic broadcast, or n-phase commit protocols.

The CSQL services sub-component 114 provides support for databases, which may contain configuration and status information. The CSQL services sub-component 114 can 15 operate in stand-alone or cluster mode. The database of the CSQL services sub-component 114 is a distributed resource which, through the use of the group services client 112, is guaranteed to be coherent and highly available. Each database is replicated across all nodes and check pointed to disk 20 so that changes are retained across reboots of the core cluster services 102. The CSQL services sub-component 114 serves or provides each cluster node with an identical copy of data.

Referring to FIG. 2, shown is a block diagram representing a cluster 200. As an example, the cluster 200 represents an application with components operating on several nodes within the cluster 200. As shown, the cluster 200 has cluster nodes 202, 204, 206, 208, and 210 each executing a component of a software application. Each of the nodes is understood to be provided by a computer 100 as described in detail with respect to FIG. 1. Furthermore, each of the nodes 202, 204, 206, 208, and 210, are members of the cluster 200 because each have a group services client application 112, which collectively provide the group services 212 for the cluster 200.

The members are coordinated by the group services 212. Each of the cluster nodes 202, 204, 206, 208, and 210 have a core cluster services software component 102 with a group services client 112 (see FIG. 1), and each of these nodes are 4Q peers with respect to each other.

The group services 212 is formed by the combination of the group services sub-component 112 of the cluster nodes 202, 204, 206, 208, and 210. The term "client" as used herein means, on a network, a computer that accesses shared 45 network resources provided by another computer.

The group services 212 can also support entities known as subscribers. These are cluster nodes that do not directly participate with the group members in planning and executing recovery actions, but are interested in recovery actions 50 taken by the group members.

Accordingly, the group services 212 of the present invention provides updates that are real-time representations that are stored as a replica or copy on each of the cluster nodes. The group services 212 also provides cooperative processes 55 to coordinate the maintenance and recovery activities across the cluster 200. An example of an addition of a member or subscriber is shown in FIG. 2, where an application component on node 214 seeks to become a member of the cluster node 200. 60

The inclusion of a node with respect to the present invention is a function of the shared resources of the cluster 200. For example, if the node 214 either lacks a data resource, such as a database, common to the other nodes of the cluster 200, or has an outdated database, the group 65 services 212 coordinates the installation of a copy of the shared database.

6

Cluster functions are provided under an n-phase protocol. The n-phase protocol has a set of available votes, which for the present invention is the voting set of {CONTINUE, APPROVE, REJECT}. Each of the nodes participating in the cluster broadcasts a message having a header containing a VOTE field to convey the votes of the cluster nodes 202, 204, 206, 208, and 210, and membership seeking node 214. Such messaging formats are known to those skilled in the art. An n-phase refers to the n-series of broadcast/vote sequences generated by the members, or providers, of the cluster 200 to arrive at a consensus with respect to a proposed request.

FIG. 3 is a flow chart of a database conflict resolution protocol ("DCRP") 300 executed by the cluster node 200. The DCRP 300 ensures that system resources accessed by the nodes of the cluster 200 are recent and valid among the nodes 202, 204, 206, 208, 210, and 214. The DCRP 300 is used with respect to a cluster resource having a distinct identifier. In the present example, the DCRP 300 is described with regards to a database having a timestamp as the distinct resource identifier.

At step 302, the DCRP 300 is started and an open_data_ resource request is issued to the group services 212 (see FIG. 2) at step 304 by one of the cluster nodes 202,204,206,208,

210, or 214 (see FIG. 2). The open data resource request

contains the name of the requested database, and a timestamp provided by the node with respect to the local database copy stored on the requesting node.

The timestamp has three components: a timestamp portion, a node identifier, and a cyclical redundancy check

("CRC"), also referred to as a checksum. The open data

resource request is broadcast to the cluster nodes 202, 204, 206, 208, 210, and 214. The term "local" as used is an adjective describing a resource that is accessible by the cluster node at hand rather than remotely accessing another node for the information stored in the database. In this sense, the resources discussed herein are with respect to a distributed resource that stores information for the cluster 200 that is maintained for consistency through local copies of the database on each of the nodes of the cluster 200.

In general, checksums, or cyclic redundancy check values, may be maintained to better ensure database integrity. In the preferred embodiment, a single checksum for each database is maintained using an evaluation hierarchy from rows and columns to tables to the entire database. When a data item is updated, the checksum of the row and column containing the checksum is updated. Some forms of checksum permit merging of the values computed for each row and the values computed for each column and arriving at the same checksum across a database table through either method. Other forms of checksum computation require a choice of merging either the values for every row or the values for every column. The merging may consist of computing checksums across those for every row or column. Some form of merging is performed on the checksums of all tables within the database to obtain a checksum for the database.

The checksum of a database can be appended as a low-order component of the last modification time used to resolve which copy of a database will be used as the master image, or copy, across a cluster 200. This ensures that two different copies of the database having identical modification dates, especially when those dates are kept with a low-resolution timer, will not be mistaken for being the same copy of the database.

The checksum of a database can be included as part of the result for a database update request, ensuring that all mem

« PrécédentContinuer »