Summary of the invention
A kind of being configured to service delivering to the server that is connected at least one client of server, described server can operate in leading role's look (master role) or from any of role (slave role) for each in a plurality of application, described server comprises:
For server being connected to the network interface of at least one other similar service device of the cluster of similar service device;
Exercisable service delivering logic when server is in the leading role's look for application, for trustship, this applies that service delivering is arrived to client;
Exercisable main logic when server is in the leading role's look for application, it is configured to the change in the data of this application to copy to the server of the configurable number of cluster;
When server in for trustship in cluster the application on another server from role time exercisable from logic, it is configured to receive data through copying from the current master server for this application of cluster and changes and safeguard the version for the real-time application data of this application;
Control logic, its be configured to detect event in cluster and in response to this event independently by the role of the one or more server for applying from and main between change, wherein role is by from using safeguarded version to carry out this application of trustship to main change.
Server can the one or more application of trustship-that is to say that it can be the master server for one or more real-time application.This server can also serve as simultaneously for the one or more real-time application by different server trustship from equipment (slave).
It is evident that, phrase " advocate peace between " cover role by lead from or by the change to main.
In an embodiment, main logic can comprise file system installation process machine (handler), and it can operate in sending mode the change in data is transferred to the server of the configurable number of cluster.
Main logic can comprise snapshot device, and it is configured to take the snapshot of the file system of serving current hosted application.
Main logic can comprise at least one often from device transmitter (per slave sender) for the change in the data of the real-time application by trust server being copied to the respective server of cluster.
Described at least one often can be by based on requirement from device transmitter illustrated from the snapshot device of equipment for each from server.
From logic, can comprise the reception reproducer of the data change that is configured to receive through copying and be configured in receiving mode to safeguard the file system installation process machine of the version of real-time application data.
Control logic can be configured to send the periodic heartbeat signal of indicating its real-time existence in cluster.
Other similar service device that control logic can be configured to from cluster receives heartbeat signal, thus and the real-time existence of definite server in cluster.
Control logic can be configured to detect the event that is selected from the following:
(i) for the fault of the current master server applied;
(ii) the subregion of cluster;
(iii) the minimizing of the quantity of server in cluster;
(iv) the increase of the quantity of server in cluster;
(v) the server of user having been expressed for the preference of hosts applications for it is incorporated in cluster;
(vi) in cluster, the load of the application among server changes, and makes to need load rebalancing event.
Control logic can be configured to sending and receiving from the message of other server in cluster, and described message transport data can be made the autonomous decision for the role of application about server thus.
Described message can comprise the binary data of indicating described autonomous decision.
Control logic can be configured to detect the message from the server of all real-time existence in cluster, and make about its for the role's of application decision before from all such server receipt messages.
Network interface can operate to be maintained to the permanent connection of minimum other similar service device in cluster, and the message between server can exchange thus.
Described or another network interface can be configured to set up interim conversation, and for transmission, the data through copying change.
Server can comprise protocol handler, and it can operate, when the positive trustship of this server is applied in real time, the request for service is routed to this server.
According to a further aspect in the invention, can provide a kind of system, it comprises the server of any one in the above server feature of a plurality of bases.
According to a further aspect in the invention, provide a kind of method that the file system of holding the data for applying is in real time installed at server place, described method comprises:
Causing before the event of application is installed at server place, at server place, from the current master server of this application of trustship, receiving the change real-time application data and safeguard the version of real-time application data;
In response to this event, server is installed the version of the real-time application data that himself is identified as new master server and safeguards with it file system of applying in real time;
Reception should be used for serving this request with delivery service for request and the use of the application of server place in real time.
In an embodiment, described method can be used to recover from the fault of current main equipment, and fault can by by form new master server from equipment autonomously detect.
Described method can be used to recover from the fault of current master server, and fault can independently detect by another server in server cluster, and other server of wherein said master server and at least one is connected.
Described method can be used to recover in the subregion from server cluster, wherein other server of current master server and at least one is connected, follow after this subregion, at least two servers can be independently potential new master server by self identification, and when from partition recovery, potential new master server can with each other and with other server negotiate of cluster to determine that the state of master server is to maintain or to shift.
Described method can be used to manage the load in the server cluster that has wherein connected master server, described method can comprise the quantity of server in detection cluster and current application load thereof, and with migrate application, carrys out balanced load with other server exchange message in cluster.
With cluster in other server exchange message with after determining the file system version of the highest barycenter tolerance, the analysis of the snapshot of the change that server can be based on in received real-time application data and be new master server by self identification.
Real-time application is installed and can comprises that illustration is for sending to the change of the data of the file system of newly installing in cluster at least one from the reproducer sending function of server.
Current main equipment can be selected the quantity as the potential server from server in the set from server cluster.
According to a further aspect in the invention, the method of a kind of management by a plurality of application of server cluster trustship is provided, described server has the interface that can be connected at least one client by network separately, and each is applied in the delivery service of client place, and described method comprises:
Elect server in cluster as master server, at least one applies described master server trustship in real time;
When master server is when trustship is applied in real time, change in the application data of application is in real time copied to and in cluster, elected the server into the configurable number from server, thus each elect from the real-time version of the application data of application of server maintenance, wherein when event being detected, in response to the event in cluster, the trustship of application from master server transfer to there is no user intervention electing of determining from one of server, that elects installs and applies and become new master server with the version of its current application data from server.
In an embodiment, described event can be the load based on server in cluster and detect preferred replaceable master server in cluster.
Described event can be the place based on server in cluster and detect preferred replaceable master server.
Described event can be in cluster, to detect preferred replaceable master server based on predefined user preference.
By applying in real time, from its current master server, move to its decision from one of server and make when the load average value of Servers-all and damping factor (Q) can being greater than cluster in the load of current main equipment.Be called " damping " factor (or " weary clear (fudge) " factor) herein, Q prevents that the server in cluster from constantly exchanging the value of load.
Can by with cluster in other server exchange message carry out detection event.
Event can be the interpolation server to cluster.
Before the interpolation of a server, cluster can be included in individual server.
Event can be to remove server from cluster, wherein said remove be expection and controlled real-time migration be initiated.
When adding server to cluster, can determine the new load of server, and can make about which in the application by cluster trustship and should be migrated to the decision of the server of new interpolation.
Event can be the fault of server in cluster, wherein for the data of the hosted real-time application of the server by fault, can recover by the version of the current application on the server that is continuing with cluster to operate.
Event can be the subregion of cluster, and after partition recovery, can from the master server of a plurality of potential competitions, select preferred replaceable master server as the server with the version of more valuable current application data.
On one of server, hosted leader's function (leader function) can be identified for the new master server of application, and wherein said leader's function can be different from the server of main equipment.
According to a further aspect in the invention, provide a kind of method that shifts application from master server, master server is the request for the service of being sent by application from client, and described method comprises:
Before causing the event of transfer, the change in application state is copied to at least one other server in cluster;
In response to this event, independently suspend one period of request that master server place enters, wherein unsettled request is processed;
And after expiring this period, in the processed situation of unsettled request therein, for serving version maintained at least one other server place route described request of the application data of this application.
In an embodiment, when unsettled request does not complete within this period, described request can not be routed to described at least one other server, and abandons the transfer to application.
After expiring this period, master server can be used for receiving the reproducer receiving function of the application data change through copying and independently be adopted the role from equipment as the application for trustship before it by illustration.
A plurality of application can be by master server trustship, and wherein master server can copy to change as the selected set from server of each application.
Can based at least one in load, user preference and place select for each application from server.
Server can independently be abandoned it as the role of master server based on detect preferred replaceable master server in server cluster.
According to a further aspect in the invention, provide a kind of method in server place hosts applications, server is the request for the service of being sent by application from client, and described method comprises:
Determine in a time interval supporting the modification quantity of the file system of this application;
At configurable time point place, take the snapshot in succession of file system, wherein said time point depends on the modification quantity to file system in this time interval; And
Snapshot is sent to reproducer for the transmission from server.
According to a further aspect in the invention, provide a kind of method of snapshot of managing file system, a plurality of servers of file system mid-span in cluster and being replicated wherein, described method comprises:
By to comprise snapshot identifier, to identify each snapshot to snapshot node (snapnode) object of binary sequence form that wherein takes the parent pointer of the early snapshot on the particular server of snapshot and wherein stored at present the server set of this snapshot;
The figure of the snapshot node object of the snapshot set of memory file system in each in a plurality of servers, the movable main equipment that one of described server is file system;
Movable main equipment takes the new snapshot of file system and creates the snapshot node object for new snapshot, thereby is wherein to have stored the server of new snapshot by activity master identification;
New snapshot is transferred to other server in a plurality of servers; And revising snapshot node object take described other server identification as wherein having stored the server of new snapshot.
In an embodiment, described method can be used to the recovery that managing file system after the event of its state will be confirmed or revise to movable main equipment therein.
Described event can be to wherein having connected the subregion of the server cluster of movable main equipment and other server, wherein after partition recovery, can there are at least two candidate's master servers, all there is separately the figure for the snapshot node object of file system, wherein can travel through the figure at each candidate's main equipment place to assess its value, and candidate's main equipment of figure with indication peak can adopt the role as the new main equipment for file system.
Before carrying out relatively, can cross-server and global synchronization snapshot data, can assess facing to the snapshot data of global synchronization the departing from of version (divergence) of the data at each candidate's main equipment place thus.
Event can be the loss of at least one other server in a plurality of servers, its serve as for movable main equipment from server, wherein replacing after equipment has been specified by main equipment, main equipment can the new complete present situation from device replication file system of instruction, and making to copy can be from current point.
Described method can comprise the step that the snapshot of the given section point (slice point) from figure is saved in to local storage region.
Described method can comprise the step of pruning snapshot.
Described method can comprise based on the following to be determined and takes which action to solve the step departing from of the figure on a plurality of servers that represent same filesystem:
(1) for the current main equipment of file system;
(2) for the figure of the snapshot node object of this document system global state;
(3) the current list from server to this main equipment for this document system.
Snapshot identifier can identify time of taking snapshot with and arsis under the server of snapshot.
In above server or method, in the embodiment of any one, can present user interface for permitting the access to the snapshot by user-selected via user to user.
According to a further aspect in the invention, can provide the method for the load balance in a kind of cluster of the server that makes a plurality of application of trustship, described method comprises:
Determine the present load of each server;
Determine the average load that the load at the server place in cluster is taken into account;
For server, determining its load is to be less than or greater than average load to add damping factor (Q);
When being greater than when mean value adds damping factor, the load of server makes from the decision of this server migration application.
According to a further aspect in the invention, a kind of computer program of having stored the computer-readable media of computer instruction set on it that comprises can be provided, and described computer instruction set is carried out according to the operation of any one in server above or method characteristic when being carried out by processing unit.
Embodiments of the invention provide for not only reconcile (mediate) to the access of hosted application but also control the mechanism of above-mentioned data Replica and device so that application can be in response to changing the topological preference of each application and load between server real-time migration seamlessly.
Embodiments of the invention provide storage (stashing) ability.Generally speaking, when storage occurs in file system and departs from (it can be for example due to network partition, thereby or from servers off-line, make fault and the nearest snapshot from equipment that be reintroduced back to be no longer for newly copy effective section point time the pruning that occurs)-and cause the file system from equipment that reception copies be partly or entirely stored to the special local storage region that is called " container " in rather than in primary memory area that wherein real-time file system occupy.
According to a further aspect in the invention, a kind of system for the application of dynamic migration between server is provided, described system comprises a plurality of servers for hosts applications, each in described a plurality of server comprises that, for receiving the protocol handler for the request of application, wherein said protocol handler is configured to suspend the request entering for application during the migration being applied between server.
Described system can comprise in addition for measuring the load balancer of the load on one of a plurality of servers, described load is caused by one or more application of trustship on this server, described load balancer is configured to, when the predetermined load condition of measured server meets, initiate the migration of one or more application from measured server to another server.
Described a plurality of server can all have controller separately, its safeguard its be taken in before trustship the record of server of application, and protocol handler is configured to check that this records to determine that the application request entering wants directed server.
Protocol handler can be configured to time-out and for the request entering of application and after predetermined amount of time, stop the current request for application.
Additionally or alternatively, and if the request entering that protocol handler can be configured to suspend within a predetermined period of time for application not yet completes for the current request of applying the request that release suspends in predetermined amount of time.
According to a further aspect in the invention, provide a kind of for the method for xcopy system between first server and second server before the subregion between first server and second server and afterwards, described method comprises: at first server place, predetermined point of time place after the modification of file system takes the snapshot of file system current state, and each snapshot is recorded in the current state of file system on server and the difference between the state of file system on the time point place server of previous snapshot; Once taking snapshot, the snapshot continuous replication taking in first server is arrived to second server; When subregion being detected, the two becomes main equipment and the new modification of acceptance to file system for file system the first and second servers; After the recovery of subregion, carry out renewal process with transaction file system, described renewal process comprises: which in sign first server and second server contains the most current version of file system; The server identifying is like this appointed as to master server and another server is appointed as from server; Sign is to master server with from the two common snapshot of server; And snapshot is subsequently copied to from server from master server.
Which in sign first server and second server contains the version that the most current (the being most worthy) version of file system can be included as in server the file system on each and calculates barycenter tolerance, mean age of the file system snapshot on described each server of barycenter measurement representation and by the represented quantity to the change of file system of the snapshot on each server.
Which in sign first server and second server contains the most current (the being most worthy) version of file system can comprise the snapshot set of identification document system in addition, it is for each server, each snapshot set-inclusion exists only in the snapshot on this server, and the snapshot set based on this server calculates barycenter tolerance for each server.
Renewal process can comprise in addition and is stored in the snapshot from server taking after common snapshot.
According to a further aspect in the invention, provide a kind of for the system of xcopy system between first server and second server before the subregion between first server and second server and afterwards, described system comprises: fast photographing device, it takes the snapshot of the file system current state in first server for the predetermined point of time place after the modification of file system, each snapshot is recorded in the current state of file system on server and the difference between the state of file system on the time point place server of previous snapshot; Reproducer device, it is for arriving second server once taking snapshot by the snapshot continuous replication taking in first server; Checkout gear, it is configured to make when subregion being detected, and the two becomes main equipment and the new modification of acceptance to file system for file system the first and second servers; Updating device, it is configured to carry out renewal process after the recovery of subregion with transaction file system, and described renewal process comprises: which in sign first server and second server contains the most current version of file system (being most worthy); The server identifying is like this appointed as to master server and another server is appointed as from server; Sign is to master server with from the two common snapshot of server; And snapshot is subsequently copied to from server from master server.
Which in sign first server and second server contains the version that the most current (the being most worthy) version of file system can be included as in server the file system on each and calculates barycenter tolerance, mean age of the file system snapshot on described each server of barycenter measurement representation and by the represented quantity to the change of file system of the snapshot on each server.
Which in sign first server and second server contains the most current (the being most worthy) version of file system can comprise the snapshot set of identification document system in addition, it is for each server, each snapshot set-inclusion exists only in the snapshot on this server, and the snapshot set based on this server calculates barycenter tolerance for each server.
Renewal process can comprise in addition and is stored in the snapshot from server taking after common snapshot.
Described system can comprise storage device in addition, the snapshot that it takes for being stored as file system, make the previous snapshot of file system can from stored snapshot, be selected by user with by system reducing to it state in the time of selected snapshot.
The previous snapshot of file system can be selectable by means of the user interface that is presented to user.
According to a further aspect in the invention, provide computer software, it,, when being carried out by suitable processing unit, makes processing unit realize the system and method for first, second, and third aspect of this aspect.
Embodiment
Term
Fig. 1 a illustrates the schematic framework of the computer system that wherein various aspects of the present invention discussed in this article can realize effectively.To readily appreciate, this is only an example, and it is contemplated that many modification (comprising 1 cluster) of server cluster.
Figure 1A illustrates the set of the server 1 operating as cluster.Cluster is formed on 2 sons and concentrates, and wherein server is noted as the second set that first of 1E gathers and wherein server is noted as 1W.Subset is can be geographically separated, and for example server 1E can be on the east coast of the U.S., can be on the West Coast of the U.S. and be labeled as the server of 1W.The server 1E of subset E connects by switch 3E.Switch can realize in any form-all needed be each server in this subset by means of its can with this subset in the mechanism that communicates of another server.Described switch can be the actual physics switch with the port that is connected to server, or can be local area network (LAN) or Intranet more possibly.The server 1W of western subset connects by switch 3W similarly.Switch 3E and 3W self is via network interconnection, and described network can be for crossing over any suitable network of geographic distance.Internet is a kind of possibility.Described network is designated as 8 in Figure 1A.
Each server is associated with the local storage facility 6 that can form any suitable storage device, for example the memory of dish or other form.Storage facility 6 supporting document systems 10.File system 10 is supported in the application of operation on server 1, and described application examples is as being just delivered to service via internet one or more client terminals 7.Embodiments of the invention are advantageous particularly in the field of sending the application based on web by internet.Embodiments of the invention are for supporting that e-mail server is useful equally, and it has benefit discussed herein, and wherein file system is supported mailbox.
At each, can both depend on institute's lectotype and on independently as main equipment or the meaning that operates from equipment, the substantially similar or homogeneity (homogenous) of server and the file system being associated.
Defining objects:
Each client 7 can be started (launch) software program such as web browser, the application (or database) of being sent by one of server for access.The server of this application of trustship and client communication are for example to be used http protocol to send webpage.In this article, term application is intended to cover can be by any movable software or the data structure at the server place of client-access, and particularly but ad lib overlay program, database and Email (mailbox) or other message send structure.
In this description, we utilize basic mathematical symbol, comprising:
A
definition
Set:
{ 1,2,3} is for unique element 1,2,3
Mapping:
[1 → A, 2 → B] is for unique key (key) 1,2
Subordinate ordered array:
(1,2, B) it can be dissimilar
Compound type definition (name array):
Type (A, B, C).
Suppose
We suppose the existence of two first floor system that embodiments of the invention depend on:
1. the file system 10 of each server this locality, it can comprise many arbitrarily subfile systems (of each application or database).Notice, hereinafter, file system support application or database and to each server, can exist more than onely, that is, subfile system is below being known as " file system ".File system can comprise any suitable storage device, for example dish.Each file system can have many arbitrarily consistent time point snapshots, separately with the unique character string name in this locality, and has in addition the mechanism from a machine reproduction to another by the difference between two snapshots.An example that meets the file system of these demands is the ZFS file system of increasing income.As described more all sidedly herein, embodiments of the invention by this ability of file system by difference from a server replicates to another with permission server migrate application/database independently.Strictly as example, in server A, may exist and there is snapshot the file system F of 1,2,3}, and in server B, may there is snapshot { 1, the 2} of same filesystem.Notice, snapshot needn't be stored as the whole bit image of file system and transmit-file system allows us to copy difference between snapshot 2 and 3 (for example, the piece on dish only, there is change in it) so that server B is up-to-date, make it comprise snapshot { 1,2,3}.Yet, in following examples, although only transmit difference, be easily, can transmit complete image.
2. the group message of being supported by network 8 sends service GMS, and it allows message M1, M2 ... .Mn between the server in cluster, send.Crucially, group message sends service and provides about being broadcast to some assurance of the message of group: even damaging, on the network link of high stand-by period, all current active members of Hai Xiang group guarantee message delivery, and prioritisation of messages is consistent in logic across Servers-all.Strictly, as example, if server A sends a piece of news to group, and server B sends another message simultaneously, and all members of group (comprising A and B) receive message by the order with identical.An example that meets the group message transmitting system of these demands is the Spread kit of increasing income.
Figure 1B is the schematic diagram of individual server 1.Server comprises and is suitable for carrying out instruction to send as the processor of the more clear difference in functionality of discussing herein.In addition, server comprises for supporting the memory (not shown) of the operation of processor.This memory is different from the storage facility 6 of supporting document system 10.As easily understanding from following, server 1 can locate to support a plurality of application at any given time.These illustrate by being labeled as the circle of app with schematic form.Being depicted as cross-hatched app indicates as the result of the application migration process of discussing after a while and has been installed in recently the application on server 1.App shown in broken lines illustrates just and from server 1, moves the application of having walked.Server 1 supported protocol processor 5, it cooperates to provide the distributed protocol processor of discussing more all sidedly mechanism herein with the similar protocol handler of other server in cluster.Server 1 is also supported group message agreement 4.Protocol handler 5 is received request and is responsible for routing requests to the application that this request is addressed to from client terminal 7 by network 8, and be responsible for to client return from apply the service data that provides for by service delivering to client 7.Group message send agreement 4 be responsible for cluster in other server exchange message m 1 to m4.Notice, these message can be in this locality by local switch 3E/3W and/or remotely exchange via network 8.Server 1 is also supported sending and receiving reproducer 34/38, and it is responsible for the snapshot of the application data of sending and receiving in the file system of supporting application.Notice, Figure 1B is highly schematic, particularly at server 1, can have in the meaning of the single or multiple ports that all internal and external communications path is provided by it.Thereby although likely provide private port for the reception to client terminal delivery service, message and snapshot and send, this is optional, and this can all occur on the same physical port of server.Notice at this moment, the relation between server 1 and storage facility 6 is not further discussed in this article, because can use any suitable file system to arrange.Therefore,, although be shown as separated module, if expectation or necessary, it is possible that file system storage facility is integrated in server.
To understand, with undefined 26S Proteasome Structure and Function, can realize with any suitable combination of hardware, firmware or software.Especially, function can realize by the appropriate codes operating on the processor of server.Thereby when using in this article, term module not necessarily means separated physical module, but can also imply that the framework of the instruction set that specific function is provided describes.
General introduction
In the normal operation period, system will be elected a master server for each application, and each application is hosted on proper what a server in cluster.The n that the change on main equipment, this application being occurred is copied to asynchronously for this document system is individual from server, for total n+1 copy of file system.This makes system n redundancy, because it can tolerate the fault of n server.To the change of application, be the change in application or database positioning (data), when being real-time when it the file system for this application records.
Dubbing system
SnapshotGraphForest(snapshot plotting forest) data structure
Pass through in this example
snapshotGraphForest(snapshot plotting forest) provides the ability of the data Replica between server of carrying out under Arbitrary Fault and subregion condition of cluster.This data structure represents that given file system is across the global state of the Servers-all in cluster.
We start to have the simple scenario of the cluster of a file system F.
Fig. 1 represents that server A copies to the server B with snapshot 1 by the second snapshot 2.Server A is the master server of the application of trust instrument system F, and the change in state is being copied to at least one elects from server B.
Snapshot plotting forest is the set of snapshot plotting G.Snapshot plotting is the directed acyclic graph (DAG) of snapshot node.Snapshot node is specific, overall unique version of file system, comprises the set of father edge (parent edge), and it identifies this snapshot node position in the drawings.
Figure is DAG, because snapshot node can have a plurality of parents (parent) and a plurality of sub level (child).It is acyclic, because father's snapshot is always older than sub-snapshot, therefore can form circulation never in the figure.
Each snapshot node is by object type SnapNode(
id, [
id_p→ (
srvs, count, imm)]) limit, wherein
idfor the unique snapshot identifier of the overall situation,
id_pfor parent pointer, it refers to early snapshot on the particular server of holding this snapshot on it
id(if it is the first snapshot, and this can be null value (NULL), in this case its to say into be based on initial point),
srvson it, to have stored at present the server set of snapshot,
countthe quantity of the file system modification with respect to his father's snapshot that expression is caught by snapshot, and
immrepresent given whether impinge upon soon on given server immutable (whether it can be deleted).We will ignore
imm, until we discuss pruning after a while.Snapshot identifier is where identified at (host server) and when, (timestamp) takes snapshot.
Observe, SnapNode(snapshot node) object can represent the file system state on a plurality of servers simultaneously, and capture the following fact: on different server, father's snapshot of each snapshot can be different, even if the data that snapshot is caught are identical.
Snapshot plotting is defined as the set of SnapGraph(snapshot node), wherein, in figure, all snapshot node can arrive via father and son's pointer of those nodes.
In example in Fig. 1, by arrow R, indicated copy before, in forest, there is figure G:
Snapshot 1 is initial snapshot, and it is stored in A and B on the two in the situation that twice change is recorded between initial point and this snapshot, and snapshot 2 is based on snapshot 1 (having the parent for snapshot 1) and only have copy in server A.As the result of the real-time application of carrying out at server A place, described change is recorded in file system F.
For the complete snapshot plotting forest of this configuration, it is SnapForest ({ G}).That is to say, in this forest, only exist a figure G(not have the snapshot node set disconnecting completely, or all nodes are all connected to all other nodes).
After snapshot 2 copies on B, figure G' has new state:
Notice, B has the copy of snapshot 2 now, above with runic, indicates.
The figure departing from
Fig. 2 a illustrates the server cluster through subregion.
Consider cluster can from the server farm of server set a_m+1 .., a_n}(is n>m wherein for a_1 .. a_m) become and be partitioned into two server farm L:{a_1 .. a_m}, R:{a_m+1 .. a_n}.In fact, fault may cause many arbitrarily subregions, but we describe the situation of two subregions, and it concludes many arbitrarily subregions.
In fact observe, these all faults can be concluded to subregion, for example the fault of individual server a_i can be regarded as dividing into group a_j | j unequal to i} and { a_i}.The fault of the network switch can be regarded as dividing into
num-portsmany groups, it all contains individual server separately.
During subregion, all thrusters of subregion are selected to the new main equipment of all Available file systems.Now, the data on the both sides of subregion can start to depart from along with the change that the file system on the both sides of subregion is made.
Fig. 2 shows the same cluster as before but in the situation of network partition.Now server A and B can not with dialogue each other, and therefore they the two all elect self as the new main equipment of the file system F for discussed.Then these two servers may all be observed and may take snapshot 3 to the modification of their file system F (change) and server A, and it catches 1 modification, and server B may take snapshot 3', and it catches 4 modifications.
SnapshotGraphForest(snapshot plotting forest for this system) global state is now:
That is to say, have now four SnapNode objects, by each of each different file system state of system acquisition.Due to snapshot 3 and 3', the two all has snapshot 2 as parent, so file system state is said to be and departs from.Notice, only in network partition, be resumed and after A and B can communicate again, they can comprise that the message of their file system state finds this complete graph by transmission.
We will consider a final example now, and it has proved the forest that why may must want to express the figure disconnecting completely.Suppose that server A and B remain open and subregion both sides on user happen to be on the both sides of subregion and add the file system G with same names.Then supposition system takes initial snapshot:
in the A' of subregion side
in the B' of subregion side
Now, consequent snapshot plotting will not be connected, and the figure that therefore forest comprises two disconnections:
A plurality of figure also may be by a server A off-line long enough to such an extent as to another server B A get back to online in before deleted file system all common snapshots caused.
Sometimes mention that the local forest only comprising about the information of the file system on particular server is useful.Observe, local forest always comprises the forest without the single linear graph departing from, because the file system on individual server must always have from the linear structure to snapshot the latest the earliest.
Finally, about snapshot identifier (
id) annotation.These are defined as array SnapId(
timestamp, server) (snapshot Id(timestamp, server)), wherein
timestampthe millisecond number since new era from UNIX, and
serverit is the unique primary ip address of the overall situation that takes the server of snapshot.The local SnapId's that attention is taken at first at description snapshot
serverfield is stored in SnapNode's where at present with the copy of indication snapshot
srvsdifference between field.
Explore SnapshotGraphForest(snapshot plotting forest): calculate depart from, head, barycenter, candidate's main equipment, and find renewal
Provide the overall snapshot plotting forest of the current global state that represents the file system on cluster, the object of described system is executable operations on the local file system on each server in system, to turn back to wherein the global consistent state that can continue that copies from main equipment to the change from equipment.
The operation that we can carry out in file system (being called executor (manipulator) operation) is:
1. snapshot: take new snapshot.
2. send: from a server, to another, send (a plurality of) increment type snapshot or full file system.
3. receive: receive and send to the snapshot (incrementally updating or complete replication stream) from server from master server.
4. store: will store (preservation) to local storage areas from the snapshot of given " section point ".
5. prune: prune (deletion) snapshot.
We describe it and can be used to detect the process that departs from and determine to carry out which action herein.Important action is in response to the devastating event such as fault or subregion, and definite new master server of applying in real time of how electing, makes its continuation and not significantly interruption.
First we define traversal function (
traversalfunction), it accesses each the snapshot node (SnapshotNode) in the figure of its connection given start node (snapshot identifier) in the situation that via its parent pointer and (deduction) sub-pointer.The mapping of its constructor pointer and parent pointer and then carrying out from the search of the addressable figure of start node, remembers which snapshot it seen to avoid loop.
Since then we can define figure function (
graphsfunction), it is at given SnapNode(snapshot node) from this set, remove SnapNode(snapshot node object set in the situation that) and its complete figure is added to the set of figure, until there is no SnapNode(snapshot node) object residue, thus by establishing which node interconnection, non-structured snapshot node set is brought to the set of snapshot plotting.
Now we can define a function (
headsfunction), to calculate which snapshot in Given Graph, be the most recent version of the competition of each file system, " head " departing from.Given as passed through
graphsthe figure that (figure) calculates, the head of this figure is the figure element in the drawings with zero sub level just.
We can be defined as the snapshot node set that is limited to the snapshot on given server with copy by the limited figure about server (restricted graph).Therefore in diagram 2, complete figure be that { 1,2,3,3'}, the figure that is still limited to server A is { 1,2,3}, and the figure that is limited to B is { 1,2,3'}.Notice, the snapshot node in limited figure only always has a father edge.
We can define centreOfMass(barycenter on limited figure now) function, it calculates weighted sum: as the class duration of stabbing the average time of all snapshots in limited figure, its modification quantity institute weighting in the single father edge at this node.Intuitively, the figure with barycenter is more recently more valuable than the figure with more old barycenter, because barycenter is more recently corresponding to more recently and more significantly changing.
This is can be used to computation-bound in the centreOfMass(barycenter of the figure of server A G) formula:
First we the tail of limited figure is defined as is not simply all snapshots of the first snapshot in the figure.This be because the mid point of each snapshot node g and its parent only at parent(g) (parent (g)) be defined while being not initial point.Then we can be by the centreOfMass(barycenter of limited figure) be defined as on the snapshot in the figure tail of the mid point of time of this snapshot and its parent and, its weight by each snapshot (this snapshot and its near parent between the quantity of change) weighting, divided by total weight of figure tail.
As example, consider which in the limited figure in diagram 2 has the highest barycenter: the figure that is limited to A has centreOfMass(barycenter) (3* (2+1) * 0.5+1* (3+2) * 0.5)/(3+1)=1.75 and the figure that is limited to B has centreOfMass(barycenter) (3* (2+1) * 0.5+4* (3+2) * 0.5)/(3+4)=2.071.Intuitively, the figure that is limited to B wins, and B should be elected as new main equipment (because change recently of its larger weight of data capture).Notice, we do not count the weight between snapshot 1 and initial point, but this is unimportant, because it equates in two kinds of situations.
This is directly perceived for formalized, and we have defined
chooseCandidateMasters(select candidate main equipment) function, its permission system is processed wherein because network partition causes two or more servers and has been become the situation for the main equipment of the competition of file system.When recover network partition, server by exchange each server think for which file system its be main equipment and for which its no that list (being called current main equipment message) is observed them in competition, and their exchanges are usingd and are determined which server should continue as main equipment for the necessary snapshot data of the overall forest of structure in addition.
chooseCandidateMasters(selecting candidate's main equipment) function operates as follows: Given Graph, it calculates and is involved the set (that is, which has the copy of any snapshot node in figure) of server in the drawings, and for each such server, is this server computation-bound figure.For each limited figure, it calculates the barycenter of this limited figure, and finally it returns to the server set that becomes draw at maximum barycenter place.
When server detects them, the two is current while being all main equipment, by checking their current main equipment message, and they the two snapshot data based on global synchronization and moving all
chooseCandidateMasters(selecting candidate's main equipment) function; It is best candidate's main equipment for whichever discovering server, asserts the ownership of website and other server are conveyed to new main equipment (they become from equipment).If they become draw, by server, with the IP address that on lexicography, (lexicographically) is minimum, elect at random one.
If observing from equipment, main equipment disconnected (separation graph) completely, the weight of the segmentation that it relatively disconnects, and triumph side (new main equipment) instruction failure side (new from equipment) is intactly stored this whole file system, and making to copy can be from wiping.That is to say, if at main equipment with from there is no common snapshot (figure " completely disconnect ") between equipment, from equipment, must store whole file system and main equipment must copy whole history, from NULL(null value) snapshot is until nearest snapshot.
We can definition procedure now
findUpdates(find and upgrade), given following independent variable: 1. elected the server for main equipment, 2. for the complete S napshotGraphForest(snapshot plotting forest of the global state of this document system), and 3. from server name list, described process determines to take which action depart from and allow normal replication to continue from equipment at those to solve.
findUpdates(find and upgrade) function is by being used
traverse(traversal) function and at the snapshot id(of the most recent of current main equipment
master_head) (id(main equipment _ head)) locate to start, thereby work backward, access each (father, son) to carrying out work.Once it finds and any common snapshot from equipment, it is just known, and this parent is for this from equipment " section point ", so its record upgrades
slave→ (
snapshot_id, master_head).Therefore
findUpdates(find and upgrade) is output as the set that copies action:
{
slave?→?(
start_snapshot_id,?end_snapshot_id)}
This is corresponding to being taked to utilize main equipment to make (to have the machine with any copy of the file system of same names from equipment, and its can have copy the common snapshot of thereon some of base) up-to-date action, cause possibly need to storing some data in case their data depart from from equipment, in this case
start_snapshot_id(beginning _ snapshot _ id) is corresponding to non-snapshot from equipment.Otherwise it is most recent (" head ") snapshot from equipment, and duplicate event is known as simply " fast forward " renewal.
Bottom document system starts and finishes snapshot node and can be in the drawings more than one curved edge away from each other, because can send more than one snapshot in single duplicate event.
In unlikely situation, wherein do not exist and depart from but given main equipment has than from the more Zao snapshot of equipment, (, from equipment until the snapshot of the first common snapshot is the strict superset of snapshot on main equipment), main equipment is respected and by instruction, is stored file system until the point that can continue to copy at this main equipment from equipment.This special situation is expressed as wherein
start_snapshot_id(beginning _ snapshot _ id) and
end_snapshot_id(end _ snapshot _ id) identical renewal.This should not occur in practice.
Main equipment operation
findUpdates(find upgrade) function and send result, for each from equipment, as the instruction for starting to copy from equipment (copy message).We copy by the state conversion aspect covering between main equipment and the participation assembly from equipment thereof the details how to proceed now.
The data of storing can be provided for user alternatively, in case user wishes to recover data from the failed side of subregion.
Reproducer
General introduction
As shown in Figure 3, exist the object of five types participate in file system peace loading, unloading (unmount) and snapshot, arrive from the pruning of data Replica and the snapshot of equipment.These objects can be implemented as the software in the processor of suitable programming, with hardware, firmware, state machine or realize with any alternate manner.
1. controller 30, wherein for each server, just have one.Controller 30 is responsible for across the synchronous global state of Servers-all, is elected main equipment, adds and remove from communicating by letter between equipment and Agent Status machine and group message transmission agreement.It also realizes the load balance aspect real-time migration.
2. installation process machine 32, and it is processed and installs safely and unloading of file system.These are present in main equipment and from equipment the two, one of each file system.
3. snapshot device 34, and it is present in (one of each file system) on main equipment, and when notice and decision that its reception file system has been modified take new snapshot.
4. often from equipment, send reproducer 36, it is present on main equipment (each file system each from one of equipment) and it is sent agreement 4 and received reproducer 38(via controller [not shown this path of attention Fig. 3] by group message) communicate so that according to from snapshot plotting
findUpdatesthe result of (find and upgrade) function is reconciled snapshot data from main equipment to the transmission from equipment.
5. receive reproducer 38, it communicates by letter to reconcile snapshot data from main equipment to the reception from equipment with every from equipment transmission reproducer.
Fig. 3 shows a kind of possible configuration, wherein has three server A, B and C and two file system F and G.This diagram is corresponding to following
currentMasters(current main equipment) mapping:
[F → server A,
G → server B]
In this example, server A is the main equipment for file system F, and server B is the main equipment for file system G.Server C be for these two file system from equipment, and cluster is configured to each file system that file system data is copied to two from equipment.Thick line in Fig. 3 represents file system snapshot data flow.
Controller and installation process machine
Each controller 30 has the file system installation process machine 32 for each file system, and each file system installation process machine is in one of two states (receive or send).If installation process machine 32 is in receiving, its file system (for example G in server A) unloaded and it there is the reproducer 38 of reception.If installation process machine is in sending, its file system is mounted (for example F in server A) and it has the reproducer 34 of transmission.By application, this document system F is changed on one's own initiative, by snapshot device 34, it is carried out to snapshot, and send reproducer often for example, from device replication device (36B, 36C, each is from one of equipment), be responsible for sending snapshot to the receiver of waiting for.
The operation of the following above-mentioned object of flowcharting.
Thick line in Fig. 4,5 and 6 is corresponding to usual successful situation, and other line is processed or partition recovery state corresponding to mistake.
Snapshot device state
See Fig. 4.
Snapshot device 34 receives the notice of file system modification and the scheduling (schedule) that will take snapshot.When taking snapshot, it informs and it often from equipment, sends reproducer 36 they should check whether to it, from equipment, initiate duplicate event, described from equipment, have to be set up prepare the reception reproducer 38 that receives.
It starts from loading (LOADING) state, this means that it is just inquiring the snapshot state that file system is current and be loaded in its forest.When this completes, it enters ready (READY) state.When it reaches ready state, it informs controller 30 by new state, and controller is broadcast to other node in cluster by described new state.When dispatched snapshot is predetermined while occurring, it at snapshot, occur duration in enter snapshot (SNAPSHOTTING).
Its Maintenance Table is shown on all nodes the overall forest 35(Fig. 3 for the global state of the snapshot data of this document system).It passes through
informGlobalState(informing global state) interface and be apprised of the state about other server, when its other server from cluster receives the renewal about global state, its controller calls this interface.
The dispatching response of snapshot is worked as follows in the notice of revising:
if file system receives an only modification, it is at SNAPSHOT_QUICK(snapshot _ rapidly) be snapshotted in overtime, it is based on previous period between revising.
if file system is at SNAPSHOT_QUICK(snapshot _ rapidly) in the time interval, receive many modifications, its SNAPSHOT_INTERVAL(snapshot _ time interval) overtime place takes snapshot, it is longer.
This means that it is every SNAPSHOT_INTERVAL(snapshot _ time interval if file system is revised in a large number) second be just snapshotted, and if it is only modified once, it is at SNAPSHOT_QUICK(snapshot _ rapidly) second in be snapshotted.Some sample values of these values are respectively 30 seconds and 60 seconds.
When snapshot completes, reproducer is also processed and is pruned asynchronously, to the quantity of snapshot is held in to rational quantity (typically 100 left and right of each file system).Pruning is being described in detail after a while.
Snapshot database
Snapshot database need to be from the cooperation of database to force it to make state consistency on its dish by the lock holding data during snapshot operation on storehouse.In one embodiment, the present invention " refreshes table (FLUSH TABLES WITH READ LOCK) " by sending to MySQL database inquiry realizes this point in the situation that of read lock.Other database engine can utilize and be equal to mechanism and integrated with the present invention.This allows database and application and mailbox is snapshotted, automatically recover and real-time migration between server.Database and relevant file system snapshot can be coordinated in time, and making to be applied in dish upper is consistent with the state in database.
Often from equipment, send reproducer state
See Fig. 5.
Often from equipment, send reproducer 36 and be responsible for initiating in conjunction with long-range reception reproducer 38 duplicate event.It starts from ready (READY) state (there is no need to load, because it refers to the forest of his father's snapshot device).When it called thereon inspection (
check) time, or because new snapshot creates, or owing to just having added server as from equipment and created newly often from equipment transmission reproducer for it, so it calls on its forest
findUpdates(find and upgrade).
When
findUpdates(find upgrade) indication specific data stream (have the beginning of definition and finish snapshot id) should be sent to often from equipment and be set up be used for long-range during from equipment from home server, and it sends agreement to long-range reception reproducer 38 transmission message and the transmission _ waits (SENDING_WAITING) that get the hang of by group message.If long-range reception reproducer 38 accepts to copy trial, often from equipment, send get the hang of transmission _ operation (SENDING_RUNNING) and snapshot data of reproducer 36 and start to flow by network.When having sent all snapshot datas, snapshot sends reproducer 34 and enters WAIT_FOR_ACK(wait acknowledge) state, this means that it is just waiting for that long-range reception reproducer replys correct reception and the storage to indicated data.When that occurs, (again via group message, send agreement), often from equipment, send reproducer and reenter ready (READY) state.
If place receives failure message from remote side at any point, if or overtime initiation (if remote machine fault or network become is partitioned, this may occur), state machine is transformed into time-out (PAUSE) and then other, is converted back to ready (READY) after overtime.This allows to be replicated in the situation that does not cause a large amount of message to be sent out and continues, in case remote side can not receive new duplicate event temporarily.
Receive reproducer state
Referring to Fig. 6.
When server be for file system 10 from equipment time, file system installation process machine 32 is in receiving (RECEIVING) pattern and guaranteed that file system self is unloaded, and can be used for from long-range often from equipment send reproducer 36(its conventionally will exist proper what a, if because only there is a main equipment-have more than one main equipment after network partition and recovery subsequently for each file system in any given network partition, main equipment as described above consults to guarantee that a main equipment convey and can continue so that copy in a small amount of time) receive file system renewal.
Receive reproducer 38 and start from loading (LOADING) state, wherein it is just inquiring the snapshot data that file system is current.When it receives file system data, it informs the snapshot state that its controller 30 is current.Controller 30 is informed other server in cluster by this, and reception reproducer 38 enters ready (READY) state.After informing other server current state, they can calculate and judge and departed from from equipment by the overall forest based on them, or it needs simple " fast forward " to upgrade.
If upgrading is to upgrade forward fast, reproducer directly proceeds to reception (RECEIVING) state, and snapshot data flows by network.When it is accomplished to the conversion of loading (LOADING) state, check that desired data are correctly received, then initiate asynchronous pruning and for next duplicate event, become ready immediately.
If upgrading is not to upgrade forward fast, reproducer instead converts storage (STASHING) state to, and it is (indicated by transmission reproducer that wherein it is stored in " section point " in local " storage catalogue "
end_snapshot(end _ snapshot), its be main equipment and between equipment nearest common snapshot) and from equipment the binary copy when the snapshot between front of file system.Once this storage completes, file system is just prepared to receive change and copy as usual and proceed immediately.So it is immutable that beginning snapshot is marked as, and storage process can be inverted.
In some cases, the local file system from equipment can be modified (even if the meaning is unloaded, keeper can for example install by accident it and revise it).In this case, copy failure, this situation detected, and convert LOCAL_MODS to yet receive reproducer, it makes local modification be snapshotted and store safely.Receive reproducer and send failure message and often from device transmitter, will be converted to time-out (PAUSE), and again attempt when its overtime initiation (fire), make to copy and can continue.
Pruning Algorithm
Process has been described establishment snapshot hereinbefore, but does not destroy them.Destroy old snapshot to be important by the restricted number of snapshot in rational quantity.When you have the snapshot of opening more than hundreds of, file system operation becomes slow.For user, away from the difference between two time point snapshots that take before a minute that surpasses before a year, compare and be likely not too important with the difference between two time point snapshots from a few minutes in past, therefore comparing with newer those keeps forging ahead more, and to prune older snapshot be significant.Pruning is the change from a plurality of snapshots in succession folded also (collapse) to be become to the process of single snapshot.
The important property of pruning process is that it causes the same snapshot on Servers-all in cluster to be selected for deleting.This makes
findUpdates(find and upgrade) process will find common snapshot recently and avoid unnecessarily sending massive duplication data.
Pruning Algorithm carrys out work by defining one group of section: typically last hour, the previous day, the last week and previous month, and then use " path point (waypoint) " between section " blind ", for example system can be configured such that all snapshots from first 60 minutes will be retained, for retaining snapshot hourly the previous day, for retaining the snapshot of every day the last week, etc.
By
suggestedDeletions(suggestion is deleted) function comes suggestion to delete snapshot, if they are not to put immediate snapshot from path.
Because path point is quite stable with respect to passage of time, therefore on Servers-all, take almost identical pruning to determine, even if prune time place slightly different on different server, occur.
Very recently, snapshot also will be excluded outside the consideration for deleting, and immutable snapshot is never deleted.If there is storage based on this snapshot, this snapshot is marked as immutable (only on particular server locally), this is because store in order to recover snapshot based on middle snapshot, middle snapshot must still exist, and therefore for recovering for the storage of data from it can be used for, must make reservoir based on snapshot immutable and never deleted until store and be dropped.
Snapshot device 34 and receive reproducer 38 the two all utilizes this Pruning Algorithm that the quantity of main equipment and the snapshot from equipment is remained in reasonable limit.
System alternatively exposed interface for user's rollback particular snapshot, from the snapshot of set point, to clone new application and database and manually some snapshot is set as be immutable.
Controller
These chapters and sections have been explained totally " controller " process, its be responsible for awaring which server in current network subregion (if any) online and therefore which server should be elected as the main equipment for each website.If it is also responsible for file system and copies deficiency (under-replicated), add from equipment, and if copying excessively (over-replicated), file system removes from equipment.
Cluster guiding and merging process
Referring to Fig. 7.
In the normal operation period, server will be broadcasted some message with appropriate time interval by group message transmitting system:
1. heartbeat message-M2 asserts that the activity (liveness) of each server and each server are by the test (proper operation on this server of all systems and process) of himself.These data are stored on every machine to be called the mapping of active mapping.
2. data available message M2-state each server has which snapshot of which file system, for determining file system state and informing and copy decision, as described.These data are stored on every machine to be called the mapping of data available mapping.
Current main equipment message-M3 state which server current be main equipment for which file system.These data are stored on every machine to be called the mapping of current main equipment mapping.
4. the current load capacity being generated by each application on each server of load value message-M4 statement, is used in load balance calculating.
Also there are a plurality of periodic tests that can move with the time interval being configured:
1. send heartbeat (S4)
2. send current main equipment message (S4)
3. check dead file system (S6)
4. check load balance (S7)
5. check redundancy (copying excessively/deficiency) (S8)
When server starts, it starts by reading current file system S1 and snapshot state.If last, there is (clean) shutdown completely, it can read this data from local cache file, described local cache file also comprise about previous current master status and just before the previously shutdown of this server for the data of the server of (live) in real time (CHECK_TIMEOUT(check _ overtime) grace period is applied to previously for each real-time server to get back to online before its website of controller " rescue ").This is for promoting where necessary rapid cluster reboot, because excessive refitting (it is slowly) is avoided.
Heartbeat message
Controller 30 use group message transmitting systems are next per second sends heartbeat from each server.System log (SYSLOG) last time it from each server S 2 hear time of message and each server can be therefore based on CHECK_TIMEOUT(check _ overtime) which server the time interval detects is real-time (the subregion identical with it), and which server is (fault or be partitioned) of mourning in silence.
Avoid too early action
When server is when starting, it can be observed to appear to and indicate it should carry out certain state of certain action, such as the significantly dead file system of rescue.Yet this behavior may be completely incorrect, because it may not yet hear that it is required to make all information of correct decision.Therefore, we define a conception of species, it is called from Servers-all hears message (heardFromAllServers) S3, S5, and it has defined, and in this set (CHECK_TIMEOUT(in the past check _ overtime) second of real-time server, we therefrom hear the server of heartbeat) must be the subset of the key of discussed mapping.Therefore we protect by hear the inspection (checking that we hear data available or current main equipment message from Servers-all) of message (heardFromAllServers) from Servers-all the periodic test that will carry out so potential damage action.
Therefore Fig. 7 described server when it starts by the state experiencing, and newly server adds, sends heartbeat, but not yet assert how it can cause that to the ownership of file system other server in cluster postpones again to move their circulation until new server has sent data set message.Only have when Servers-all hears that (S3) all other real-time servers send data set message, any server just will be allowed to send current main equipment (S4) message, and only have when there is overall situation common recognition in current current master status, any server just will be therefore can the dead website of rolling inspection (
checkDeadSites) (S6).This makes cluster is very sane for server or network failure and is brought back and makes online and not the decision that part is informed, the decision that described part is informed may cause unfortunate result, reaches the standard grade and require as the main equipment for heap file system such as legacy server when in fact it has the copy of all data of fortnight.
Use leader (leader) to carry out decision-making
System is defined as the leader for file system the server of the IP address having on minimum lexicography, and it has the copy of this document system.For example, in Figure 1A, current main equipment can be server A, but leader can be server B.This has broken the symmetry in other homogenous distribution formula system.
Notice, as the leader for file system, be different from very much as the main equipment for it.Only use leading capacity (leadership) to check and can make about changing which server for the decision of the current main equipment for this document system to establish which server.This mechanism has stopped a plurality of servers and has attempted that the migration of file system is had to conflict simultaneously.Certainly, leader's server will be that current main equipment-role of leaders is the role who defines discretely for master role in some cases, but can be in same server.
Current main equipment message is sent binary value and is converged at (converge on) global state common recognition
Current main equipment message M3 contains from its which website of positive trustship of each server 1 and the list of which website of not trustship.This allows the overall consistent current main equipment mapping of Servers-all structure and the competition main equipment after parsing partition recovery.
Be when receiving current main equipment message M3, wherein in the subregion recently merging, the situation of two competition main equipments can be detected and process.This describes in snapshot plotting chapters and sections by use
chooseCandidateMasters(selecting candidate's main equipment) function completes.
System is each file system broadcast binary value true or false.By checking overall from the current main equipment message of Servers-all, and compare with the current main equipment mapping of system self, we are by carrying out correctly synchronously global state with following logic:
if server requirement trust instrument system, but we do not think that it is hosted in there, or server requirement but trust instrument system we do not think that it is hosted in there
and we are the leaders for this document system
based on the calculating of candidate's main equipment, moved to best server
Local and remote redundant computation (
addSlaves(adding from equipment))
Two things of each file system 10 inspection that copy check circulation is used for as main equipment for server 1 at present: whether file system copies deficiency, and it calls on snapshot device 34 in this case
addSlaves(add from equipment), its create some new often from device replication device 36 for selected new for server (then it create new reception reproducer automatically, and file system is copied into new for equipment).
Second checks it is whether file system copies excessively, it sends deleteFilesystem(delete file system in this case) message, its cause long-range from equipment discarded they file system copy and for those being often shut down from device replication device 36 from equipment.
In one embodiment, cluster know which server at local data center and which server at remote data center.This allows its configuration based on cluster and to the wisdom more of how many this respects that will copy to from equipment about in each place.For example, cluster administrator can determine that she wishes to have local redundancy (localRedundancy) value is 2, this means except main equipment, in local data, two servers in the heart make each file system be copied to their (making cluster can tackle the fault of 2 home servers), overall situation redundancy (globalRedundancy) value is 1, this means that two other data centers (place) must make each file system be copied to them, and every remote site is 1 from equipment (slavesPerRemoteLocality) value, it means that each remote site must have a server, it obtains the copy of file system.
Because file system and application can be arrived another by Cong Yige data center real-time migration, therefore when file system gets there, additional duplicate may automatically be created in new data center, and some duplicate in old data center may be removed.
Check dead file system
If server failure, some file system will stop on any real-time server in trust.In this case, it can do about it set of the dead file system of anything the dead file system of inspection (checkDeadFilesystems) cycle calculations on each server, its concern is: for server, have those file system of its copy, the current main equipment (if any) of file system is not real-time at present.
For each in these file system, each server determines whether it is the current leader for this document system, and if its be, its based on from
chooseCandidateMastersa new main equipment of electing for this document system in the optimal service device of (selecting candidate's main equipment) function.
Distributed protocol processor
What between client and system, reconcile all protocol access (exemplary protocols: HTTP, HTTPS, MySQL client protocol, SMTP, POP and IMAP) is the distributed protocol processor 5 of describing in Fig. 8.
They allow to be directed to any server in cluster for any request of any file system.For example this means, can set up DNS configuration make website have a plurality of ' A' record, point to separately the different server in cluster, to utilize (limited) built-in redundancy in HTTP, wherein web browser will attempt interchangeable ' A' record, if first disabled words that it is attempted.
On each server 1, protocol handler 5 " is sitting in " " the place ahead " (example application server: Apache, MySQL server, Exim, Dovecot) of practical application server.In addition, protocol handler is connected to above-described controller, and can access its current main equipment mapping.Protocol handler can " be said " the lucky enough of each agreement, to establish request, should be routed to which file system.Exemplary plot 8 illustrates the configuration of two servers 1, wherein for the request of file system F, from client 7, via network 8, comes server A, and is received by entering agency (incoming proxy) 80A.Protocol handler shines upon to select back-end server by checking the current main equipment at server A place controller, and find that it need to route requests to server B, thus its agency (the outgoing proxy) 82 that go out be connected to server B ' enter and act on behalf of 80B.Then server B check it current main equipment mapping (its by global state as described above, know together and with server A ' conform to) and route the request to it self " back-end server ".Connect is in this that " seamless combination " to make client 7 or back-end server (B in this case) both can not differentiate this be not that completely common client connects.Client and correct then communication as usual of back-end server (for example: server is connected to client and is sent webpage by HTTP), but protocol handler keeps the tracking to the connection through them simultaneously.
They need to keep the tracking to connecting, because they have the ability to suspend as required new request.This is in order to realize seamless real-time migration.If the connection that controller 30 has asked protocol handler 5 to be suspended to given server 1, it is by with one in two kinds of patterns.It is closed wait timeout naturally for the connection of " (in-flight) in-flight ", suspends all new connections that enter simultaneously, then:
1., if suspend and force, and if the connection in current flight does not have nature to close, it will stop them by force.
2., if suspend and do not force, it is die wait timeout (die) for connecting nature, suspends all new connections that enter simultaneously.If aloft connection does not complete in the distributed time, abandon suspending trial and the new connection suspending by " liberation ".
If suspended successfully, it is waited for, until controller 30 requests " deblocking (unblock) " suspend, system checks that by asking controller which back-end server 1 should be connected to (most important ground again in this, rear end may change during pausing operation), and be connected to potential different back-end server, liberation " large quantities of request ", it is being established on new server during time-out process, so described new server can be processed them as usual.If postpone fully shortly, terminal use will only notice little delay.
Real-time migration
We have all of picture mosaic and describe complete real-time migration process with reference to Fig. 9 now.For recapitulaion, we can:
guarantee to copy proceeding to from server, even under the condition of fault and subregion, and after the recovery of those conditions, recover.
with distributed protocol processor, controlling inbound (in-bound) connects, make any request can be addressed to any server in system, and make system can at once suspend inbound connection, those complete to wait for in-flight (unsettled), and request is redirected to different servers.
We can describe the conversion of real-time migration state machine and agreement now.Controller can, under user's guide or due to one in two kinds of mechanism described below, select to initiate the real-time migration of application from a server to another.
" throwing arm (thrower) server " 30(master server) controller creates throwing arm object in state I NIT 90, and it is responsible for controlling dubbing system and distributed protocol processor simultaneously.This throwing arm object to the remote controllers of destination server (catcher (catcher)) (new main equipment) send request moving load (
requestmoveload) message, it attempts 92 for real-time migration distribution gap (existence is allowed to the real-time migration of the parallel limited quantity occurring).If gap (slot) is assigned with, it creates catcher object in state I NIT 94, and catcher send accept moving load (
acceptmoveload) message.Then its snapshot device of throwing arm instruction 96 34 structures are often from device replication device 36 for destination server, in case it is not yet from equipment.Throwing arm then send nearest snapshot (
latestsnapshot) message, its instruction catcher enters precloning state (PREREPLICATION) 98 until received snapshot.This may not be the final snapshot being used in copying, but it at least makes capture server " quite stylish " make wherein the critical path element of the real-time migration that the inbound request for file system is blocked at once short as much as possible.If catcher is observed it, there is this snapshot, it can walk around for 180 precloning stages and initiate to continue immediately moving load (
continuemoveload) message.Otherwise, it send 99 preclonings (
prereplication) message and then when the dubbing system of catcher is observed snapshot and arrived, its inform throwing arm it can by transmission continue moving load (
continuemoveload) message continues.Throwing arm then its distributed protocol processor of instruction 102 starts to suspend when all new requests that enter and the request in all current flights complete and notifies it.Catcher is done identical thing 100.Present whole real-time migration process can, in one of two kinds of patterns, be forced or non-being forced.If pattern is non-being forced, and (for example there is the connection of (long-lived) in real time for a long time of current main equipment, such as IDLE IMAP, connect), suspend and can be abandoned, it causes that whole real-time migration is abandoned, and (this can be useful, for example turn-off if necessary server completely to force real-time migration to make their always successes in a small amount of time, this is may close the cost that is connected to of some long-time running).When the distributed protocol processor of both sides is all successfully when closeall current connection and time-out/stop all new connections entering, its file system installation process machine of throwing arm instruction 104 is with unloading of file system, make to make further change to it, it takes the final snapshot of file system and this final snapshot 106 is arrived to catcher in this, and this is all when the request newly entering for application is suspended.When copying for 108 whens success, catcher is installed 110 file system, and sent moving load (
completemoveload) message, its cause throwing arm and catcher the two all their distributed protocol processors separately of deblocking 112 and therefore the request (user waits for the several seconds that this process spends patiently) of large quantities of time-outs by solution, be placed on the new main equipment for this website.
Drive real-time migration
Controller 30 has for automatically initiating two kinds of mechanism of real-time migration event.These are load balancing and application site preference mechanism.
Load balance: load >av+Q
Servers-all 1 in cluster is constantly concluded the business about the information of the present level by each load of generating of application, for example, by measuring within ten second period the summation for total request number of times of this application.By using exponential decay algorithm (being calculated the identical algorithms of using by UNIX load average) to make these measure " smoothly " in 10 minutes.Server continuously (check load balance (
checkLoadBalancing) in circulation) checking that the average load whether their total load (across the load summation of all its application) surpasses in cluster adds " the weary clear factor " Q, its existence is so that server stops constantly concluding the business load.If the load of server surpasses av+Q, server is elected recipient's server, it is the server with minimum load that comes from Servers-all, and picks up website from its current website, and it is not make recipient oneself think the website of maximum load of its overload.This is known as " burn prevention sweet potato selection function ", because it makes server stop constantly concluding the business load.Selected website is arrived recipient by real-time migration.
From the behavior of appearing suddenly (emergent behaviour) of this simple rule set, be server by automatically by moving whole self load balance that should be used for making between the server in cluster everywhere.In addition,, if an application-specific obtains the large spike in traffic carrying capacity, this application self will be by real-time migration (because burn prevention sweet potato selection function is forbidden like this); On the contrary, all other application on this server will be walked by migration, and leaving this server becomes the private server for this application.
Application site preference
Wandering back to cluster can distribute across zones of different geographically.If user may wish to express preference, given area can be with (if there exist online server), and first trustship is there for their website.If user specifies or changes this preference (it can be stored in database), controller detects and changes and initiate the two real-time migration of application and any Relational database.This is important so that application and database thereof are always stored in geographic local zone because database access to be often assumed to be low latency.For application, also may importantly not be hosted in or copy to given place, to observe local rules.
Protection is in case user error
Protecting in case in the data protection system of hardware fault, such as RAID or synchronously copying, if user has unexpectedly deleted data, delete the data that are copied to (a plurality of) duplicate equipment and delete and will forever lose.
As explained above, system of the present invention takes the time point snapshot of all data in the system of being stored in continuously, and these snapshots are stored they can be accessed by user, and for example, via web interface, it presents the diagrammatic representation of available snapshot.If user has unexpectedly deleted the data from system, previous data snapshot can be used this interface, by one of figured snapshot of selection, be selected, and system can be reduced or be returned to its at selected snapshot by the state when taking, for example, before deleting, and without system manager's intervention.
The above embodiment of the present invention has been sent a plurality of feature and advantage as hereinbefore set forth:
1. from server or data center's fault, automatically recover for restoring
Get back to Figure 1A, when heartbeat message indication server or the whole subset of server or during to the late barrier of connection (such as for example switch 3E or 3W) of subset of servers even perhaps, the file system that is attached to this server of supporting application-specific now this fact in the dust can be identified and this situation can be in the situation that be recovered receiving (if any) minimal disruption that has become now the client terminals 7 of the service that dead application supports by the file system for it automatically.
This respect supported by the mechanism institute of contiguous file system copies, and master server has the snapshot of the file system image of the real-time application of supporting for them continuously to the named aggregate transmission from server thus.Again get back to Figure 1A, for example the main equipment in the subset of east can guarantee that he always specifies at least one other server and the sending for support application-specific of at least one other server in western subset in the subset of east.
For the current main equipment of applying, can carry out local and remote redundant computation (addSlave(adds from equipment) function) and if for checking the necessary redundancy that increases.Thereby main equipment can independently not only be determined the quantity from equipment that it copies to application data, but also determines those character from equipment and position.Except other item, this can be inputted or user preference instructs by user.
2. selected from the data of network partition recovery-most worthy.
Referring back to Fig. 2 A, the recovery from subregion, leader's server can independently determine which in a plurality of potential main equipments should be elected as new main equipment.To it is clear easily that, and after subregion, may have the server of the either side of subregion, it respectively identifies oneself oneself is the main equipment of application.In this case, the figure weighting function of describing in the early time can be implemented to determine larger barycenter and thereby determine which main equipment has the data of most worthy.Notice, by the following fact, helped considerably the recovery from subregion: before subregion, exist ongoing file system to copy, the file of a version-this is to determine who has the problem of best edition by having to make each in potential new main equipment.
3. move criterion
Except providing support from the automatic recovery of fault with from the mechanism of the recovery of network partition, embodiment as described above can send real-time migration for optimizing object.Thereby the server in cluster can come Autonomous determination to be applied on different main equipments by their message will to be served better, still do not exist even therein in the situation of fault of current main equipment.This can so that load balance or delivery needle the mode of place preference (such as may being inputted by user or keeper) of application is completed.The mechanism that load is compared with average load across server in cluster and factor Q allows the vertically scale (vertical scaling) of application to send as required private server.That is to say, in application, just occupying the place of the current server resource of significant quantity, can make and determining so that other application is moved away to different server from this server, rather than make and determining so that this application is removed from this server, thereby and allow this application to increase its resource on current server.
4. real-time migration
Once real-time migration as discussed above has just been initiated and has been controlled and copy by real-time migration, and supported by processing request during the migration by protocol handler.
5. the Interactive control of time point reduction feature-this is supported by user interface, the time point that it allows user's select File system to be restored to.This can be useful especially for Email, database and file, snapshot, rollback when being supported in difference and browsing.This provides the protection to user error, particularly when user is when application level has been deleted the something that they do not want to delete.Although delete, can be effectively, the snapshot in the early time that item is deleted in reduction will be possible for present to user at interface.
6. horizontal scalability
The remarkable advantage of embodiments of the invention as described above is add or remove server to increase or to reduce the ability of its whole capacity from cluster.For example, cluster can be managed by all websites of a mobile server, so as for example in the situation that before upgrading or off-line procedure the managed reproduction process of migrate application by its upgrading or make its off-line.What should expect is, by making leader's server for applying make the decision about new main equipment based on sending binary value with the current main equipment message that to converge at about whom will be the global state common recognition of best main equipment, this can be by cluster independently management substantially.Thereby if the generation that removes of new server or server detected, leader's server can independently be born the task of specifying new main equipment in the context of the new cluster (it can comprise more or less server now) forming so.Be in this context, existing cluster can be the cluster of a server, at additional server, can be added in the meaning in this cluster.
In described mechanism, support that the useful especially point of this point is the action of avoiding too early-for cluster and the server of Yan Weixin is only just done anything after they have received about the insufficient information of whole system, to make suitable decision.Load balancing contributes to allow new server to make load move to them, because when having added new server (before they support any file system), overall situation average load level reduces, and the decision that some in load is moved to new server can independently be realized.
Embodiments of the invention as described above solve redundancy issue in the following manner:
The change of application state is copied to asynchronously to other server of the configurable number in system.In the configurable number of seconds of change that detects application data, take the time point snapshot of the data of each application, and the difference between these snapshots is replicated between server.This allows the system that is applied in to detect in the situation of fault of assembly, server, the network equipment or even whole data center automatically to recover from data copy very recently.Due to the dependence not existing shared storage, therefore without quota, fence or STONITH, arrange.
Embodiments of the invention as described above solve load balance problem in the following manner:
By the caused load of application, by systematic survey and use, in distributed decision making process, with initiation, be applied in the seamless real-time migration between server continuously.For example,, if the positive hosts applications { 1,2 of server A, positive hosts applications { 4,5, the 6} of 3} and server B, and application 1 and 2 the two all experience the spike in load, and all the other are silent (quiescent), system can elect real-time migration 2 to arrive server B for configuration A → { 1, the 3} of balance, B → { 2,4,5,6}.
In embodiment as described above, " seamless " real-time migration is a kind of migration wherein all completing before final snapshot is taken and copy to new main equipment by all modifications aloft request of application being occurred in the file system of the application on old main equipment, make when file system is installed on new main equipment, never application code or client can be differentiated real-time migration has occurred, and there is no loss of data.