|Numéro de publication||US8886603 B2|
|Type de publication||Octroi|
|Numéro de demande||US 13/864,877|
|Date de publication||11 nov. 2014|
|Date de dépôt||17 avr. 2013|
|Date de priorité||31 janv. 2012|
|Autre référence de publication||US8447730, US20130232114|
|Numéro de publication||13864877, 864877, US 8886603 B2, US 8886603B2, US-B2-8886603, US8886603 B2, US8886603B2|
|Inventeurs||Masood Mortazavi, Jacob Y. Levy, Sahaya Andrews Albert|
|Cessionnaire d'origine||Yahoo! Inc.|
|Exporter la citation||BiBTeX, EndNote, RefMan|
|Citations de brevets (16), Classifications (9), Événements juridiques (3)|
|Liens externes: USPTO, Cession USPTO, Espacenet|
This application is a continuation of and claims the benefit of the filing date of U.S. application Ser. No. 13,363,148, titled “PROBE SYSTEM FOR REPLICATION MONITORING”, filed on Jan. 31, 2012, which application is incorporated by reference.
The present invention relates to Replica systems and particular to a Probe System that determines the latency and health of paths within such systems.
In high-replica count systems or in systems where replication can be performed across a large number of zones or regions, it is critical to measure and monitor all possible replication paths to ensure that those paths are healthy and functioning. Failures can occur due to large backlogs, replication volume failures, end-point failures, transmission message failures or other kinds of failures along any one of the paths. Furthermore, such monitoring is best performed by a probe system that does not unduly burden the serving system or affect the multi-tenant cloud services it provides. In other words, resources used for probing should be infinitesimally small compared to general serving resources. Finally, monitoring computations should be reliable and available relatively immediately.
One embodiment of the present invention is a probe system that has a completely decentralized architecture which partitions replication monitoring based on the origin of the replicated transaction. Each replica has its own instance of a Probe System. Each Probe System includes a Probe Head and a Probe Echo. The probe head issues experiments on a set of Target Records, marking them with an Epoch value and a time stamp. (Target Records are selected to exercise all Replication Channels.) The Probe Echo instance completes its leg of the experiment by echoing the Head's probe, with the same experiment Epoch value and a latency value. Finally, the probe head computes and reports replication metrics based on received echoes. The system can operate with partial failures and new Probe Systems can be added for new replicas without having to reconfigure any of the existing probe systems.
Yet another embodiment of the present invention is a method of probing replication paths in a computer system. The embodiment includes (i) creating a new instance of a probe system on a first replica computer system, where the probe system includes a probe head and a probe echo, (ii) creating a target record on a second replica computer system, where the target record includes a head field and an echo field, the head field includes an epoch value and a time stamp, and the echo field includes an epoch value and a latency value, (iii) sending a probe head from the first replica to the second replica to update the epoch value and the time stamp in the head field, (iv) sending a probe echo from the first replica to the second replica to update the epoch value and the latency value in the echo field if the epoch value in the echo field is less than the epoch value in the head field, wherein the latency value is a measurement of time to communicate between first and second replicas, and (v) obtaining a report that gives the communication latency for a communication starting in the first replica and ending in the second replica.
One advantage of the new probe system is its peer-to-peer architecture. Another advantage is the system's capability to maintain high levels of measurement fidelity for a range of write throughput budget allowed, by the cloud storage system, for probe system consumption.
Yet another advantage is the system's distributed protocol, algorithm and data schema that produce a unique capability for online monitoring of replication system without any reliance on logging or off-line analysis.
Yet another advantage is the system's broad applicability to a large class of high-replica, distributed and high-scale cloud storage systems and services. The probe system writes keys and values, where the values are composed of disjoint columns, and transactions specify the columns they are updating.
Yet another advantage is the system's completely decentralized architecture, with no master or slave.
Yet another advantage is the system's capability to be deployed independently of the storage system itself.
Yet another advantage is the system's capability to monitor and exercise all available replication channels.
Yet another advantage is the system's capability to continue operating despite partial failures, i.e., no operational interdependence is required or assumed.
Yet another advantage is the system's capability to add new probe systems without any requirement to configure any other, active probe systems.
Subject matter will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific example embodiments. Subject matter may, however, be embodied in a variety of different forms and, therefore, covered or claimed subject matter is intended to be construed as not being limited to any example embodiments set forth herein; example embodiments are provided merely to be illustrative. Likewise, a reasonably broad scope for claimed or covered subject matter is intended. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware or any combination thereof (other than software per se). The following detailed description is, therefore, not intended to be taken in a limiting sense.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment and the phrase “in another embodiment” as used herein does not necessarily refer to a different embodiment. It is intended, for example, that claimed subject matter include combinations of example embodiments in whole or in part.
Content server 107 includes a content delivery network or “content distribution network” (CDN). This generally refers to a distributed content delivery system that comprises a collection of computers or computing devices linked by a network or networks. A CDN may employ software, systems, protocols or techniques to facilitate various services, such as storage, caching, communication of content, or streaming media or applications. Services may also make use of ancillary technologies including, but not limited to, “cloud computing,” distributed storage, Domain Name System (DNS) request handling, provisioning, signal monitoring and reporting, content targeting, personalization, or business intelligence. A CDN may also enable an entity to operate or manage another's site infrastructure, in whole or in part.
In an embodiment, the present invention operates in the environment of the one or more servers of a network, as shown in
One embodiment of the present invention is a Probe System that operates to determine the health and latency of replica paths within a Replica system. A Replica system is one that includes replication channels that can be tested, where a channel is any resource along a path that is used in multiplexed mode, potentially by multiple replication messages. The Probe System includes for each Replica a ProbeHead and a ProbeEcho. In one embodiment, the Probe Head is an instance of an Object of the ProbeHead class and the ProbeEcho is an instance of an Object of a ProbeEcho class. The ProbeHead updates data in a Target Record for a channel and the ProbeEcho uses that data to help determine the latency of a path to the system having the Target Record. Thus, the Target Record includes information relating to the latency of all paths that lead to the system having the Target Record. By constructing a report using the Target Record, information about the health and latency of each path to the system having the Target Record is available.
The Target Record
In each Target Record, a Head_i field 202, 208, 214 contained therein includes (a) a time stamp and (b) in one embodiment, a monotonically increasing value, called the Epoch value. For example, in
In each Target Record, an Echo_i_j field 204, 206, 210, 212, 216, 218 contained therein includes (a) an arrival latency value and (b) an Epoch value. The Echo field thus represents echoes from Replica_i to Replica_j in response to updates from Replica_j which occur in response to reading a new Epoch value in the Head_j field of a given TargetRecord.
Referring again to
For ProbeSystem 2, Echo—3—2 210 has a latency of 3 and an Epoch value of 100 and Echo—1—2 212 has a latency of 2 and an Epoch value of 100. This means that there is a latency of 3 between replica—3 and replica—2 and a latency of 2 between replica—1 and replica—2. For ProbeSystem 2, ProbeEcho—1 updates the latency of Echo—1—2 212 and ProbeEcho—3 updates the latency in Echo—3—2 210.
For ProbeSystem 3, Echo—1—3 216 has a latency of 3 and Epoch value of 206 and Echo—2—3 218 has a latency of 2 and an Epoch value of 205. This means that there is a latency of 3 between replica—1 and replica—3 and a latency of 2 between replica—2 and replica—3. For ProbeSystem 3, ProbeEcho—2 updates the latency of Echo—2—3 218 and ProbeEcho—1 updates the latency of Echo—1—3 216.
In general, the Echo_i_j field of TargetRecord_j is updated by ProbeEcho_i where i=1, 2, . . . M and i≠j. In effect, the index j for the TargetRecord identifies a channel in which replication can occur.
Thus, if N is the number of replicas and M is the number of Probe System instances, M is always less than or equal to N. Additionally, there are M active Head_i fields, where i=1, 2 . . . M, and M·(M−1) active Echo_i_j fields, for i,j=1, 2, . . . M and i≠j.
As Probe Systems for new replicas are added, new fields are added to each Target Record by the ProbeHead and ProbeEcho for the Probe System corresponding to a given Replica. As Probe Systems go dormant, their fields become dormant. For example, if ProbeSystem_i goes dormant, then there are no updates to Head_i and Echo_i_j. Dormancy is evaluated and determined by the ProbeHead during reporting. So, dormancy of ProbeSystem_i is evaluated by all other Probe Systems, according to some policy or rule. For example, if a replica associated with a Probe System goes down, the ProbeSystem must be evaluated by other Probe Systems so that the dormancy can be reported. In one embodiment, a policy or rule for this evaluation is the length of time since a Head_i update was issued by the ProbeSystem_i.
In practice in some embodiments, multiple potential echoes are batched in ProbeEcho_j and written together in the Target Record. Batching induces a delay in discovery but does not affect fidelity of latency measurements. Writing the Echo fields in batches also allows the ProbeEcho to conserve write throughput budget used up by the Probe System.
The Probe System
The ProbeHead is the component of Probe System that is responsible for updating the Head_i field for a specific replica i. ProbeHead is instantiated as ProbeHead_i, for Replica_i and ProbeHead_i is the only instance that updates Head_i. Not only does the ProbeHead_i instance decide when to issue a new Epoch value in Head_i, but ProbeHead_i also reads Echo_j_i, which is updated by ProbeEcho_j, and directly publishes latencies (in “real time”) to the monitoring system.
The ProbeEcho is the component of Probe System responsible for updating the Echo field for a specific replica pair. ProbeEcho is instantiated as ProbeEcho_i, for Replica_i and ProbeEcho_i is the only instance updating Echo_i_j, in response to all Head_j updates issued by ProbeHead_j at Replica_j. In other words, ProbeEcho_i reads each TargetRecord to determine whether the value of Head_j has been updated. If Head_j has been updated, ProbeEcho_i issues the Echo_i_j update. In one embodiment the system batches these updates to minimize impact on system write throughput.
In the case of one system per Container, each probe system runs separately in a Container per replicate. Thus, ProbeSystem_i runs in Container_i and probes replication from Replica_i. The Container has access to Replica_i.
In the case of multiple probe systems per Container, multiple ProbeSystems are bundled and run on the same Container. The Container has access to the multiple ProbeSystems.
In the case of multiple Probe Systems per process, each probe system is implemented as a composite object that contains the ProbeHead Object and the ProbeEcho object. The composite object has an instance per replica. The same composite object can be used in the case of one system per Container and multiple systems per Container. In the current embodiment, Each ProbeHead and each ProbeEcho object instance runs as a separate thread in the same multi-threaded process. The process has access, via the containing Container, to all replica systems. Choosing multiple Probe Systems per process depends on resource availability and the maximum expected load on a single process probe system collection.
Interactions Between Probe System and Target Record
Arrow 2 414 in
Arrow 3 416 in
let t=TargetRecord(k) in
if (t.Echo(i,j).epoch<t.Head(j).epoch) then
As mentioned above, in one embodiment, these updates occur in memory and are then collected to update the Target Record in order to improve efficiency.
The quality of the latency measurement in the ProbeSystem in an embodiment of the present invention depends to a significant extent on the ProbeEcho's read frequency, which, in turn, depends on the read budget that is available to the Probe System.
Delays in generating latency reports depend, in one embodiment, largely on the length of experimental cycles or ProbeHead write frequencies, which in turn, depend on the write budge that is available to the Probe System.
The term “storage media” as used herein refers to any non-transitory media that stores data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage devices 612, 614, 616. Volatile media includes dynamic memory, such as main memory 656 in
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 664 in
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 652 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over the external bus interface 660 or the high speed network interface 662, which in some embodiments can be 10G Ethernet.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.
|Brevet cité||Date de dépôt||Date de publication||Déposant||Titre|
|US6651062 *||31 août 1999||18 nov. 2003||Aprisma Management Technologies||Method and apparatus for managing data for use by data applications|
|US6735603||17 mars 2003||11 mai 2004||Microsoft Corp||Logical volume data structure determining configuration status of logical volume based on comparing values of examined volume epoch identifiers on each on-line extent|
|US6938084||28 juin 2001||30 août 2005||Microsoft Corporation||Method and system for consistent cluster operational data in a server cluster using a quorum of replicas|
|US7213038 *||29 juil. 2004||1 mai 2007||International Business Machines Corporation||Data synchronization between distributed computers|
|US7213040||29 oct. 2002||1 mai 2007||Novell, Inc.||Apparatus for policy based storage of file data and meta-data changes over time|
|US7584224||13 avr. 2004||1 sept. 2009||Microsoft Corporation||Volume configuration data administration|
|US7610510||16 févr. 2007||27 oct. 2009||Symantec Corporation||Method and apparatus for transactional fault tolerance in a client-server system|
|US7693891||9 avr. 2007||6 avr. 2010||Novell, Inc.||Apparatus for policy based storage of file data and meta-data changes over time|
|US7720841 *||4 oct. 2006||18 mai 2010||International Business Machines Corporation||Model-based self-optimizing distributed information management|
|US7739239||8 mars 2006||15 juin 2010||Amazon Technologies, Inc.||Distributed storage system with support for distinct storage classes|
|US7743271||27 févr. 2007||22 juin 2010||International Business Machines Corporation||Use of a UUID as a time epoch to determine if the system clock has been reset backwards in time|
|US7774469||12 sept. 2005||10 août 2010||Massa Michael T||Consistent cluster operational data in a server cluster using a quorum of replicas|
|US7890916||25 mars 2008||15 févr. 2011||Xilinx, Inc.||Debugging using a virtual file system interface|
|US7984155||14 juin 2010||19 juil. 2011||Microsoft Corporation||Consistent cluster operational data in a server cluster using a quorum of replicas|
|US8010648||24 oct. 2008||30 août 2011||Microsoft Corporation||Replica placement in a distributed storage system|
|US8156143 *||20 oct. 2010||10 avr. 2012||Sap Ag||System and method of reconciling human resource database|
|Classification aux États-Unis||707/634|
|Classification internationale||G06F11/20, G06F11/34, G06F17/30|
|Classification coopérative||G06F17/30, G06F17/30575, G06F11/3495, G06F11/2094, G06F11/3419|
|18 avr. 2016||AS||Assignment|
Owner name: EXCALIBUR IP, LLC, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO! INC.;REEL/FRAME:038383/0466
Effective date: 20160418
|1 juin 2016||AS||Assignment|
Owner name: YAHOO! INC., CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:EXCALIBUR IP, LLC;REEL/FRAME:038951/0295
Effective date: 20160531
|3 juin 2016||AS||Assignment|
Owner name: EXCALIBUR IP, LLC, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO! INC.;REEL/FRAME:038950/0592
Effective date: 20160531