US20160034370A1 - Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system - Google Patents

Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system Download PDF

Info

Publication number
US20160034370A1
US20160034370A1 US14/463,567 US201414463567A US2016034370A1 US 20160034370 A1 US20160034370 A1 US 20160034370A1 US 201414463567 A US201414463567 A US 201414463567A US 2016034370 A1 US2016034370 A1 US 2016034370A1
Authority
US
United States
Prior art keywords
segment
storage
storage unit
data
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/463,567
Inventor
Anil Nanduri
Chunqi Han
Murali Krishna Vishnumolakala
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Enterprise Development LP
Original Assignee
Nimble Storage, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nimble Storage, Inc. filed Critical Nimble Storage, Inc.
Priority to US14/463,567 priority Critical patent/US20160034370A1/en
Publication of US20160034370A1 publication Critical patent/US20160034370A1/en
Assigned to HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP reassignment HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NIMBLE STORAGE, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1092Rebuilding, e.g. when physically replacing a failing disk
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2082Data synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/84Using snapshots, i.e. a logical point-in-time copy of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/104Metadata, i.e. metadata associated with RAID systems with parity

Definitions

  • the present invention relates to methods and systems for reconstructing at least some of the contents of a storage unit following the failure of the storage unit, and more particularly relates to efficiently storing information that facilitates such reconstruction process.
  • data is often stored in a redundant manner.
  • data redundancy allows the data of the failed storage units to be recovered from the operational storage units (assuming there is sufficient redundancy). While it is certainly beneficial that data on a failed storage unit can be recovered, there are certain costs (and concerns) associated with the data recovery process.
  • data recovery consumes resources of the storage system that would otherwise be available to process read and/or write requests of a host.
  • data recovery in most cases involves reading content from the operational storage units in order to recover the lost data.
  • content e.g., in the form of data blocks and parity blocks
  • Such reads and processing of a data recovery process may increase the time it takes for a storage system to respond to read and write requests from a host.
  • the longer the data recovery process takes the longer the storage system operates in a degraded mode of operation.
  • any data requested from the failed storage unit must be first reconstructed (if it has not already been reconstructed) before the request can be fulfilled, increasing a storage system's response time to read requests.
  • a reduced level of data redundancy makes the storage system more vulnerable to permanent data loss.
  • One way to address such concerns is to shorten the data recovery process, and one way to shorten the data recovery process is to reduce the amount of data that needs to be recovered.
  • Such approach is not always possible. Indeed, if all the data of a storage unit were lost and that data is needed, there is no choice but to reconstruct all the data of the storage unit, in a process known as a “full rebuild” or a “full reconstruction”. In other cases, however, rebuilding only a subset of the data may be sufficient.
  • a storage unit when a storage unit fails, sometimes its data is not lost. In other words, a failure of a storage unit may render the storage unit unresponsive to any read or write requests, but its data is left intact.
  • the problem is that any writes to the storage system that occurred during the failure of the storage unit will not be reflected on the failed storage unit, rendering some of its data “stale”.
  • information from a log structured file system may be utilized to determine which portions of data to rebuild in a partial rebuild process.
  • data is written to a storage system in fixed-sized blocks called “data segments”, and a segment identifier is used to identify each data segment.
  • Each new data segment may be assigned a segment identifier that is greater than the maximum segment identifier that previously existed in the storage system. Consequently, the sequence of segment identifiers that is allocated over time could be a monotonically increasing sequence (or a strictly monotonically increasing sequence).
  • an identifier of the first storage unit and the segment identifier associated with the last data segment that was written to the storage system are stored in a persistent storage.
  • the storage system can refer to the information in the persistent storage to facilitate a partial rebuild of the failed storage unit.
  • the storage system may determine which storage unit needs the partial rebuild based on the storage unit identifier stored in the persistent storage.
  • the storage system may rebuild only those data segments whose segment identifier is greater than the stored segment identifier.
  • the storage system when a first one of the storage units fails, the identifier of the first storage unit and a first segment identifier associated with the last data segment that was written to the storage system (prior to the failure of the first storage unit) are stored in a persistent storage.
  • a second segment identifier associated with the last data segment that was written to the storage system (prior to the recovery of the first storage unit) is stored in the persistent storage.
  • the storage system can refer to the information in the persistent storage to facilitate a partial rebuild of the failed storage unit.
  • the storage system may determine which storage unit needs the partial rebuild based on the storage unit identifier stored in the persistent storage.
  • the storage system may rebuild only those segments whose segment identifier is larger than the first segment identifier and less than or equal to the second segment identifier.
  • the storage system may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers.
  • the storage system may process a first sequence of write requests, the first sequence of write requests being associated with a first sequence of the segment identifiers.
  • the storage system may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last write request that was processed from the first sequence of write requests.
  • the storage system may process a second sequence of write requests, the second sequence of write requests being associated with a second sequence of the segment identifiers. Subsequent to the first storage unit being recovered, the storage system may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first segment identifier.
  • the storage system may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers.
  • the storage system may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last data segment that was written on the storage array before the failure of the first storage unit.
  • the storage system may store a second one of the segment identifiers from the segment map on the second storage unit, the second segment identifier being associated with the last data segment that was written on the storage array before the recovery of the first storage unit.
  • the storage system may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit.
  • FIG. 1 depicts a storage system communicatively coupled to a host, in accordance with one embodiment.
  • FIG. 2 depicts an arrangement of data blocks and error-correction blocks on a storage array, in accordance with one embodiment.
  • FIG. 3 depicts an example of a segment map, in accordance with one embodiment.
  • FIG. 4 depicts the data blocks and error-correction blocks of a storage array after the failure of one of its storage units, in accordance with one embodiment.
  • FIG. 5 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 6 depicts the segment map of FIG. 3 at a later point in time, in accordance with one embodiment.
  • FIG. 7 depicts the data blocks and error-correction blocks of a storage array after the failure of one of its storage units and after the processing of a plurality of write requests, in accordance with one embodiment.
  • FIG. 8 depicts the data blocks and error-correction blocks of a storage array after the failed storage unit has transitioned back to an operational state, in accordance with one embodiment.
  • FIG. 9 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 10 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 11 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 12 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 13 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 14 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 15 depicts a flow diagram for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment.
  • FIG. 16 depicts a flow diagram for reconstructing at least some of the contents of a storage unit, in accordance with one embodiment.
  • FIG. 17 depicts a flow diagram for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment.
  • FIG. 18 depicts a flow diagram for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment.
  • FIG. 19 depicts a flow diagram for reconstructing at least some of the contents of a storage unit, in accordance with one embodiment.
  • FIG. 20 depicts a flow diagram for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment.
  • FIG. 21 depicts components of a computer system in which computer readable instructions instantiating the methods of the present invention may be stored and executed.
  • FIG. 1 depicts system 10 in which storage system 12 may be communicatively coupled to host 14 , in accordance with one embodiment.
  • Host 14 may transmit read and/or write requests to storage system 12 , which in turn may process the read and/or write requests.
  • storage system 12 may be communicatively coupled to host 14 via a network.
  • the network may include a LAN, WAN, MAN, wired or wireless network, private or public network, etc.
  • Storage controller 16 of storage system 12 may receive the read and/or write requests and may process the read and/or write requests by, among other things, communicating with one or more of a plurality of storage units ( 28 , 30 , 32 , 34 ).
  • the plurality of storage units may be collectively referred to as storage array 26 . While each of the storage units is depicted as a disk drive in FIG. 1 , the techniques of the present invention are not limited to storage devices employing magnetic disk based storage. More generally, techniques of the present invention may be applied to a plurality of storage units including one or more solid-state drives (e.g., flash drives), magnetic disk drives (e.g., hard disk drives), optical drives, etc. While four disk drives have been depicted in storage array 26 , this is not necessarily so, and a different number of disk drives may be employed in storage array 26 .
  • solid-state drives e.g., flash drives
  • magnetic disk drives e.g., hard disk drives
  • optical drives etc. While four disk drives have been depicte
  • Storage controller 16 may include processor 18 , random access memory (RAM) 20 and non-volatile random access memory (NVRAM) 22 .
  • Processor 18 may direct the handling of read and/or write requests, and may oversee the reconstruction of at least some of the contents of a failed storage unit. More specifically, processor 18 may perform any of the processes described below in association with FIGS. 15-20 .
  • RAM 20 may store instructions that, when executed by processor 18 , cause processor 18 to perform one or more of the processes of FIGS. 15-20 .
  • RAM 20 may also act as a buffer, storing yet to be processed read and/or write requests, storing data that has been retrieved from storage array 26 but not yet provided to host 14 , etc.
  • NVRAM 22 may store data that must be maintained, despite a loss of power to storage system 12 .
  • Segment map 36 may be stored in NVRAM 22 or storage array 26 (as is depicted in FIG. 1 ), or both. In a preferred embodiment, a plurality of updates to segment map 36 may be aggregated in NVRAM 22 , before being written to storage array 26 in a batch. Segment map 36 associates a plurality of segment identifiers with a plurality of stripe numbers, and is further described below in association with FIGS. 3 and 6 .
  • FIG. 2 depicts one possible arrangement of data blocks and error-correction blocks on storage array 26 . It is noted that the information depicted in FIG. 2 may only be a partial representation of the information stored on storage array 26 . For example, segment map 36 is depicted in the storage array 26 of FIG. 1 , but is not depicted in FIG. 2 so as to not unnecessarily clutter the presentation of FIG. 2 .
  • error-correction block(s)” will be used to generally refer to any block(s) of information that is dependent on one or more data blocks and can be used to recover one or more data blocks.
  • An example of an error-correction block is a parity block, which is typically computed using XOR operations.
  • an XOR operation is only one operation that may be used to compute an error-correction block, and more generally, an error-correction block may be computed based on a code, such as a Reed-Solomon code.
  • data block(s) will be used to generally refer to any block(s) of information that might be transmitted to or from host 14 .
  • block is used to generally refer to any collection of information typically represented as one or more binary strings (e.g., “01010100”).
  • reference labels are used to refer to particular data blocks.
  • d. 00 is a reference label used to refer to a data block stored on disk 0 .
  • reference labels associated with data blocks begin with the letter “d”
  • reference labels associated with error-correction blocks begin with the letter “P”.
  • error-correction blocks are illustrated with a striped pattern.
  • the information stored by a data block is typically in the form of a binary string (e.g., “0010101001 . . . ”).
  • the information stored by an error-correction block is typically in the form of a binary string (e.g., “101010100 . . . ”). Entries of the storage unit without any data or error-correction blocks have been left blank.
  • the arrangement of data blocks and error-correction blocks of FIG. 2 is representative of a RAID 4 data redundancy scheme, in which one of the storage units (i.e., disk 3 in FIG. 2 ) is dedicated for storing error-correction blocks, and all other storage units (i.e., disks 0 - 2 ) are dedicated for storing data blocks.
  • the data blocks in each row of the arrangement may belong to a data segment.
  • data blocks d. 00 , d. 01 and d. 02 may belong to a data segment, and the data segment (i.e., including d. 00 , d. 01 and d. 02 ) along with its error-correction block P.
  • a stripe 0 may be stored at the location of stripe 0 (which corresponds to the top row of the arrangement).
  • a stripe may be interpreted as a container for storing a data segment (and its associated error-correction block), and a stripe number may be used to identify a particular stripe.
  • a stripe typically includes a plurality of storage locations distributed across the storage units of storage array 26 . While a RAID 4 data redundancy scheme is used to explain techniques of the present invention (for ease of explanation), the techniques of the present invention can be applied to other redundancy schemes, such as RAID 5, RAID 6, RAID 7, etc.
  • FIG. 3 depicts segment map 36 that allows storage system 12 to determine where a data segment is stored in storage array 26 .
  • storage system 12 can determine that the data segment with segment identifier 112 is stored at stripe number 5 (i.e., comprising the data blocks d. 50 , d. 51 and d. 52 ). More generally, segment map 36 associates each segment identifier with a stripe number.
  • each new data segment is assigned a segment identifier that is greater than the maximum segment identifier that previously existed on the storage system.
  • the next segment identifier added to segment map 36 could be segment identifier 116 . Consequently, the sequence of segment identifiers that is allocated over time may be a monotonically increasing sequence (or a strictly monotonically increasing sequence).
  • a segment identifier is a 64-bit number, so there is not a concern that storage system 12 will ever reach the maximum segment identifier and need to wrap the segment identifier around to 0.
  • segment map 36 may be viewed as a timeline, recording the order in which storage system 12 has written to the stripes of storage array 26 over time. Segment map 36 indicates that a data segment was written to stripe number 6 , then a data segment was written to stripe number 0 , then a data segment was written to stripe number 1 , and so on. To be more precise, segment map 36 may only provide a partial timeline, as entries (i.e., rows) of the segment map 36 could be deleted. In other words, the stripe numbers are ordered in chronological order (with respect to ascending segment identifiers), but the sequence of the stripe numbers could have some missing entries due to deleted data segments. For example, if the data segment with segment identifier 113 were deleted, the row with segment identifier 113 could be deleted from segment map 36 .
  • a data segment When a data segment is modified, it is assigned a new segment identifier. For instance, if the data segment with segment identifier 111 was modified, a new row in segment map 36 could be created, associating segment identifier 116 with stripe number 1 (i.e., the stripe number formerly associated with segment identifier 111 ); and the row with segment identifier 111 could be deleted.
  • a log structured file system could have a sequence of monotonically increasing segment identifiers
  • other sequences of segment identifiers could be used, so long as the sequence can be used to distinguish at least two points in time.
  • a monotonically decreasing (or strictly monotonically decreasing) sequence of segment identifiers could be used, in which time progression could be associated with decreasing segment identifiers.
  • another name for a monotonically increasing sequence is a non-decreasing sequence and another name for a monotonically decreasing sequence is a non-increasing sequence.
  • Increasing (or decreasing) segment identifiers could be associated with progressing time, while a run of identical segment identifiers could be associated with a time period.
  • FIGS. 4-14 a partial rebuild process employing techniques of one embodiment of the present invention is described in more detail.
  • FIG. 4 depicts the scenario in which disk 1 has failed. All the contents of the disk 1 are no longer accessible, and hence the contents of disk 1 are represented as “--”.
  • information for facilitating a partial rebuild process of a failed storage unit may be stored on persistent storage (e.g., storage unit 32 , also referred to as disk 2 ).
  • the information may include an identifier of the failed storage unit (e.g., a serial number of the failed storage unit).
  • the identifier of disk 1 (which has failed) is 0001, so 0001 is stored on the persistent storage.
  • the information may also include the segment identifier associated with the last data segment that was written to storage array 26 prior to the failure of disk 1 .
  • the data segment with segment identifier 114 was the last data segment to be fully written to storage array 26 . In other words, (referring to FIGS. 2 ), d.
  • the segment identifier stored on the persistent storage is 114 .
  • segment identifier that is stored on the persistent storage may not be the maximum segment identifier that is present in segment map 36 .
  • segment identifier 114 was stored on the persistent storage, but segment map 36 also contained segment identifier 115 .
  • Such segment identifiers i.e., those greater than the segment identifier written to the persistent storage
  • such segment identifiers could also correspond to data segments located in a write buffer (e.g., in a portion of RAM 20 or in a portion of NVRAM 22 ) that have not yet been written to storage array 26 prior to the failure of the storage unit.
  • the information to be used to facilitate a partial rebuild process is stored on disk 2 (e.g., in RAID superblock of disk 2 ).
  • such information could be stored on one or more of the storage units that are still in operation. Indeed, such information could be stored on disk 0 , disk 2 and disk 3 , for extra reliability (e.g., in RAID superblocks of disks 0 , 2 and 3 ).
  • such information could be stored in NVRAM 22 .
  • storage system 12 may process additional write requests, and for purposes of explanation, assume that two additional write requests are received.
  • the state of segment map 36 is depicted in FIG. 6 following these two write requests.
  • the first request is to modify the data segment with segment identifier 111
  • the second request (following the first request) is to modify the data segment with segment identifier 109 .
  • the entry with segment identifier 111 is deleted from segment map 36 and a new entry mapping segment identifier 116 to stripe number 1 (i.e., the stripe number formerly associated with segment identifier 111 ) is added to segment map 36 .
  • segment map 36 As a result of the second request, the entry with segment identifier 109 is deleted from segment map 36 and a new entry mapping segment identifier 117 to stripe number 6 (i.e., the stripe number formerly associated with segment identifier 109 ) is added to segment map 36 .
  • FIG. 8 depicts the state of storage array 26 after disk 1 has been recovered. Assume that the failure of disk 1 did not affect the contents of disk, and upon its recovery, the contents of disk 1 are identical to the contents of disk 1 immediately prior to its failure. One can see that the contents of disk 1 are identical between FIG. 2 and FIG. 8 . Now, storage system 12 is tasked with distinguishing which data of disk 1 is stale and requires rebuilding. From the arrangement of data and error-correction blocks of FIG.
  • storage system 12 may be able to determine that stripe 4 needs to be rebuilt on disk 1 (as the absence of data on disk 1 for stripe 4 and the presence of data on the other disks for stripe 4 would indicate that a write to stripe 4 occurred during the failure of disk 1 ), but it would not be able to determine (from only the information presented in FIG. 8 ) which other stripes need to be rebuilt on disk 1 .
  • storage system 12 reads the storage unit identifier that has been stored in persistent storage (i.e., information depicted in FIG. 5 ). From the storage unit identifier (i.e., 0001 in the current example), storage system 12 can determine that a partial rebuild may need to be performed on disk 1 . Further, the storage system 12 reads the segment identifier that has been stored in persistent storage (i.e., 114 in the current example). From this information, storage system 12 can determine which data segments to rebuild. Since the segment identifiers are allocated in a monotonically increasing manner (at least in the presently discussed embodiment), any data segments that are written to storage array 26 subsequent to the failure of disk 1 will have a segment identifier that is greater than the stored segment identifier.
  • the storage system 12 can rebuild data segments corresponding to segment identifiers that are greater than the stored segment identifier.
  • the stored segment identifier was 114 , so the data segments that will be rebuilt on disk 1 are 115 , 116 and 117 , which would correspond to stripes 4 , 1 and 6 , respectively.
  • FIGS. 9-14 illustrate an iterative method to partially rebuild the contents of a failed storage unit, in accordance with one embodiment.
  • storage system 12 determines whether the stored segment identifier (i.e., 114 in the current example) is the maximum segment identifier present in segment map 36 . If so, the process ends. Since it is not the maximum segment identifier present in segment map 36 (i.e., maximum segment identifier in the current example is 117 ), storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier (i.e., next higher segment identifier in segment map 36 ) on disk 1 . In the present example, the next higher segment identifier is 115 , which is mapped to stripe number 4 .
  • storage system 12 generates data block d. 41 ′ (from data blocks d. 40 ′ and d. 42 ′ and error-correction block P. 4 ′) and stores data block d. 41 ′ on disk 1 , as depicted in FIG. 9 .
  • data block d. 41 ′ from data blocks d. 40 ′ and d. 42 ′ and error-correction block P. 4 ′
  • storage system 12 stores data block d. 41 ′ on disk 1 , as depicted in FIG. 9 .
  • the specific techniques to generate a data block from other data blocks and redundant information e.g., error-correction blocks
  • the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 115 , as depicted in FIG. 10 .
  • storage system 12 determines whether the stored segment identifier (i.e., segment identifier 115 ) is the maximum segment identifier present in segment map 36 . If so, the process ends. Since it is not the maximum segment identifier present in segment map 36 , storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier on disk 1 . In the present example, the next higher segment identifier is 116 , which is mapped to stripe number 1 . Accordingly, storage system 12 generates data block d. 11 ′ (from data blocks d. 10 ′ and d. 12 ′ and error-correction block P. 1 ′) and stores data block d. 11 ′ on disk 1 , as depicted in FIG. 11 . Next, the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 116 , as depicted in FIG. 12 .
  • the stored segment identifier is advanced to 116
  • storage system 12 determines whether the stored segment identifier (i.e., segment identifier 116 ) is the maximum segment identifier present in segment map 36 . If so, the process ends. Since it is not the maximum segment identifier present in segment map 36 , storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier on disk 1 . In the present example, the next higher segment identifier is 117 , which is mapped to stripe number 6 . Accordingly, storage system 12 generates data block d. 61 ′ (from data blocks d. 60 ′ and d. 62 ′ and error-correction block P. 6 ′) and stores data block d. 61 ′ on disk 1 , as depicted in FIG. 13 . Next, the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 117 , as depicted in FIG. 14 .
  • the stored segment identifier is advanced to
  • storage system 12 determines whether the stored segment identifier (i.e., 117 ) is the maximum segment identifier present in segment map 36 . Since it is the maximum segment identifier in segment map 36 , the partial rebuild process concludes.
  • storage system 12 was able to determine which data segments to rebuild based solely on segment map 36 and a single segment identifier stored on the persistent storage.
  • Segment map 36 is required for normal operation of storage system 12 , so the only storage overhead required to enable a partial rebuild of a failed storage unit is the storing of the segment identifier on the persistent storage.
  • the determination of whether a data segment needs to be rebuilt is also a computationally efficient step, only requiring that the segment identifier of the data segment be compared to the stored segment identifier.
  • FIG. 15 depicts flow diagram 100 for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment.
  • storage system 12 may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers.
  • storage system 12 may process a first sequence of write requests, the first sequence of write requests being associated with a first sequence of the segment identifiers.
  • the first sequence of the segment identifiers is a monotonically increasing sequence, while in another embodiment, the first sequence of the segment identifiers is a monotonically decreasing sequence.
  • storage system 12 may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last write request that was processed from the first sequence of write requests.
  • storage system 12 may store an identifier of the first storage unit on the second storage unit.
  • storage system 12 may process a second sequence of write requests, the second sequence of write requests being associated with a second sequence of the segment identifiers.
  • the second sequence of the segment identifiers is a monotonically increasing sequence, while in another embodiment, the second sequence of the segment identifiers is a monotonically decreasing sequence.
  • storage system 12 may determine, based on the identifier of the first storage unit that was stored on the second storage unit, that a partial rebuild process needs to be performed on the first storage unit.
  • storage system 12 may determine a set of stripe numbers associated with the content to be rebuilt on the first storage unit, the determination being based on the segment map and the first segment identifier. In one embodiment, the determination may be based solely on the segment map and the first segment identifier. The set of stripe numbers that are determined may identify stripes associated with data segments having segment identifiers greater than the first segment identifier. Finally, at step 116 , storage system 12 may, for each stripe identified in the set of stripe numbers, rebuild on the first storage unit a portion of the content that belongs to the stripe and the first storage unit.
  • FIG. 16 depicts flow diagram 200 that elaborates upon step 116 of FIG. 15 .
  • storage system 12 may determine whether the first segment identifier is the maximum segment identifier. If so, the partial rebuild process ends. Otherwise, storage system 12 may rebuild on the first storage unit a portion of a data segment associated with the next higher segment identifier in segment map (step 204 ).
  • the first segment identifier may be advanced to the next largest segment identifier in segment map 36 . The process may then repeat to previously described step 202 .
  • FIG. 17 depicts flow diagram 300 for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment.
  • storage system 12 may receive a read request from host 14 for data on the first storage unit while contents of the first storage unit are being reconstructed.
  • storage system 12 may determine whether the requested data needs to be reconstructed. In one embodiment, such determination may involve comparing the segment identifier associated with the read request with the first segment identifier. If the segment identifier of the read request is less than or equal to the first segment identifier, the requested data may not need to be reconstructed and the requested data can be directly read from the first storage unit (step 306 ) and transmitted to the host device ( 308 ). Otherwise, if the segment identifier of the read request is greater than the first segment identifier, the requested data may be reconstructed (step 310 ) and the reconstructed data may be transmitted to the host device (step 312 ).
  • FIGS. 18-20 depict a variant of the processes depicted in FIGS. 15-17 . Description for steps already described will not be provided for conciseness.
  • storage system 12 in addition to recording the first segment identifier, also records a second segment identifier (step 402 ).
  • the second segment identifier may be associated with the last write request that was processed from the second sequence of write requests (i.e., the sequence of write requests while the storage array was in a degraded mode of operation).
  • the second segment identifier may identify the last data segment that was written during the degraded mode of operation, and hence may identify the last data segment that requires reconstruction.
  • storage system 12 may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first and second segment identifiers. In one embodiment, the determination is solely based on the segment map and the first and second segment identifiers.
  • the set of stripe numbers that are determined may identify stripes associated with data segments having segment identifiers greater than the first segment identifier and less than or equal to the second segment identifier.
  • the process of FIG. 18 may eliminate some redundant computation, as compared to the process of FIG. 15 . For instance, if data segments are written to storage array 26 after the first storage unit has been recovered, no data reconstruction should be needed for such data segments. Nevertheless, the process of FIG. 15 may perform data reconstruction for such data segments (even though it is not needed), while the process of FIG. 18 will avoid such data reconstruction.
  • Flow diagram 500 (depicted in FIG. 19 ) is similar to flow diagram 200 (depicted in FIG. 16 ), except for step 502 .
  • storage system 12 may determine whether the first segment identifier is equal to the second segment identifier as a termination condition, rather than determining whether the first segment identifier is the maximum segment identifier (as in step 202 ).
  • Flow diagram 600 (depicted in FIG. 20 ) is similar to flow diagram 300 (depicted in FIG. 17 ), except for step 604 .
  • storage system 12 may determine whether the requested data needs to be reconstructed by comparing the segment identifier associated with the read request with the first and second segment identifiers. More specifically, storage system 12 may determine whether the segment identifier associated with the read request is greater than the first segment identifier and less than or equal to the second segment identifier. If so, the requested data is reconstructed (step 310 ), otherwise, the requested data can be read from the first storage unit (step 306 ).
  • FIG. 21 provides an example of a system 700 that is representative of any of the computing systems discussed herein. Further, computer system 700 may be representative of a system that performs any of the processes depicted in FIGS. 15-20 . Note, not all of the various computer systems have all of the features of system 700 . For example, certain ones of the computer systems discussed above may not include a display inasmuch as the display function may be provided by a client computer communicatively coupled to the computer system or a display function may be unnecessary. Such details are not critical to the present invention.
  • System 700 includes a bus 702 or other communication mechanism for communicating information, and a processor 704 coupled with the bus 702 for processing information.
  • Computer system 700 also includes a main memory 706 , such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 702 for storing information and instructions to be executed by processor 704 .
  • Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704 .
  • Computer system 700 further includes a read only memory (ROM) 708 or other static storage device coupled to the bus 702 for storing static information and instructions for the processor 704 .
  • ROM read only memory
  • a storage device 710 which may be one or more of a floppy disk, a flexible disk, a hard disk, flash memory-based storage medium, magnetic tape or other magnetic storage medium, a compact disk (CD)-ROM, a digital versatile disk (DVD)-ROM, or other optical storage medium, or any other storage medium from which processor 704 can read, is provided and coupled to the bus 702 for storing information and instructions (e.g., operating systems, applications programs and the like).
  • information and instructions e.g., operating systems, applications programs and the like.
  • Computer system 700 may be coupled via the bus 702 to a display 712 , such as a flat panel display, for displaying information to a computer user.
  • a display 712 such as a flat panel display
  • An input device 714 such as a keyboard including alphanumeric and other keys, may be coupled to the bus 702 for communicating information and command selections to the processor 704 .
  • cursor control device 716 is Another type of user input device
  • cursor control device 716 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on the display 712 .
  • Other user interface devices, such as microphones, speakers, etc. are not shown in detail but may be involved with the receipt of user input and/or presentation of output.
  • processor 704 may be implemented by processor 704 executing appropriate sequences of computer-readable instructions contained in main memory 706 . Such instructions may be read into main memory 706 from another computer-readable medium, such as storage device 710 , and execution of the sequences of instructions contained in the main memory 706 causes the processor 704 to perform the associated actions.
  • processor 704 may be executing appropriate sequences of computer-readable instructions contained in main memory 706 . Such instructions may be read into main memory 706 from another computer-readable medium, such as storage device 710 , and execution of the sequences of instructions contained in the main memory 706 causes the processor 704 to perform the associated actions.
  • hard-wired circuitry or firmware-controlled processing units e.g., field programmable gate arrays
  • the computer-readable instructions may be rendered in any computer language including, without limitation, C#, C/C++, Fortran, COBOL, PASCAL, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), JavaTM and the like.
  • CORBA Common Object Request Broker Architecture
  • Computer system 700 also includes a communication interface 718 coupled to the bus 702 .
  • Communication interface 718 may provide a two-way data communication channel with a computer network, which provides connectivity to and among the various computer systems discussed above.
  • communication interface 718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, which itself is communicatively coupled to the Internet through one or more Internet service provider networks.
  • LAN local area network
  • Internet service provider networks The precise details of such communication paths are not critical to the present invention. What is important is that computer system 700 can send and receive messages and data through the communication interface 718 and in that way communicate with hosts accessible via the Internet.

Abstract

The failure of a storage unit in a storage array of a storage system may render the storage unit unresponsive to any requests. Any writes to the storage system that occur during the failure of the storage unit will not be reflected on the failed unit, rendering some of the failed unit's data stale. Assuming the failed unit's data is not corrupted but is just stale, a partial rebuild may be performed on the failed unit, selectively reconstructing only data that is needed to replace the stale data. Described herein are techniques for storing information that identifies the data that needs to be rebuilt. When the storage unit fails, the segment identifier associated with the last data segment written to the storage system may be stored. Upon the storage unit recovering, the storage system can rebuild only those data segments whose identifier is greater than the stored segment identifier.

Description

    RELATED APPLICATIONS
  • This application is a Continuation of U.S. application Ser. No. 14/446,191 filed on Jul. 29, 2014, incorporated herein by reference.
  • FIELD OF THE INVENTION
  • The present invention relates to methods and systems for reconstructing at least some of the contents of a storage unit following the failure of the storage unit, and more particularly relates to efficiently storing information that facilitates such reconstruction process.
  • BACKGROUND
  • In a storage system with a plurality of storage units, data is often stored in a redundant manner. When one or more of the storage units experiences a failure and its associated data is lost, data redundancy allows the data of the failed storage units to be recovered from the operational storage units (assuming there is sufficient redundancy). While it is certainly beneficial that data on a failed storage unit can be recovered, there are certain costs (and concerns) associated with the data recovery process.
  • First, data recovery consumes resources of the storage system that would otherwise be available to process read and/or write requests of a host. For example, data recovery in most cases involves reading content from the operational storage units in order to recover the lost data. In many cases, once the content is read (e.g., in the form of data blocks and parity blocks), it must be further processed in order to reconstruct the lost data. Such reads and processing of a data recovery process may increase the time it takes for a storage system to respond to read and write requests from a host.
  • Second, the longer the data recovery process takes, the longer the storage system operates in a degraded mode of operation. In a degraded mode, any data requested from the failed storage unit must be first reconstructed (if it has not already been reconstructed) before the request can be fulfilled, increasing a storage system's response time to read requests. Further, a reduced level of data redundancy makes the storage system more vulnerable to permanent data loss.
  • One way to address such concerns is to shorten the data recovery process, and one way to shorten the data recovery process is to reduce the amount of data that needs to be recovered. Such approach, of course, is not always possible. Indeed, if all the data of a storage unit were lost and that data is needed, there is no choice but to reconstruct all the data of the storage unit, in a process known as a “full rebuild” or a “full reconstruction”. In other cases, however, rebuilding only a subset of the data may be sufficient.
  • For example, when a storage unit fails, sometimes its data is not lost. In other words, a failure of a storage unit may render the storage unit unresponsive to any read or write requests, but its data is left intact. Upon recovery of the failed storage unit, the problem is that any writes to the storage system that occurred during the failure of the storage unit will not be reflected on the failed storage unit, rendering some of its data “stale”. In this scenario, it is possible to perform a partial rebuild (rather than a full rebuild) on the failed unit, only reconstructing data that is needed to replace the stale data.
  • While a partial rebuild is preferable to a full rebuild (reducing the amount of time that the system is in a degraded mode of operation and reducing the processing of the storage system), a tradeoff is that the storage system is required to keep track of which data needs to be rebuilt, which takes additional resources as compared to a full rebuild process.
  • SUMMARY OF THE INVENTION
  • In one embodiment, information from a log structured file system may be utilized to determine which portions of data to rebuild in a partial rebuild process. In a log structured file system, data is written to a storage system in fixed-sized blocks called “data segments”, and a segment identifier is used to identify each data segment. Each new data segment may be assigned a segment identifier that is greater than the maximum segment identifier that previously existed in the storage system. Consequently, the sequence of segment identifiers that is allocated over time could be a monotonically increasing sequence (or a strictly monotonically increasing sequence).
  • In one embodiment, when a first one of the storage units fails, an identifier of the first storage unit and the segment identifier associated with the last data segment that was written to the storage system (prior to the failure of the first storage unit) are stored in a persistent storage. Upon the first storage unit being recovered (and assuming that none of its data is lost), the storage system can refer to the information in the persistent storage to facilitate a partial rebuild of the failed storage unit. First, the storage system may determine which storage unit needs the partial rebuild based on the storage unit identifier stored in the persistent storage. Second, the storage system may rebuild only those data segments whose segment identifier is greater than the stored segment identifier.
  • In another embodiment, when a first one of the storage units fails, the identifier of the first storage unit and a first segment identifier associated with the last data segment that was written to the storage system (prior to the failure of the first storage unit) are stored in a persistent storage. Upon the first storage unit being recovered, a second segment identifier associated with the last data segment that was written to the storage system (prior to the recovery of the first storage unit) is stored in the persistent storage. Assuming that none of the data of the first storage unit was lost, the storage system can refer to the information in the persistent storage to facilitate a partial rebuild of the failed storage unit. First, the storage system may determine which storage unit needs the partial rebuild based on the storage unit identifier stored in the persistent storage. Second, the storage system may rebuild only those segments whose segment identifier is larger than the first segment identifier and less than or equal to the second segment identifier.
  • In another embodiment, the storage system may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers. Prior to a first one of the storage units failing, the storage system may process a first sequence of write requests, the first sequence of write requests being associated with a first sequence of the segment identifiers. In response to the first storage unit failing, the storage system may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last write request that was processed from the first sequence of write requests. Subsequent to the first storage unit failing and prior to a recovery of the first storage unit, the storage system may process a second sequence of write requests, the second sequence of write requests being associated with a second sequence of the segment identifiers. Subsequent to the first storage unit being recovered, the storage system may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first segment identifier.
  • In another embodiment, the storage system may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers. In response to a first one of the storage units failing, the storage system may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last data segment that was written on the storage array before the failure of the first storage unit. In response to the first storage unit being recovered, the storage system may store a second one of the segment identifiers from the segment map on the second storage unit, the second segment identifier being associated with the last data segment that was written on the storage array before the recovery of the first storage unit. Based on the segment map and the first and second segment identifiers, the storage system may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit.
  • These and other embodiments of the invention are more fully described in association with the drawings below.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 depicts a storage system communicatively coupled to a host, in accordance with one embodiment.
  • FIG. 2 depicts an arrangement of data blocks and error-correction blocks on a storage array, in accordance with one embodiment.
  • FIG. 3 depicts an example of a segment map, in accordance with one embodiment.
  • FIG. 4 depicts the data blocks and error-correction blocks of a storage array after the failure of one of its storage units, in accordance with one embodiment.
  • FIG. 5 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 6 depicts the segment map of FIG. 3 at a later point in time, in accordance with one embodiment.
  • FIG. 7 depicts the data blocks and error-correction blocks of a storage array after the failure of one of its storage units and after the processing of a plurality of write requests, in accordance with one embodiment.
  • FIG. 8 depicts the data blocks and error-correction blocks of a storage array after the failed storage unit has transitioned back to an operational state, in accordance with one embodiment.
  • FIG. 9 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 10 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 11 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 12 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 13 depicts the data blocks and error-correction blocks of a storage array while one of its storage units is undergoing a partial rebuild process, in accordance with one embodiment.
  • FIG. 14 depicts a storage system communicatively coupled to a host, and in particular information on a storage array that may facilitate the reconstruction of at least some of the contents of a storage unit of the storage array, in accordance with one embodiment.
  • FIG. 15 depicts a flow diagram for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment.
  • FIG. 16 depicts a flow diagram for reconstructing at least some of the contents of a storage unit, in accordance with one embodiment.
  • FIG. 17 depicts a flow diagram for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment.
  • FIG. 18 depicts a flow diagram for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment.
  • FIG. 19 depicts a flow diagram for reconstructing at least some of the contents of a storage unit, in accordance with one embodiment.
  • FIG. 20 depicts a flow diagram for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment.
  • FIG. 21 depicts components of a computer system in which computer readable instructions instantiating the methods of the present invention may be stored and executed.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. Description associated with any one of the figures may be applied to a different figure containing like or similar components/steps. While the flow diagrams each present a series of steps in a certain order, the order of the steps is for one embodiment and it is understood that the order of steps may be different for other embodiments.
  • FIG. 1 depicts system 10 in which storage system 12 may be communicatively coupled to host 14, in accordance with one embodiment. Host 14 may transmit read and/or write requests to storage system 12, which in turn may process the read and/or write requests. While not depicted, storage system 12 may be communicatively coupled to host 14 via a network. The network may include a LAN, WAN, MAN, wired or wireless network, private or public network, etc.
  • Storage controller 16 of storage system 12 may receive the read and/or write requests and may process the read and/or write requests by, among other things, communicating with one or more of a plurality of storage units (28, 30, 32, 34). The plurality of storage units may be collectively referred to as storage array 26. While each of the storage units is depicted as a disk drive in FIG. 1, the techniques of the present invention are not limited to storage devices employing magnetic disk based storage. More generally, techniques of the present invention may be applied to a plurality of storage units including one or more solid-state drives (e.g., flash drives), magnetic disk drives (e.g., hard disk drives), optical drives, etc. While four disk drives have been depicted in storage array 26, this is not necessarily so, and a different number of disk drives may be employed in storage array 26.
  • Storage controller 16 may include processor 18, random access memory (RAM) 20 and non-volatile random access memory (NVRAM) 22. Processor 18 may direct the handling of read and/or write requests, and may oversee the reconstruction of at least some of the contents of a failed storage unit. More specifically, processor 18 may perform any of the processes described below in association with FIGS. 15-20. RAM 20 may store instructions that, when executed by processor 18, cause processor 18 to perform one or more of the processes of FIGS. 15-20. RAM 20 may also act as a buffer, storing yet to be processed read and/or write requests, storing data that has been retrieved from storage array 26 but not yet provided to host 14, etc. NVRAM 22 may store data that must be maintained, despite a loss of power to storage system 12.
  • Segment map 36 may be stored in NVRAM 22 or storage array 26 (as is depicted in FIG. 1), or both. In a preferred embodiment, a plurality of updates to segment map 36 may be aggregated in NVRAM 22, before being written to storage array 26 in a batch. Segment map 36 associates a plurality of segment identifiers with a plurality of stripe numbers, and is further described below in association with FIGS. 3 and 6.
  • FIG. 2 depicts one possible arrangement of data blocks and error-correction blocks on storage array 26. It is noted that the information depicted in FIG. 2 may only be a partial representation of the information stored on storage array 26. For example, segment map 36 is depicted in the storage array 26 of FIG. 1, but is not depicted in FIG. 2 so as to not unnecessarily clutter the presentation of FIG. 2. The term “error-correction block(s)” will be used to generally refer to any block(s) of information that is dependent on one or more data blocks and can be used to recover one or more data blocks. An example of an error-correction block is a parity block, which is typically computed using XOR operations. It is noted that an XOR operation is only one operation that may be used to compute an error-correction block, and more generally, an error-correction block may be computed based on a code, such as a Reed-Solomon code. The term “data block(s)” will be used to generally refer to any block(s) of information that might be transmitted to or from host 14. Further, it is noted that the term “block” is used to generally refer to any collection of information typically represented as one or more binary strings (e.g., “01010100”).
  • For clarity of description, reference labels are used to refer to particular data blocks. For instance, d.00 is a reference label used to refer to a data block stored on disk 0. For clarity of notation, reference labels associated with data blocks begin with the letter “d”, while reference labels associated with error-correction blocks begin with the letter “P”. For clarity of presentation, error-correction blocks are illustrated with a striped pattern. The information stored by a data block is typically in the form of a binary string (e.g., “0010101001 . . . ”). Similarly, the information stored by an error-correction block is typically in the form of a binary string (e.g., “10101010100 . . . ”). Entries of the storage unit without any data or error-correction blocks have been left blank.
  • The arrangement of data blocks and error-correction blocks of FIG. 2 is representative of a RAID 4 data redundancy scheme, in which one of the storage units (i.e., disk 3 in FIG. 2) is dedicated for storing error-correction blocks, and all other storage units (i.e., disks 0-2) are dedicated for storing data blocks. The data blocks in each row of the arrangement may belong to a data segment. For example, data blocks d.00, d.01 and d.02 may belong to a data segment, and the data segment (i.e., including d.00, d.01 and d.02) along with its error-correction block P.0 may be stored at the location of stripe 0 (which corresponds to the top row of the arrangement). To elaborate, a stripe may be interpreted as a container for storing a data segment (and its associated error-correction block), and a stripe number may be used to identify a particular stripe. A stripe typically includes a plurality of storage locations distributed across the storage units of storage array 26. While a RAID 4 data redundancy scheme is used to explain techniques of the present invention (for ease of explanation), the techniques of the present invention can be applied to other redundancy schemes, such as RAID 5, RAID 6, RAID 7, etc.
  • FIG. 3 depicts segment map 36 that allows storage system 12 to determine where a data segment is stored in storage array 26. For example, from segment map 36, storage system 12 can determine that the data segment with segment identifier 112 is stored at stripe number 5 (i.e., comprising the data blocks d.50, d.51 and d.52). More generally, segment map 36 associates each segment identifier with a stripe number.
  • In one embodiment of a log structured file system, each new data segment is assigned a segment identifier that is greater than the maximum segment identifier that previously existed on the storage system. For example, the next segment identifier added to segment map 36 could be segment identifier 116. Consequently, the sequence of segment identifiers that is allocated over time may be a monotonically increasing sequence (or a strictly monotonically increasing sequence). Typically, a segment identifier is a 64-bit number, so there is not a concern that storage system 12 will ever reach the maximum segment identifier and need to wrap the segment identifier around to 0.
  • Conceptually, segment map 36 may be viewed as a timeline, recording the order in which storage system 12 has written to the stripes of storage array 26 over time. Segment map 36 indicates that a data segment was written to stripe number 6, then a data segment was written to stripe number 0, then a data segment was written to stripe number 1, and so on. To be more precise, segment map 36 may only provide a partial timeline, as entries (i.e., rows) of the segment map 36 could be deleted. In other words, the stripe numbers are ordered in chronological order (with respect to ascending segment identifiers), but the sequence of the stripe numbers could have some missing entries due to deleted data segments. For example, if the data segment with segment identifier 113 were deleted, the row with segment identifier 113 could be deleted from segment map 36.
  • When a data segment is modified, it is assigned a new segment identifier. For instance, if the data segment with segment identifier 111 was modified, a new row in segment map 36 could be created, associating segment identifier 116 with stripe number 1 (i.e., the stripe number formerly associated with segment identifier 111); and the row with segment identifier 111 could be deleted.
  • While a log structured file system could have a sequence of monotonically increasing segment identifiers, other sequences of segment identifiers could be used, so long as the sequence can be used to distinguish at least two points in time. For example, a monotonically decreasing (or strictly monotonically decreasing) sequence of segment identifiers could be used, in which time progression could be associated with decreasing segment identifiers. It is noted that another name for a monotonically increasing sequence is a non-decreasing sequence and another name for a monotonically decreasing sequence is a non-increasing sequence. Increasing (or decreasing) segment identifiers could be associated with progressing time, while a run of identical segment identifiers could be associated with a time period.
  • In FIGS. 4-14, a partial rebuild process employing techniques of one embodiment of the present invention is described in more detail. FIG. 4 depicts the scenario in which disk 1 has failed. All the contents of the disk 1 are no longer accessible, and hence the contents of disk 1 are represented as “--”.
  • In FIG. 5, information for facilitating a partial rebuild process of a failed storage unit may be stored on persistent storage (e.g., storage unit 32, also referred to as disk 2). The information may include an identifier of the failed storage unit (e.g., a serial number of the failed storage unit). In the current example, assume the identifier of disk 1 (which has failed) is 0001, so 0001 is stored on the persistent storage. The information may also include the segment identifier associated with the last data segment that was written to storage array 26 prior to the failure of disk 1. In the current example, assume that the data segment with segment identifier 114 was the last data segment to be fully written to storage array 26. In other words, (referring to FIGS. 2), d.70, d.71, d.72 and P.7 were the last information blocks to be written to storage array 26 before the failure of disk 1. Therefore, as depicted in FIG. 5, the segment identifier stored on the persistent storage is 114.
  • It is noted that the segment identifier that is stored on the persistent storage may not be the maximum segment identifier that is present in segment map 36. In the currently discussed example, segment identifier 114 was stored on the persistent storage, but segment map 36 also contained segment identifier 115. Such segment identifiers (i.e., those greater than the segment identifier written to the persistent storage) could correspond to a data segment that was only partially written to storage array 26 when one of its storage units failed. Alternatively or in addition, such segment identifiers could also correspond to data segments located in a write buffer (e.g., in a portion of RAM 20 or in a portion of NVRAM 22) that have not yet been written to storage array 26 prior to the failure of the storage unit.
  • In FIG. 5, the information to be used to facilitate a partial rebuild process is stored on disk 2 (e.g., in RAID superblock of disk 2). In another embodiment, such information could be stored on one or more of the storage units that are still in operation. Indeed, such information could be stored on disk 0, disk 2 and disk 3, for extra reliability (e.g., in RAID superblocks of disks 0, 2 and 3). In yet another embodiment, such information could be stored in NVRAM 22.
  • After the failure of disk 1, storage system 12 may process additional write requests, and for purposes of explanation, assume that two additional write requests are received. The state of segment map 36 is depicted in FIG. 6 following these two write requests. The first request is to modify the data segment with segment identifier 111, and the second request (following the first request) is to modify the data segment with segment identifier 109. As a result of the first request, the entry with segment identifier 111 is deleted from segment map 36 and a new entry mapping segment identifier 116 to stripe number 1 (i.e., the stripe number formerly associated with segment identifier 111) is added to segment map 36. As a result of the second request, the entry with segment identifier 109 is deleted from segment map 36 and a new entry mapping segment identifier 117 to stripe number 6 (i.e., the stripe number formerly associated with segment identifier 109) is added to segment map 36.
  • Further assume that the data segments with segment identifiers 115, 116 and 117 are written to storage array 26 while disk 1 has failed (i.e., while storage array 26 is in a degraded mode of operation). The state of storage array 26 is depicted in FIG. 7 following such data segments being written to the storage array. Stripe 4 of storage array 26 is now occupied with data blocks d.40′ and d.42′, and error-correction block P.4′. Notice that no data block has been written to disk 1, since it has failed. Stripe 1 of storage array 26 is now occupied with data blocks d.10′ and d.12′, and error-correction block P.1′, and stripe 6 of storage array 26 is now occupied with data blocks d.60′ and d.62′, and error-correction block P.6
  • It is noted that some of the data and error-correction blocks have been labeled with an apostrophe to indicate that such blocks were written after the failure of disk 1. Such designation is for illustration purposes only (for clarity of explanation) and storage array 26 may not actually store a designator with each block to indicate whether it was written and/or modified prior to or after the failure of disk 1.
  • FIG. 8 depicts the state of storage array 26 after disk 1 has been recovered. Assume that the failure of disk 1 did not affect the contents of disk, and upon its recovery, the contents of disk 1 are identical to the contents of disk 1 immediately prior to its failure. One can see that the contents of disk 1 are identical between FIG. 2 and FIG. 8. Now, storage system 12 is tasked with distinguishing which data of disk 1 is stale and requires rebuilding. From the arrangement of data and error-correction blocks of FIG. 8, storage system 12 may be able to determine that stripe 4 needs to be rebuilt on disk 1 (as the absence of data on disk 1 for stripe 4 and the presence of data on the other disks for stripe 4 would indicate that a write to stripe 4 occurred during the failure of disk 1), but it would not be able to determine (from only the information presented in FIG. 8) which other stripes need to be rebuilt on disk 1.
  • According to one embodiment, storage system 12 reads the storage unit identifier that has been stored in persistent storage (i.e., information depicted in FIG. 5). From the storage unit identifier (i.e., 0001 in the current example), storage system 12 can determine that a partial rebuild may need to be performed on disk 1. Further, the storage system 12 reads the segment identifier that has been stored in persistent storage (i.e., 114 in the current example). From this information, storage system 12 can determine which data segments to rebuild. Since the segment identifiers are allocated in a monotonically increasing manner (at least in the presently discussed embodiment), any data segments that are written to storage array 26 subsequent to the failure of disk 1 will have a segment identifier that is greater than the stored segment identifier. Therefore, in one embodiment, the storage system 12 can rebuild data segments corresponding to segment identifiers that are greater than the stored segment identifier. In the present example, the stored segment identifier was 114, so the data segments that will be rebuilt on disk 1 are 115, 116 and 117, which would correspond to stripes 4, 1 and 6, respectively.
  • FIGS. 9-14 illustrate an iterative method to partially rebuild the contents of a failed storage unit, in accordance with one embodiment. First, storage system 12 determines whether the stored segment identifier (i.e., 114 in the current example) is the maximum segment identifier present in segment map 36. If so, the process ends. Since it is not the maximum segment identifier present in segment map 36 (i.e., maximum segment identifier in the current example is 117), storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier (i.e., next higher segment identifier in segment map 36) on disk 1. In the present example, the next higher segment identifier is 115, which is mapped to stripe number 4. Accordingly, storage system 12 generates data block d.41′ (from data blocks d.40′ and d.42′ and error-correction block P.4′) and stores data block d.41′ on disk 1, as depicted in FIG. 9. It is noted that the specific techniques to generate a data block from other data blocks and redundant information (e.g., error-correction blocks) is known in the art, and will not be explained herein for conciseness. Next, the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 115, as depicted in FIG. 10.
  • Next, storage system 12 determines whether the stored segment identifier (i.e., segment identifier 115) is the maximum segment identifier present in segment map 36. If so, the process ends. Since it is not the maximum segment identifier present in segment map 36, storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier on disk 1. In the present example, the next higher segment identifier is 116, which is mapped to stripe number 1. Accordingly, storage system 12 generates data block d.11′ (from data blocks d.10′ and d.12′ and error-correction block P.1′) and stores data block d.11′ on disk 1, as depicted in FIG. 11. Next, the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 116, as depicted in FIG. 12.
  • Next, storage system 12 determines whether the stored segment identifier (i.e., segment identifier 116) is the maximum segment identifier present in segment map 36. If so, the process ends. Since it is not the maximum segment identifier present in segment map 36, storage system 12 rebuilds a portion of the data segment associated with the next higher segment identifier on disk 1. In the present example, the next higher segment identifier is 117, which is mapped to stripe number 6. Accordingly, storage system 12 generates data block d.61′ (from data blocks d.60′ and d.62′ and error-correction block P.6′) and stores data block d.61′ on disk 1, as depicted in FIG. 13. Next, the stored segment identifier is advanced to the next higher segment identifier in the segment map. In the present example, the stored segment identifier is advanced to 117, as depicted in FIG. 14.
  • Next, storage system 12 determines whether the stored segment identifier (i.e., 117) is the maximum segment identifier present in segment map 36. Since it is the maximum segment identifier in segment map 36, the partial rebuild process concludes.
  • To emphasize the advantages of one embodiment of the present invention, storage system 12 was able to determine which data segments to rebuild based solely on segment map 36 and a single segment identifier stored on the persistent storage. Segment map 36 is required for normal operation of storage system 12, so the only storage overhead required to enable a partial rebuild of a failed storage unit is the storing of the segment identifier on the persistent storage. During the partial rebuild process, the determination of whether a data segment needs to be rebuilt is also a computationally efficient step, only requiring that the segment identifier of the data segment be compared to the stored segment identifier.
  • In FIGS. 15-20, various processes associated with embodiments of the present invention are depicted in the form of flow diagrams. FIG. 15 depicts flow diagram 100 for storing information that may be used in the process of reconstructing at least some of the contents of a storage unit, and using that information to reconstruct at least some of the contents of the storage unit, in accordance with one embodiment. At step 102, storage system 12 may maintain a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers. At step 104, prior to a first one of the storage units failing, storage system 12 may process a first sequence of write requests, the first sequence of write requests being associated with a first sequence of the segment identifiers. In one embodiment, the first sequence of the segment identifiers is a monotonically increasing sequence, while in another embodiment, the first sequence of the segment identifiers is a monotonically decreasing sequence. At step 106, in response to the first storage unit failing, storage system 12 may store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last write request that was processed from the first sequence of write requests. At step 108, in response to the first storage unit failing, storage system 12 may store an identifier of the first storage unit on the second storage unit. At step 110, subsequent to the first storage unit failing and prior to a recovery of the first storage unit, storage system 12 may process a second sequence of write requests, the second sequence of write requests being associated with a second sequence of the segment identifiers. In one embodiment, the second sequence of the segment identifiers is a monotonically increasing sequence, while in another embodiment, the second sequence of the segment identifiers is a monotonically decreasing sequence. At step 112, subsequent to the first storage unit being recovered, storage system 12 may determine, based on the identifier of the first storage unit that was stored on the second storage unit, that a partial rebuild process needs to be performed on the first storage unit. At step 114, storage system 12 may determine a set of stripe numbers associated with the content to be rebuilt on the first storage unit, the determination being based on the segment map and the first segment identifier. In one embodiment, the determination may be based solely on the segment map and the first segment identifier. The set of stripe numbers that are determined may identify stripes associated with data segments having segment identifiers greater than the first segment identifier. Finally, at step 116, storage system 12 may, for each stripe identified in the set of stripe numbers, rebuild on the first storage unit a portion of the content that belongs to the stripe and the first storage unit.
  • FIG. 16 depicts flow diagram 200 that elaborates upon step 116 of FIG. 15. At step 202, storage system 12 may determine whether the first segment identifier is the maximum segment identifier. If so, the partial rebuild process ends. Otherwise, storage system 12 may rebuild on the first storage unit a portion of a data segment associated with the next higher segment identifier in segment map (step 204). At step 206, the first segment identifier may be advanced to the next largest segment identifier in segment map 36. The process may then repeat to previously described step 202.
  • FIG. 17 depicts flow diagram 300 for processing a read request while at least some of the contents of a storage unit are being reconstructed, in accordance with one embodiment. At step 302, storage system 12 may receive a read request from host 14 for data on the first storage unit while contents of the first storage unit are being reconstructed. At step 304, storage system 12 may determine whether the requested data needs to be reconstructed. In one embodiment, such determination may involve comparing the segment identifier associated with the read request with the first segment identifier. If the segment identifier of the read request is less than or equal to the first segment identifier, the requested data may not need to be reconstructed and the requested data can be directly read from the first storage unit (step 306) and transmitted to the host device (308). Otherwise, if the segment identifier of the read request is greater than the first segment identifier, the requested data may be reconstructed (step 310) and the reconstructed data may be transmitted to the host device (step 312).
  • FIGS. 18-20 depict a variant of the processes depicted in FIGS. 15-17. Description for steps already described will not be provided for conciseness. In flow diagram 400 of FIG. 18, storage system 12, in addition to recording the first segment identifier, also records a second segment identifier (step 402). The second segment identifier may be associated with the last write request that was processed from the second sequence of write requests (i.e., the sequence of write requests while the storage array was in a degraded mode of operation). In other words, the second segment identifier may identify the last data segment that was written during the degraded mode of operation, and hence may identify the last data segment that requires reconstruction. At step 404, storage system 12 may determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first and second segment identifiers. In one embodiment, the determination is solely based on the segment map and the first and second segment identifiers. The set of stripe numbers that are determined may identify stripes associated with data segments having segment identifiers greater than the first segment identifier and less than or equal to the second segment identifier.
  • It is noted that the process of FIG. 18 may eliminate some redundant computation, as compared to the process of FIG. 15. For instance, if data segments are written to storage array 26 after the first storage unit has been recovered, no data reconstruction should be needed for such data segments. Nevertheless, the process of FIG. 15 may perform data reconstruction for such data segments (even though it is not needed), while the process of FIG. 18 will avoid such data reconstruction.
  • Flow diagram 500 (depicted in FIG. 19) is similar to flow diagram 200 (depicted in FIG. 16), except for step 502. In step 502, storage system 12 may determine whether the first segment identifier is equal to the second segment identifier as a termination condition, rather than determining whether the first segment identifier is the maximum segment identifier (as in step 202).
  • Flow diagram 600 (depicted in FIG. 20) is similar to flow diagram 300 (depicted in FIG. 17), except for step 604. In step 604, storage system 12 may determine whether the requested data needs to be reconstructed by comparing the segment identifier associated with the read request with the first and second segment identifiers. More specifically, storage system 12 may determine whether the segment identifier associated with the read request is greater than the first segment identifier and less than or equal to the second segment identifier. If so, the requested data is reconstructed (step 310), otherwise, the requested data can be read from the first storage unit (step 306).
  • As is apparent from the foregoing discussion, aspects of the present invention involve the use of various computer systems and computer readable storage media having computer-readable instructions stored thereon. FIG. 21 provides an example of a system 700 that is representative of any of the computing systems discussed herein. Further, computer system 700 may be representative of a system that performs any of the processes depicted in FIGS. 15-20. Note, not all of the various computer systems have all of the features of system 700. For example, certain ones of the computer systems discussed above may not include a display inasmuch as the display function may be provided by a client computer communicatively coupled to the computer system or a display function may be unnecessary. Such details are not critical to the present invention.
  • System 700 includes a bus 702 or other communication mechanism for communicating information, and a processor 704 coupled with the bus 702 for processing information. Computer system 700 also includes a main memory 706, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 702 for storing information and instructions to be executed by processor 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704. Computer system 700 further includes a read only memory (ROM) 708 or other static storage device coupled to the bus 702 for storing static information and instructions for the processor 704. A storage device 710, which may be one or more of a floppy disk, a flexible disk, a hard disk, flash memory-based storage medium, magnetic tape or other magnetic storage medium, a compact disk (CD)-ROM, a digital versatile disk (DVD)-ROM, or other optical storage medium, or any other storage medium from which processor 704 can read, is provided and coupled to the bus 702 for storing information and instructions (e.g., operating systems, applications programs and the like).
  • Computer system 700 may be coupled via the bus 702 to a display 712, such as a flat panel display, for displaying information to a computer user. An input device 714, such as a keyboard including alphanumeric and other keys, may be coupled to the bus 702 for communicating information and command selections to the processor 704. Another type of user input device is cursor control device 716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on the display 712. Other user interface devices, such as microphones, speakers, etc. are not shown in detail but may be involved with the receipt of user input and/or presentation of output.
  • The processes referred to herein may be implemented by processor 704 executing appropriate sequences of computer-readable instructions contained in main memory 706. Such instructions may be read into main memory 706 from another computer-readable medium, such as storage device 710, and execution of the sequences of instructions contained in the main memory 706 causes the processor 704 to perform the associated actions. In alternative embodiments, hard-wired circuitry or firmware-controlled processing units (e.g., field programmable gate arrays) may be used in place of or in combination with processor 704 and its associated computer software instructions to implement the invention. The computer-readable instructions may be rendered in any computer language including, without limitation, C#, C/C++, Fortran, COBOL, PASCAL, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), Java™ and the like. In general, all of the aforementioned terms are meant to encompass any series of logical steps performed in a sequence to accomplish a given purpose, which is the hallmark of any computer-executable application. Unless specifically stated otherwise, it should be appreciated that throughout the description of the present invention, use of terms such as “processing”, “computing”, “calculating”, “determining”, “displaying”, “receiving”, “transmitting” or the like, refer to the action and processes of an appropriately programmed computer system, such as computer system 700 or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within its registers and memories into other data similarly represented as physical quantities within its memories or registers or other such information storage, transmission or display devices.
  • Computer system 700 also includes a communication interface 718 coupled to the bus 702. Communication interface 718 may provide a two-way data communication channel with a computer network, which provides connectivity to and among the various computer systems discussed above. For example, communication interface 718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, which itself is communicatively coupled to the Internet through one or more Internet service provider networks. The precise details of such communication paths are not critical to the present invention. What is important is that computer system 700 can send and receive messages and data through the communication interface 718 and in that way communicate with hosts accessible via the Internet.
  • Thus, methods and systems for reconstructing at least some of the contents of a storage unit following the failure of the storage unit have been described. It is to be understood that the above-description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (20)

What is claimed is:
1. A method for a storage system having a storage array, the storage array having a plurality of storage units, the method comprising:
maintaining a segment map in a persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers;
in response to a first one of the storage units failing, storing a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last data segment that was written on the storage array before the failure of the first storage unit;
in response to the first storage unit being recovered, storing a second one of the segment identifiers from the segment map on the second storage unit, the second segment identifier being associated with the last data segment that was written on the storage array before the recovery of the first storage unit; and
determining a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first and second segment identifiers.
2. The method of claim 1, wherein the first segment identifier identifies the last data segment that was written on the storage array before the failure of the first storage unit.
3. The method of claim 1, wherein the second segment identifier identifies the last data segment that was written on the storage array before the recovery of the first storage unit.
4. The method of claim 1, wherein the set of stripe numbers are associated with ones of the segment identifiers that are greater than the first segment identifier and less than or equal to the second segment identifier.
5. The method of claim 1, further comprising, for each stripe identified in the set of stripe numbers, rebuilding on the first storage unit a portion of the content that belongs to the stripe and the first storage unit.
6. The method of claim 1, wherein the persistent storage is a part of the plurality of storage units.
7. The method of claim 1, wherein the persistent storage is a part of a non-volatile random access memory (NVRAM) of the storage system.
8. The method of claim 1, wherein the determination is solely based on the segment map and the first and second segment identifiers.
9. The method of claim 1, further comprising, each time a portion of a data segment is rebuilt on the first storage unit, advancing the first segment identifier to a segment identifier in the segment map that is higher in value than the first segment identifier.
10. The method of claim 9, further comprising:
receiving a read request from a host device for data on the first storage unit while a partial rebuild process is being performed on the first storage unit;
determining whether the read requested is directed at one or more stripes having stripe numbers within the set of stripe numbers; and
if so, reconstructing the requested data and transmitting the reconstructed data to the host device, otherwise, transmitting the requested data from the first storage unit to the host device.
11. A storage system, comprising:
a plurality of storage units;
a persistent storage, wherein the persistent storage is a part of the plurality of storage units;
a main memory;
a processor communicatively coupled to the plurality of storage units and the main memory; and
software instructions on the main memory that, when executed by the processor, cause the processor to:
maintain a segment map in the persistent storage, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers;
in response to a first one of the storage units failing, store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last data segment that was written on the storage array before the failure of the first storage unit;
in response to the first storage unit being recovered, store a second one of the segment identifiers from the segment map on the second storage unit, the second segment identifier being associated with the last data segment that was written on the storage array before the recovery of the first storage unit; and
determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first and second segment identifiers.
12. The storage system of claim 11, wherein the first segment identifier identifies the last data segment that was written on the storage array before the failure of the first storage unit.
13. The storage system of claim 11, wherein the second segment identifier identifies the last data segment that was written on the storage array before the recovery of the first storage unit.
14. The storage system of claim 11, wherein the set of stripe numbers are associated with ones of the segment identifiers that are greater than the first segment identifier and less than or equal to the second segment identifier.
15. The storage system of claim 11, further comprising software instructions on the main memory that, when executed by the processor, cause the processor to, for each stripe identified in the set of stripe numbers, rebuild on the first storage unit a portion of the content that belongs to the stripe and the first storage unit.
16. A non-transitory machine-readable storage medium for a storage system having a plurality of storage units, a main memory, and a processor communicatively coupled to the plurality of storage units and the main memory, the non-transitory machine-readable storage medium comprising software instructions that, when executed by the processor, cause the processor to:
maintain a segment map in one or more of the storage units, the segment map associating a plurality of segment identifiers with a plurality of stripe numbers;
in response to a first one of the storage units failing, store a first one of the segment identifiers from the segment map on a second one of the storage units, the first segment identifier being associated with the last data segment that was written on the storage array before the failure of the first storage unit;
in response to the first storage unit being recovered, store a second one of the segment identifiers from the segment map on the second storage unit, the second segment identifier being associated with the last data segment that was written on the storage array before the recovery of the first storage unit; and
determine a set of stripe numbers associated with content to be rebuilt on the first storage unit, the determination being based on the segment map and the first and second segment identifiers.
17. The non-transitory machine-readable storage medium of claim 16, wherein the first segment identifier identifies the last data segment that was written on the storage array before the failure of the first storage unit.
18. The non-transitory machine-readable storage medium of claim 16, wherein the second segment identifier identifies the last data segment that was written on the storage array before the recovery of the first storage unit.
19. The non-transitory machine-readable storage medium of claim 16, wherein the set of stripe numbers are associated with ones of the segment identifiers that are greater than the first segment identifier and less than or equal to the second segment identifier.
20. The non-transitory machine-readable storage medium of claim 16, further comprising software instructions that, when executed by the processor, cause the processor to, for each stripe identified in the set of stripe numbers, rebuild on the first storage unit a portion of the content that belongs to the stripe and the first storage unit.
US14/463,567 2014-07-29 2014-08-19 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system Abandoned US20160034370A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/463,567 US20160034370A1 (en) 2014-07-29 2014-08-19 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/446,191 US10684927B2 (en) 2014-07-29 2014-07-29 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system
US14/463,567 US20160034370A1 (en) 2014-07-29 2014-08-19 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/446,191 Continuation US10684927B2 (en) 2014-07-29 2014-07-29 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system

Publications (1)

Publication Number Publication Date
US20160034370A1 true US20160034370A1 (en) 2016-02-04

Family

ID=55180069

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/446,191 Active 2035-06-18 US10684927B2 (en) 2014-07-29 2014-07-29 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system
US14/463,567 Abandoned US20160034370A1 (en) 2014-07-29 2014-08-19 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/446,191 Active 2035-06-18 US10684927B2 (en) 2014-07-29 2014-07-29 Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system

Country Status (1)

Country Link
US (2) US10684927B2 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10140185B1 (en) * 2015-03-31 2018-11-27 Maginatics Llc Epoch based snapshot summary
US10263881B2 (en) * 2016-05-26 2019-04-16 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US10341221B2 (en) 2015-02-26 2019-07-02 Cisco Technology, Inc. Traffic engineering for bit indexed explicit replication
CN110058791A (en) * 2018-01-18 2019-07-26 伊姆西Ip控股有限责任公司 Storage system and corresponding method and computer-readable medium
US10382334B2 (en) 2014-03-06 2019-08-13 Cisco Technology, Inc. Segment routing extension headers
US10409682B1 (en) * 2017-02-24 2019-09-10 Seagate Technology Llc Distributed RAID system
US10469370B2 (en) 2012-10-05 2019-11-05 Cisco Technology, Inc. Segment routing techniques
US10469325B2 (en) 2013-03-15 2019-11-05 Cisco Technology, Inc. Segment routing: PCE driven dynamic setup of forwarding adjacencies and explicit path
US20200043524A1 (en) * 2018-08-02 2020-02-06 Western Digital Technologies, Inc. RAID Storage System with Logical Data Group Priority
US10601707B2 (en) 2014-07-17 2020-03-24 Cisco Technology, Inc. Segment routing using a remote forwarding adjacency identifier
US20210027115A1 (en) * 2019-07-22 2021-01-28 EMC IP Holding Company LLC Generating compressed representations of sorted arrays of identifiers
US11032197B2 (en) 2016-09-15 2021-06-08 Cisco Technology, Inc. Reroute detection in segment routing data plane
US11132256B2 (en) * 2018-08-03 2021-09-28 Western Digital Technologies, Inc. RAID storage system with logical data group rebuild
US11722404B2 (en) 2019-09-24 2023-08-08 Cisco Technology, Inc. Communicating packets across multi-domain networks using compact forwarding instructions
US20230251932A1 (en) * 2020-09-11 2023-08-10 Netapp Inc. Persistent memory file system reconciliation

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104244B (en) * 2018-10-29 2023-08-29 伊姆西Ip控股有限责任公司 Method and apparatus for reconstructing data in a storage array set

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161805A1 (en) * 2005-01-14 2006-07-20 Charlie Tseng Apparatus, system, and method for differential rebuilding of a reactivated offline RAID member disk
US20150127891A1 (en) * 2013-11-04 2015-05-07 Falconstor, Inc. Write performance preservation with snapshots

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5533190A (en) 1994-12-21 1996-07-02 At&T Global Information Solutions Company Method for maintaining parity-data consistency in a disk array
US6594745B2 (en) 2001-01-31 2003-07-15 Hewlett-Packard Development Company, L.P. Mirroring agent accessible to remote host computers, and accessing remote data-storage devices, via a communcations medium
US7055058B2 (en) * 2001-12-26 2006-05-30 Boon Storage Technologies, Inc. Self-healing log-structured RAID
US7103884B2 (en) * 2002-03-27 2006-09-05 Lucent Technologies Inc. Method for maintaining consistency and performing recovery in a replicated data storage system
US8429654B2 (en) * 2006-07-06 2013-04-23 Honeywell International Inc. Apparatus and method for guaranteed batch event delivery in a process control system
US8812901B2 (en) * 2011-09-23 2014-08-19 Lsi Corporation Methods and apparatus for marking writes on a write-protected failed device to avoid reading stale data in a RAID storage system
JP5954081B2 (en) * 2012-09-26 2016-07-20 富士通株式会社 Storage control device, storage control method, and storage control program
US9454434B2 (en) * 2014-01-17 2016-09-27 Netapp, Inc. File system driven raid rebuild technique

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161805A1 (en) * 2005-01-14 2006-07-20 Charlie Tseng Apparatus, system, and method for differential rebuilding of a reactivated offline RAID member disk
US20150127891A1 (en) * 2013-11-04 2015-05-07 Falconstor, Inc. Write performance preservation with snapshots

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10469370B2 (en) 2012-10-05 2019-11-05 Cisco Technology, Inc. Segment routing techniques
US11424987B2 (en) 2013-03-15 2022-08-23 Cisco Technology, Inc. Segment routing: PCE driven dynamic setup of forwarding adjacencies and explicit path
US10469325B2 (en) 2013-03-15 2019-11-05 Cisco Technology, Inc. Segment routing: PCE driven dynamic setup of forwarding adjacencies and explicit path
US11689427B2 (en) 2013-03-15 2023-06-27 Cisco Technology, Inc. Segment routing over label distribution protocol
US11784889B2 (en) 2013-03-15 2023-10-10 Cisco Technology, Inc. Segment routing over label distribution protocol
US11290340B2 (en) 2013-03-15 2022-03-29 Cisco Technology, Inc. Segment routing over label distribution protocol
US10764146B2 (en) 2013-03-15 2020-09-01 Cisco Technology, Inc. Segment routing over label distribution protocol
US10382334B2 (en) 2014-03-06 2019-08-13 Cisco Technology, Inc. Segment routing extension headers
US11336574B2 (en) 2014-03-06 2022-05-17 Cisco Technology, Inc. Segment routing extension headers
US11374863B2 (en) 2014-03-06 2022-06-28 Cisco Technology, Inc. Segment routing extension headers
US10601707B2 (en) 2014-07-17 2020-03-24 Cisco Technology, Inc. Segment routing using a remote forwarding adjacency identifier
US10341221B2 (en) 2015-02-26 2019-07-02 Cisco Technology, Inc. Traffic engineering for bit indexed explicit replication
US10693765B2 (en) 2015-02-26 2020-06-23 Cisco Technology, Inc. Failure protection for traffic-engineered bit indexed explicit replication
US10958566B2 (en) 2015-02-26 2021-03-23 Cisco Technology, Inc. Traffic engineering for bit indexed explicit replication
US10341222B2 (en) 2015-02-26 2019-07-02 Cisco Technology, Inc. Traffic engineering for bit indexed explicit replication
US10983868B2 (en) 2015-03-31 2021-04-20 EMC IP Holding Company LLC Epoch based snapshot summary
US10140185B1 (en) * 2015-03-31 2018-11-27 Maginatics Llc Epoch based snapshot summary
US10742537B2 (en) 2016-05-26 2020-08-11 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US11671346B2 (en) 2016-05-26 2023-06-06 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US10263881B2 (en) * 2016-05-26 2019-04-16 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US11323356B2 (en) 2016-05-26 2022-05-03 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US11489756B2 (en) 2016-05-26 2022-11-01 Cisco Technology, Inc. Enforcing strict shortest path forwarding using strict segment identifiers
US11032197B2 (en) 2016-09-15 2021-06-08 Cisco Technology, Inc. Reroute detection in segment routing data plane
US10409682B1 (en) * 2017-02-24 2019-09-10 Seagate Technology Llc Distributed RAID system
CN110058791A (en) * 2018-01-18 2019-07-26 伊姆西Ip控股有限责任公司 Storage system and corresponding method and computer-readable medium
US20200043524A1 (en) * 2018-08-02 2020-02-06 Western Digital Technologies, Inc. RAID Storage System with Logical Data Group Priority
US10825477B2 (en) * 2018-08-02 2020-11-03 Western Digital Technologies, Inc. RAID storage system with logical data group priority
US11132256B2 (en) * 2018-08-03 2021-09-28 Western Digital Technologies, Inc. RAID storage system with logical data group rebuild
US11720251B2 (en) * 2019-07-22 2023-08-08 EMC IP Holding Company LLC Generating compressed representations of sorted arrays of identifiers
US20210027115A1 (en) * 2019-07-22 2021-01-28 EMC IP Holding Company LLC Generating compressed representations of sorted arrays of identifiers
US11722404B2 (en) 2019-09-24 2023-08-08 Cisco Technology, Inc. Communicating packets across multi-domain networks using compact forwarding instructions
US11855884B2 (en) 2019-09-24 2023-12-26 Cisco Technology, Inc. Communicating packets across multi-domain networks using compact forwarding instructions
US20230251932A1 (en) * 2020-09-11 2023-08-10 Netapp Inc. Persistent memory file system reconciliation

Also Published As

Publication number Publication date
US10684927B2 (en) 2020-06-16
US20160034209A1 (en) 2016-02-04

Similar Documents

Publication Publication Date Title
US10684927B2 (en) Methods and systems for storing information that facilitates the reconstruction of at least some of the contents of a storage unit on a storage system
US10496481B2 (en) Methods and systems for rebuilding data subsequent to the failure of a storage unit
US9417963B2 (en) Enabling efficient recovery from multiple failures together with one latent error in a storage array
US7036066B2 (en) Error detection using data block mapping
US9424128B2 (en) Method and apparatus for flexible RAID in SSD
US7206991B2 (en) Method, apparatus and program for migrating between striped storage and parity striped storage
US20190332477A1 (en) Method, device and computer readable storage medium for writing to disk array
US10831401B2 (en) Method, device and computer program product for writing data
US11449400B2 (en) Method, device and program product for managing data of storage device
US9063869B2 (en) Method and system for storing and rebuilding data
US20220253385A1 (en) Persistent storage device management
US20040034817A1 (en) Efficient mechanisms for detecting phantom write errors
US20170123915A1 (en) Methods and systems for repurposing system-level over provisioned space into a temporary hot spare
US11314594B2 (en) Method, device and computer program product for recovering data
US20200174689A1 (en) Update of raid array parity
US10860446B2 (en) Failed storage device rebuild using dynamically selected locations in overprovisioned space
US11537330B2 (en) Selectively improving raid operations latency
CN110737395B (en) I/O management method, electronic device, and computer-readable storage medium
US10664346B2 (en) Parity log with by-pass
US10783035B1 (en) Method and system for improving throughput and reliability of storage media with high raw-error-rate
KR102032878B1 (en) Method for correcting error of flash storage controller
US20220261314A1 (en) Read request response for reconstructed data in a degraded drive
CN115809011A (en) Data reconstruction method and device in storage system

Legal Events

Date Code Title Description
AS Assignment

Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NIMBLE STORAGE, INC.;REEL/FRAME:042810/0906

Effective date: 20170601

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION