WO2005001841A2 - Safe write to multiply-redundant storage - Google Patents
Safe write to multiply-redundant storage Download PDFInfo
- Publication number
- WO2005001841A2 WO2005001841A2 PCT/EP2004/051150 EP2004051150W WO2005001841A2 WO 2005001841 A2 WO2005001841 A2 WO 2005001841A2 EP 2004051150 W EP2004051150 W EP 2004051150W WO 2005001841 A2 WO2005001841 A2 WO 2005001841A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- write
- parity
- data
- storage
- mark
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B20/00—Signal processing not specific to the method of recording or reproducing; Circuits therefor
- G11B20/10—Digital recording or reproducing
- G11B20/18—Error detection or correction; Testing, e.g. of drop-outs
- G11B20/1833—Error detection or correction; Testing, e.g. of drop-outs by adding special lists or symbols to the coded information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2211/00—Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
- G06F2211/10—Indexing scheme relating to G06F11/10
- G06F2211/1002—Indexing scheme relating to G06F11/1076
- G06F2211/1009—Cache, i.e. caches used in RAID system with parity
Definitions
- the present invention relates to the field of storage, and more specifically to the operation of multiply-redundant arrays. Background Art
- an array of independent storage devices can be configured to operate as a single virtual storage device using a technology known as RAID (Redundant Array of Independent Disks).
- RAID Redundant Array of Independent Disks
- a computer system configured to operate with a RAID storage system is able to perform input and output (I/O) operations (such as read and write operations) on the RAID storage system as if the RAID storage system were a single storage device.
- I/O input and output
- a RAID storage system includes an array of independent storage devices and a RAID controller.
- the RAID controller provides a virtualized view of the array of independent storage devices - this means that the array of independent storage devices appear as a single virtual storage device with a sequential list of storage elements.
- the storage elements are commonly known as blocks of storage, and the data stored within them are known as data blocks.
- I O operations are qualified with reference to one or more blocks of storage in the virtual storage device.
- the RAID controller maps the I O operation onto the array of independent storage devices.
- the RAID controller may employ any of several standard RAID techniques well known to those skilled in the data processing art.
- Striping involves spreading data blocks across storage devices in a round- robin fashion.
- a number of data blocks known as a strip is stored in each storage device.
- the size of a strip may be determined by a particular RAID implementation or may be configurable.
- a row of strips comprising a first strip stored on a first storage device and subsequent strips stored on subsequent storage devices is known as a stripe.
- the size of a stripe is the total size of all strips comprising the stripe.
- Physical storage devices such as disk storage devices are renowned for poor reliability and it is a further function of a RAID controller to provide a reliable storage system.
- One technique to provide reliability involves the storage of check information along with data in an array of independent storage devices.
- Check information is redundant information that allows regeneration of data which has become unreadable due to a single point of failure, such as the failure of a single storage device in an array of such devices. Unreadable data is regenerated from a combination of readable data and redundant check information.
- Check information is recorded as parity data which occupies a single strip in a stripe, and is calculated by applying the EXCLUSIVE OR (XOR) logical operator to all data strips in the stripe.
- XOR EXCLUSIVE OR
- a stripe comprising data strips A, B and C would be further complimented by a parity strip calculated as A XOR B XOR C.
- the parity strip is used to regenerate an inaccessible data strip. For example, if a stripe comprising data strips A, B, C and PARITY is stored across four independent storage devices W, X, Y and Z respectively, and storage device X fails, strip B stored on device X would be inaccessible.
- a multiply redundant RAID array is a storage device built out of independent hard disk drives which stores data and parity information in such a way that several of the component hard disk drives can fail without losing any of the data written to the array. For simplicity, this will be described here in terms of doubly redundant arrays; the techniques of the preferred embodiments of the present invention can be extended to arrays with more redundancy.
- Each column represents a disk.
- Each cell in the figure represents a strip of data (top row) or parity (bottom row) on the disk.
- RAID-5 a singly redundant type of RAID, typically manage this by storing in a small non- volatile memory (NVRAM) the indices of each stripe being written.
- NVRAM non- volatile memory
- the adapter looks through its ⁇ VRAM and discovers which of the patterns on the array need to be resynchronized. This resynchronization is done by reading all of the data strips and computing the new parity.
- This approach needs -32 bits of ⁇ VRAM for each pattern that needs to be written concurrently. This technique only works if all of the data strips can. be read. If a single data strip cannot be read (eg. because a disk has failed,) the parity strip cannot be computed and data is lost. This borders on the acceptable with RAID-5, which is designed to survive a single failure (there have arguably been two failures, the disk failure and the power failure or controller reset.)
- controllers of a RAID array there are at least two controllers of a RAID ) array, so that failure of a single controller does not cause loss of access to the array. These controllers must communicate to synchronise access to the RAID array. -Any non-volatile information must be visible to both controllers. In the simplest case, the controllers communicate using the device network, but if the contents of the NVS are to be shadowed, network bandwidth is disadvantageously consumed, which bandwidth would otherwise be available for user I O.
- the present invention accordingly provides, in a first aspect an arrangement of apparatus for safely writing data and parity to multiply-redundant storage comprising a first storage component operable to store at least a first mark in a storage device to index uniquely a pattern to be written by at least a data write; a write component operable to perform said at least data write; a further storage component operable to overwrite a mark in said storage device with at least a further mark to index uniquely a pattern to be written by a parity write; and a further write component operable to perform said parity write.
- said first storage component further comprises a second storage component operable to overwrite said at least first mark: in said storage device with a second mark to index a pattern to be written by a first parity write; and said write component is further operable to perform said first parity write.
- the arrangement of the first aspect may be preferred to have double redundancy.
- the apparatus of the first aspect may be preferred to have
- the arrangement of the first aspect may be adapted to serialise each data and parity write operation.
- the arrangement of the first aspect may further comprise RAID storage, which may preferably be RAID-5 storage.
- the arrangement of the first aspect may be adapted to perform lazy parity update.
- the arrangement of the first aspect may preferably further comprise a fastwrite cache.
- the present invention provides a method for safely writing data and parity to multiply-redundant storage comprising storing at least a first mark in a storage device to index uniquely a pattern to be written by at least a data write; performing said at least data write; overwriting a mark in said storage device with at least a further mark to index uniquely a pattern to oe written by a parity write; and performing said parity write.
- said step of storing at least a first mark in a storage device to index uniquely a pattern to be written by at least a data write further comprises overwriting said at least first mark in said storage device with a second mark to index a pattern to be written by a first parity write; and said step of perforating said at least data write comprises perfo ⁇ ning said first parity write.
- the multiply redundant storage is doubly redundant.
- the multiply redundant storage has greater than double redundancy.
- said multiply redundant storage comprises RAID storage.
- said RAID storage is RAID-5 storage.
- the method of the second aspect preferably further comprises a step of perforating a lazy parity update.
- the method of the second aspect preferably further comprises a step of perforating a fastwrite caching operation.
- the present invention provides a computer program comprising computer program code embodied in a tangible nxedium, to, when loaded into a computer system and executed thereon, perform all the steps of the method of the second aspect.
- the present invention provides protection against a combination of a disk failure and a power reset or of a disk failure and a controller reset.
- the present invention alleviates the problem of bandwidth consumption reducing the available bandwidth for user I/O.
- a further advantage of the present invention is that it is scalable from doubly- redundant arrays to arrays having higher orders of redundancy.
- Figure 1 shows a well known encoding scheme for a four-disk array
- Figure 2 shows a block schematic of an arrangement of apparatus according to a preferred embodiment of the present invention.
- Figure 3 shows the steps of a method according to a preferred embodiment of the present invention. Mode for the Invention [050] To appreciate the preferred embodiments of the present invention, consider again the known encoding scheme shown in Figure 1. Writes to this pattern are serialised. If writes for both strips A and B are submitted at the same time, the array controller will do one first and then the other.
- a mark is stored in NVRAM to uniquely identify the strip within the pattern that is being written on each of writes #0, #l, and#2.
- FIG. 2 there is shown an arrangement of apparatus 100 for safely writing data and parity to multiply-redundant storage comprising a first write component 112 comprising a writer 102 which cooperates with a first storage component 108 to store a mark to index uniquely a pattern that is to be written by a data write, and which then performs a write operation to storage (110).
- Write component 112 may further contain a second writer 104 which also cooperates with a first storage component 108 to store at mark to index uniquely a pattern that is to be written by a first parity write, and which then performs a write operation to storage 110.
- the data and parity may be written in parallel as shown by parallel writes 114.
- component 112 is effectively decomposed to provide two independent writers 102, 104, and the writes shown in the parallel write component 112 are decomposed into a serialized pair of writes, the first of data, and the second of the first parity (and note that the second parity write must also be serialized).
- a third writer 106 cooperates with storage component 108 to overwrite the mark in storage with at least a further mark to index uniquely a pattern to be written by a second parity write, and then writes the parity to storage 110.
- the information stored in storage component 108, which is an NNRAM is:
- the second and third items uniquely identify which strip within the pattern is being written. There are alternative ways of representing this information, including, for example but not lirnited to, the use of X-Y coordinates within the pattern. It will be clear to one skilled in the data processing art that there are many alternative representations.
- FIG. 3 there is shown a flow diagram of the steps of a method according to the preferred embodiments of the present invention.
- a mark is written to NVRAM to store the pattern index, strip index and write index for the data write.
- the data is written at step 204.
- a mark is written to NVRAM to store the pattern index, strip index and write index for the first parity write.
- the first parity is written at step 208. In the case of a non-atomic update, these writes may occur in parallel. Otherwise, they must be serialized, and the second parity write (to be discussed below) must also be serialized.
- a mark is written to NVRAM to store the pattern index, strip index and write index for the second parity write.
- the second parity is written at step 212.
- the NVRAM is invalidated.
- the pattern can be resynchronized because all strips on the remaining disks are already in sync.
- the pattern can be resynchronized as follows: [085] 1. If the NVRAM says "Write 0" then the situation is similar to one where two disks have failed — the one that has really failed and the one containing the data strip being written. Because the pattern is multiply redundant, all data can be reconstructed, rettirning the pattern to the state it was in before the write started. [086] 2.
- the data strips on the failed disk can always be reconstructed and the pattern resynchronized to complete the interrupted write.
- Each data strip on the failed disk has two parity strips. Neither of these is on the failed disk.
- At most one of these parity strips is shared with the data strip written during "Write 0", and so there is always at least one way of reconst-racting the data strip.
- the first preferred embodiment of this invention requires that although the various reads may be issued in parallel, the three writes must be issued sequentially — write #2 is not begun until write #1 has completed etc.
- One disadvantage of this embodiment is that the three disk writes are serialised, meaning that a write sent to the array will take longer than a write to the equivalent array using the NVS approach, where the three disk writes may be issued in parallel. [100] This may be alleviated somewhat if the RAID array is behind a fastwrite cache and so at least partially isolated from host response time. [101] An additional potential method of alleviating this disadvantage is by signalling successful completion to the host write I O after step 8A — this is the equivalent of the well-known "Lazy parity update" technique often applied to RAID-5. [102] In a second embodiment of the present invention, the serialisation is unnecessary unless the application requires the additional property that a write to the RAID array be atomic.
- the present invention may suitably be embodied as a computer program product for use with a computer system.
- Such an implementation may comprise a series of computer readable instructions either fixed on a tangible medium, such as a computer readable medium, for example, diskette, CD-ROM, ROM, or hard disk, or transmittable to a computer system, via a modem or other interface device, over either a tangible medium, including but not limited to optical or analogue communications lines, or intangibly using wireless techniques, including but not jumited to microwave, infrared or other transmission techniques.
- the series of computer readable instructions embodies all or part of the functionality previously described herein.
- Such computer readable instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including but not limited to, semiconductor, magnetic, or optical, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, or microwave. It is conte plated that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation, for example, shrink-wrapped software, pre-loaded with a computer system, for example, on a system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, for example, the Internet or World Wide Web.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006516161A JP4848272B2 (en) | 2003-06-28 | 2004-06-17 | Apparatus and method for secure writing to multiplexed redundant storage |
EP04741822A EP1639467A2 (en) | 2003-06-28 | 2004-06-17 | Safe write to multiply-redundant storage |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0315157.8 | 2003-06-28 | ||
GB0315157A GB0315157D0 (en) | 2003-06-28 | 2003-06-28 | Safe write to multiply-redundant storage |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2005001841A2 true WO2005001841A2 (en) | 2005-01-06 |
WO2005001841A3 WO2005001841A3 (en) | 2005-09-09 |
Family
ID=27676272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2004/051150 WO2005001841A2 (en) | 2003-06-28 | 2004-06-17 | Safe write to multiply-redundant storage |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP1639467A2 (en) |
JP (1) | JP4848272B2 (en) |
CN (1) | CN100359478C (en) |
GB (1) | GB0315157D0 (en) |
TW (1) | TWI315873B (en) |
WO (1) | WO2005001841A2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3713094A1 (en) * | 2019-03-22 | 2020-09-23 | Zebware AB | Application of the mojette transform to erasure correction for distributed storage |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0569212A1 (en) * | 1992-05-05 | 1993-11-10 | International Business Machines Corporation | Method and means for fast writing data to LRU cached based DASD arrays under diverse fault tolerant modes |
WO1994029795A1 (en) * | 1993-06-04 | 1994-12-22 | Network Appliance Corporation | A method for providing parity in a raid sub-system using a non-volatile memory |
US5574882A (en) * | 1995-03-03 | 1996-11-12 | International Business Machines Corporation | System and method for identifying inconsistent parity in an array of storage |
US5774643A (en) * | 1995-10-13 | 1998-06-30 | Digital Equipment Corporation | Enhanced raid write hole protection and recovery |
US20020161970A1 (en) * | 2001-03-06 | 2002-10-31 | Busser Richard W. | Utilizing parity caching and parity logging while closing the RAID 5 write hole |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5195100A (en) * | 1990-03-02 | 1993-03-16 | Micro Technology, Inc. | Non-volatile memory storage of write operation identifier in data sotrage device |
JP2857288B2 (en) * | 1992-10-08 | 1999-02-17 | 富士通株式会社 | Disk array device |
JPH05341921A (en) * | 1992-06-05 | 1993-12-24 | Hitachi Ltd | Disk array device |
JP3181398B2 (en) * | 1992-10-06 | 2001-07-03 | 三菱電機株式会社 | Array type recording device |
US5522032A (en) * | 1994-05-05 | 1996-05-28 | International Business Machines Corporation | Raid level 5 with free blocks parity cache |
KR100267366B1 (en) * | 1997-07-15 | 2000-10-16 | Samsung Electronics Co Ltd | Method for recoding parity and restoring data of failed disks in an external storage subsystem and apparatus therefor |
JP3618529B2 (en) * | 1997-11-04 | 2005-02-09 | 富士通株式会社 | Disk array device |
JP3590015B2 (en) * | 2001-11-30 | 2004-11-17 | 株式会社東芝 | Disk array device and method of restoring consistency of logical drive having redundant data |
-
2003
- 2003-06-28 GB GB0315157A patent/GB0315157D0/en not_active Ceased
-
2004
- 2004-06-04 TW TW93116234A patent/TWI315873B/en not_active IP Right Cessation
- 2004-06-17 EP EP04741822A patent/EP1639467A2/en not_active Withdrawn
- 2004-06-17 JP JP2006516161A patent/JP4848272B2/en not_active Expired - Fee Related
- 2004-06-17 WO PCT/EP2004/051150 patent/WO2005001841A2/en active Application Filing
- 2004-06-17 CN CNB2004800135130A patent/CN100359478C/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0569212A1 (en) * | 1992-05-05 | 1993-11-10 | International Business Machines Corporation | Method and means for fast writing data to LRU cached based DASD arrays under diverse fault tolerant modes |
WO1994029795A1 (en) * | 1993-06-04 | 1994-12-22 | Network Appliance Corporation | A method for providing parity in a raid sub-system using a non-volatile memory |
US5574882A (en) * | 1995-03-03 | 1996-11-12 | International Business Machines Corporation | System and method for identifying inconsistent parity in an array of storage |
US5774643A (en) * | 1995-10-13 | 1998-06-30 | Digital Equipment Corporation | Enhanced raid write hole protection and recovery |
US20020161970A1 (en) * | 2001-03-06 | 2002-10-31 | Busser Richard W. | Utilizing parity caching and parity logging while closing the RAID 5 write hole |
Also Published As
Publication number | Publication date |
---|---|
CN100359478C (en) | 2008-01-02 |
WO2005001841A3 (en) | 2005-09-09 |
JP4848272B2 (en) | 2011-12-28 |
GB0315157D0 (en) | 2003-08-06 |
CN1791863A (en) | 2006-06-21 |
TWI315873B (en) | 2009-10-11 |
TW200518089A (en) | 2005-06-01 |
EP1639467A2 (en) | 2006-03-29 |
JP2009514047A (en) | 2009-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6294518B2 (en) | Synchronous mirroring in non-volatile memory systems | |
US7330931B2 (en) | Method and system for accessing auxiliary data in power-efficient high-capacity scalable storage system | |
US7055058B2 (en) | Self-healing log-structured RAID | |
US7506187B2 (en) | Methods, apparatus and controllers for a raid storage system | |
US6523087B2 (en) | Utilizing parity caching and parity logging while closing the RAID5 write hole | |
US7152184B2 (en) | Storage device, backup method and computer program code of this storage device | |
US7908512B2 (en) | Method and system for cache-based dropped write protection in data storage systems | |
CN102662607B (en) | RAID6 level mixed disk array, and method for accelerating performance and improving reliability | |
US9304937B2 (en) | Atomic write operations for storage devices | |
US7418550B2 (en) | Methods and structure for improved import/export of raid level 6 volumes | |
WO2013160972A1 (en) | Storage system and storage apparatus | |
US20030236944A1 (en) | System and method for reorganizing data in a raid storage system | |
US20080177803A1 (en) | Log Driven Storage Controller with Network Persistent Memory | |
US20080270719A1 (en) | Method and system for efficient snapshot operations in mass-storage arrays | |
CN101609420A (en) | Realize method and the redundant arrays of inexpensive disks and the controller thereof of rebuilding of disc redundant array | |
JPWO2006123416A1 (en) | Disk failure recovery method and disk array device | |
JPH04230512A (en) | Method and apparatus for updating record for dasd array | |
CN102799533B (en) | Method and apparatus for shielding damaged sector of disk | |
JPH1049308A (en) | Host base raid-5 and nv-ram integrated system | |
US20110154105A1 (en) | Redundant File System | |
US6985996B1 (en) | Method and apparatus for relocating RAID meta data | |
CN105302665B (en) | A kind of improved Copy on write Snapshot Method and system | |
CN110187830A (en) | A kind of method and system accelerating disk array reconstruction | |
US11379326B2 (en) | Data access method, apparatus and computer program product | |
CN101169705B (en) | Method for implementing file class mirror-image under multiple hard disk based on nude file system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 20048135130 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2004741822 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2006516161 Country of ref document: JP |
|
WWP | Wipo information: published in national office |
Ref document number: 2004741822 Country of ref document: EP |