WO2017061891A1 - Coding for distributed storage system - Google Patents

Coding for distributed storage system Download PDF

Info

Publication number
WO2017061891A1
WO2017061891A1 PCT/RU2015/000655 RU2015000655W WO2017061891A1 WO 2017061891 A1 WO2017061891 A1 WO 2017061891A1 RU 2015000655 W RU2015000655 W RU 2015000655W WO 2017061891 A1 WO2017061891 A1 WO 2017061891A1
Authority
WO
WIPO (PCT)
Prior art keywords
symbols
codeword
unknown
marked
node
Prior art date
Application number
PCT/RU2015/000655
Other languages
French (fr)
Other versions
WO2017061891A9 (en
Inventor
Peter Vladimirovich Trifonov
Yunfeng Shao
Yuangang WANG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2015/000655 priority Critical patent/WO2017061891A1/en
Priority to CN201580083657.1A priority patent/CN108141228A/en
Publication of WO2017061891A1 publication Critical patent/WO2017061891A1/en
Publication of WO2017061891A9 publication Critical patent/WO2017061891A9/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/373Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 with erasure correction and erasure determination, e.g. for packet loss recovery or setting of erasures for the decoding of Reed-Solomon codes

Abstract

The present invention relates to a method for encoding input data in a codeword, wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols, (I), the remaining positions (II) comprise the input data and the matrix A is representable as (III), wherein B is a permutation matrix and F 0 , F 1 ,..., F m-1 are l i x l i matrices over GF(2 µ ) not being permutation-equivalent to diagonal matrices, wherein the set ℱ comprises integers s with (IV), the set ℱ comprises integers (V) such that, (VI) or the set ℱ comprises integers (VII).

Description

CODING FOR DISTRIBUTED STORAGE SYSTEM
TECHNICAL FIELD
The present invention relates to a method for encoding input data in a codeword and to a method for updating a codeword. The present invention also relates to a storage controller and to a computer-readable storage medium. The present invention further also relates to a com- puter-readable storage medium storing program code, the program code comprising instructions for carrying out one of the above methods.
BACKGROUND In an information storage system that consists of Ns servers, each server equipped with Nd storage devices, servers, disks and blocks on them may fail or go temporary offline at any time for many different reasons. In order to ensure that the information stored in the system is continuously available, one may store multiple copies of data on different servers/disks (this approach is adopted in Google file system and Hadoop Distributed File System). However, this severely increases storage requirements and the overall equipment cost. Another solution is to employ some kind of erasure coding, i.e. partition a chunk of data (stripe) into k information blocks (symbols), compute for them n— k parity (check) blocks (symbols), and store these blocks on different disks and servers. If any of them fails, one can consider the corresponding symbols as erased, and try to recover the missing blocks by means of erasure decoding of the corresponding code.
Numerous erasure correcting codes for network storage systems have been suggested, including: Reed-Solomon codes, Pyramid codes, EvenOdd and RDP codes, Parity splitting codes, and Zigzag codes. (n, k, n— k + 1) Reed-Solomon code provides protection against any combi- nation of up to n— k erasures, and have therefore the lowest possible redundancy. However, recovering any erased symbol requires one to access at least k surviving symbols. Pyramid and parity splitting codes provide the ability to recover a number of erasures by accessing at most / < k non-erased symbols. This is achieved at the expense of higher redundancy of the code. Essentially, these constructions are obtained from some maximum distance separable code (e.g. Reed-Solomon) by introducing into codewords additional check symbols, which depend only on some subsets of information symbols.
Array codes, such as EvenOdd, RDP and zigzag codes, are defined over a vector alphabet. This enables one to design efficient encoding algorithms, as well as to reduce the amount of data to be transmitted over the network in case of erasure recovery. However, there are still no explicit and efficient methods for construction of these codes for arbitrary values of code dimension k and redundancy n— k. The performance of a network storage system depends on the amount of traffic generated and the number of servers contacted during encode and rebuild operations, disk access rate and computational complexity of the associated algorithms. An important problem arising in such systems is that applications tend to write data in relatively small chunks consisting of less than k blocks. This requires one to implement bufferization, i.e. accumulate the data somewhere until sufficient amount of it is collected. This approach may result in a data loss, since the storage device used for bufferization may fail itself. Furthermore, applications may need to update some information blocks in previously stored stripes. In order to keep check blocks consistent, one needs to fetch their old values, as well as old values of information blocks, compute their difference, and update them to reflect new data. This involves many input/output and network transfer operations, which severely degrade system performance.
If erasure decoding is performed, one should contact as low number of servers as possible, since each network data transfer induces very high performance penalty. This problem is addressed with the construction of locally decodable codes, such as pyramid and parity splitting ones. However, since these codes have more check symbols than comparable maximum distance separable ones, partial update introduces even higher performance overhead. Therefore, this method has been used only with immutable data.
SUMMARY OF THE INVENTION It is an objective of the present invention to provide a method for encoding data and a method for updating a codeword, wherein the methods overcome one or more of the above-mentioned problems of the prior art. In particular, an objective of the present invention can include ensuring data availability in a distributed storage system, which may suffer from block, device and server failures.
A first aspect of the invention provides a method for encoding input data in a codeword, wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols u[ = 0, for i 6 7, the remaining positions i g 7 comprise the input data and the matrix A is representable as A— BF0 ® Fx (g) ··· ® Fm- , wherein B is a permutation matrix and F0, Fl 5 ... , Fm→ are Zj x Zj matrices over GF(2^) not being permutation-equivalent to diagonal matrices,
- the set 7 comprises integers s with 0 < s < r
Figure imgf000004_0001
h>
the set 7 comprises integers s = ∑ Q1 si Π;=ο Z- : 0 < Sj < lit
such that Π™ oHsi + 1) < d, or
the set 7 comprises integers s = i 2 > 0≤ i < lm-i> ≤ s < p Flylo3 - In the above G (2J1) stands for a finite field of order 2μ, wherein μ is a natural number.
The different choices of the set 7 are advantageous for different application scenarios:
• In order to ensure that r erasures can be recovered locally within each block of /
symbols
Figure imgf000004_0002
h- · In order to ensure that the code can correct any combination of d— 1 erasures (i.e. device failures once can include in 7 integers s =∑™ si Π}=ο Zy : 0 < Sf < li, such
Figure imgf000004_0003
• In order to ensure that the code can correct any combination of p server failures, once can include in 7 integers s = i Π™~ο2 ί·, 0 < i < .m_1; 0 < s < p IlyL'o3 lj- · In order to ensure that the data loss probability is upper bounded by some pre-defined value π, once can include in 7 integers s =
Figure imgf000004_0004
Sj i1 : 0 < S; < Z, such that
∑sCF P(P(- (P(P> sm ~ 1)' - )» si)< so) < π· Here p is the probability of a server (node) failure within a given time interval (no replacements), and P(p, t) =
Figure imgf000004_0005
The method of the first aspect can use polar codes. These can be generated by some rows of matrix A = BF0 (g) Fr ® ··· ® Fm_l5 where B is a permutation matrix corresponding to map- ping = Σ^ο1 Si Π$=ο →∑!V si Π™ iii wherein 0 < st < lt and j are some k X it matrices over GF(2M) not being permutation-equivalent to diagonal matrices.
Accordingly, there is presented a method for encoding the data with a polar code and storing the encoded values on the elements of storage system, as well as an efficient method implementing the encoding and decoding operations.
In an embodiment of the first aspect, there is provided a method for encoding data in a storage system, where the data are placed into selected positions within codeword c, which are declared known, and positions of check symbols are declared unknown, and the values of the unknown symbols are recursively determined so that a codeword c = u'A, A = BF0 <¾ F1 ® · · · (g> Fm→m is obtained, where u' is a vector with u[ = 0, i G T . q is stored on the i-th node (server or disk within a server). In embodiments of the invention, one can obtain codes with the same erasure correcting properties in several different ways. For example, one can permute columns of matrices Fj, and multiply these columns by arbitrary non-zero values. Furthermore, there exist many different ways to assign information and check symbols within a codeword. In a first implementation of the method according to the first aspect, the method is a method for recovering erased data and wherein the method is based on an encoding scheme based on nodes F_i that correspond to the matrices Ft and the method comprises:
marking one or more erased symbols of the codeword as unknown and one or more non-erased symbols of the codeword as known,
- marking one or more symbols with i G 7 as known,
for any node F_i in the encoding scheme, if t of its output symbols are marked unknown, marking t of its topmost input symbols as unknown, unless they have already been marked known,
if a node F_i has t known input symbols and t unknown output symbols, mark the remaining output symbols as unknown, unless they are marked known,
repeating until all symbols of the codeword are marked known:
if a node has t known input symbols, t unknown and /— t known output symbols, recover unknown output symbols by local decoding at the node and mark all output symbols as known, and if all output symbols of a node are known, compute unknown input symbols and mark them known.
Thus, the method of the first implementation provides an efficient method for erasure recovery. In particular, the task of erasure recovery can be reduced to a plurality of local decoding tasks, for which efficient implementations exist, as further outlined below.
In a second implementation of the method according to the first aspect, the method is a method for systematic encoding and obtaining the codeword comprises initial steps of:
- selecting data positions of the codeword as p' = Σ™^1 Pm_1_j Π;=ο
where p = Σ^1 P, Π¾ e {0 Uf ^ ~ l} \ T, and
placing the input data into the data positions of the codeword, and marking the remaining codeword positions as erased.
employing the above described method for recovering erased data.
This presents a particularly efficient way of systematic encoding, which is important for practical systems, where the input data should appear as part of the codeword.
In a third implementation of the method according to the first aspect, the method further comprises a step of recovering the symbols in the remaining codeword positions using the method of the first implementation of the first aspect.
The proposed method for selecting data positions within a codeword ensures that the described erasure recovery method always recovers all the check symbols of the codeword.
In a fourth implementation of the method according to the first aspect, the local decoding comprises:
solving equations Si =∑~¾ CjBji , 0 < t < t, where B = for unknown output symbols cSjl 0 < j < t, using known input symbols S;, and
- as soon as all output symbols become known, computing unknown input symbols $i =∑;'=0 Bijcj - In a fifth implementation of the method according to the first aspect, the local decoding is implemented as: computing unknown output symbols as cs . =
Figure imgf000007_0001
where Γ(χ)≡ S(x)A(x) mod xt, and
- as soon as all output symbols are known, computing unknown input
symbols St =∑~ o -?jyCy.
This has the advantage that low-complexity polynomial evaluation algorithms with complexity 0(t2) can be used, instead of generic Gaussian elimination method for solving systems of linear equations with complexity 0 (t3) .
In a sixth implementation of the method according to the first aspect, the matrices FQ, FX, ... , Fm_i are selected as Fs = (/iy), fij = ccjs 1 έ, wherein a are arranged into a sequence of cyclotomic cosets.
This provides a practical implementation of the method of the fifth aspect.
In a seventh implementation of the method according to the first aspect, a Fast Fourier Transform, FFT, is utilized for determining unknown values from known values.
In particular, the FFT can be used for evaluating the polynomials, as required by the above described local decoding method, thus providing a particularly efficient way of performing the local decoding.
In an eighth implementation of the method according to the first aspect, the method further comprises a step of partially encoding input data with a length that is less than code dimension
Thus, encoding can be performed partially until unknown symbols can be recovered. Encoding can be resumed as soon as additional data arrives. Long codes are needed in order to maximize the payload capacity of a storage system given some target data loss probability. However, the dimension of such codes may be too high compared to the amount of data which can be produced at once by an application. Therefore the eighth implementation provides a delayed encoding method, which can be used in order to generate a few check symbols for small pieces of data as soon as it arrives, until sufficient amount of data is accumulated in order to produce the whole codeword.
The method of the eighth implementation can make use of the idea to designate initially all codeword symbols as unknown, and put the information symbols into appropriate positions, designating them as known, as soon as they arrive. Then one can execute the above described systematic encoding algorithm, which may stop at some points due to lack of known symbols, and resume as soon as they appear. Observe that it is not likely that many devices fail within a short time span which is needed to accumulate t data blocks. Therefore, a few check symbols obtained during incomplete execution of the encoding algorithm may be sufficient to cope with such failures. As soon as the encoding algorithm completes its execution, the whole set of check symbols can be obtained, which ensures protection against many device failures, as required for long-term data storage.
Therefore, the method of the eighth implementation provides an efficient way of dealing with large codes and small information chunks provided by an application.
In a ninth implementation of the method according to the first aspect, the method further comprises a step of storing an z'-th element of the codeword on an z'-th node, wherein the node is a server or a disk within a server.
A second aspect of the invention refers to a method for updating a codeword encoded according to the method of the first aspect or one of its implementations, the method comprising:
marking a symbol of the codeword to be updated as obsolete,
storing a new value of the symbol to be updated,
- marking one or more check symbols of the codeword as obsolete,
allocating one or more new check symbols and marking them as unknown, recursively determining values of one or more symbols marked as unknown, wherein when a symbol is marked known, its previous value is marked obsolete and when all symbols of a codeword are either known or obsolete, one or more positions marked as obsolete are erased. An implementation of the second aspect can provide a method for updating the data encoded according to the method of the first aspect, where the symbol to be updated is declared obsolete, a new value is stored on the device, the old check symbols are declared obsolete, and new check symbols are allocated and declared initially as unknown. Then the values of unknown symbols are recursively determined as in the case of the encoding procedure. Here every time a symbol is declared known, the block storing its previous value is declared obsolete. When all blocks corresponding to a codeword are either known, or obsolete, obsolete blocks are erased.
A third aspect of the invention refers to a storage controller, configured to carry out the method of one of the previous claims. The controller can be implemented either in software or in hardware (e.g. ASIC, FPGA). The controller can be directly connected to the storage devices or it can be connected to the storage devices through a network connection, wherein e.g. the storage devices are connected to the network through a further controller.
A fourth aspect of the invention refers to a computer-readable storage medium storing prog] code, the program code comprising instructions for carrying out the method of the first or second aspect or one of the implementations of the first or second aspect.
BRIEF DESCRIPTION OF THE DRAWINGS To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, but modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.
FIG. 1 is a block diagram illustrating an encoder structure in accordance with an embodiment of the present invention, FIG. 2A to 2D are block diagrams illustrating processing steps of an erasure recovery method in accordance with a further embodiment of the present invention,
FIG. 3 is a schematic diagram illustrating a method for delayed encoding in accordance with a further embodiment of the present invention, and
FIG. 4 is a schematic diagram illustrating a method for updating a codeword in accordance with a further embodiment of the present invention.
Detailed Description of the Embodiments
In a first embodiment, polar codes are generated by some rows of matrix A = BF0 ® F <g) " · · ®
Figure imgf000010_0001
h Σϋο1 st njLl+i Ij , 0 < Si < l and F; are some /; x matrices over GF(2 ) not being permutation-equivalent to diagonal matrices. That is, the non-systematic encoding operation is performed as c = vA, where v is a vector of length n =
Figure imgf000010_0002
h having 0 at positions i G T and information symbols Uj at the remaining positions. Here it will be convenient to assume that U( and q are elements of GF(2^), although in practice these may be column vectors (blocks) of GF(2 ) values. Set T will be referred to as the set of frozen symbols. For the sake of simplicity, in what follows we will assume that j = F, although the proposed construction is generic and can be used for any combination of ^ values.
FIG. 1 illustrates an encoding scheme in accordance with an embodiment of the present in- vention. A system 100 comprises a set of nodes 102 to 1 12 wherein each of the nodes, denoted by "F", implements multiplication of a vector of input values (left-hand side inputs) by matrix F, and the result is passed via right-hand side terminals. The encoded symbols can be stored in the following ways:
1. Each symbol is stored on its own device, where the output of each "F" node in the right-hand side layer is stored in a single group of devices (e.g. within one server).
2. Each symbol is stored in its own block, and the output of each node is stored on the same device, and the output of I adjacent node is stored within a single server.
It must be recognized that other mappings of codeword symbols onto storage devices are also possible. For the sake of concreteness, method 2 will be considered in what follows. Observe that failure of any block causes the data stored on it to be unavailable. This results in the corresponding codeword symbols to be erased.
This general construction requires, however, one to specify the particular matrix F and a method for finding the set of frozen channels 7. In the context of storage systems it is advantageous to employ Reed-Solomon kernel, which is given by = α;·_1~\ where a^, 0 < i < I are some distinct elements of ύ»Ρ(2μ). More details on selection of α£· will be given below. Observe that for the case of Reed-Solomon kernel last i last rows of F generate (/, i, I— i + 1), 1≤ i≤ I code. This enables one to construct the set of frozen channels 7 as follows:
1. In order to ensure that r erasures can be recovered locally within each block of I symbols, include into 7 integers s, 0 < s < rim_1. For example, this enables one to perform recovery of a failed device locally within each server without accessing network.
2. In order to ensure that the code can correct any combination of d— 1 erasures (i.e. device failures), include into 7 integers s =∑™ο si ^ 0 < S£ < i, such that ΓΊ^ ο 5; + Ό < d.
3. In order to ensure that the code can correct any combination of p server failures, include into 7 integers t + Is, 0 < t < 1, 0 < s < plm~2.
4. In order to ensure that the data loss probability is upper bounded by some pre-defined value
7Γ, include into 7 integers s =
Figure imgf000011_0001
si l ' 0si < l> SUCH THAT∑SCF W (- (. (Ρ· sm ~ 1), ... ), s0) < 71. Here p is the probability of a server (node) failure within a given time interval (no replacements), and
Figure imgf000011_0002
In the example shown in FIG. 1 , a set of 4 channels are frozen, indicated with reference number 120. Input data are provided as a set of symbols u0 to ut, indicated with reference number 122. Output symbols Co to c8 are indicated with reference number 124.
An erasure recovery method An embodiment of the invention presents the following algorithm for correction of erasures in a codeword of a polar code:
1. Mark all erased codeword symbols as "unknown", and non-erased as "known".
2. If a symbol Vi corresponds to a frozen channel, mark it "known".
3. For any node F in the encoding scheme, if / of its output symbols are "unknown", then mark t of its topmost input symbols as also unknown (unless they are already marked known).
4. If a node F has t known input symbols and t unknown output symbols, mark the remaining output symbols as "unknown", unless they are marked "known".
5. Repeat until all codeword symbols become known:
a. If a node has t known input symbols, t unknown and I— t known output symbols, recover unknown symbols by local decoding (see below) at the node. Mark all output symbols as "known".
b. If all output symbols of a node are known, compute unknown input symbols and mark them "known".
FIGs. 2A to 2D presents an example of application of the above described erasure recovery method. The example system comprises five nodes, indicated with reference numbers 202, 204, 206, 208 and 210. Unknown values are shown with dashed lines, and known values are shown with uninterrupted lines.
FIG. 2A shows the initial situation. The symbols c0, ci, c3, and c6 have been erased and are marked as unknown. The first input of the first node 202 corresponds to a frozen channel and is therefore also marked as known. The first input of the third, fourth and fifth node 206, 208, 210 are also frozen channels and are therefore also marked as known.
The third node 206 has two unknown output symbols, therefore, according to above rule 3), the second input is also marked as unknown. In a first processing step, according to above rule 5) a), local decoding is performed at the fourth node 208 and the fifth node 210. Thus, their outputs c3 and c6 can be computed and marked as known. Consequently, all outputs of the fourth node 208 and the fifth node 210 are known and their inputs can be marked as known, according to above rule 5) b). The situation above applying rules 5) a) and 5) b) in the first processing step is shown in FIG. 2B. Subsequently, the topmost output symbol of the first node 202 can be computed according to rule 5) a). The subsequent situation is shown in FIG. 2C. Finally, the output symbols c0 and ci of the third node 206 can be computed by performing local decoding. Subsequently, all symbols c0 to c8 are known, as shown in FIG. 2D.
Observe that the most typical failure patterns include just one erasure. If the code is properly designed (i.e. r > 1), this erasure can be recovered in a single iteration of this algorithm without transfer of information between nodes.
Local decoding at a node can be performed as follows. Let 5έ and c} denote input and output symbols, respectively.
1. Solve equations =∑ =o CjBji , 0 < i < t, where B = F' 1, for unknown output symbols cSj, 0 < j < t , using known input symbols S;
2. As soon as all output symbols become known, compute unknown input symbols Si
Figure imgf000013_0001
BijCj -
Let §i =∑jei< CjBji, where K is the set of known output symbol indices. Then one obtains Si - Si =∑j K cjBji. (1 )
The task of recovering unknown values Cj can be recognized as the problem of erasure decoding of a code with check matrix H = (Bji , 0 < ;' < /, 0 < i < t for the case of syndrome vector {SQ— SQ, ... , St_1— 5t-i) , and the task of computing §i can be recognized as the task of syndrome vector evaluation. Observe that syndrome vector evaluation can be performed using the cyclotomic fast Fourier transform.
Observe that for the case that the Reed-Solomon kernel B turns out to be a Vandermonde matrix, syndrome evaluation reduces to computing t components of the discrete Fourier transform of vector c. Furthermore, one can construct polynomial S(x) with coefficients given by the elements of the syndrome vector. Then the values of the erased symbols can be obtained via the Forney formula as
Figure imgf000014_0001
where
Γ(χ)≡ S(x)A(x) mod xl (3)
is the erasure evaluator polynomial, Sj are positions of erased symbols, A(x) =
Figure imgf000014_0002
~ cSj x) is me erasure evaluator polynomial, and A'(x) is its formal derivative. These expressions can be used to obtain the values of erased symbols with complexity 0(t2) instead of solving system of equations (1).
Fast systematic encoding
Practical storage systems require systematic encoding, i.e. such encoding method, so that the information symbols appear as a part of a codeword.
The above described erasure decoding algorithm can be used to implement systematic encod- ing. To do this, one can place the information symbols to positions Σ™^1 W; im-1_i within the codeword, where Σ^ ο1 w l T, 0≤ Wi < I, mark the remaining ones as erased, and execute the above described encoding method.
Advantageously, one can select the values ,· so that as . form a set of conju ate elements, i.e. Λ(χ) is a polynomial with binary coefficients. In this case evaluation of Γ can be per
Figure imgf000014_0003
formed using the inverse cyclotomic fast Fourier transform algorithm, which requires
0(t log ^ ) multiplications. Furthermore, computing Γ(χ) from (3) does not require any multiplications. Observe that such choice is always possible, provided that μ is a power of 2. Consider for example the case of a code over GF(22) constructed for = 3, m = 2. Let us introduce the requirement of r = 1 symbols being recoverable locally and any 2 combinations of erasures being recoverable (i.e. d = 3). This implies that T must contain elements 0,1,2,3. Let us further assume that the probability of storage device failure within a certain time interval (e.g. one year) is p = 0.01, and require that the annual data loss probability does not exceed 10~4. One obtains P(0.01,l) = 0.0297, P(0.01,2) = 2.98 · 10-4, P(0.01,3) = 10~5, and the following sequence of values PiSl+So— P(P(0.01, sx), s0) :
.86e-l , .89e-3, .3e-5, .26e-2, .27e-6, .3e-l l , .26e-4, .26e-10, . l e-17 The sum of all these values except the first four ones (since they are already included into T) is 2.6 · 10-4, so no additional symbols need to be frozen, i.e. one can set 7 = {0,1,2,3}· This results in a non-systematic encoder structure. Let F = where a is a primitive root of x2 + x + 1. Observe that
Figure imgf000015_0001
1 1 1\
B = F'1 = l a a2 1 I. In order to implement systematic encoding, one places the data to
\a2 a 1/
be encoded in symbols c2, c4, c5, c7, c8, and marks the corresponding symbols known. Codeword symbols c0, c1( c3, c6 are unknown. Observe that the 0-th input symbol of all F nodes in the rightmost layer in FIG. 1 is set to 0, i.e. it is known. Hence, the above described encoding algorithm requires one first to solve the equation 0 = B00c3i + B1Qc3i+1 + B20c3i+ for unknown symbols c3i, 0 < i < 3. This can be done immediately.
Having recovered c3i, one computes = B01c3i + βιιί3ί+1 + B21c3i+2, 1 < i < 3, i.e. two last output symbols of node 1 in the leftmost layer of the scheme. To recover the 0-th output symbol of this node (observe that the 0-th in ut s mbol of this block is known to be equal 0),
Figure imgf000015_0002
one again needs to solve 0 = Now one has both input symbols of the 0-th node at the rightmost layer known. They are equal = 0 and s ° Observe that
¾ = ^O^ ^20C2 = #ooco + ^10C1
S{ = s 0^— B21c2 = B01c0 + Bxlcx
One needs to solve this system of equations to recover c0, cx. Instead of applying Gaussian elimination, one can use the above described method based on Forney formula. Let us define A(x) = (1 - a°x)(l - a½) = 1 + (a + l)x + ax2. It can be seen that A'(x) = a + 1 + lax = a + 1 .One computes Γ(χ) = (SQ + S[x)A(x)mod x2, and obtains c0 = , cx = 1Γ( ~1)
+1
Observe that one could also use matrix F = or, equivalently, place the data into
Figure imgf000015_0003
symbols c0, c4, c5, c7, c8. In this case one would obtain A(x) = (1— a1x) (l— a2x) = 1 + x + x2. In this case computing Y(x) would not require any multiplications, and the check symbols would be given by the expressions cx =— ^— - = α(Γ0 4· Τχθ. + Γ^), c2 =
a2r(a~2)
— ^— - = α20 + ^ ). Observe that (Γ0 + ΓΊα) can be computed once, and re-used in both of these expressions.
Delayed encoding
FIGs 3 and 4 illustrate embodiments of methods for delayed encoding and for updating encoded codewords. In FIGS. 3 and 4, it is assumed that check symbols are located in positions 3, 7, 11 , 14, and 15.
FIG. 3 illustrates a method for delayed encoding. Positions 0, 1 and 2 comprise previously encoded data symbols x0, xi, and x2. Position 3, indicated with reference number 300, comprises a parity symbol p0 that is marked as unknown.
In a first processing step S10, a step of the encoding algorithm is performed, the parity symbol po is computed, stored in position 3, and marked as known, as indicated with reference number 310.
In a second processing step SI 2, new data symbols x3, x4, and x5 arrive. They are stored in positions 4, 5 and 6, and marked as known, as indicated with reference numbers 320, 322 and 324.
In a third processing step SI 4, a further step of the encoding algorithm is performed and the parity symbol pi is computed and stored in position 7, as indicated with reference number 330.
In a fourth processing step S16, new data symbols x6, x7, x8, x9 and xio arrive. They are stored in positions 8, 9 and 10, and marked as known, as indicated with reference numbers 340, 342, 344, 346 and 348.
In a fifth processing step SI 8, a further step of the encoding algorithm is performed and the parity symbols p2 and p3 are computed and stored in positions 1 1 and 14, as indicated with reference number 330. 15 000655
In a sixth processing step S20, a further step of the encoding algorithm is performed and the global parity symbol p4 is computed. The above approach can be extended in order to implement partial update of the information symbols. This is illustrated in FIG. 4.
In a first processing step S20 of the updating method, symbol x0 is updated, i.e., it is marked as obsolete, indicated with reference number 410, and new symbol x0' is stored, indicated with reference number 412. A corresponding check symbol x0' is marked as unknown, indicated with reference number 414.
In a second processing step S21, the previous check symbol p0 is marked as obsolete, indicated with reference number 420, and the new check symbol p0' is marked as known, indicated with reference number 422.
In a third processing step S22, the check symbols are updated, they are stored instead of the old check symbols. This is indicated with reference number 430, 432, and 434. To do this, one can store new values of information symbols in some other blocks on the same storage devices as old ones, and mark the blocks storing old and new values of information symbols as obsolete and known, respectively. Furthermore, one should allocate the blocks to store the updated values of check symbols, and mark them unknown. Then the above described encoding algorithm should be executed until all unknown blocks become known. Every time a new value of check symbol is computed, the block storing the old value should be marked obsolete. After all new values of check symbols have been computed, the corresponding obsolete blocks should be released. Observe that if some devices fail before the above described update process completes, one can still recover the corresponding data by employing obsolete data blocks.
Implementing this approach requires one to maintain a directory, which stores the addresses of the actual and obsolete blocks corresponding to a stripe, as well as their status
(known/unknown/obsolete) . 0065S
To summarize, the invention provides a method for encoding the data in a storage system with a polar code, which includes a method for finding parameters of the polar code, a systematic encoding algorithm, a delayed encoding method, and a method for partial updating of the encoded data.
Embodiments of the present invention employ polar codes for encoding the data in distributed storage system, and provide a method for their construction, which enable local data recovery, protection against a given number of block, disk and server failures, as well as a fast algorithm for their encoding and erasure decoding. Furthermore, the invention presents techniques which enable the data to be written to the system in small blocks, and provide an efficient implementation of the partial update operation. This can provide one or more of the following advantages compared to existing approaches such as Reed-Solomon code used in HDFS-RAID:
1. The ability to recover the data stored on failed disks within a server locally without accessing any other servers, reducing thus the network traffic.
2. The ability to recover the data on failed servers by contacting at most Ζ£· surviving servers, avoiding thus costly network data transfers during the rebuild phase. HDFS-RAID does not provide this feature at all.
3. The ability to perform delayed encoding, i.e. to perform encoding of small chunks of data as soon as they arrive, avoiding thus bufferization. Observe that if bufferization is used, the data may be lost if the system crashes before it is written to disks. Therefore, the proposed approach reduces the probability of data loss due to such failures.
4. Furthermore, one can balance over the time the computational load and improve thus the overall system responsiveness by delaying computation of some of the check blocks until system load becomes sufficiently low.
5. The ability to perform efficient encoding and updating of small chunks of data enables one to employ long codes in a storage system. This in turn allows one to increase the payload capacity of the system for a fixed data loss probability. 15 000655
6. The complexity of the proposed systematic encoding algorithm for polar codes is given by 0(n logn), where n is code length. For the case of pyramid, parity splitting and
EvenOdd/RDP codes only encoding algorithms with complexity 0(n2) have been published up to now. High encoding complexity limits practical application of these codes to those with small n.
7. The proposed method enables one to construct codes over field GF(q), q≥ max Ζέ, while the constructions based on parity splitting and pyramid codes requires field size q≥n. This results in reduced complexity of arithmetic operations.
8. The proposed fast algorithm for systematic encoding the data with a polar code with Reed-Solomon kernel is applicable also in telecommunication systems employing the corresponding polar codes.
These advantages make it possible to construct large-scale distributed fault tolerant storage systems.
The foregoing descriptions are only implementation manners of the present invention, the protection of the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.

Claims

1. A method for encoding input data (122) in a codeword (124), wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols u[ = 0 for i 6 7, the remaining positions i 6 7 comprise the input data and the matrix A is representable as A = BFQ ® F1 (¾ ··· ® Fm_1, wherein B is a permutation matrix and F0, Fl5 ... , F -! are x matrices over GF(2tl) not being permutation-equivalent to diagonal matrices, wherein
the set 7 comprises integers s wi
the set 7 comprises integers s =
Figure imgf000020_0001
< lit such that o^5; + 1) < d, or
the set 7 comprises integers s = i Π™ο2 lj> 0 < i < lm-\, 0≤ s < p T\™o lj.
2. The method of claim 1, wherein the method is a method for recovering erased data and wherein the method is based on an encoding scheme based on nodes F_i (102-1 12, 202-210) that correspond to the matrices Fj and the method comprises:
marking one or more erased symbols of the codeword as unknown and one or more non-erased symbols of the codeword as known,
marking one or more symbols vt with i E 7 as known,
for any node F_i in the encoding scheme, if t of its output symbols are marked unknown, marking t of its topmost input symbols as unknown, unless they have already been marked known,
if a node F_i has t known input symbols and t unknown output symbols, mark the remaining output symbols as unknown, unless they are marked known, repeating until all symbols of the codeword are marked known:
if a node has t known input symbols, t unknown and I— t known output symbols, recover unknown output symbols by local decoding at the node and mark all output symbols as known, and
if all output symbols of a node are known, compute unknown input symbols and mark them known.
The method of claim 1, wherein the method is a method for systematic encoding and obtaining the codeword comprises initial steps of:
selecting data positions of the codeword as p' = Σ^ο1 Pm-i-i Π;=ο (/wnere
Figure imgf000021_0001
placing the input data into the data positions of the codeword and marking remaining positions of the codeword as erased.
The method of claim 3, further comprising a step of recovering the symbols in the remaining codeword positions using the method of claim 2.
The method of claim 2, wherein the local decoding comprises:
solving equations Sj =∑;=o CjBji , 0≤ i < t, where B = F_1, for unknown output symbols cSj, 0 < j < t, using known input symbols S and
as soon as all output symbols become known, computing unknown input symbols =∑l }Q BijCj.
The method of claim 5, wherein the local decoding comprises: computing unknown output symbols as cs . =— , ,
^ asj )
where Γ(χ)≡ S(x)A(x) mod xl, and
as soon as all output symbols are known, computing unknown input symbols
Figure imgf000021_0002
$i Bijcj-
The method of one of the previous claims, wherein the matrices F0, Fx, ... , Fm_x are selected as Fs = (/ϊ;·), fij = crjs-1-i, wherein · are arranged into a sequence of cy- clotomic co sets.
The method of one of claims 2 to 7, wherein a Fast Fourier Transform, FFT, is utilized for determining unknown values from known values.
The method of one of the previous claims, further comprising a step of partially encoding data with a length that is less than code dimension k = Π^ο1 - 1^1·
10. The method of one of the previous claims, further comprising a step of storing an z'-th element of the codeword on an z'-th node, wherein the node is a server or a disk within a server.
1 1. A method for updating a codeword encoded according to the method of one of the previous claims, the method comprising:
marking a symbol of the codeword to be updated as obsolete,
storing a new value of the symbol to be updated,
marking one or more check symbols of the codeword as obsolete, allocating one or more new check symbols of the codeword and marking them as unknown,
recursively determining values of one or more symbols marked as unknown, wherein when a symbol is marked known, its previous value is marked obsolete and when all symbols of a codeword are either known or obsolete, one or more positions marked as obsolete are erased.
12. Storage controller, configured to carry out the method of one of the previous claims.
13. A computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of one of claims 1 to 10.
PCT/RU2015/000655 2015-10-09 2015-10-09 Coding for distributed storage system WO2017061891A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU2015/000655 WO2017061891A1 (en) 2015-10-09 2015-10-09 Coding for distributed storage system
CN201580083657.1A CN108141228A (en) 2015-10-09 2015-10-09 The coding of distributed memory system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2015/000655 WO2017061891A1 (en) 2015-10-09 2015-10-09 Coding for distributed storage system

Publications (2)

Publication Number Publication Date
WO2017061891A1 true WO2017061891A1 (en) 2017-04-13
WO2017061891A9 WO2017061891A9 (en) 2017-06-15

Family

ID=55967384

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2015/000655 WO2017061891A1 (en) 2015-10-09 2015-10-09 Coding for distributed storage system

Country Status (2)

Country Link
CN (1) CN108141228A (en)
WO (1) WO2017061891A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018228357A1 (en) * 2017-06-15 2018-12-20 Huawei Technologies Co., Ltd. Methods and apparatus for encoding and decoding based on layered polar code

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208996A1 (en) * 2010-02-22 2011-08-25 International Business Machines Corporation Read-other protocol for maintaining parity coherency in a write-back distributed redundancy data storage system
US20140331083A1 (en) * 2012-12-29 2014-11-06 Emc Corporation Polar codes for efficient encoding and decoding in redundant disk arrays

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676735B2 (en) * 2005-06-10 2010-03-09 Digital Fountain Inc. Forward error-correcting (FEC) coding and streaming
CN101834898B (en) * 2010-04-29 2013-01-30 中科院成都信息技术有限公司 Method for storing network distributed codes
CN102624866B (en) * 2012-01-13 2014-08-20 北京大学深圳研究生院 Data storage method, data storage device and distributed network storage system
US9203902B2 (en) * 2012-01-31 2015-12-01 Cleversafe, Inc. Securely and reliably storing data in a dispersed storage network
US8996950B2 (en) * 2012-02-23 2015-03-31 Sandisk Technologies Inc. Erasure correction using single error detection parity
CN103336785B (en) * 2013-06-04 2016-12-28 华中科技大学 A kind of distributed storage method based on network code and device thereof
CN107844268B (en) * 2015-06-04 2021-09-14 华为技术有限公司 Data distribution method, data storage method, related device and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208996A1 (en) * 2010-02-22 2011-08-25 International Business Machines Corporation Read-other protocol for maintaining parity coherency in a write-back distributed redundancy data storage system
US20140331083A1 (en) * 2012-12-29 2014-11-06 Emc Corporation Polar codes for efficient encoding and decoding in redundant disk arrays

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ERDAL ARIKAN: "A survey of reed-muller codes from polar coding perspective", INFORMATION THEORY WORKSHOP (ITW), 2010 IEEE, IEEE, PISCATAWAY, NJ, USA, 6 January 2010 (2010-01-06), pages 1 - 5, XP031703947, ISBN: 978-1-4244-6372-5 *
ESMAILI KYUMARS SHEYKH ET AL: "CORE: Cross-object redundancy for efficient data repair in storage systems", 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, IEEE, 6 October 2013 (2013-10-06), pages 246 - 254, XP032535096, DOI: 10.1109/BIGDATA.2013.6691581 *
ESMAILI KYUMARS SHEYKH ET AL: "Efficient updates in cross-object erasure-coded storage systems", 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, IEEE, 6 October 2013 (2013-10-06), pages 28 - 32, XP032535038, DOI: 10.1109/BIGDATA.2013.6691658 *
HUANG PENGFEI ET AL: "Cyclic linear binary locally repairable codes", 2015 IEEE INFORMATION THEORY WORKSHOP (ITW), IEEE, 26 April 2015 (2015-04-26), pages 1 - 5, XP032788757, ISBN: 978-1-4799-5524-4, [retrieved on 20150624], DOI: 10.1109/ITW.2015.7133128 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018228357A1 (en) * 2017-06-15 2018-12-20 Huawei Technologies Co., Ltd. Methods and apparatus for encoding and decoding based on layered polar code
US10505566B2 (en) 2017-06-15 2019-12-10 Huawei Technologies Co., Ltd. Methods and apparatus for encoding and decoding based on layered polar code
CN111066250A (en) * 2017-06-15 2020-04-24 华为技术有限公司 Method and device for encoding and decoding based on layered polarization code
CN111066250B (en) * 2017-06-15 2021-11-19 华为技术有限公司 Method and device for encoding and decoding based on layered polarization code

Also Published As

Publication number Publication date
CN108141228A (en) 2018-06-08
WO2017061891A9 (en) 2017-06-15

Similar Documents

Publication Publication Date Title
US10146618B2 (en) Distributed data storage with reduced storage overhead using reduced-dependency erasure codes
US9600365B2 (en) Local erasure codes for data storage
Cadambe et al. Permutation code: Optimal exact-repair of a single failed node in MDS code based distributed storage systems
Sasidharan et al. A high-rate MSR code with polynomial sub-packetization level
US9354975B2 (en) Load balancing on disks in raid based on linear block codes
US9465692B2 (en) High reliability erasure code distribution
US20140006850A1 (en) Redundant disk encoding via erasure decoding
Sung et al. A ZigZag-decodable code with the MDS property for distributed storage systems
KR20120058556A (en) Methods and apparatus employing fec codes with permanent inactivation of symbols for encoding and decoding processes
US20120017140A1 (en) Non-mds erasure codes for storage systems
CN106201764B (en) A kind of date storage method and device, a kind of data reconstruction method and device
WO2012008921A1 (en) Data encoding methods, data decoding methods, data reconstruction methods, data encoding devices, data decoding devices, and data reconstruction devices
Shahabinejad et al. A class of binary locally repairable codes
CN114153651B (en) Data encoding method, device, equipment and medium
Balaji et al. On partial maximally-recoverable and maximally-recoverable codes
CN114116297A (en) Data encoding method, device, equipment and medium
US11463113B2 (en) Apparatus and method for multi-code distributed storage
US10031701B2 (en) Hierarchical processing for extended product codes
US10110258B2 (en) Accelerated erasure coding for storage systems
WO2017061891A1 (en) Coding for distributed storage system
CN109257049B (en) Construction method for repairing binary array code check matrix and repairing method
CN108352845B (en) Method and device for encoding storage data
Chen et al. A new Zigzag MDS code with optimal encoding and efficient decoding
WO2017158430A1 (en) Coding technique
WO2017194780A1 (en) Balanced locally repairable erasure codes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15860015

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15860015

Country of ref document: EP

Kind code of ref document: A1