WO2017061891A1

WO2017061891A1 - Coding for distributed storage system

Info

Publication number: WO2017061891A1
Application number: PCT/RU2015/000655
Authority: WO
Inventors: Peter Vladimirovich Trifonov; Yunfeng Shao; Yuangang WANG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-04-13
Also published as: CN108141228A; WO2017061891A9

Abstract

The present invention relates to a method for encoding input data in a codeword, wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols, (I), the remaining positions (II) comprise the input data and the matrix A is representable as (III), wherein B is a permutation matrix and F ₀, F ₁,..., F _m-1 are l _i x l _i matrices over GF(2^µ) not being permutation-equivalent to diagonal matrices, wherein the set ℱ comprises integers s with (IV), the set ℱ comprises integers (V) such that, (VI) or the set ℱ comprises integers (VII).

Description

CODING FOR DISTRIBUTED STORAGE SYSTEM

TECHNICAL FIELD

The present invention relates to a method for encoding input data in a codeword and to a method for updating a codeword. The present invention also relates to a storage controller and to a computer-readable storage medium. The present invention further also relates to a com- puter-readable storage medium storing program code, the program code comprising instructions for carrying out one of the above methods.

BACKGROUND In an information storage system that consists of N_s servers, each server equipped with N_d storage devices, servers, disks and blocks on them may fail or go temporary offline at any time for many different reasons. In order to ensure that the information stored in the system is continuously available, one may store multiple copies of data on different servers/disks (this approach is adopted in Google file system and Hadoop Distributed File System). However, this severely increases storage requirements and the overall equipment cost. Another solution is to employ some kind of erasure coding, i.e. partition a chunk of data (stripe) into k information blocks (symbols), compute for them n— k parity (check) blocks (symbols), and store these blocks on different disks and servers. If any of them fails, one can consider the corresponding symbols as erased, and try to recover the missing blocks by means of erasure decoding of the corresponding code.

Numerous erasure correcting codes for network storage systems have been suggested, including: Reed-Solomon codes, Pyramid codes, EvenOdd and RDP codes, Parity splitting codes, and Zigzag codes. (n, k, n— k + 1) Reed-Solomon code provides protection against any combi- nation of up to n— k erasures, and have therefore the lowest possible redundancy. However, recovering any erased symbol requires one to access at least k surviving symbols. Pyramid and parity splitting codes provide the ability to recover a number of erasures by accessing at most / < k non-erased symbols. This is achieved at the expense of higher redundancy of the code. Essentially, these constructions are obtained from some maximum distance separable code (e.g. Reed-Solomon) by introducing into codewords additional check symbols, which depend only on some subsets of information symbols.

Array codes, such as EvenOdd, RDP and zigzag codes, are defined over a vector alphabet. This enables one to design efficient encoding algorithms, as well as to reduce the amount of data to be transmitted over the network in case of erasure recovery. However, there are still no explicit and efficient methods for construction of these codes for arbitrary values of code dimension k and redundancy n— k. The performance of a network storage system depends on the amount of traffic generated and the number of servers contacted during encode and rebuild operations, disk access rate and computational complexity of the associated algorithms. An important problem arising in such systems is that applications tend to write data in relatively small chunks consisting of less than k blocks. This requires one to implement bufferization, i.e. accumulate the data somewhere until sufficient amount of it is collected. This approach may result in a data loss, since the storage device used for bufferization may fail itself. Furthermore, applications may need to update some information blocks in previously stored stripes. In order to keep check blocks consistent, one needs to fetch their old values, as well as old values of information blocks, compute their difference, and update them to reflect new data. This involves many input/output and network transfer operations, which severely degrade system performance.

If erasure decoding is performed, one should contact as low number of servers as possible, since each network data transfer induces very high performance penalty. This problem is addressed with the construction of locally decodable codes, such as pyramid and parity splitting ones. However, since these codes have more check symbols than comparable maximum distance separable ones, partial update introduces even higher performance overhead. Therefore, this method has been used only with immutable data.

SUMMARY OF THE INVENTION It is an objective of the present invention to provide a method for encoding data and a method for updating a codeword, wherein the methods overcome one or more of the above-mentioned problems of the prior art. In particular, an objective of the present invention can include ensuring data availability in a distributed storage system, which may suffer from block, device and server failures.

A first aspect of the invention provides a method for encoding input data in a codeword, wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols u[ = 0, for i 6 7, the remaining positions i g 7 comprise the input data and the matrix A is representable as A— BF₀ ® F_x (g) ··· ® F_m- , wherein B is a permutation matrix and F₀, F_{l 5} ... , F_m→ are Z_j x Zj matrices over GF(2^) not being permutation-equivalent to diagonal matrices,

- the set 7 comprises integers s with 0 < s < r

h_>

the set 7 comprises integers s = ∑ Q^{1 s}i Π;=ο Z- : 0 < Sj < l_it

such that Π™ oHsi + 1) < d, or

the set 7 comprises integers s = i ² > 0≤ i < l_m-i> ≤ s < p Flylo³ - In the above G (2^J1) stands for a finite field of order 2^μ, wherein μ is a natural number.

The different choices of the set 7 are advantageous for different application scenarios:

• In order to ensure that r erasures can be recovered locally within each block of /

symbols

h- · In order to ensure that the code can correct any combination of d— 1 erasures (i.e. device failures once can include in 7 integers s =∑™ ^si Π}=ο Zy : 0 < Sf < li, such

• In order to ensure that the code can correct any combination of p server failures, once can include in 7 integers s = i Π™^~ο² ί·, 0 < i < ._m__1; 0 < s < p IlyL^'o³ l_j- · In order to ensure that the data loss probability is upper bounded by some pre-defined value π, once can include in 7 integers s =

Sj i¹ : 0 < S; < Z, such that

∑sCF P(P(- (P(P> ^sm ^~ 1)' - )» ^si)< ^so) < ^π· Here p is the probability of a server (node) failure within a given time interval (no replacements), and P(p, t) =

The method of the first aspect can use polar codes. These can be generated by some rows of matrix A = BF₀ (g) F_r ® ··· ® F_m__l5 where B is a permutation matrix corresponding to map- ping = Σ^ο¹ Si Π$=ο →∑!V ^si Π™ iii wherein 0 < s_t < l_t and j are some k X i_t matrices over GF(2^M) not being permutation-equivalent to diagonal matrices.

Accordingly, there is presented a method for encoding the data with a polar code and storing the encoded values on the elements of storage system, as well as an efficient method implementing the encoding and decoding operations.

In an embodiment of the first aspect, there is provided a method for encoding data in a storage system, where the data are placed into selected positions within codeword c, which are declared known, and positions of check symbols are declared unknown, and the values of the unknown symbols are recursively determined so that a codeword c = u'A, A = BF₀ <¾ F₁ ® · · · (g> F_m→F®^m is obtained, where u' is a vector with u[ = 0, i G T . q is stored on the i-th node (server or disk within a server). In embodiments of the invention, one can obtain codes with the same erasure correcting properties in several different ways. For example, one can permute columns of matrices Fj, and multiply these columns by arbitrary non-zero values. Furthermore, there exist many different ways to assign information and check symbols within a codeword. In a first implementation of the method according to the first aspect, the method is a method for recovering erased data and wherein the method is based on an encoding scheme based on nodes F_i that correspond to the matrices F_t and the method comprises:

marking one or more erased symbols of the codeword as unknown and one or more non-erased symbols of the codeword as known,

- marking one or more symbols with i G 7 as known,

for any node F_i in the encoding scheme, if t of its output symbols are marked unknown, marking t of its topmost input symbols as unknown, unless they have already been marked known,

if a node F_i has t known input symbols and t unknown output symbols, mark the remaining output symbols as unknown, unless they are marked known,

repeating until all symbols of the codeword are marked known:

if a node has t known input symbols, t unknown and /— t known output symbols, recover unknown output symbols by local decoding at the node and mark all output symbols as known, and if all output symbols of a node are known, compute unknown input symbols and mark them known.

Thus, the method of the first implementation provides an efficient method for erasure recovery. In particular, the task of erasure recovery can be reduced to a plurality of local decoding tasks, for which efficient implementations exist, as further outlined below.

In a second implementation of the method according to the first aspect, the method is a method for systematic encoding and obtaining the codeword comprises initial steps of:

- selecting data positions of the codeword as p' = Σ™^¹ P_m_₁_j Π;=ο

where p = Σ^¹ P, Π¾ e {0 Uf ^ ~ l} \ T, and

placing the input data into the data positions of the codeword, and marking the remaining codeword positions as erased.

employing the above described method for recovering erased data.

This presents a particularly efficient way of systematic encoding, which is important for practical systems, where the input data should appear as part of the codeword.

In a third implementation of the method according to the first aspect, the method further comprises a step of recovering the symbols in the remaining codeword positions using the method of the first implementation of the first aspect.

The proposed method for selecting data positions within a codeword ensures that the described erasure recovery method always recovers all the check symbols of the codeword.

In a fourth implementation of the method according to the first aspect, the local decoding comprises:

solving equations Si =∑^~¾ CjBji , 0 < t < t, where B = for unknown output symbols c_Sjl 0 < j < t, using known input symbols S;, and

- as soon as all output symbols become known, computing unknown input symbols $i ⁼∑;^'=0 Bij^cj - In a fifth implementation of the method according to the first aspect, the local decoding is implemented as: computing unknown output symbols as c_s . =

where Γ(χ)≡ S(x)A(x) mod x^t, and

- as soon as all output symbols are known, computing unknown input

symbols S_t =∑^~ o -?jyCy.

This has the advantage that low-complexity polynomial evaluation algorithms with complexity 0(t²) can be used, instead of generic Gaussian elimination method for solving systems of linear equations with complexity 0 (t³) .

In a sixth implementation of the method according to the first aspect, the matrices F_Q, F_X, ... , F_m_i are selected as F_s = (/iy), fi_j = ccj^{s 1 έ}, wherein a are arranged into a sequence of cyclotomic cosets.

This provides a practical implementation of the method of the fifth aspect.

In a seventh implementation of the method according to the first aspect, a Fast Fourier Transform, FFT, is utilized for determining unknown values from known values.

In particular, the FFT can be used for evaluating the polynomials, as required by the above described local decoding method, thus providing a particularly efficient way of performing the local decoding.

In an eighth implementation of the method according to the first aspect, the method further comprises a step of partially encoding input data with a length that is less than code dimension

Thus, encoding can be performed partially until unknown symbols can be recovered. Encoding can be resumed as soon as additional data arrives. Long codes are needed in order to maximize the payload capacity of a storage system given some target data loss probability. However, the dimension of such codes may be too high compared to the amount of data which can be produced at once by an application. Therefore the eighth implementation provides a delayed encoding method, which can be used in order to generate a few check symbols for small pieces of data as soon as it arrives, until sufficient amount of data is accumulated in order to produce the whole codeword.

The method of the eighth implementation can make use of the idea to designate initially all codeword symbols as unknown, and put the information symbols into appropriate positions, designating them as known, as soon as they arrive. Then one can execute the above described systematic encoding algorithm, which may stop at some points due to lack of known symbols, and resume as soon as they appear. Observe that it is not likely that many devices fail within a short time span which is needed to accumulate t data blocks. Therefore, a few check symbols obtained during incomplete execution of the encoding algorithm may be sufficient to cope with such failures. As soon as the encoding algorithm completes its execution, the whole set of check symbols can be obtained, which ensures protection against many device failures, as required for long-term data storage.

Therefore, the method of the eighth implementation provides an efficient way of dealing with large codes and small information chunks provided by an application.

In a ninth implementation of the method according to the first aspect, the method further comprises a step of storing an z^'-th element of the codeword on an z^'-th node, wherein the node is a server or a disk within a server.

A second aspect of the invention refers to a method for updating a codeword encoded according to the method of the first aspect or one of its implementations, the method comprising:

marking a symbol of the codeword to be updated as obsolete,

storing a new value of the symbol to be updated,

- marking one or more check symbols of the codeword as obsolete,

allocating one or more new check symbols and marking them as unknown, recursively determining values of one or more symbols marked as unknown, wherein when a symbol is marked known, its previous value is marked obsolete and when all symbols of a codeword are either known or obsolete, one or more positions marked as obsolete are erased. An implementation of the second aspect can provide a method for updating the data encoded according to the method of the first aspect, where the symbol to be updated is declared obsolete, a new value is stored on the device, the old check symbols are declared obsolete, and new check symbols are allocated and declared initially as unknown. Then the values of unknown symbols are recursively determined as in the case of the encoding procedure. Here every time a symbol is declared known, the block storing its previous value is declared obsolete. When all blocks corresponding to a codeword are either known, or obsolete, obsolete blocks are erased.

A third aspect of the invention refers to a storage controller, configured to carry out the method of one of the previous claims. The controller can be implemented either in software or in hardware (e.g. ASIC, FPGA). The controller can be directly connected to the storage devices or it can be connected to the storage devices through a network connection, wherein e.g. the storage devices are connected to the network through a further controller.

A fourth aspect of the invention refers to a computer-readable storage medium storing prog] code, the program code comprising instructions for carrying out the method of the first or second aspect or one of the implementations of the first or second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, but modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.

FIG. 1 is a block diagram illustrating an encoder structure in accordance with an embodiment of the present invention, FIG. 2A to 2D are block diagrams illustrating processing steps of an erasure recovery method in accordance with a further embodiment of the present invention,

FIG. 3 is a schematic diagram illustrating a method for delayed encoding in accordance with a further embodiment of the present invention, and

FIG. 4 is a schematic diagram illustrating a method for updating a codeword in accordance with a further embodiment of the present invention.

Detailed Description of the Embodiments

In a first embodiment, polar codes are generated by some rows of matrix A = BF₀ ® F <g) " · · ®

h ^→ Σϋο^{1 s}t njLl+i Ij , 0 < Si < l and F; are some /; x matrices over GF(2 ) not being permutation-equivalent to diagonal matrices. That is, the non-systematic encoding operation is performed as c = vA, where v is a vector of length n =

h having 0 at positions i G T and information symbols Uj at the remaining positions. Here it will be convenient to assume that U₍ and q are elements of GF(2^), although in practice these may be column vectors (blocks) of GF(2 ) values. Set T will be referred to as the set of frozen symbols. For the sake of simplicity, in what follows we will assume that j = F, although the proposed construction is generic and can be used for any combination of ^ values.

FIG. 1 illustrates an encoding scheme in accordance with an embodiment of the present in- vention. A system 100 comprises a set of nodes 102 to 1 12 wherein each of the nodes, denoted by "F", implements multiplication of a vector of input values (left-hand side inputs) by matrix F, and the result is passed via right-hand side terminals. The encoded symbols can be stored in the following ways:

1. Each symbol is stored on its own device, where the output of each "F" node in the right-hand side layer is stored in a single group of devices (e.g. within one server).

2. Each symbol is stored in its own block, and the output of each node is stored on the same device, and the output of I adjacent node is stored within a single server.

It must be recognized that other mappings of codeword symbols onto storage devices are also possible. For the sake of concreteness, method 2 will be considered in what follows. Observe that failure of any block causes the data stored on it to be unavailable. This results in the corresponding codeword symbols to be erased.

This general construction requires, however, one to specify the particular matrix F and a method for finding the set of frozen channels 7. In the context of storage systems it is advantageous to employ Reed-Solomon kernel, which is given by = α_;·^_1~\ where a^, 0 < i < I are some distinct elements of ύ»Ρ(2^μ). More details on selection of α_£· will be given below. Observe that for the case of Reed-Solomon kernel last i last rows of F generate (/, i, I— i + 1), 1≤ i≤ I code. This enables one to construct the set of frozen channels 7 as follows:

1. In order to ensure that r erasures can be recovered locally within each block of I symbols, include into 7 integers s, 0 < s < ri^m_1. For example, this enables one to perform recovery of a failed device locally within each server without accessing network.

2. In order to ensure that the code can correct any combination of d— 1 erasures (i.e. device failures), include into 7 integers s =∑™ο ^si ^ 0 < S£ < i, such that ΓΊ^ ο ⁵; + Ό < d.

3. In order to ensure that the code can correct any combination of p server failures, include into 7 integers t + Is, 0 < t < 1, 0 < s < pl^m~2.

4. In order to ensure that the data loss probability is upper bounded by some pre-defined value

7Γ, include into 7 integers s =

^si ^l ' ⁰≤ ^si < ^l> ^{SUCH THAT}∑SCF W (- (. (Ρ· ^sm ~ 1), ... ), s₀) < 71. Here p is the probability of a server (node) failure within a given time interval (no replacements), and

In the example shown in FIG. 1 , a set of 4 channels are frozen, indicated with reference number 120. Input data are provided as a set of symbols u₀ to ut, indicated with reference number 122. Output symbols Co to c₈ are indicated with reference number 124.

An erasure recovery method An embodiment of the invention presents the following algorithm for correction of erasures in a codeword of a polar code:

1. Mark all erased codeword symbols as "unknown", and non-erased as "known".

2. If a symbol Vi corresponds to a frozen channel, mark it "known".

3. For any node F in the encoding scheme, if / of its output symbols are "unknown", then mark t of its topmost input symbols as also unknown (unless they are already marked known).

4. If a node F has t known input symbols and t unknown output symbols, mark the remaining output symbols as "unknown", unless they are marked "known".

5. Repeat until all codeword symbols become known:

a. If a node has t known input symbols, t unknown and I— t known output symbols, recover unknown symbols by local decoding (see below) at the node. Mark all output symbols as "known".

b. If all output symbols of a node are known, compute unknown input symbols and mark them "known".

FIGs. 2A to 2D presents an example of application of the above described erasure recovery method. The example system comprises five nodes, indicated with reference numbers 202, 204, 206, 208 and 210. Unknown values are shown with dashed lines, and known values are shown with uninterrupted lines.

FIG. 2A shows the initial situation. The symbols c₀, ci, c₃, and c₆ have been erased and are marked as unknown. The first input of the first node 202 corresponds to a frozen channel and is therefore also marked as known. The first input of the third, fourth and fifth node 206, 208, 210 are also frozen channels and are therefore also marked as known.

The third node 206 has two unknown output symbols, therefore, according to above rule 3), the second input is also marked as unknown. In a first processing step, according to above rule 5) a), local decoding is performed at the fourth node 208 and the fifth node 210. Thus, their outputs c₃ and c₆ can be computed and marked as known. Consequently, all outputs of the fourth node 208 and the fifth node 210 are known and their inputs can be marked as known, according to above rule 5) b). The situation above applying rules 5) a) and 5) b) in the first processing step is shown in FIG. 2B. Subsequently, the topmost output symbol of the first node 202 can be computed according to rule 5) a). The subsequent situation is shown in FIG. 2C. Finally, the output symbols c₀ and ci of the third node 206 can be computed by performing local decoding. Subsequently, all symbols c₀ to c₈ are known, as shown in FIG. 2D.

Observe that the most typical failure patterns include just one erasure. If the code is properly designed (i.e. r > 1), this erasure can be recovered in a single iteration of this algorithm without transfer of information between nodes.

Local decoding at a node can be performed as follows. Let 5_έ and c_} denote input and output symbols, respectively.

1. Solve equations =∑ =o CjBji , 0 < i < t, where B = F^{' 1}, for unknown output symbols c_Sj, 0 < j < t , using known input symbols S;

2. As soon as all output symbols become known, compute unknown input symbols Si

BijCj -

Let §i =∑jei< CjBji, where K is the set of known output symbol indices. Then one obtains S_i - S_i =∑_{j K} c_jB_ji. (1 )

The task of recovering unknown values Cj can be recognized as the problem of erasure decoding of a code with check matrix H = (Bji , 0 < ;^' < /, 0 < i < t for the case of syndrome vector {S_Q— S_Q, ... , S_t_₁— 5_t-i) , and the task of computing §i can be recognized as the task of syndrome vector evaluation. Observe that syndrome vector evaluation can be performed using the cyclotomic fast Fourier transform.

Observe that for the case that the Reed-Solomon kernel B turns out to be a Vandermonde matrix, syndrome evaluation reduces to computing t components of the discrete Fourier transform of vector c. Furthermore, one can construct polynomial S(x) with coefficients given by the elements of the syndrome vector. Then the values of the erased symbols can be obtained via the Forney formula as

where

Γ(χ)≡ S(x)A(x) mod x^l (3)

is the erasure evaluator polynomial, Sj are positions of erased symbols, A(x) =

~ c_Sj ^x) i^{s me} erasure evaluator polynomial, and A'(x) is its formal derivative. These expressions can be used to obtain the values of erased symbols with complexity 0(t²) instead of solving system of equations (1).

Fast systematic encoding

Practical storage systems require systematic encoding, i.e. such encoding method, so that the information symbols appear as a part of a codeword.

The above described erasure decoding algorithm can be used to implement systematic encod- ing. To do this, one can place the information symbols to positions Σ™^¹ W; i^m-1_i within the codeword, where Σ^ ο^{1 w l} T, 0≤ Wi < I, mark the remaining ones as erased, and execute the above described encoding method.

Advantageously, one can select the values ,· so that a_s . form a set of conju ate elements, i.e. Λ(χ) is a polynomial with binary coefficients. In this case evaluation of Γ can be per

formed using the inverse cyclotomic fast Fourier transform algorithm, which requires

0(t log ^ ) multiplications. Furthermore, computing Γ(χ) from (3) does not require any multiplications. Observe that such choice is always possible, provided that μ is a power of 2. Consider for example the case of a code over GF(2²) constructed for = 3, m = 2. Let us introduce the requirement of r = 1 symbols being recoverable locally and any 2 combinations of erasures being recoverable (i.e. d = 3). This implies that T must contain elements 0,1,2,3. Let us further assume that the probability of storage device failure within a certain time interval (e.g. one year) is p = 0.01, and require that the annual data loss probability does not exceed 10^~4. One obtains P(0.01,l) = 0.0297, P(0.01,2) = 2.98 · 10^-4, P(0.01,3) = 10^~5, and the following sequence of values Pi_Sl+_So— P(P(0.01, s_x), s₀) :

.86e-l , .89e-3, .3e-5, .26e-2, .27e-6, .3e-l l , .26e-4, .26e-10, . l e-17 The sum of all these values except the first four ones (since they are already included into T) is 2.6 · 10^-4, so no additional symbols need to be frozen, i.e. one can set 7 = {0,1,2,3}· This results in a non-systematic encoder structure. Let F = where a is a primitive root of x² + x + 1. Observe that

1 1 1\

B = F^'1 = l a a² 1 I. In order to implement systematic encoding, one places the data to

\a² a 1/

be encoded in symbols c₂, c₄, c₅, c₇, c₈, and marks the corresponding symbols known. Codeword symbols c₀, c₁₍ c₃, c₆ are unknown. Observe that the 0-th input symbol of all F nodes in the rightmost layer in FIG. 1 is set to 0, i.e. it is known. Hence, the above described encoding algorithm requires one first to solve the equation 0 = B₀₀c_3i + B_1Qc_3i+1 + B₂₀c_3i+ for unknown symbols c_3i, 0 < i < 3. This can be done immediately.

Having recovered c_3i, one computes = B₀₁c_3i + βιιί_3ί+1 + B₂₁c₃i₊₂, 1 < i < 3, i.e. two last output symbols of node 1 in the leftmost layer of the scheme. To recover the 0-th output symbol of this node (observe that the 0-th in ut s mbol of this block is known to be equal 0),

one again needs to solve 0 = Now one has both input symbols of the 0-th node at the rightmost layer known. They are equal = 0 and s ° Observe that

^■¾ ⁼ ^O^^— ^20^C2 ⁼ #oo^co + ^10^C1

S{ = s ⁰^— B₂₁c₂ = B₀₁c₀ + B_xlc_x

One needs to solve this system of equations to recover c₀, c_x. Instead of applying Gaussian elimination, one can use the above described method based on Forney formula. Let us define A(x) = (1 - a°x)(l - a½) = 1 + (a + l)x + ax². It can be seen that A'(x) = a + 1 + lax = a + 1 .One computes Γ(χ) = (S_Q + S[x)A(x)mod x², and obtains c₀ = , c_x = 1Γ( ~¹)

+1

Observe that one could also use matrix F = or, equivalently, place the data into

symbols c₀, c₄, c₅, c₇, c₈. In this case one would obtain A(x) = (1— a¹x) (l— a²x) = 1 + x + x². In this case computing Y(x) would not require any multiplications, and the check symbols would be given by the expressions c_x =— ^— - = α(Γ₀ 4· Τχθ. + Γ^), c₂ =

a²r(a~²)

— ^— - = α²(Γ₀ + ^ ). Observe that (Γ₀ + ΓΊα) can be computed once, and re-used in both of these expressions.

Delayed encoding

FIGs 3 and 4 illustrate embodiments of methods for delayed encoding and for updating encoded codewords. In FIGS. 3 and 4, it is assumed that check symbols are located in positions 3, 7, 11 , 14, and 15.

FIG. 3 illustrates a method for delayed encoding. Positions 0, 1 and 2 comprise previously encoded data symbols x₀, xi, and x₂. Position 3, indicated with reference number 300, comprises a parity symbol p₀ that is marked as unknown.

In a first processing step S10, a step of the encoding algorithm is performed, the parity symbol po is computed, stored in position 3, and marked as known, as indicated with reference number 310.

In a second processing step SI 2, new data symbols x₃, x₄, and x₅ arrive. They are stored in positions 4, 5 and 6, and marked as known, as indicated with reference numbers 320, 322 and 324.

In a third processing step SI 4, a further step of the encoding algorithm is performed and the parity symbol pi is computed and stored in position 7, as indicated with reference number 330.

In a fourth processing step S16, new data symbols x₆, x₇, x₈, x₉ and xio arrive. They are stored in positions 8, 9 and 10, and marked as known, as indicated with reference numbers 340, 342, 344, 346 and 348.

In a fifth processing step SI 8, a further step of the encoding algorithm is performed and the parity symbols p₂ and p₃ are computed and stored in positions 1 1 and 14, as indicated with reference number 330. 15 000655

In a sixth processing step S20, a further step of the encoding algorithm is performed and the global parity symbol p₄ is computed. The above approach can be extended in order to implement partial update of the information symbols. This is illustrated in FIG. 4.

In a first processing step S20 of the updating method, symbol x₀ is updated, i.e., it is marked as obsolete, indicated with reference number 410, and new symbol x₀' is stored, indicated with reference number 412. A corresponding check symbol x₀' is marked as unknown, indicated with reference number 414.

In a second processing step S21, the previous check symbol p₀ is marked as obsolete, indicated with reference number 420, and the new check symbol p₀' is marked as known, indicated with reference number 422.

In a third processing step S22, the check symbols are updated, they are stored instead of the old check symbols. This is indicated with reference number 430, 432, and 434. To do this, one can store new values of information symbols in some other blocks on the same storage devices as old ones, and mark the blocks storing old and new values of information symbols as obsolete and known, respectively. Furthermore, one should allocate the blocks to store the updated values of check symbols, and mark them unknown. Then the above described encoding algorithm should be executed until all unknown blocks become known. Every time a new value of check symbol is computed, the block storing the old value should be marked obsolete. After all new values of check symbols have been computed, the corresponding obsolete blocks should be released. Observe that if some devices fail before the above described update process completes, one can still recover the corresponding data by employing obsolete data blocks.

Implementing this approach requires one to maintain a directory, which stores the addresses of the actual and obsolete blocks corresponding to a stripe, as well as their status

(known/unknown/obsolete) . 0065S

To summarize, the invention provides a method for encoding the data in a storage system with a polar code, which includes a method for finding parameters of the polar code, a systematic encoding algorithm, a delayed encoding method, and a method for partial updating of the encoded data.

Embodiments of the present invention employ polar codes for encoding the data in distributed storage system, and provide a method for their construction, which enable local data recovery, protection against a given number of block, disk and server failures, as well as a fast algorithm for their encoding and erasure decoding. Furthermore, the invention presents techniques which enable the data to be written to the system in small blocks, and provide an efficient implementation of the partial update operation. This can provide one or more of the following advantages compared to existing approaches such as Reed-Solomon code used in HDFS-RAID:

1. The ability to recover the data stored on failed disks within a server locally without accessing any other servers, reducing thus the network traffic.

2. The ability to recover the data on failed servers by contacting at most Ζ_£· surviving servers, avoiding thus costly network data transfers during the rebuild phase. HDFS-RAID does not provide this feature at all.

3. The ability to perform delayed encoding, i.e. to perform encoding of small chunks of data as soon as they arrive, avoiding thus bufferization. Observe that if bufferization is used, the data may be lost if the system crashes before it is written to disks. Therefore, the proposed approach reduces the probability of data loss due to such failures.

4. Furthermore, one can balance over the time the computational load and improve thus the overall system responsiveness by delaying computation of some of the check blocks until system load becomes sufficiently low.

5. The ability to perform efficient encoding and updating of small chunks of data enables one to employ long codes in a storage system. This in turn allows one to increase the payload capacity of the system for a fixed data loss probability. 15 000655

6. The complexity of the proposed systematic encoding algorithm for polar codes is given by 0(n logn), where n is code length. For the case of pyramid, parity splitting and

EvenOdd/RDP codes only encoding algorithms with complexity 0(n²) have been published up to now. High encoding complexity limits practical application of these codes to those with small n.

7. The proposed method enables one to construct codes over field GF(q), q≥ max Ζ_έ, while the constructions based on parity splitting and pyramid codes requires field size q≥n. This results in reduced complexity of arithmetic operations.

8. The proposed fast algorithm for systematic encoding the data with a polar code with Reed-Solomon kernel is applicable also in telecommunication systems employing the corresponding polar codes.

These advantages make it possible to construct large-scale distributed fault tolerant storage systems.

The foregoing descriptions are only implementation manners of the present invention, the protection of the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.

Claims

1. A method for encoding input data (122) in a codeword (124), wherein the codeword is obtained as a product of a vector u' and a matrix A, the vector u' comprises symbols u[ = 0 for i 6 7, the remaining positions i 6 7 comprise the input data and the matrix A is representable as A = BF_Q ® F₁ (¾ ··· ® F_m_₁, wherein B is a permutation matrix and F₀, F_l5 ... , F -_! are x matrices over GF(2^tl) not being permutation-equivalent to diagonal matrices, wherein

the set 7 comprises integers s wi

the set 7 comprises integers s =

< l_it such that o^⁵; + 1) < d, or

the set 7 comprises integers s = i Π™ο² l_j> 0 < i < l_m-_\, 0≤ s < p T\™o l_j.

2. The method of claim 1, wherein the method is a method for recovering erased data and wherein the method is based on an encoding scheme based on nodes F_i (102-1 12, 202-210) that correspond to the matrices Fj and the method comprises:

marking one or more symbols v_t with i E 7 as known,

if a node F_i has t known input symbols and t unknown output symbols, mark the remaining output symbols as unknown, unless they are marked known, repeating until all symbols of the codeword are marked known:

if a node has t known input symbols, t unknown and I— t known output symbols, recover unknown output symbols by local decoding at the node and mark all output symbols as known, and

if all output symbols of a node are known, compute unknown input symbols and mark them known.

The method of claim 1, wherein the method is a method for systematic encoding and obtaining the codeword comprises initial steps of:

selecting data positions of the codeword as p' = Σ^ο¹ P_m-_i-_i Π;=ο (_/'» ^wnere

placing the input data into the data positions of the codeword and marking remaining positions of the codeword as erased.

The method of claim 3, further comprising a step of recovering the symbols in the remaining codeword positions using the method of claim 2.

The method of claim 2, wherein the local decoding comprises:

solving equations Sj =∑;=o CjBji , 0≤ i < t, where B = F^_1, for unknown output symbols c_Sj, 0 < j < t, using known input symbols S and

as soon as all output symbols become known, computing unknown input symbols =∑^l }_Q BijCj.

The method of claim 5, wherein the local decoding comprises: computing unknown output symbols as c_s . =— , ,

^ ^asj )

where Γ(χ)≡ S(x)A(x) mod x^l, and

as soon as all output symbols are known, computing unknown input symbols

$i Bij^cj-

The method of one of the previous claims, wherein the matrices F₀, F_x, ... , F_m__x are selected as F_s = (/ϊ_;·), fij = crj^s-1-i, wherein · are arranged into a sequence of cy- clotomic co sets.

The method of one of claims 2 to 7, wherein a Fast Fourier Transform, FFT, is utilized for determining unknown values from known values.

The method of one of the previous claims, further comprising a step of partially encoding data with a length that is less than code dimension k = Π^ο^{1 -} 1^1·

10. The method of one of the previous claims, further comprising a step of storing an z^'-th element of the codeword on an z^'-th node, wherein the node is a server or a disk within a server.

1 1. A method for updating a codeword encoded according to the method of one of the previous claims, the method comprising:

marking a symbol of the codeword to be updated as obsolete,

storing a new value of the symbol to be updated,

marking one or more check symbols of the codeword as obsolete, allocating one or more new check symbols of the codeword and marking them as unknown,

recursively determining values of one or more symbols marked as unknown, wherein when a symbol is marked known, its previous value is marked obsolete and when all symbols of a codeword are either known or obsolete, one or more positions marked as obsolete are erased.

12. Storage controller, configured to carry out the method of one of the previous claims.

13. A computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of one of claims 1 to 10.