Methods for data recovery of a distributed storage system and storage medium thereof

Abstract

A method of data recovery for a distributed storage system is a method of recovering multiple failed nodes concurrently with the minimum feasible bandwidth when failed nodes exist in a distributed storage system. By means of selecting assistant nodes, obtaining helper data sub-blocks through computing the selected assistant nodes, then computing a repair matrix and finally multiple the repair matrix and the helper data sub-blocks, the missing data blocks are reconstructed; or the missing data blocks are reconstructed by decoding. The method is applicable to data recovery in the case of any number of failed nodes and any reasonable combinations of coding parameters. The data recovery herein can reach the theoretical lower limit of the minimum recovery bandwidth.

Claims

1. A method of data recovery for a distributed storage system having totally n storage nodes each configured to store data, the system coding data to be stored with minimum storage regenerating code via product matrix construction C(n′,k′,d′), where n′=n+δ, k′=k+δ, d′=d+δ, δ denotes number of virtual nodes which are not real storage nodes, but theoretical storage nodes for computation, k indicates that data is equally divided into k segments, each segment contains a data sub-blocks, d is repair degree, C is resulted n′×α coded matrix and each entry thereof is data sub-block after encoded, first δ rows of C are zero, rows δ+1 to δ+k are source data and last m=n′−k′ rows are coded parity data, such that only n=n′−δ data blocks of resulted n′ coded data blocks are needed to be stored, the method comprising following steps: Step 1: selecting assistant nodes from surviving nodes, wherein the surviving nodes are non-failed storage nodes, and the assistant nodes are storage nodes, selected from the surviving nodes, which provide helper data for data recovery, calculating helper data sub-blocks by using the assistant nodes, and sending the helper data sub-blocks to a regenerating node which is a storage node for reconstructing missing data of multiple failed storage nodes; letting {N.sub.i|1≤i≤n′} denote a set of both virtual nodes and real nodes, the real nodes being real storage nodes, where N.sub.1˜N.sub.6 are virtual nodes and N.sub.δ+1˜N.sub.n′ are real nodes, number of the surviving nodes is n′−t, t is number of failed nodes which are the storage nodes missing the data and t≥1, defining X={x.sub.i|i=1, . . . ,t, δ<x.sub.i≤n′} as a loss list, letting c.sub.x.sub.1, c.sub.x.sub.2, . . . , c.sub.x.sub.i denote missing data blocks, selecting d′−t+1 surviving nodes as assistant nodes, a set of the assistant nodes being denoted by N.sub.a={N.sub.a.sub.i|1≤i≤d′−t+1,1≤a.sub.i≤n′,a.sub.i∈X}, both the assistant nodes and the failed nodes being referred to as “evolved nodes” represented by N.sub.e={N.sub.j|N.sub.j ∈N.sub.a or j∈X}, the assistant nodes N.sub.j∈N.sub.a each first calculating inner products between its stored data block c.sub.j and {ϕ.sub.x.sub.i|i=1, . . . , t}, resulting in t helper data sub-blocks: {h.sub.x.sub.i.sub.,j=c.sub.jϕ.sub.x.sub.i.sup.t|i=1, . . . ,t}, then sending the helper data sub-blocks as helper data to the regenerating node, Step 2: the regenerating node calculating a repair matrix R.sub.X: (1) for each x.sub.i, x.sub.j∈X, j≠i, calculating a matrix according to following formula:
Ω.sub.i,j=−θ.sub.x.sub.i.sub.,f.sub.j.sub.iϕ.sub.x.sub.i, where θ.sub.x.sub.i.sub.,f.sub.j.sub.i=ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1+λ.sub.x.sub.iψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 is a vector consisting of upper α entries of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, and ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 is lower α entries of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, here ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 denotes columns of matrix Ψ.sub.x.sub.i.sup.−1 that correspond to the helper data sub-blocks h.sub.x.sub.i.sub.,x.sub.j=c.sub.x.sub.jϕ.sub.x.sub.i.sup.t, Ψ.sub.x.sub.i.sup.−1 is an inverse matrix of Ψ.sub.x.sub.i=[ψ.sub.j.sup.t|N.sub.j∈N.sub.e,j≠x.sub.i].sup.t, (2) letting X.sub.i=X|x.sub.i to be an ordered set of remaining indices of the loss list X without x.sub.i, and G.sub.i={g.sub.j.sup.i|j≠i} to be an ordered set of indices of columns in Ψ.sub.x.sub.i.sup.−1 that correspond to data blocks {c.sub.x.sub.j|x.sub.j∈X.sub.i}, for each i=1˜t:
Θ.sub.i=[θ.sub.x.sub.i.sub.,m|1≤m≤d′,m.Math.G.sub.i] where computation of θ.sub.x.sub.i.sub.,m is identical to that of θ.sub.x.sub.i.sub.f.sub.j.sub.i, m=n−k, according to aforesaid computing process, there being $[\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}] [\begin{matrix} c_{x_{1}} \\ c_{x_{2}} \\ .Math. \\ c_{x_{t}} \end{matrix}] = [\begin{matrix} Θ_{1} {\overset{.fwdarw.}{h}}_{x_{1}} \\ Θ_{2} {\overset{.fwdarw.}{h}}_{x_{2}} \\ .Math. \\ Θ_{t} {\overset{.fwdarw.}{h}}_{x_{t}} \end{matrix}]$ where {right arrow over (h)}.sub.x.sub.i=[h.sub.x.sub.i.sub.,j=c.sub.jϕ.sub.x.sub.i.sup.t|j∈N.sub.a].sup.t, (3) for each x.sub.i, size of vector {right arrow over (h)}.sub.x.sub.i is d′−t+1, actually first δ entries of {right arrow over (h)}.sub.x.sub.i being all zero, accordingly first δ columns of each Θ.sub.i being not useful at all, thereby replacing Θ.sub.i{right arrow over (h)}.sub.x.sub.i with Θ′.sub.i{right arrow over (h)}′h.sub.x.sub.i, where Θ′.sub.i is Θ.sub.i without first δ columns and {right arrow over (h)}′.sub.x.sub.i is {right arrow over (h)}.sub.x.sub.i without first δ entries: letting $Ξ_{X} = [\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}]$ then repair matrix being: $R_{X} = {[\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}]}^{- 1} [\begin{matrix} Θ_{1}^{'} \\ Θ_{2}^{'} \\ ⋱ \\ Θ_{t}^{'} \end{matrix}]$ and Step 3: the regenerating node reconstructing missing data blocks for the multiple failed storage nodes: obtaining missing data blocks by left multiplying vector [{right arrow over (h)}′.sub.x.sub.1.sup.t, {right arrow over (h)}′.sub.x.sub.2.sup.t, . . . , {right arrow over (h)}′.sub.x.sub.t.sup.t].sup.t composed of the helper data sub-blocks with repair matrix R.sub.X; and performing concurrent recovery of the multiple failed nodes by writing the reconstituted data blocks into new replacements for the failed nodes.

2. The method according to claim 1, wherein the repair matrix works for t<min {k, α}, and decoding is used for recovery missing data for t≥min{k, α}: choosing k nodes as assistant nodes randomly from surviving nodes, downloading k×α from the assistant nodes, then decoding the data sub-blocks to obtain source data.

3. The method according to claim 1, wherein Ξ.sub.X is an inverse matrix, or Ξ.sub.X becomes invertible by adding a node to the loss list, or Ξ.sub.X becomes invertible by replacing one or some nodes in the loss list with other nodes; otherwise decoding to reconstruct missing data.

4. The method according to claim 1, wherein the method includes centralized data recovery, under which number of the assistant nodes is chosen according to number of the failed nodes to be recovered and collection of the helper data sub-blocks, computation of the repair matrix and reconstruction of the missing data are implemented by a central agent.

5. The method according to claim 1, wherein the method includes distributed data recovery, under which number of the assistant nodes is chosen according to number of the failed nodes to be recovered, each new node reconstructs data blocks stored on the failed node that it substitutes, and collection of the helper data sub-blocks, computation of the repair matrix and reconstruction of the missing data are implemented by each corresponding new node.

6. The method according to claim 5, wherein data blocks reconstructed by the new node further comprises data blocks needed by other new nodes.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flowchart of the computation of recovery for data blocks;

(2) FIG. 2 is a system model diagram schematically shown data recovery in a centralized recovery mode;

(3) FIG. 3 is a system model diagram schematically shown data recovery in a distributed recovery mode;

(4) FIG. 4 is a diagram showing the calculation of the left sub-matrix of a repair matrix;

(5) FIG. 5 is a diagram showing the calculation of the right sub-matrix of a repair matrix;

(6) FIG. 6 is a flowchart schematically shown data recovery in a centralized recovery mode; and

(7) FIG. 7 is a flowchart schematically shown data recovery in a distributed recovery mode.

DETAILED DESCRIPTION

(8) The present disclosure will be further described in detail below through specific embodiments in combination with the accompanying drawings. Many details described in the following embodiments are for the purpose of better understanding the present disclosure. However, a person skilled in the art can realize with minimal effort that some of these features can be omitted in different cases or be replaced by other methods. For clarity some operations related to the present disclosure are not shown or illustrated herein so as to prevent the core from being overwhelmed by excessive descriptions. For the person skilled in the art, such operations are not necessary to be explained in detail, and they can fully understand the related operations according to the description in the specification and the general technical knowledge in the field.

(9) In addition, the features, operations or characteristics described in the specification may be combined in any suitable manner to form various embodiments. At the same time, the steps or actions in the described method can also be sequentially changed or adjusted in a manner that can be apparent to those skilled in the art. Therefore, the various sequences in the specification and the drawings are only for the purpose of describing a particular embodiment, and are not intended to be an order of necessity, unless otherwise stated one of the sequences must be followed.

(10) In an embodiment of the present disclosure, a method of data recovery for a distributed storage system with MSR code constructed via PM (PM MSR) is proposed. The basic thinking of this method is to construct a repair matrix that is applicable to any number of failed nodes and any valid combination of coding parameters, and the lost data can be obtained by simply multiplying the helper data with the repair matrix.

(11) Separate data recovery algorithms for PM MSR codes are further proposed for centralized and distributed modes in the embodiments of the present disclosure. FIG. 2 and FIG. 3 depict the system models of concurrent failure recovery in centralized and distributed modes, respectively. In a distributed storage system, there are totally n storage nodes (denoted by N.sub.1, N.sub.2, . . . , N.sub.n), and t nodes out of them fail. As shown in FIG. 2 and FIG. 3, it assumes the first t nodes N.sub.1, N.sub.2, . . . , N.sub.n fail in without loss of generality. In the centralized recovery mode, a central agent collects helper data and regenerates the lost data, then sends the recovered data to the t new nodes. While in the distributed recovery mode, t new nodes handle the collection of helper data and the regeneration of the lost data by themselves independently of each other, and each new node needs only to reconstruct the lost data on the failed node that it substitutes, that is the new nodes herein are also referred to as substitute nodes.

(12) For convenience, the notions that are used throughout this disclosure and their definitions will be first presented. A minimum storage regenerating code via product matrix construction (PM MSR) is denoted by C(n,k,d), where n is the total number of nodes, k is the number of “systematical nodes” on which the uncoded original data blocks are stored; thus m=n−k is the number of “parity nodes” which hold the encoded data blocks. d is “repair degree” which is defined to be the number of assistant node required for the recovery when there is only one failed node.

(13) Given that the total amount of data to be stored is B. With the requirement of PM MSR code, data of amount B is divided into k segments of the same size, and then encoded into n data blocks denoted by b.sub.1, b.sub.2, . . . , b.sub.n with block ID 1, 2, . . . , n respectively. Each data blocks b.sub.i consists of α sub-blocks s.sub.i1, s.sub.i2, . . . , s.sub.iα. According to the relevant theory of MSR code, there are two equations as follows:
α=d−k+1 (1)
and
B=kα=k(d−k+1) (2)

(14) PM MSR code requires that d≥2k−2, thus n≥d+1≥2k−1 and α≥k−1.

(15) To this general case, the data amount B will be encoded to n′ data blocks based on the extended PM MSR code C(n′,k′,d′), where n′=n+δ, k′=k+δ and d′=d+δ. δ denotes the number of virtual nodes added by the extension, and it is chosen to make d′=2k′−2, then:
δ=d−(2k−2) (3)
and
d′=2(k+δ)−2=2(d−k+1)=2α (4)

(16) PM MSR code is constructed through the following formula:
C=ΨM (5)

(17) where M is the d′×α message matrix with

(18) $\begin{matrix} M = [\begin{matrix} s_{1} \\ s_{2} \end{matrix}] & (6) \end{matrix}$

(19) where S.sub.1 and S.sub.2 are α×α symmetric matrices constructed such that the (α+1)α/2 entries in the upper-triangle of each matrix are filled by distinct data sub-blocks. Ψ=[ΦΛΦ] is the n′×d′ encoding matrix, and the ith row of Φ is denoted by ϕ.sub.i. Λ is an n′×n′ diagonal matrix as

(20) $\begin{matrix} Λ = [\begin{matrix} λ_{1} \\ λ_{2} \\ ⋱ \\ λ_{n^{'}} \end{matrix}] & (7) \end{matrix}$

(21) C is the resulted n′×α coded matrix, each entry thereof is a data sub-block after encoded, and the ith row of C is:
c.sub.i=φ.sub.iM=[ϕ.sub.iλ.sub.iϕ.sub.i]M=ϕ.sub.iS.sub.1+λ.sub.iϕ.sub.iS.sub.2 (8)

(22) where φ.sub.i=[ϕ.sub.iλ.sub.iϕ.sub.i] is the ith row of Ψ. M is encoded in such a way that the resulted C is systematical, that means, the first δ rows of C are all zero, and rows δ+1 to δ+k are exactly the source data, while the last m=n′−k′ rows are coded parity data. The advantage to this is that the source data can be read without decoding where there is no failed systematical node. It should be noted that all the matrices and data mentioned in this embodiment are represented by entries of finite field GF(q).

(23) Before recovery, δ virtual nodes on which the data are all zero are firstly added to the system when a node fails. Let {N.sub.i|1≤i≤n′} denote the set of both virtual nodes and real nodes, where N.sub.1˜N.sub.δ are virtual nodes and N.sub.δ+1˜N.sub.n′ are real nodes. Hence it can be viewed that {c.sub.i|1≤i≤n′} are held by n′ nodes. Note that, the “node” used herein can be either physical or logical, say, multiple logical nodes can reside on one or more physical machines, which does not affect the effectiveness of the present disclosure. For another, the virtual nodes mentioned herein are not real and are conceived for theoretical reasoning, reflecting the thinking process of the inventor and easy understanding by those skilled in the art.

(24) The number of failed nodes is denoted by t≥1. The set of the indices of all missing blocks X={x.sub.i|i=1, . . . , t, δ<x.sub.i≤n′} is defined as a loss list. The missing blocks are denoted by c.sub.x.sub.1, c.sub.x.sub.2, . . . , c.sub.x.sub.t. Without loss of generality, it is assumed that x.sub.i in x are in ascending order. Currently there are n′−t surviving nodes including virtual nodes, and d′−t+1 nodes are chosen from them as assistant nodes to help recover the lost data. The set of the assistant nodes is denoted by N.sub.a={N.sub.a.sub.i|1≤i≤d′−t+1,1≤a.sub.i≤n′, a.sub.i.Math.X}. Both assistant nodes and failed nodes are called “evolved nodes”, represented by
N.sub.e={N.sub.j|N.sub.j∈N.sub.a or j∈X} (9)

(25) It can be seen that there are d′+1 members in the union set N.sub.e.

(26) During recovery, each assistant node N.sub.j∈N.sub.a first calculates the inner products between its data block c.sub.j (regarded it as a vector composed of sub-blocks [c.sub.j1, c.sub.j2, . . . , c.sub.jα]) and {ϕ.sub.x.sub.i|i=1, . . . , t}, resulting in t encoded sub-blocks:
{h.sub.x.sub.i.sub.,j=c.sub.jϕ.sub.x.sub.i.sup.t|i=1, . . . ,t} (10)

(27) Then the encoded sub-blocks are sent to the regenerating node as helper data. Thus the regenerating node will get t(d′t+1) helper sub-blocks in total. Note that, the premise of selecting d′−t+1 nodes as assistant nodes from surviving nodes mentioned above is to concurrently repair t failed nodes; however the nodes to be modified can also be selected according to actual conditions, in this way, the number of the selected assistant nodes will change, which will be further described below when discussing the centralized and distributed recovery modes.

(28) The regenerating node recovers the missing data blocks through “repair matrix” which is obtained as follows.

(29) First, for each x.sub.i, x.sub.j∈X, j≠i, according to the following formula (11), calculating a matrix
Ω.sub.i,j=−θ.sub.x.sub.i.sub.,f.sub.j.sub.iϕ.sub.x.sub.i (11)
where
θ.sub.x.sub.i.sub.,f.sub.j.sub.i=ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1+λ.sub.x.sub.iψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 (12)
and ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 is a vector consisting of the upper α entries of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, i.e. the front part of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, while ψ.sub.x.sub.j.sub.,f.sub.j.sub.i.sup.−1 is the lower a entries of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1, i.e. the rear part of ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1. Here ψ.sub.x.sub.i.sub.,f.sub.j.sub.i.sup.−1 denotes the column of the matrix Ψ.sub.x.sub.i.sup.−1 that corresponds to helper data sub-blocks h.sub.x.sub.i.sub.,x.sub.j=c.sub.x.sub.jϕ.sub.x.sub.i.sup.t. Ψ.sub.x.sub.i.sup.−1 is the inverse matrix of the matrix shown in the following formula (13):
Ψ.sub.x.sub.i=[ψ.sub.j.sup.t|N.sub.j∈N.sub.e,j≠x.sub.i].sup.t (13)

(30) that is, Ψ.sub.x.sub.i in the formula (13) is the submatrix composed of the rows corresponding to the evolved nodes except N.sub.x.sub.i, say, Ψ.sub.x.sub.i is the remainder of matrix Ψ in which the rows associated with nodes N.sub.x.sub.i has been removed. f.sub.j.sup.i is the index of the column in the matrix Ψ.sub.x.sub.i.sup.−1 that corresponds to the helper data sub-blocks h.sub.x.sub.i.sub.,x.sub.j=c.sub.x.sub.jϕ.sub.x.sub.i.sup.t. The computation of each matrix mentioned above is shown as that of the submatrix on the left side of FIG. 4.

(31) Next, let X.sub.i=X|x.sub.i to be the ordered set of the remaining indices of the loss list X without x.sub.i, and G.sub.i={g.sub.j.sup.i|j≠i} to be the ordered set of indices of columns in Ψ.sub.x.sub.i.sup.−1 that corresponds to the data block {c.sub.x.sub.j|x.sub.j∈X.sub.i}. For each i=1˜t, calculating according to the following formula (14)
Θ.sub.i=[θ.sub.x.sub.i.sub.,m|1≤m≤d′,m.Math.G.sub.i] (14)

(32) where the computation of θ.sub.x.sub.i.sub.,m is the same as the formula (12), and the specific process is as shown in the calculation process of the submatrix on the right side of FIG. 5.

(33) Combining the computation mentioned above, we have

(34) $\begin{matrix} [\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}] [\begin{matrix} c_{x_{1}} \\ c_{x_{2}} \\ .Math. \\ c_{x_{t}} \end{matrix}] = [\begin{matrix} Θ_{1} {\overset{.fwdarw.}{h}}_{x_{1}} \\ Θ_{2} {\overset{.fwdarw.}{h}}_{x_{2}} \\ .Math. \\ Θ_{t} {\overset{.fwdarw.}{h}}_{x_{t}} \end{matrix}] where & (15) \end{matrix}$ $\begin{matrix} {\overset{.fwdarw.}{h}}_{x_{i}} = {[h_{x_{i}, j} = c_{j} ϕ_{x_{i}}^{t} | j \in N_{a}]}^{t} & (16) \end{matrix}$

(35) are a vector which consists of coded sub-blocks calculated by each assistant node through making the inner product between the data blocks it have and ϕ.sub.x.sub.i.

(36) For each x.sub.i, the size of the vector {right arrow over (h)}.sub.x.sub.i is d′−t+1. But actually the first δ entries of {right arrow over (h)}.sub.x.sub.i are all zero (provided by the virtual nodes), hence the first δ columns of each Σ.sub.i are not useful at all. Therefore, the matrix on the right side of the formula (15) can be streamlined by replacing each Θ.sub.i{right arrow over (h)}.sub.x.sub.i Θ.sub.i′{right arrow over (h)}.sub.x.sub.i′, where Θ.sub.i′ is Θ.sub.i without the first δ columns and {right arrow over (h)}.sub.x.sub.i′ is {right arrow over (h)}.sub.x.sub.i without the first δ entries.

(37) Let

(38) $\begin{matrix} Ξ_{X} = [\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}] & (17) \end{matrix}$

(39) If the matrix Ξ.sub.X is invertible, the repair matrix can be obtained by the following formula as follows:

(40) $\begin{matrix} R_{X} = {[\begin{matrix} I & Ω_{1, 2} & .Math. & Ω_{1, t} \\ Ω_{2, 1} & I & .Math. & Ω_{2, t} \\ .Math. & .Math. & ⋱ & .Math. \\ Ω_{t, 1} & Ω_{t, 2} & .Math. & I \end{matrix}]}^{- 1} [\begin{matrix} Θ_{1}^{'} \\ Θ_{2}^{'} \\ ⋱ \\ Θ_{t}^{'} \end{matrix}] & (18) \end{matrix}$

(41) After the repair matrix is computed, the missing data blocks can be reconstructed by left multiplying the vector [{right arrow over (h)}′.sub.x.sub.1.sup.t, {right arrow over (h)}′.sub.x.sub.2.sup.t, . . . , {right arrow over (h)}′.sub.x.sub.t.sup.t].sup.t composed of the helper data sub-blocks with R.sub.X.

(42) The repair matrix in the formula (18) works for any t<min{k,α}. For t≥min{k,α}, decoding can be used for recovery, that is, choosing k nodes as the assistant nodes randomly from the surviving node, downloading k×α data sub-blocks from all the k assistant nodes, then decoding these data sub-blocks to get the source data and finally encoding the source data into the lost data blocks. But if t>m, the lost data are not recoverable since the current number of surviving nodes is n−t<k derived from m=n−k.

(43) Besides, if the matrix Ξ.sub.x is not invertible, the repair matrix cannot be calculated through the formula (18). Several solutions can handle this situation, including but not limiting to: 1) adding one or several nodes to X to make Ξ.sub.x invertible; 2) replacing one or some nodes in X with other nodes to make Ξ.sub.x invertible; and 3) decoding to implement data reconstruction. Since the possibility of such situation is rather small, any solutions have marginal effect on the overall performance. When adopting the solution of 1) and/or 2), the actual repair data block not only includes real lost data blocks, but also the data blocks on the new added or replaced nodes.

(44) The computation of the repair matrix is summarized as step 102 to step 106 shown in FIG. 1. The detailed procedure on each step in the figure is as follows:

(45) Step 1: for each x.sub.i∈X, computing Ψ.sub.x.sub.i.sup.−1 according to the formulas (9) and (13);

(46) Step 2: for each x.sub.i, x.sub.j∈X, j≠i, computing Ω.sub.i,j according to FIG. 4 an the formula (11);

(47) Step 3: based on the result in Step 2, constructing Ξ.sub.X according to the formula (17);

(48) Step 4: for each x.sub.i∈X, computing Θ.sub.i′ according to FIG. 5 and the formula (14) and arranging the resulted Θ.sub.i′, i=1, . . . , t to be a general diagonal matrix;

(49) Step 5: if Ξ.sub.X is invertible, left multiplying the general diagonal matrix resulted from Step 4 with Ξ.sub.X.sup.−1 to get the repair matrix R.sub.X according to the formula (18).

(50) As shown in FIG. 1, Step 2 plus Step 3 can be performed either before or after Step 4.

(51) The method of concurrent recovery for multiple data blocks stored distributedly will be explained in detail for centralized and distributed modes of data recovery.

(52) The concurrent recovery scheme for PM MSR regenerating codes that can jointly regenerate data blocks in the centralized mode is presented in FIG. 6, and the detailed explanation on each step will be described as follows. Since the recovery is centralized, besides providing helper data, all the other operations are performed on a central agent. After data recovery, the central agent is further needed to send the reconstruct data to corresponding new nodes. The missing data is unrecoverable when t>m. Therefore the situation of t>m is not covered herein for brevity. As shown in FIG. 6, at step 600, determining if t≥min{k, α} is true.

(53) (1) if t≥min{k,α}:

(54) Step 611: selecting k nodes as the assistant nodes randomly from the surviving nodes, sending a request to the assistant nodes by the central agent for asking the assistant nodes to offer their stored data blocks to the central agent;

(55) Step 612: the central agent waiting until receiving all the k×α helper data sub-blocks;

(56) Step 613: the central agent decoding the received data sub-blocks to get the source data; and

(57) Step 614: the central agent reconstructing the missing data blocks through encoding the source data and sending the reconstructed database to t new nodes.

(58) (2) if t<min{k,α}:

(59) Step 621: computing Ξ.sub.X according to Step 102 to Step 104 in FIG. 1;

(60) Step 622: going to one of the following three operations a) to c) when Ξ.sub.X is not invertible, otherwise performing next step;

(61) a) returning to Step 611 and reconstructing data by decoding;

(62) b) adding a node to X and recalculating Ξ.sub.X, then going to Step 623 when it is invertible or else performing a), b) or c);

(63) c) replacing a node in X with another node outside X and calculating Ξ.sub.X again, then going to Step 623 when it is invertible or else performing a), b) or c);

(64) Given that the number of entries in X is z. When all possible combinations of z<min{k,α} in performing b) and/or c) have been gone through, it is need to perform the operation a).

(65) Step 623: selecting d−t+1 nodes from the surviving nodes as the assistant nodes according to step 101 in FIG. 1, the central agent sending a request to the assistant nodes to inform each of them to calculate according to the formula (10) and offer t helper data sub-blocks;

(66) Step 624: the central agent waiting until receiving all t(d−t+1) helper data sub-blocks;

(67) Step 625: computing repair matrix R.sub.X according to step 106 in FIG. 1;

(68) Step 626: rearranging the received helper data sub-blocks according to the formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and

(69) Step 627: regenerating the missing data blocks by left multiplying the vector composed of the helper data sub-blocks re-ordered in step 626 with the repair matrix R.sub.X as shown in step 107 in FIG. 1, and sending the regenerated data to t new nodes.

(70) For distributed recovery mode, each new node only needs to regenerate the data blocks stored in the failed node it substitutes. If t≤n−d, a new node can choose d surviving nodes as the assistant nodes and regenerate its own missing data blocks by obtaining one coded helper data sub-block from one corresponding assistant node, thus the recovery bandwidth is td sub-blocks. If t>n−d, it is impossible to only regenerate its missing data block for a single new node because there are not enough assistant nodes; in this situation, each new node has to regenerate at least t−(n−d)+1 missing data blocks concurrently with the aid of n−t assistant nodes.

(71) The concurrent failure recovery scheme for PM MSR regenerating codes that can jointly regenerate missing data blocks in the distributed mode is presented in FIG. 7, and each detailed operation thereof is as follows. Besides providing helper data, other operations are performed on the new node(s).

(72) As shown in FIG. 7, at step 700, checking the size of t.

(73) (1) if t≥n−d−1+min{k,α}:

(74) Step 711: selecting k surviving nodes as the assistant nodes, the new node sending a request to the assistant node to inform them to offer their storing data blocks to the new node;

(75) Step 712: the new node waiting until receiving all k×α data sub-blocks;

(76) Step 713: decoding the received helper data sub-blocks to obtain source data; and

(77) Step 714: regenerating the missing data blocks through encoding, and returning.

(78) (2) If t≤n−d:

(79) Step 721: selecting d surviving nodes as the assistant nodes, the new node sending a request to the assistant nodes to inform the assistant nodes to calculate according to the formula (10) respectively and offer one helper data sub-block;

(80) Step 722: the new node waiting until receiving all d helper data sub-blocks;

(81) Step 723: computing repair matrix R.sub.{x.sub.i.sub.} according to Step 106 shown in FIG. 16 where x.sub.i is the index of corresponding substitute node;

(82) Step 724: re-arranging the received helper data sub-blocks according to formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and

(83) Step 725: regenerating the missing data blocks by left multiplying the vectors formed by the re-arranged helper data sub-blocks in Step 724 with the repair matrix R.sub.{x.sub.i.sub.}, and going back.

(84) (3) other situation:

(85) Step 731: selecting another u=t−n+d missing data blocks c.sub.y.sub.1, c.sub.y.sub.2, . . . , c.sub.y.sub.u, letting x={x.sub.i, y.sub.1, . . . , y.sub.u}, where x.sub.i is the index of corresponding substitute node;

(86) Step 732: calculating Ξ.sub.X according to the steps shown in FIG. 1;

(87) Step 733: going to one of the following three operations a) to c) when Ξ.sub.X is not invertible, otherwise performing next step;

(88) a) returning to step 1.1 and reconstructing data by decoding;

(89) b) adding a node to X and recalculating Ξ.sub.X, then going to step 734 when Ξ.sub.X is invertible or else performing a), b) or c);

(90) c) replacing a node in X with another node outside X, recalculating Ξ.sub.X and going to Step 734 when it is invertible or else performing a), b) or c);

(91) Given that the number of entries in X is z. When all possible combinations of z<min{k,α} in performing b) and/or c) have been gone through, it is needed to perform the operation a).

(92) Step 734: selecting n−t surviving nodes as the assistant nodes, the new node sending assistant node sending a request to the assistant nodes to inform each assistant node to calculate according to the formula (10) and offer u+1 helper data sub-blocks;

(93) Step 735: the new node waiting until receiving all the (n−t)(u+1) helper data sub-blocks;

(94) Step 736: calculating the repair matrix R.sub.X according to the step 106 in FIG. 1;

(95) Step 737: re-arranging the received helper data sub-blocks according to the formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and

(96) Step 738: regenerating the missing data blocks by left multiplying the vectors formed by the re-arranged helper data sub-blocks in Step 737 with the repair matrix R.sub.X, and going back.

(97) Note that when applying the above-mentioned algorithm, if t>n−d, a new node may not only reconstruct the data blocks stored on the failed node that it substitutes, but may also reconstruct the data blocks required by other substitute nodes at the same time. At this time, the substitute node can choose to send the additional reconstructed data blocks to other substitute nodes that are needed, so as to prevent these substitute nodes from rebuilding the data blocks by themselves. This can further reduce the recovery bandwidth and computing overhead, but requires the coordination and cooperation between nodes, and may increase the repair time, which belongs to the range of collaborative repair. In practical applications, trade-offs should be made based on system performance needs.

(98) The principle and implementation manners present disclosure has been described above with reference to specific embodiments, which are merely provided for the purpose of understanding the present disclosure and are not intended to limit the present disclosure. It will be possible for those skilled in the art to make variations based on the principle of the present disclosure.

Methods for data recovery of a distributed storage system and storage medium thereof

Assignee

Inventors

Cpc classification

Classification Explorer

G06F11/2094

PHYSICS

Classification Explorer

G06F11/3034

PHYSICS

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

H03M13/373

ELECTRICITY

Classification Explorer

G06F3/065

PHYSICS

Classification Explorer

G06F11/0793

PHYSICS

Classification Explorer

H03M13/616

ELECTRICITY

Classification Explorer

G06F3/0619

PHYSICS

Classification Explorer

G06F3/067

PHYSICS

Classification Explorer

G06F11/1088

PHYSICS

International classification

Classification Explorer

G06F11/10

PHYSICS

Classification Explorer

G06F11/07

PHYSICS

Classification Explorer

H03M13/37

ELECTRICITY

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

G06F11/30

PHYSICS

Classification Explorer

H03M13/00

ELECTRICITY

Abstract

Claims

Description