Methods for data recovery of a distributed storage system and storage medium thereof
11500725 · 2022-11-15
Assignee
Inventors
Cpc classification
G06F11/3034
PHYSICS
G06F17/16
PHYSICS
H03M13/373
ELECTRICITY
H03M13/616
ELECTRICITY
G06F3/0619
PHYSICS
G06F3/067
PHYSICS
International classification
G06F11/10
PHYSICS
G06F11/07
PHYSICS
H03M13/37
ELECTRICITY
G06F17/16
PHYSICS
Abstract
A method of data recovery for a distributed storage system is a method of recovering multiple failed nodes concurrently with the minimum feasible bandwidth when failed nodes exist in a distributed storage system. By means of selecting assistant nodes, obtaining helper data sub-blocks through computing the selected assistant nodes, then computing a repair matrix and finally multiple the repair matrix and the helper data sub-blocks, the missing data blocks are reconstructed; or the missing data blocks are reconstructed by decoding. The method is applicable to data recovery in the case of any number of failed nodes and any reasonable combinations of coding parameters. The data recovery herein can reach the theoretical lower limit of the minimum recovery bandwidth.
Claims
1. A method of data recovery for a distributed storage system having totally n storage nodes each configured to store data, the system coding data to be stored with minimum storage regenerating code via product matrix construction C(n′,k′,d′), where n′=n+δ, k′=k+δ, d′=d+δ, δ denotes number of virtual nodes which are not real storage nodes, but theoretical storage nodes for computation, k indicates that data is equally divided into k segments, each segment contains a data sub-blocks, d is repair degree, C is resulted n′×α coded matrix and each entry thereof is data sub-block after encoded, first δ rows of C are zero, rows δ+1 to δ+k are source data and last m=n′−k′ rows are coded parity data, such that only n=n′−δ data blocks of resulted n′ coded data blocks are needed to be stored, the method comprising following steps: Step 1: selecting assistant nodes from surviving nodes, wherein the surviving nodes are non-failed storage nodes, and the assistant nodes are storage nodes, selected from the surviving nodes, which provide helper data for data recovery, calculating helper data sub-blocks by using the assistant nodes, and sending the helper data sub-blocks to a regenerating node which is a storage node for reconstructing missing data of multiple failed storage nodes; letting {N.sub.i|1≤i≤n′} denote a set of both virtual nodes and real nodes, the real nodes being real storage nodes, where N.sub.1˜N.sub.6 are virtual nodes and N.sub.δ+1˜N.sub.n′ are real nodes, number of the surviving nodes is n′−t, t is number of failed nodes which are the storage nodes missing the data and t≥1, defining X={x.sub.i|i=1, . . . ,t, δ<x.sub.i≤n′} as a loss list, letting c.sub.x.sub.
Ω.sub.i,j=−θ.sub.x.sub.
Θ.sub.i=[θ.sub.x.sub.
2. The method according to claim 1, wherein the repair matrix works for t<min {k, α}, and decoding is used for recovery missing data for t≥min{k, α}: choosing k nodes as assistant nodes randomly from surviving nodes, downloading k×α from the assistant nodes, then decoding the data sub-blocks to obtain source data.
3. The method according to claim 1, wherein Ξ.sub.X is an inverse matrix, or Ξ.sub.X becomes invertible by adding a node to the loss list, or Ξ.sub.X becomes invertible by replacing one or some nodes in the loss list with other nodes; otherwise decoding to reconstruct missing data.
4. The method according to claim 1, wherein the method includes centralized data recovery, under which number of the assistant nodes is chosen according to number of the failed nodes to be recovered and collection of the helper data sub-blocks, computation of the repair matrix and reconstruction of the missing data are implemented by a central agent.
5. The method according to claim 1, wherein the method includes distributed data recovery, under which number of the assistant nodes is chosen according to number of the failed nodes to be recovered, each new node reconstructs data blocks stored on the failed node that it substitutes, and collection of the helper data sub-blocks, computation of the repair matrix and reconstruction of the missing data are implemented by each corresponding new node.
6. The method according to claim 5, wherein data blocks reconstructed by the new node further comprises data blocks needed by other new nodes.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) The present disclosure will be further described in detail below through specific embodiments in combination with the accompanying drawings. Many details described in the following embodiments are for the purpose of better understanding the present disclosure. However, a person skilled in the art can realize with minimal effort that some of these features can be omitted in different cases or be replaced by other methods. For clarity some operations related to the present disclosure are not shown or illustrated herein so as to prevent the core from being overwhelmed by excessive descriptions. For the person skilled in the art, such operations are not necessary to be explained in detail, and they can fully understand the related operations according to the description in the specification and the general technical knowledge in the field.
(9) In addition, the features, operations or characteristics described in the specification may be combined in any suitable manner to form various embodiments. At the same time, the steps or actions in the described method can also be sequentially changed or adjusted in a manner that can be apparent to those skilled in the art. Therefore, the various sequences in the specification and the drawings are only for the purpose of describing a particular embodiment, and are not intended to be an order of necessity, unless otherwise stated one of the sequences must be followed.
(10) In an embodiment of the present disclosure, a method of data recovery for a distributed storage system with MSR code constructed via PM (PM MSR) is proposed. The basic thinking of this method is to construct a repair matrix that is applicable to any number of failed nodes and any valid combination of coding parameters, and the lost data can be obtained by simply multiplying the helper data with the repair matrix.
(11) Separate data recovery algorithms for PM MSR codes are further proposed for centralized and distributed modes in the embodiments of the present disclosure.
(12) For convenience, the notions that are used throughout this disclosure and their definitions will be first presented. A minimum storage regenerating code via product matrix construction (PM MSR) is denoted by C(n,k,d), where n is the total number of nodes, k is the number of “systematical nodes” on which the uncoded original data blocks are stored; thus m=n−k is the number of “parity nodes” which hold the encoded data blocks. d is “repair degree” which is defined to be the number of assistant node required for the recovery when there is only one failed node.
(13) Given that the total amount of data to be stored is B. With the requirement of PM MSR code, data of amount B is divided into k segments of the same size, and then encoded into n data blocks denoted by b.sub.1, b.sub.2, . . . , b.sub.n with block ID 1, 2, . . . , n respectively. Each data blocks b.sub.i consists of α sub-blocks s.sub.i1, s.sub.i2, . . . , s.sub.iα. According to the relevant theory of MSR code, there are two equations as follows:
α=d−k+1 (1)
and
B=kα=k(d−k+1) (2)
(14) PM MSR code requires that d≥2k−2, thus n≥d+1≥2k−1 and α≥k−1.
(15) To this general case, the data amount B will be encoded to n′ data blocks based on the extended PM MSR code C(n′,k′,d′), where n′=n+δ, k′=k+δ and d′=d+δ. δ denotes the number of virtual nodes added by the extension, and it is chosen to make d′=2k′−2, then:
δ=d−(2k−2) (3)
and
d′=2(k+δ)−2=2(d−k+1)=2α (4)
(16) PM MSR code is constructed through the following formula:
C=ΨM (5)
(17) where M is the d′×α message matrix with
(18)
(19) where S.sub.1 and S.sub.2 are α×α symmetric matrices constructed such that the (α+1)α/2 entries in the upper-triangle of each matrix are filled by distinct data sub-blocks. Ψ=[ΦΛΦ] is the n′×d′ encoding matrix, and the ith row of Φ is denoted by ϕ.sub.i. Λ is an n′×n′ diagonal matrix as
(20)
(21) C is the resulted n′×α coded matrix, each entry thereof is a data sub-block after encoded, and the ith row of C is:
c.sub.i=φ.sub.iM=[ϕ.sub.iλ.sub.iϕ.sub.i]M=ϕ.sub.iS.sub.1+λ.sub.iϕ.sub.iS.sub.2 (8)
(22) where φ.sub.i=[ϕ.sub.iλ.sub.iϕ.sub.i] is the ith row of Ψ. M is encoded in such a way that the resulted C is systematical, that means, the first δ rows of C are all zero, and rows δ+1 to δ+k are exactly the source data, while the last m=n′−k′ rows are coded parity data. The advantage to this is that the source data can be read without decoding where there is no failed systematical node. It should be noted that all the matrices and data mentioned in this embodiment are represented by entries of finite field GF(q).
(23) Before recovery, δ virtual nodes on which the data are all zero are firstly added to the system when a node fails. Let {N.sub.i|1≤i≤n′} denote the set of both virtual nodes and real nodes, where N.sub.1˜N.sub.δ are virtual nodes and N.sub.δ+1˜N.sub.n′ are real nodes. Hence it can be viewed that {c.sub.i|1≤i≤n′} are held by n′ nodes. Note that, the “node” used herein can be either physical or logical, say, multiple logical nodes can reside on one or more physical machines, which does not affect the effectiveness of the present disclosure. For another, the virtual nodes mentioned herein are not real and are conceived for theoretical reasoning, reflecting the thinking process of the inventor and easy understanding by those skilled in the art.
(24) The number of failed nodes is denoted by t≥1. The set of the indices of all missing blocks X={x.sub.i|i=1, . . . , t, δ<x.sub.i≤n′} is defined as a loss list. The missing blocks are denoted by c.sub.x.sub.
N.sub.e={N.sub.j|N.sub.j∈N.sub.a or j∈X} (9)
(25) It can be seen that there are d′+1 members in the union set N.sub.e.
(26) During recovery, each assistant node N.sub.j∈N.sub.a first calculates the inner products between its data block c.sub.j (regarded it as a vector composed of sub-blocks [c.sub.j1, c.sub.j2, . . . , c.sub.jα]) and {ϕ.sub.x.sub.
{h.sub.x.sub.
(27) Then the encoded sub-blocks are sent to the regenerating node as helper data. Thus the regenerating node will get t(d′t+1) helper sub-blocks in total. Note that, the premise of selecting d′−t+1 nodes as assistant nodes from surviving nodes mentioned above is to concurrently repair t failed nodes; however the nodes to be modified can also be selected according to actual conditions, in this way, the number of the selected assistant nodes will change, which will be further described below when discussing the centralized and distributed recovery modes.
(28) The regenerating node recovers the missing data blocks through “repair matrix” which is obtained as follows.
(29) First, for each x.sub.i, x.sub.j∈X, j≠i, according to the following formula (11), calculating a matrix
Ω.sub.i,j=−θ.sub.x.sub.
where
θ.sub.x.sub.
and
Ψ.sub.x.sub.
(30) that is, Ψ.sub.x.sub.
(31) Next, let X.sub.i=X|x.sub.i to be the ordered set of the remaining indices of the loss list X without x.sub.i, and G.sub.i={g.sub.j.sup.i|j≠i} to be the ordered set of indices of columns in Ψ.sub.x.sub.
Θ.sub.i=[θ.sub.x.sub.
(32) where the computation of θ.sub.x.sub.
(33) Combining the computation mentioned above, we have
(34)
(35) are a vector which consists of coded sub-blocks calculated by each assistant node through making the inner product between the data blocks it have and ϕ.sub.x.sub.
(36) For each x.sub.i, the size of the vector {right arrow over (h)}.sub.x.sub.
(37) Let
(38)
(39) If the matrix Ξ.sub.X is invertible, the repair matrix can be obtained by the following formula as follows:
(40)
(41) After the repair matrix is computed, the missing data blocks can be reconstructed by left multiplying the vector [{right arrow over (h)}′.sub.x.sub.
(42) The repair matrix in the formula (18) works for any t<min{k,α}. For t≥min{k,α}, decoding can be used for recovery, that is, choosing k nodes as the assistant nodes randomly from the surviving node, downloading k×α data sub-blocks from all the k assistant nodes, then decoding these data sub-blocks to get the source data and finally encoding the source data into the lost data blocks. But if t>m, the lost data are not recoverable since the current number of surviving nodes is n−t<k derived from m=n−k.
(43) Besides, if the matrix Ξ.sub.x is not invertible, the repair matrix cannot be calculated through the formula (18). Several solutions can handle this situation, including but not limiting to: 1) adding one or several nodes to X to make Ξ.sub.x invertible; 2) replacing one or some nodes in X with other nodes to make Ξ.sub.x invertible; and 3) decoding to implement data reconstruction. Since the possibility of such situation is rather small, any solutions have marginal effect on the overall performance. When adopting the solution of 1) and/or 2), the actual repair data block not only includes real lost data blocks, but also the data blocks on the new added or replaced nodes.
(44) The computation of the repair matrix is summarized as step 102 to step 106 shown in
(45) Step 1: for each x.sub.i∈X, computing Ψ.sub.x.sub.
(46) Step 2: for each x.sub.i, x.sub.j∈X, j≠i, computing Ω.sub.i,j according to
(47) Step 3: based on the result in Step 2, constructing Ξ.sub.X according to the formula (17);
(48) Step 4: for each x.sub.i∈X, computing Θ.sub.i′ according to
(49) Step 5: if Ξ.sub.X is invertible, left multiplying the general diagonal matrix resulted from Step 4 with Ξ.sub.X.sup.−1 to get the repair matrix R.sub.X according to the formula (18).
(50) As shown in
(51) The method of concurrent recovery for multiple data blocks stored distributedly will be explained in detail for centralized and distributed modes of data recovery.
(52) The concurrent recovery scheme for PM MSR regenerating codes that can jointly regenerate data blocks in the centralized mode is presented in
(53) (1) if t≥min{k,α}:
(54) Step 611: selecting k nodes as the assistant nodes randomly from the surviving nodes, sending a request to the assistant nodes by the central agent for asking the assistant nodes to offer their stored data blocks to the central agent;
(55) Step 612: the central agent waiting until receiving all the k×α helper data sub-blocks;
(56) Step 613: the central agent decoding the received data sub-blocks to get the source data; and
(57) Step 614: the central agent reconstructing the missing data blocks through encoding the source data and sending the reconstructed database to t new nodes.
(58) (2) if t<min{k,α}:
(59) Step 621: computing Ξ.sub.X according to Step 102 to Step 104 in
(60) Step 622: going to one of the following three operations a) to c) when Ξ.sub.X is not invertible, otherwise performing next step;
(61) a) returning to Step 611 and reconstructing data by decoding;
(62) b) adding a node to X and recalculating Ξ.sub.X, then going to Step 623 when it is invertible or else performing a), b) or c);
(63) c) replacing a node in X with another node outside X and calculating Ξ.sub.X again, then going to Step 623 when it is invertible or else performing a), b) or c);
(64) Given that the number of entries in X is z. When all possible combinations of z<min{k,α} in performing b) and/or c) have been gone through, it is need to perform the operation a).
(65) Step 623: selecting d−t+1 nodes from the surviving nodes as the assistant nodes according to step 101 in
(66) Step 624: the central agent waiting until receiving all t(d−t+1) helper data sub-blocks;
(67) Step 625: computing repair matrix R.sub.X according to step 106 in
(68) Step 626: rearranging the received helper data sub-blocks according to the formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and
(69) Step 627: regenerating the missing data blocks by left multiplying the vector composed of the helper data sub-blocks re-ordered in step 626 with the repair matrix R.sub.X as shown in step 107 in
(70) For distributed recovery mode, each new node only needs to regenerate the data blocks stored in the failed node it substitutes. If t≤n−d, a new node can choose d surviving nodes as the assistant nodes and regenerate its own missing data blocks by obtaining one coded helper data sub-block from one corresponding assistant node, thus the recovery bandwidth is td sub-blocks. If t>n−d, it is impossible to only regenerate its missing data block for a single new node because there are not enough assistant nodes; in this situation, each new node has to regenerate at least t−(n−d)+1 missing data blocks concurrently with the aid of n−t assistant nodes.
(71) The concurrent failure recovery scheme for PM MSR regenerating codes that can jointly regenerate missing data blocks in the distributed mode is presented in
(72) As shown in
(73) (1) if t≥n−d−1+min{k,α}:
(74) Step 711: selecting k surviving nodes as the assistant nodes, the new node sending a request to the assistant node to inform them to offer their storing data blocks to the new node;
(75) Step 712: the new node waiting until receiving all k×α data sub-blocks;
(76) Step 713: decoding the received helper data sub-blocks to obtain source data; and
(77) Step 714: regenerating the missing data blocks through encoding, and returning.
(78) (2) If t≤n−d:
(79) Step 721: selecting d surviving nodes as the assistant nodes, the new node sending a request to the assistant nodes to inform the assistant nodes to calculate according to the formula (10) respectively and offer one helper data sub-block;
(80) Step 722: the new node waiting until receiving all d helper data sub-blocks;
(81) Step 723: computing repair matrix R.sub.{x.sub.
(82) Step 724: re-arranging the received helper data sub-blocks according to formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and
(83) Step 725: regenerating the missing data blocks by left multiplying the vectors formed by the re-arranged helper data sub-blocks in Step 724 with the repair matrix R.sub.{x.sub.
(84) (3) other situation:
(85) Step 731: selecting another u=t−n+d missing data blocks c.sub.y.sub.
(86) Step 732: calculating Ξ.sub.X according to the steps shown in
(87) Step 733: going to one of the following three operations a) to c) when Ξ.sub.X is not invertible, otherwise performing next step;
(88) a) returning to step 1.1 and reconstructing data by decoding;
(89) b) adding a node to X and recalculating Ξ.sub.X, then going to step 734 when Ξ.sub.X is invertible or else performing a), b) or c);
(90) c) replacing a node in X with another node outside X, recalculating Ξ.sub.X and going to Step 734 when it is invertible or else performing a), b) or c);
(91) Given that the number of entries in X is z. When all possible combinations of z<min{k,α} in performing b) and/or c) have been gone through, it is needed to perform the operation a).
(92) Step 734: selecting n−t surviving nodes as the assistant nodes, the new node sending assistant node sending a request to the assistant nodes to inform each assistant node to calculate according to the formula (10) and offer u+1 helper data sub-blocks;
(93) Step 735: the new node waiting until receiving all the (n−t)(u+1) helper data sub-blocks;
(94) Step 736: calculating the repair matrix R.sub.X according to the step 106 in
(95) Step 737: re-arranging the received helper data sub-blocks according to the formula (16) so that it is corresponded to the general diagonal matrix in the formula (18); and
(96) Step 738: regenerating the missing data blocks by left multiplying the vectors formed by the re-arranged helper data sub-blocks in Step 737 with the repair matrix R.sub.X, and going back.
(97) Note that when applying the above-mentioned algorithm, if t>n−d, a new node may not only reconstruct the data blocks stored on the failed node that it substitutes, but may also reconstruct the data blocks required by other substitute nodes at the same time. At this time, the substitute node can choose to send the additional reconstructed data blocks to other substitute nodes that are needed, so as to prevent these substitute nodes from rebuilding the data blocks by themselves. This can further reduce the recovery bandwidth and computing overhead, but requires the coordination and cooperation between nodes, and may increase the repair time, which belongs to the range of collaborative repair. In practical applications, trade-offs should be made based on system performance needs.
(98) The principle and implementation manners present disclosure has been described above with reference to specific embodiments, which are merely provided for the purpose of understanding the present disclosure and are not intended to limit the present disclosure. It will be possible for those skilled in the art to make variations based on the principle of the present disclosure.