Data storage method, device and distributed network storage system

Abstract

A method, device and system disclosed used in storage technique, comprising: splitting a file of size M into k blocks, that is to say, each block is of size M/k; issuing the above k blocks across k different storage nodes in the distributed network storage system in a distributed manner; using the k blocks, constructing nk independent blocks via linear coding method, and satisfying the property that any k of the n encoded blocks can be used to reconstruct the original data in the file, which means the linear coding method is a kind of Maximum-Distance Separable (MDS) code; distribute the nk encoded blocks to the rest nk different storage codes in the distributed network storage systems.

Claims

1. A method of storing data, comprising: splitting a file of size M into k blocks, wherein M is a measure of an amount of data in the file and k is a positive integer and is the smallest number of nodes required to reconstruct the file and such that each block is of size M/k; distributing the k blocks across k first storage nodes in a distributed network storage system via a first block allocation device; constructing n-k independent blocks via a linear coding method employing a Maximum-Distance Separable (MDS) code, where n is a total number of nodes in the distributed network storage system and n>k and satisfying that any k of the n encoded blocks can reconstruct the original data in the file; distributing the n-k encoded blocks to n-k different second storage nodes in the distributed network storage systems via a second block allocation device; determining that nodes have failed and that no more than n-k first and second nodes have failed; and recovering data stored in failed first and second nodes by a node recovery device through linear encoding via at least k intact nodes, wherein data in the failed first nodes is exactly regenerated with interference alignment and wherein data in the failed second nodes is regenerated maintaining a MDS code property.

2. A data storage device comprising: a data block device which splits a file of size M into k blocks, wherein M is a measure of an amount of data in the file and k is a positive integer and is the smallest number of nodes required to reconstruct the file, where each block is of size M/k; a first block allocation device which distributes the k blocks into k first nodes in a distributed network storage system; a first encoding device which constructs n-k independent blocks via linear coding employing a Maximum-Distance Separable (MDS) code from the k blocks and satisfying a property that arbitrary k of the n encoded blocks can reconstruct the file; a second block allocation device which distributes n-k encoding blocks into n-k second storage nodes wherein, n, k are both positive integers, and satisfy n>k, and n is the number of total nodes in distributed network storage system, while k is the least number of nodes needed to reconstruct the file; and a node recovery device which determines that nodes have failed and that no more than n-k first and second nodes have failed and recovers data stored in failed first and second nodes through linear encoding via at least k intact nodes, wherein data in the failed first nodes is exactly regenerated with interference alignment and wherein data in the failed second nodes is regenerated maintaining a MDS code property.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 illustrates the overall framework of the schematic diagram of a distributed network storage system that this invention case provides;

(2) FIG. 2 illustrates the flow chart of the method of a data storage that this invention case 1 provides;

(3) FIG. 3 illustrates the flow chart of the method of a data storage that this invention case 2 provides;

(4) FIG. 4a illustrates the schematic diagram of a (4,2) MDS code that this invention case provides;

(5) FIG. 4b illustrates the schematic diagram of nodes failure recovery of an MDS code that this invention case provides;

(6) FIG. 4c illustrates another schematic diagram of nodes failure recovery of an MDS code that this invention case provides;

(7) FIG. 5 illustrates the schematic diagram of different versions of repair and main techniques that this invention case provides;

(8) FIG. 6 illustrates the information process schematic diagram of the functional repair of a (4,2) MDS code that this invention case provides;

(9) FIG. 7 illustrates the schematic diagram of the optimal tradeoff curve of node storage and repair bandwidth that this invention case provides;

(10) FIG. 8 illustrates the schematic diagram of the exact repair of a (5,3) MBR code that this invention case provides;

(11) FIG. 9 illustrates the schematic diagram of the exact repair of a (4,2) MSR code that this invention case provides;

(12) FIG. 10 illustrates the schematic diagram of the exact repair of a (6,3) MBR code that this invention case provides;

(13) FIG. 11 illustrates the schematic diagram of the exact repair of systematic parts that this invention case provides;

(14) FIG. 12 illustrates structure diagram of a data storage device that this invention case 3 provides; and

(15) FIG. 13 illustrates structure diagram of a data storage device that this invention case 4 provides.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(16) The embodiment of the invention provides a method of storing data, device and distributed network storage system, aims at solving the problem that the reliability of distributed network storage system cannot be guaranteed, which is provided by the methods of the prior technique of storing data.

(17) This invention provides a method of storing data, the method comprising:

(18) Splitting a file of size M into k blocks, each block is of size M/k;

(19) Issuing the above k blocks across k different storage nodes in the distributed network storage systems in a distributed manner;

(20) Constructing nk independent blocks via linear coding method, and satisfy the property that any k of the n encoded blocks can be used to reconstruct the original data in the file, which means the linear coding method is a kind of Maximum-Distance Separable (MDS) code;

(21) Distributing the nk encoding blocks to the rest nk different storage codes in the distributed network storage systems.

(22) Wherein, n, k are both positive integers, and satisfy the property that n>k. And n is the number of total nodes in distributed network storage system, while k is the least number of nodes that needed to reconstruct the original file.

(23) For another, this invention provides a device of storing data, the device comprising:

(24) The data block unit, is used to split a file of size M into k blocks, each block is of size M/k;

(25) The first block allocation unit, is used to issue the k blocks into k different nodes in the distributed network storage system;

(26) The encoding unit, is used to construct nk independent blocks via linear coding from the k blocks, and satisfies the property that arbitrary k of the n encoded blocks can be used to reconstruct the original data, which means the linear coding method is a kind of Maximum-Distance Separable (MDS) code;

(27) The second block allocation unit, is used to issue the nk encoding blocks into the rest nk different storage codes in the distributed network storage systems.

(28) Wherein, n, k are both positive integers, and satisfy n>k. And n is the number of total nodes in distributed network storage system, while k is the least number of nodes that needed to reconstruct the original file.

(29) What's more, this invention also provides a distributed network storage system (including client), the system further comprises data storage apparatus that connects to the client. The data storage apparatus is data node and index server in the distributed network storage system.

(30) In the embodiment of this invention, the data is divided into k blocks, and stored in k different nodes. Then use the above k blocks, and construct nk independent blocks via linear coding method (Maximum-Distance Separable (MDS) code), satisfying the property that any k of the n encoded blocks can be used to reconstruct the original data in the file. After that, distribute the nk encoded blocks to the rest nk different storage codes. This invention enables the system to tolerate at most nk nodes' simultaneous failure without losing data, keep the redundancy of system in an invariant level, and ensure the reliability of the distributed network storage system at the same time.

(31) In order to make the intentions, technical scheme and advantages of the present invention more apparent, the present invention will be described in further detail below in conjunction with the accompanying diagrams and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not limited to the present invention.

(32) Data is scattered and stored in multiple independent devices in the distributed storage systems. Traditional network storage system uses a centralized storage server to store all data, storage servers become the bottleneck of the system performance, and the focus of system reliability as well as system security. Obviously, traditional network storage system cannot meet the needs of large-scale storage applications.

(33) Distributed network storage system uses a scalable system architecture, sharing the storage load by using multiple storage servers, storing location information by using the location server, it will not only improve system reliability, availability and access efficiency, but also easy to extend.

(34) FIG. 1 is an overall framework for the schematic diagram of the distributed network storage system. There are four nodes (SN1-SN4) in the system, also an index server IS and a client represented by the PC, IS and SN constitute the server side. IS stores the naming and routing information of each node, the entire system's directory structure, the mapping between file names and blocks as well as the storage location of each file block etc. The role of IS is similar to the file partition table of traditional file system. When users want to upload or download files, find the specified SN by IS, then interact with SN.

(35) Embodiment of the invention, introduces redundancy through the linear network coding in distributed network storage system, to increase reliability of the system. When using coding technology, the problem of repairing failed nodes will occur: if a node that stores encrypted information failed, the encrypted information needs to be constructed in a new node in order to support the same level of system reliability. This is equivalent to a partial recovery of the coding. However, the conventional error correcting code is concentrated to recover information from a sub-coding block. This will lead to new design challenges with respect to repair network load considerations. Recently, network coding technology has been helpful for these challenges. Compared to the standard error correction codes, network coding will make the bandwidth usage reduce by orders of magnitude.

(36) Invention Case 1

(37) FIG. 2 illustrates the realization process of the method of data storage that present invention case 1 provided, mainly includes:

(38) In step S201, split a file of size M into k blocks, that is to say, each block is of size M/k.

(39) In the present embodiment, a file of size M is to be stored in the distributed network storage system. The file is first equally divided into k blocks, such that the size of each block is equal. That is to say, each block is of size M/k.

(40) In step S202, issue the above k blocks across k different storage nodes in the distributed network storage system in a distributed manner.

(41) In the present embodiment, k different blocks of the same size are distributed to k different nodes in the distributed network storage system. These nodes, are known as the system nodes.

(42) In step S203, use the above k blocks, construct nk independent blocks via linear coding method, and satisfy the property that any k of the n encoded blocks can be used to reconstruct the original data in the file, and the mentioned linear coding method is a kind of Maximum-Distance Separable (MDS) code.

(43) In the present embodiment, randomly generate a full rank matrix of size k*(nk) firstly, then construct nk independent blocks by using the above k blocks and the full rank matrix. The generated blocks are the linear superposition of the original information blocks and satisfy the MDS property that any k of the n blocks are able to reconstruct the entire original file. Wherein, n, k are both positive integers, and satisfy the property that n>k. And n is the number of total nodes in distributed network storage system, while k is the least number of nodes that needed to reconstruct the original file.

(44) In step S204, distribute the nk encoded blocks to the rest nk different storage nodes in the distributed network storage system.

(45) In the present embodiment, issue the nk encoded blocks to the rest nk different storage nodes, and these nodes non-system nodes.

(46) Invention Case 2

(47) FIG. 3 illustrates the realization process of the method of data storage that present invention case 2 provided, mainly includes:

(48) In step S301, split a file of size M into k blocks, that is to say, each block is of size M/k.

(49) In step S302, issue the above k blocks across k different storage nodes in the distributed network storage system in a distributed manner.

(50) In step S303, use the above k blocks, construct nk independent blocks via linear coding method, and satisfy the property that any k of the n encoded blocks can be used to reconstruct the original data in the file, and the mentioned linear coding method is a kind of Maximum-Distance Separable (MDS) code.

(51) In step S304, distribute the nk encoded blocks to the rest nk different storage nodes in the distributed network storage system.

(52) In step S305, if there exist nodes failure and the number of failed nodes is no larger than nk, then recover the data stored in the failed nodes via at least k live nodes.

(53) In the present embodiment, control the nodes in the distributed network storage system, and determine whether there are nodes failed (such as hard disk damage, network disconnection, power off, etc.) and count the number of nodes that failed at the same time.

(54) For the case that the number of failed nodes is larger than nk, according to the MDS property, it would need at least k nodes to recover the information of the original file. And if the number of active nodes is less than k, thus the contents of the file cannot be recovered, the system just losses the file.

(55) For the case that the number of failed nodes is no larger than nk, according to the MDS property, related information can be obtained from the existing not less than k nodes, and recover the information that stored in the failed node through linear coding. After the repair of failed nodes, the redundancy of the system is just the same, which is still a MDS code. Thereby the reliability of the system is ensured.

(56) The above invention case land invention case 2 are mainly applied to the MDS codes, MDS code is the best tradeoff between redundancy and reliability, because the k blocks contain the minimal information to restore the original data. In a distributed network storage system, the n different nodes (such as a disk, the server, or endpoint) that store encrypted data packet, spread all over the network, and the system can tolerate (nk) nodes failure without loss of data.

(57) FIG. 4a is an example of a (4,2) MDS code. Wherein, store a file of size 4, namely A1, A2, B1, B2. Divide this file into 2 parts, each of size 2. The blocks stored on the former two nodes are uncoded, which turn out to be system nodes. What stored on the latter two nodes are the linear combinations of the original information A1, A2, B1, B2. It can be observed that arbitrary k(=2) of the n(=4) nodes can recover all the original data.

(58) Hereinafter, the problem will be focused on that how to recover the data on the failed nodes through the no less than k existing active nodes, when there exist nodes failure and the total number of the failed nodes do not exceed nk.

(59) FIG. 4b illustrates the case when the first node failed, the newly come node can recover the lost data through the other three existing nodes. Obviously, it's much easier to repair a simple data loss than reconstruct the whole data. In this case, the original information can be recovered via communicating with the three existing nodes. More precisely, down load three blocks (B2, A2+B2, A1+A2+B2) to finish fault recovery.

(60) FIG. 4c illustrates the repair of the fourth storage node. This can be achieved by using only three blocks but one key difference is that the second node needs to compute a linear combination of the stored packets B1, B2 and the actual communicated block is B1+B2. This shows clearly the necessity of network coding, creating linear combinations in intermediate nodes during the repair process.

(61) If the network bandwidth is more critical resource compared to disk access, as is often the case, an important consideration is to find what is the minimum required bandwidth and which codes can achieve it.

(62) In the repair examples shown in FIG. 4, the newcomer constructs exactly the two blocks that were in failed node. Note, however, that the definition of repair only requires that the new node forms an (n, k) MDS code property (that any k nodes out of n suffice to recover the original whole data), when combined with existing nodes. In other words, the new node could be forming new linear combinations that were different from the ones in the lost node, a requirement that is strictly easier to satisfy.

(63) Thus there are three versions of repair have been considered in the method that recovering the data stored in the failed node through no less than k live nodes, as illustrated in FIG. 5: exact repair, functional repair, and hybrid repair (exact repair of systematic parts). In functional repair, it's advisable as long as the newly generated blocks maintain the MDS code property. It can be reduced to a multicasting problem on an appropriately constructed graph called the information flow graph, thus network coding is the mainly used technology. In exact repair, the failed blocks are exactly regenerated, which mainly use network coding and interference alignment. The exact repair of the systematic part is a hybrid repair model lying between exact repair and functional repair. It is required that the systematic blocks are exactly repaired upon failures and the nonsystematic parts follow a functional repair model. It can be viewed as a relax from the exact repair, thus the technologies mainly used are network coding and interference alignment.

(64) Wherein, functional repair is: data contained in the blocks of the new nodes, which is constructed via linear network coding technique, is not exactly the same as it stored in the failed nodes, while the MDS property is maintained in the distributed network system after repair.

(65) Specifically, functional repair includes:

(66) Download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding;

(67) Wherein, dn1, n is the number of total nodes in the distributed network storage system.

(68) The detailed description of functional repair is listed as follows:

(69) The functional repair problem can be represented as multicasting over an information flow graph. The information flow graph represents the evolution of information flow as nodes join and leave the storage network. For any node that stores *(n,k,d,), the points (n,k,d,,) are feasible and linear network codes suffice to achieve them.

(70) It is information theoretically impossible to achieve points with <*(n,k,d,), The threshold function a*(n,k,d,) is the following:

(71) $\begin{matrix} ^{*} (n, k, d,) = {\begin{matrix} \frac{M}{k}, [f (0), +] \\ \frac{M - g (i)}{k - i}, [f (i), f (i - 1)) \end{matrix} & (1) \\ f (i) = \frac{2 Md}{(2 k - i - 1) i + 2 k (d - k + 1)} & (2) \\ g (i) = \frac{(2 d - 2 k + i + 1)}{2 d} & (3) \end{matrix}$

(72) where dn1. Given (n,k,d), the minimum repair bandwidth is

(73) $\begin{matrix} _{\min} = f (k - 1) = \frac{2 Md}{2 kd - k^{2} + k} . & (4) \end{matrix}$

(74) One important observation is that the minimum repair bandwidth =d is a decreasing function of the number d of nodes that participate in the repair. While the newcomer communicates with more nodes, the size of each communicated packet becomes smaller fast enough to make the product d decrease. Therefore, the minimum repair bandwidth can be achieved when d=n1.

(75) As we mentioned, code repair can be achieved if and only if the underlying information flow graph has sufficiently large min-cut. This condition leads to the repair rates computed in formula (1) to formula (4). And when these conditions are met, simple random linear combinations will make the system end up with high reliability as the field size over which coding is performed grows.

(76) We take two kinds of codes into consideration, namely minimum-storage regenerating (MSR) codes and minimum-bandwidth regenerating (MBR) codes. According to the above formula (1) to formula (4), it can be verified that the minimum storage point is achieved:

(77) $\begin{matrix} (_{MSR},_{MSR}) = (\frac{M}{k}, \frac{Md}{k (d - k + 1)}) . & (5) \end{matrix}$

(78) As discussed, the repair bandwidth .sub.MSR=d/.sub.MSR is a decreasing function of the number of nodes d that participate in the repair. Since the MSR codes store M/k bits at each node while ensuring the MDS-code property, they are equivalent to standard MDS code. Observe that when d=k, the total communication for repair is M (the size of the original file). Therefore, if a newcomer is allowed to contact only k nodes, it is inevitable to download the whole data object to repair one new failure and this is the naive repair method that can be performed for any MDS codes.

(79) However, allowing a newcomer to contact more than k nodes, MSR codes can reduce the repair bandwidth .sub.MSR, which is minimized when d=n1,

(80) $\begin{matrix} (_{MSR},_{MSR}^{\min}) = (\frac{M}{k}, \frac{M}{k} .Math. \frac{n - 1}{n - k}) . & (6) \end{matrix}$

(81) We have separated the M/k factor in .sub.MSR.sup.min to illustrate that MSR codes communicate an (n1)(nk) factor more than what they store. This represents a fundamental expansion necessary for MDS constructions that are optimal on the reliability-redundancy tradeoff. For example, consider a (n,k)=(14,7) code. In this case, the new-comer needs to download only M/49 bits from each of the d=n1=13 active storage nodes, making the repair bandwidth equal to (M/7).Math.(13/7). Notice that we need only an expansion factor of 13/7, while a factor of 7 is required for the naive repair method.

(82) At the other end of the tradeoff are MBR codes, which have minimum repair bandwidth. It can be verified that the minimum repair bandwidth point is achieved by

(83) $\begin{matrix} (_{MBR},_{MBR}) = (\frac{2 Md}{2 kd - k^{2} + k}, \frac{2 Md}{2 kd - k^{2} + k}), & (7) \end{matrix}$

(84) Note that in the minimum band width regenerating codes, the storage size is equal to , the total number of bits communicated during repair. If we set the optimal value d=n1, we obtain

(85) $\begin{matrix} (_{MBR}^{\min},_{MBR}^{\min}) = (\frac{M}{k} .Math. \frac{2 n - 2}{2 n - k - 1}, \frac{M}{k} .Math. \frac{2 n - 2}{2 n - k - 1}) . & (8) \end{matrix}$

(86) Notice that .sub.MBR.sup.min=.sub.MBR.sup.min: MBR codes incur no repair bandwidth expansion at all, just like a replication system does, downloading exactly the amount of information stored during a repair. However, MBR codes require an expansion factor of (2n2)/(2nk1) in the amount of stored information and are no longer optimal in terms of their reliability for the given redundancy. FIG. 6 gives an example of an information flow graph of an (4, 2) MDS code. In this FIG. 6, each storage node is represented by a pair of nodes X.sup.i.sub.in and X.sup.i.sub.out connected by an edge whose capacity is the storage capacity of the node. There is a virtual source node s corresponding to the origin of the data object. Suppose initially we store a file of size M=4 blocks at four nodes, where each node stores =2 blocks and the file can be reconstructed from any two nodes. Virtual sink nodes called data collectors connect to any k node subsets and ensure that the code has the MDS property (that any k out of n suffices to recover). Suppose storage node 4 fails, then the goal is to create a new storage node, X.sup.5.sub.in, which communicates the minimum amount of information and then stores =2 blocks. node X.sup.5.sub.in is connected to the d=3 active storage nodes. Assuming bits communicated from each active storage node, of interest is the minimum required. The min-cut separating the source and the data collector must be larger than M=4 blocks for regeneration to be possible. For FIG. 6, the min-cut value is given by +2, implying that communicating 1 block is sufficient and necessary. The total repair bandwidth to repair one failure is therefore =d/=3 blocks.

(87) An information flow graph corresponds to a particular evolution of the distributed network storage system after a certain number of failures/repairs. We call each failure/repair a stage: in each stage, a single storage node fails and the code gets repaired by down-loading bits each from any d surviving nodes. Therefore, the total repair bandwidth is =d. As the Example shown in FIG. 6, In the initial stage, the system consists of nodes 1, 2, 3, and 4; in the second stage, the system consists of nodes 2, 3, 4 and 5.

(88) As we mentioned, code repair can be achieved if and only if the underlying information flow graph has sufficiently large min-cut. And simple random linear combinations will end up with high reliability as the field size over which coding is performed grows. According to formula (1) to formula (4), FIG. 7 shows the optimal tradeoff curve between storage and repair bandwidth , for the case (k=5, n=10). Here M=1 and d=n1. Note that traditional erasure coding corresponds to the point (=0.2, =1) for the optional tradeoff curve of parameter (k=5, n=10).

(89) It is of interest to study the two extremal points on the optimal tradeoff curve, which correspond to the best storage efficiency and the minimum repair bandwidth, respectively. We call codes that attain these points minimum-storage regenerating (MSR) codes and minimum-bandwidth regenerating (MBR) codes, respectively.

(90) As we discussed, the repair-storage tradeoff for functional repair can be completely characterized by analyzing the cutset of the information flow graphs. However, as mentioned earlier, functional repair is of limited practical interest since there is a need to maintain the code in systematic form. Also, under functional repair, significant system overhead is incurred in order to continually update repairing-and-decoding rules whenever a failure occurs. Moreover, the random-network-coding-based solution for the function repair can require a huge finite-field size to support a dynamically expanding graph size (due to continual repair). This can significantly increase the computational complexity of encoding and decoding. Furthermore, functional repair is undesirable in storage security applications in the face of eavesdroppers. In this case, information leakage occurs continually due to the dynamics of repairing-and-decoding rules that can be potentially observed by eavesdroppers. These drawbacks motivate the need for exact repair of failed nodes.

(91) Wherein, exact repair is: data stored in the blocks of the new nodes, which is constructed via linear network coding technique and interference alignment, is exactly regenerated which means restoring exactly the lost blocks with their replicas.

(92) Specifically, exact repair includes:

(93) Download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding.

(94) The dimension of the newly generated nodes' interference reduces through linear array;

(95) Wherein, for the Minimum-bandwidth Regenerating (MBR) codes, d=n1; for the Minimum-storage Regenerating (MSR) codes, d[2k1,n1], k/n, n is the number of total nodes in distributed network storage system.

(96) Wherein, the detailed description of exact repair is listed as follows:

(97) For Exact-MBR codes, the cutset lower bound of can be achieved with a deterministic scheme that requires a finite-field alphabet size of at most (n1)n/2 when d=n1. FIG. 8 illustrates an idea through the example of (n,k,d,,)=(5,3,4,4,4) where the maximum file size of M=9 (matching the cutset bound) can be stored. Each node stores four blocks with the form of a.sup.tv.sub.i, where v.sub.i can be interpreted as a 1-D subspace of data file. We simply write only subspace vector to represent an actually stored block. Notice that the degree d is equal to the number of storage blocks to be repaired, i.e., the number of available equations matches the number of desired variables for exact repair of a single node. Hence, for exact repair, there must be at least one duplicated block between node 1 and node i for all i1.

(98) The idea is to have other nodes i (i #1) store each block of node 1, respectively: nodes 2, 3, 4, and 5 store a.sup.tv.sub.1, a.sup.tv.sub.2, a.sup.tv.sub.3, and a.sup.tv.sub.4 in its own place, respectively. Notice that for ensuring repair, it suffice s to have only one duplicated block between any two storage nodes. Hence, node 2 can store another new three blocks of a.sup.tv.sub.5, a.sup.tv.sub.6, and a.sup.tv.sub.7 in the remaining other places. In accordance with the above procedure, nodes 3, 4, and 5 then copy each of three blocks in their space, respectively. We repeat this procedure until 10=(4+3+2+1) blocks are stored in total. One can see that this construction guarantees exact repair of any failed node, since at least one block is duplicated between any two storage nodes and also the duplicated block is distinct.

(99) The remaining issue is now to design these ten subspace vectors v.sub.i, i=1, . . . , 10. The detailed construction comes from the MDS-code property that any three nodes out of five need to recover the whole data file. Observe in FIG. 8 that nine distinct vectors can be downloaded from any three nodes. Hence, any (10,9) MDS code can construct these v.sub.i. In this example, using the parity-check code defined over GF(2), we can design the v.sub.i as follows: v.sub.i=e.sub.i, i=1, . . . , 9 and v.sub.10=[1, . . . , 1].sup.t. And this idea can be extended to an arbitrary (n, k) case. This construction can be interpreted as an optimal interference avoidance technique. To see this, observe in the figure that the number of desired blocks for exact repair matches the number of available equations that can be downloaded. Hence, the involvement of any undesired blocks (interference) precludes exact repair.

(100) Even though this interference-avoidance provides solutions to MBR codes, the data stored on one node is very large, it can't satisfy MSR codes. Thus it turns out that a new idea is needed for MSR nodes. The new idea is interference alignment. And the idea of interference alignment is to align multiple interference signals in a signal subspace whose dimension is smaller than the number of interferers. FIG. 9 illustrates interference alignment for exact repair of failed node 1 for (n,k,d,,)=(4,2,3,2,2) where the maximum file size of M=4 can be stored. We introduce matrix notation for illustration purposes. Let a=(a.sub.1,a.sub.2).sup.t and b=(b.sub.1,b.sub.2).sup.t be 2-D information-unit vectors. Let A.sub.i and B.sub.i be 2-by-2 encoding matrices for parity node i (i=1,2), which contain encoding coefficients for the linear combination of (a.sub.1,a.sub.2) and (b.sub.1,b.sub.2), respectively. For example, parity node 1 stores blocks in the form of a.sup.tA.sub.1+b.sup.tB.sub.1, as shown in FIG. 9. The encoding matrices for systematic nodes are not explicitly defined since those are trivially inferred. Finally, we define 2-D projection vectors v.sub.i's (i=1, 2, 3) because of =1.

(101) Let us explain the interference-alignment scheme through the example shown by FIG. 9. First, two blocks in each storage node are projected into a scalar with projection vectors v.sub.i's. By connecting to three nodes, we get v.sub.1.sup.tb; (A.sub.1v.sub.2).sup.ta+(B.sub.1v.sub.2).sup.tb; (A.sub.2v.sub.3).sup.ta+(B.sub.2v.sub.3).sup.tb. Here the goal is to decode two desired unknowns out of three equations including four unknowns. To achieve this goal, we need

(102) $rank ([\begin{matrix} {(A_{1} v_{2})}^{t} \\ {(A_{2} v_{3})}^{t} \end{matrix}]) = 2$ $rank ([\begin{matrix} v_{1}^{t} \\ {(B_{1} v_{2})}^{t} \\ {(B_{2} v_{3})}^{t} \end{matrix}]) = 1.$

(103) The second condition can be met by setting v.sub.2=B.sub.1.sup.1v.sub.1 and v.sub.3=B.sub.2.sup.1v.sub.1. This choice forces the interference space to be collapsed into a 1-D linear subspace, thereby achieving interference alignment. On the other hand, we can satisfy the first condition as well by carefully choosing the A.sub.i's and B.sub.i's. For exact repair of node 2, we can apply the same idea.

(104) In order to achieve the cutset bound of exact repair for arbitrary (k/n)(), we use the simultaneous interference alignment. FIG. 9 illustrates the interference alignment technique through the example of (n,k,d,,)=(6,3,5,3,3) where M=9. Let a=(a.sub.1,a.sub.2,a.sub.3).sup.t, b=(b.sub.1,b.sub.2,b.sub.3).sup.t and c=(c.sub.1,c.sub.2,c.sub.3).sup.t be 3-D information-unit vectors. Let A.sub.i, B.sub.i and C.sub.i be 3-by-3 encoding matrices for parity node i (i=1, 2, 3). We define 3-D projection vectors v.sub.i's (i=1, . . . , 5). When node 1 failed, we get five equations by connecting to five nodes. In order to successfully recover the desired signal components of a, the matrix associated with a should have full rank of 3, while the other matrices corresponding to b and c should have rank 1, respectively. In accordance with the (4, 2) code example in FIG. 9, if one were to set v.sub.3=B.sub.1.sup.1v.sub.1, v.sub.4=B.sub.2.sup.1v.sub.2 and v.sub.5=B.sub.3.sup.1v.sub.1, then it is possible to achieve interference alignment with respect to b. However, this choice also specifies the interference space of c. If the B.sub.i's and C.sub.i's are not designed judiciously, interference alignment is not guaranteed for c. Hence, it is not evident how to achieve interference alignment at the same time.

(105) In order to address the challenge of simultaneous interference alignment, a common eigenvector concept is invoked. The idea consists of two parts: 1) designing the (A.sub.i, B.sub.i, C.sub.i)'s such that v.sub.1 is a common eigenvector of the B.sub.i's and C.sub.i's, but not of A.sub.i's.sup.3; and 2) repairing by having survivor nodes project their data onto a linear subspace spanned by this common eigenvector v.sub.1. We can then achieve interference alignment for b and c at the same time, by setting v.sub.i=v.sub.1, i. As long as [A.sub.1v.sub.1, A.sub.2v.sub.1, A.sub.3v.sub.1] is invertible, we can also guarantee the decodability of a.

(106) The challenge is now to design encoding matrices to guarantee the existence of a common eigenvector while also satisfying the decodability of desired signals. The difficulty comes from the fact that in the (6,3,5) code example, these constraints need to be satisfied for all six possible failure configurations. The structure of elementary matrices (generalized matrices of Householder and Gauss matrices) gives insights into this. To see this, consider a 3-by-3 elementary matrix A: A=uv.sup.t+I, where u and v are 3-D vectors. Note that the dimension of the null space of v is 2 and the null vector v.sup.t is an eigenvector of A, i.e., Av.sup.t=v.sup.t. This motivates the following structure:
A.sub.1=u.sub.1v.sub.1.sup.t+.sub.1I,A.sub.2=u.sub.2v.sub.1.sup.t+.sub.2I,A.sub.3=u.sub.3v.sub.1.sup.t+.sub.3I
B.sub.1=u.sub.1v.sub.2.sup.t+.sub.1I,B.sub.2=u.sub.2v.sub.2.sup.t+.sub.2I,B.sub.3=u.sub.3v.sub.2.sup.t+.sub.3I
C.sub.1=u.sub.1v.sub.3.sup.t+.sub.1I,C.sub.2=u.sub.2v.sub.3.sup.t+.sub.2I,C.sub.3=u.sub.3v.sub.3.sup.t+.sub.3I

(107) where v.sub.i's are 3-D linearly independent vectors and so are u.sub.i's. The values of the .sub.i's, .sub.i's and .sub.i's can be arbitrary nonzero values. For simplicity, we consider the simple case where the v.sub.i's are orthonormal, although these need not be orthogonal, but only linearly independent. We then see that i=1, 2, 3
.sub.iv.sub.1+u.sub.i,B.sub.iv.sub.1=.sub.iv.sub.1,C.sub.iv.sub.1=.sub.iv.sub.1.

(108) Importantly, notice that v.sub.1 is a common eigenvector of the B.sub.i's and C.sub.i's, while simultaneously ensuring that the vectors of A.sub.iv.sub.1 are linearly independent. Hence, setting v.sub.i=v.sub.1 for all i, it is possible to achieve simultaneous interference alignment while also guaranteeing the decodability of the desired signals. On the other hand, this structure also guarantees exact repair for b and c. We use v.sub.2 for exact repair of b. It is a common eigenvector of the C.sub.i's and A.sub.i's, while ensuring [B.sub.1v.sub.2, B.sub.2v.sub.2, B.sub.3v.sub.2] invertible. Similarly, v.sub.3 used for c.

(109) Parity nodes can be repaired by drawing a dual relationship with systematic nodes. The procedure has two steps. The first is to remap parity nodes with a, b and c, respectively. Systematic nodes can then be rewritten in terms of the prime notations:
a.sup.t=a.sup.t+A.sub.1+b.sup.tB.sub.1+c.sup.tC.sub.1,b.sup.t=a.sup.tA.sub.2+b.sup.tB.sub.2+c.sup.tC.sub.2,c.sup.t=a.sup.tA.sub.3+b.sup.tB.sub.3+c.sup.tC.sub.3

(110) where the newly mapped encoding matrices (A.sub.i, B.sub.i, C.sub.i)'s are defined as

(111) $[\begin{matrix} A_{1}^{} & A_{2}^{} & A_{3}^{} \\ B_{1}^{} & B_{2}^{} & B_{3}^{} \\ C_{1}^{} & C_{2}^{} & C_{3}^{} \end{matrix}] := {[\begin{matrix} A_{1} & A_{2} & A_{3} \\ B_{1} & B_{2} & B_{3} \\ C_{1} & C_{2} & C_{3} \end{matrix}]}^{- 1} .$

(112) With this remapping, one can dualize the relationship between systematic and parity node repair. Specifically, if all of the A.sub.i's, B.sub.i's and C.sub.i's are elementary matrices and form a similar code structure, exact repair of the parity nodes becomes transparent. It was shown that a special relationship between [u.sub.1,u.sub.2,u.sub.3] and [v.sub.1,v.sub.2,v.sub.3] through the correct choice of (.sub.i,.sub.i,.sub.i)'s can also guarantee the dual structure.

(113) Wherein, hybrid repair is: if the failed nodes in the distributed network storage systems are systematic nodes, then data stored in the blocks of the new nodes, which is constructed via linear network coding technique and interference alignment, is exactly regenerated; if the failed nodes in the distributed network storage systems are non-systematic nodes, then data stored in the blocks of the new nodes, which is constructed via linear network coding technique, isn't the same with the lost blocks of the failed nodes while satisfies MDS property in the distributed network system after repair.

(114) Specifically, hybrid repair includes:

(115) Download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding;

(116) The dimension of the newly generated nodes' interference reduces through linear array;

(117) Wherein, d=k+1, k is the least number of nodes that needed to reconstruct the original file, n is the number of total nodes in distributed network storage system.

(118) In practice, hybrid repair is the repair between functional repair and exact repair, namely the system parts need exact repair, which means that the system nodes adopt exact repair while the non-system nodes only need functional repair. Derived from [Y. Wu. (2009, August). A construction of systematic MDS codes with minimum repair bandwidth. IEEE Trans. Inf. Theory], a construction of systematic (n, k) MDS codes for n2k achieves the minimum repair bandwidth when repairing from k+1 nodes. FIG. 11 illustrates the construction scheme of this repair. In FIG. 11, xF.sup.2k is a vector consisting of the 2k original information symbols. Each node stores two symbols x.sup.Tu.sub.i and x.sup.Tv.sub.i. The vectors {u.sub.i} do not change over time but {v.sub.i} change as the code repairs. We maintain the invariant property that the 2n length-2k vectors {v.sub.i,u.sub.i} form an (2n, 2k)-MDS code; that is, any 2k vectors in the set {v.sub.i,u.sub.i} have full rank 2k. This certainly implies that the n nodes form an (n, k)-MDS code. We initialize the code using any (2n, 2k) systematic MDS code over F.

(119) Now we consider the situation of a repair. Without loss of generality, suppose node n failed and is repaired by accessing nodes 1, . . . , k+1. As illustrated in FIG. 11, the replacement node downloads .sub.ix.sup.Tu.sub.i+.sub.ix.sup.Tv.sub.i each node of {1, . . . , k+1}. Using these k+1 downloaded symbols, the replacement node computes two symbols x.sup.Tu.sub.n and x.sup.Tv.sub.n as follows:

(120) 0 ${.Math.}_{i = 1}^{k + 1} (_{i} x^{T} u_{i} +_{i} x^{T} v_{i}) = x^{T} u_{n}$ ${.Math.}_{i = 1}^{k + 1}_{i} (_{i} x^{T} u_{i} +_{i} x^{T} v_{i}) = x^{T} v_{n}^{}$

(121) Note that v.sub.n is allowed to be different from v.sub.n; the property that we maintain is that the repaired code continues to be an (2n, 2k)-MDS code. Here {.sub.i,.sub.i,.sub.i} and v.sub.n are the variables that we can control. And we can choose these variables so that the repaired code continues to be an (2n, 2k)-MDS code.

(122) Invention Case 3

(123) FIG. 12 illustrates the specific structure diagram of data storage that present invention case 3 provided. In order to facilitate the explanation, we only show the portion that associated with this embodiment of the invention. And the device is able to achieve the method of the above-described invention case 1. The nodes and index server of the device in the network storage system, along with clients, can constitute the system shown in FIG. 1. In this invention case, the device includes: data block unit 121, the first block allocation unit 122, encoding unit 123, the second block allocation unit 124.

(124) Wherein, the data block unit 121, is used to split a file of size M into k blocks, each block is of size M/k;

(125) The first block allocation unit 122, is used to issue the k blocks into k different nodes in the distributed network storage system;

(126) The encoding unit 123, is used to construct nk independent blocks via linear coding from the k blocks, and satisfies the property that arbitrary k of the n encoded blocks can be used to reconstruct the original data, which means the linear coding method is a kind of Maximum-Distance Separable (MDS) code;

(127) The second block allocation unit 124, is used to issue the nk encoding blocks into the rest nk different storage codes in the distributed network storage systems.

(128) Wherein, n, k are both positive integers, and satisfy n>k. And n is the number of total nodes in distributed network storage system, while k is the least number of nodes that needed to reconstruct the original file.

(129) The device of data storage that present invention case provided can be used in the preceding corresponding method embodiment (invention case 1). You can refer to the methods shown in FIG. 2 for specific details, we will not go further on this device here.

(130) Invention Case 4

(131) FIG. 13 illustrates the specific structure diagram of data storage that present invention case 4 provided. In order to facilitate the explanation, we only show the portion that associated with this embodiment of the invention. And the device is able to achieve the method of the above-described invention case 2. The nodes and index server of the device in the network storage system, along with clients, can constitute the system shown in FIG. 1. In this invention case, the device includes: data block unit 131, the first block allocation unit 132, encoding unit 133, the second block allocation unit 134 and node recovery unit 135.

(132) Wherein, the data block unit 131, is used to split a file of size M into k blocks, each block is of size M/k;

(133) The first block allocation unit 132, is used to issue the k blocks into k different nodes in the distributed network storage system;

(134) The encoding unit 133, is used to construct nk independent blocks via linear coding from the k blocks, and satisfies the property that arbitrary k of the n encoded blocks can be used to reconstruct the original data, which means the linear coding method is a kind of Maximum-Distance Separable (MDS) code;

(135) The second block allocation unit 134, is used to issue the nk encoding blocks into the rest nk different storage codes in the distributed network storage systems.

(136) Wherein, n, k are both positive integers, and satisfy n>k. And n is the number of total nodes in distributed network storage system, while k is the least number of nodes that needed to reconstruct the original file.

(137) The node recovery unit 135, is used in the situation when there are nodes failure and the number of failed nodes is no larger than nk, then recover the data stored in the failed nodes via at least k live nodes.

(138) Wherein, the node recovery unit 135 includes at least one block from the three functional repair block, exact repair block, hybrid repair block:

(139) The functional repair block, is used in the situation that data contained in the blocks of the new nodes, which is constructed via linear network coding technique, is not exactly the same as it stored in the failed nodes, while the MDS property is maintained in the distributed network system after repair;

(140) The exact repair block, is used in the situation that data stored in the blocks of the new nodes, which is constructed via linear network coding technique and interference alignment, is exactly regenerated which means restoring exactly the lost blocks with their replicas;

(141) The hybrid repair block, is used in two situations: if the failed nodes in the distributed network storage systems are systematic nodes, then data stored in the blocks of the new nodes, which is constructed via linear network coding technique and interference alignment, is exactly regenerated; If the failed nodes in the distributed network storage systems are non-systematic nodes, then data stored in the blocks of the new nodes, which is constructed via linear network coding technique, isn't the same with the lost blocks of the failed nodes while satisfies MDS property in the distributed network system after repair.

(142) Wherein, the functional repair block includes:

(143) The first coding sub-block, is used to download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding. Wherein, dn1, n is the number of total nodes in distributed network storage system;

(144) The exact repair block includes:

(145) The second coding sub-block, is used to download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding; Wherein, for MBR codes, d=n1; for MSR codes, d[2k1, n1], k/n, n is the number of total nodes in distributed network storage system;

(146) The first interference alignment sub-block, is used to reduce the dimension of the newly generated nodes' interference through linear array;

(147) The hybrid repair block includes:

(148) The third coding sub-block, is used to download bits coding message from arbitrary d live nodes, and repair the data stored in the failed nodes via linear network coding; Wherein, d=k+1, k is the least number of nodes that needed to reconstruct the original file;

(149) The second interference alignment sub-block, is used to reduce the dimension of the newly generated nodes' interference through linear array.

(150) The device of data storage that present invention case provided can be used in the preceding corresponding method embodiment (invention case 2). You can refer to the methods shown in FIG. 3 for specific details, we will not go further on this device here.

(151) In the present invention case, we split a file into k blocks and store them in k nodes. Then we use the k blocks, construct nk independent blocks via linear coding (Maximum-Distance Separable, MDS), and satisfy the property that any k of the n encoded blocks can be used to reconstruct the original data in the file. After that, we distribute the nk coded blocks to the nk different storage codes in the distributed network storage systems. This enables the distributed network storage system tolerates at most nk simultaneous failures of nodes without losing data, keeps the redundancy of system in an invariant level, and ensures the reliability of the distributed network storage system. In addition, when there are nodes failed, three versions of repair are all taken into consideration, thus reduce the repair load as well as data stored on each node. The functional repair problem is in essence a problem of multicasting from a source to an unbounded number of receivers over an unbounded graph. As we showed there is a tradeoff between storage and repair bandwidth and the two extremal points are achieved by MBR and MSR codes. The repair bandwidth is characterized by the min-cut bounds. Problems that require exact repair correspond to network coding problems having sinks with overlapping sub-set demands. For MBR codes, the repair bandwidth given by the cutset bound is achievable for the interesting case of d=n1. And for MSR codes, the cutset bound can be matched when d[2k1,n1], dn1. The hybrid repair is just a construction of an (n, k) MDS code for the case that n2k, and the minimum repair bandwidth can be achieved when communicating with d=k+1 nodes.

(152) All described above, is only a preferred embodiment for the purpose of the present invention, and not limited to the present invention. Where any modifications, equivalent substitutions, and improvements, etc., made within the spirit and principle of the present invention, should be included in the protection of the invention range.

Data storage method, device and distributed network storage system

Assignee

Inventors

Cpc classification

Classification Explorer

H03M13/1515

ELECTRICITY

Classification Explorer

H03M13/033

ELECTRICITY

Classification Explorer

H04L67/1095

ELECTRICITY

Classification Explorer

H04L67/1097

ELECTRICITY

Classification Explorer

G06F11/1092

PHYSICS

International classification

Classification Explorer

H04L29/08

ELECTRICITY

Classification Explorer

H03M13/15

ELECTRICITY

Classification Explorer

G06F11/10

PHYSICS

Classification Explorer

H03M13/03

ELECTRICITY

Abstract

Claims

Description