RAID-based globally resource-shared data storage system
10997025 · 2021-05-04
Assignee
Inventors
Cpc classification
G06F11/1096
PHYSICS
G06F11/1076
PHYSICS
G06F2211/104
PHYSICS
International classification
Abstract
The data storage system is a RAID-based data storage system in which resources are globally shared. This storage system includes the first number of disks, and the RAID mechanism is used to store data on each disk. The blocks on different disks form stripes, and at least one of the blocks on the stripe stores the parity information, wherein the width of the stripe is less than the first number. The data layout of the data storage system satisfies the following characteristics: any two physical blocks in the stripe are distributed on different disks; the data blocks distributed on each disk are the same, and the distributed parity blocks are also the same; other data in the stripe associated with any piece of disk data is evenly distributed across all the remaining disks. Normal data layout and degraded data layout can be implemented by orthogonal Latin squares. This system can remove the limitation that the number of disks in the normal data storage system is equal to the stripe width, and break the resource isolation between the disk groups. And in the event of a disk failure, this invention can achieve a complete equalization of the reconstructed read load.
Claims
1. A Redundant Array of Independent Disks (RAID)-based data storage system in which resources are globally shared, including a first number of disks, wherein a RAID mechanism is used to store data on each disk, blocks on different disks form stripes, and at least one of the blocks in a stripe stores parity information, wherein a width of a stripe is less than the first number, and a data layout of the data storage system satisfies the following characteristics: any two physical blocks in a stripe are distributed on different disks; numbers of data blocks distributed on each disk are the same, and numbers of parity blocks distributed on each disk are the same; and blocks sharing stripes with blocks on any given disk are distributed evenly among all other disks; wherein the first number is a power of a prime number, and the first number is represented by a number n, which is greater than or equal to 4, stripe width is indicated by a number k, and for the number n, (n−1) orthogonal Latin squares being capable of being obtained, and the data layout of the data storage system is generated as follows: k mutual orthogonal Latin squares in (n−1) mutual orthogonal Latin squares are obtained with rows having same element values in the k mutual orthogonal Latin squares ignored, and then all remaining positions in the k mutual orthogonal Latin squares are traversed in a row-first order, combining element values in same row and column into a mapping group, and each mapping group corresponds to one stripe, and a value of each element in each mapping group indicates an ordinal number of a disk on which each block in a corresponding stripe is placed.
2. The data storage system of claim 1, wherein the k mutual orthogonal Latin squares are generated according to the following theorem: a first row among all the rows of each generated Latin square is ignored and a first mutual orthogonal Latin square in k generated squares is represented by L.sub.0, assuming the element on the i.sup.th row and j.sup.th column of the m.sup.th orthogonal Latin square to be L.sub.m−1[ij], the mapping group (L.sub.0[ij], . . . , L.sub.m−1[ij], . . . , L.sub.k−1[ij]) indicates ordinal numbers of disks on which respective blocks on the ((i−1)*n+j).sup.th stripe are placed, wherein a first block is placed on the L.sub.0[ij].sup.th disk, an m.sup.th block is placed on the L.sub.m−1 [ij].sup.th disk, and a k.sup.th block is placed on the L.sub.k−1[ij].sup.th disk, wherein data for each disk are placed in blocks, theorem: for a complete set of mutually orthogonal Latin squares with its order being a power of a prime number, the i.sup.th Latin square f.sub.i (i∈[1, n−1]) has an element value fi[x, y]=i.Math.x+y in the x.sup.th row and y.sup.th column, where the operator “.Math.” and ‘+’ are the multiplication and addition in a finite field.
3. The data storage system of claim 1, wherein when one disk fails, for each failed stripe associated with the failed disk, data from other disks associated with the failed stripe for calculating reconstructed data are concurrently read, and the reconstructed data are stored in free space reserved on all other disks, and an ordinal number of the disk on which the reconstructed data is written is determined as follows: selecting a Latin square from (n−1) mutual orthogonal Latin squares other than the k mutual orthogonal Latin squares, and referring the selected Latin square as the (k+1).sup.th Latin square, for each failed stripe associated with the failed disk, identifying a position on the Latin squares corresponding to the failed stripe, and obtaining a first element value at the position on the (k+1).sup.th Latin square, the first element value indicating an ordinal number of the disk on which the reconstructed data is placed, and storing the reconstructed data in free space of the disk indicated by the ordinal number.
4. The data storage system of claim 3, wherein when another disk fails, a Latin square of (n−1) mutual orthogonal Latin squares other than the k mutual orthogonal Latin squares and the (k+1).sup.th Latin square is selected and referred to as the (k+2).sup.th Latin square, for each failed stripe associated with the failed disk, a position on the Latin squares corresponding to the each failed stripe is identified, and a second element value at the position on the (k+2).sup.th Latin square is obtained, the second element value indicating the ordinal number of the disk on which the reconstructed data block is placed, and the reconstructed data block is stored in free space of the disk indicated by the number.
5. The data storage system of claim 1, wherein when p disks fail simultaneously, a stripe associated with any one of the p disks is determined; for any stripes associated with any of the p failed disks, a number of data blocks in stripes that locate on the p failed disks is determined; a higher recovery priority for a stripe having a larger number of data blocks located in the p failed disks is assigned; and the stripe with higher priority is recovered with priority.
6. The data storage system of claim 1, wherein the data storage system stores data with different storage templates; for first data to be stored according to a first template; a first corresponding space in the first number of disks is allocated to the first data; and for the first corresponding space in the first number of disks, mapping a relationship between data stripes in the first template; for second data to be stored according to a second template; a second corresponding space in the first number of disks is allocated to the second data; and for the second corresponding space in the first number of disks, mapping a relationship between the data stripes in the second template.
7. The data storage system of claim 6, wherein the first template and the second template at least differ in one aspect of RAID levels, stripe width, physical block size, and inter-stripe addressing policy.
8. The data storage system of claim 6, wherein the first corresponding space is denoted as a logical volume, and each logical volume uses a same type of data template as granularity of storage space allocation, wherein an indexing technique is used to track a physical location of each data template in a logical volume, metadata is maintained to realize a map between user volumes and physical space, and the metadata is cached in memory.
9. The data storage system of claim 8, wherein when a user request arrives, a specific physical access location is located by querying index tables, locating data templates, locating stripes, locating internal block locations of stripes, and calculating global locations.
10. The data storage system of claim 1, wherein when data to store is desired to be stored in a read-friendly ordering; determining a mapping relationship between stripes and disks wherein a parity block in one stripe is a last data block of the one stripe; and individual stripes are sorted so that an ordinal number of a disk on which the last block in a stripe is located is an ordinal number of a disk on which the first data block in the next stripe is located.
11. The data storage system of claim 1, wherein when data to store is desired to be stored in a write-friendly ordering; a parity block in one stripe is a last block of the one stripe; individual stripes are sorted so that an ordinal number of a disk on which the last block in a stripe is located is an amount less than an ordinal number of a disk on which the first block in the next stripe is located; and the amount is a row number of a position of a mapping group corresponding to the stripe in the k mutual orthogonal Latin squares.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION
(5) To clearly display the technical problems, our technical solution and the advantages of our method, the following description will be combined with the attached drawings and practical embodiment.
(6) The first thing to note is that the word “same” in the sentence “numbers of data blocks distributed on each disk are the same, and numbers of parity blocks distributed on each disk are also the same” should be comprehended as exactly or nearly exactly the same. That is to say, when the amount of data is large enough, this data distribution can be guaranteed by mutual orthogonal Latin squares in the invention. However, it is not excluded that the data blocks are basically but not exactly the same when the amount of data is relatively small in practical application. Similarly, “blocks sharing stripes with blocks on any given disk are distributed evenly among all the other disks” should also be applied to large data amount. Obviously, under some extreme circumstances, like the total amount of the stripes is only one or two, it is impossible to satisfy that the remaining data of the stripes that are associated to data on disk A uniformly distribute on disk B, disk C, disk D, etc.
(7) The phrase Stripe Width or Stripe Size means the amount of disks that a stripe spans.
(8) The embodiments of the invention utilize the mathematical features of orthogonal Latin squares in Combinatorial mathematics to construct a storage system (sometimes referred as RAID+ in the following parts). The perfect features of orthogonal Latin squares support the uniformly distributed data expected in the invention.
(9) The definition and theorem of mutually orthogonal Latin square are briefly presented below. Detailed information can be found in related materials like reference [1]: Bose R C, Shrikhande S S. On the construction of sets of mutually orthogonal Latin squares and the falsity of a conjecture of euler. Transactions of the American Mathematical Society, 1960, 95(2):191-209.
(10) Definition 1: A Latin square of order n is a n×n array filled with n different items, each occurring exactly once in each row and column.
(11) Definition 2: Let L1 and L2 be two n-order Latin squares. L1 and L2 are mutually orthogonal if, when superimposed, each of the n.sup.2 ordered pairs occur exactly once across the overlapping cells of the two squares.
(12) Definition 3: Given a set of Latin squares, if all its member pairs are mutually orthogonal, we call it a set of mutually orthogonal Latin squares (MOLS).
(13) Theorem 1: With any given order n, there can be at most (n−1) MOLS, with this upper bound achieved when n is a power of a prime number.
(14) Theorem 2: For a complete set of MOLS with its order a power of a prime number, the i.sup.th Latin square f.sub.i(i∈[1, n−1]) has the element value fi[x; y]=i.Math.x+y in x.sup.th row and y.sup.th column. Detailed information is available in reference 1 mentioned above. Here the operator “.Math.” and “+” are the multiplication and summation in finite field.
(15) Theorem 3: If two Latin squares of order n are mutually orthogonal, for the n positions that any number d(d∈[0; n−1]) appears in one of the two Latin squares, numbers on corresponding positions in the other Latin square are exactly n different numbers.
(16) As is mentioned above, based on the theory discovery, when the order of matrix is a power of a prime number, at most n−1 different mutually orthogonal Latin squares of this order can be constructed. For example, as shown in
(17) Based on the mathematical features of orthogonal Latin squares that this invention utilizes to implement the storage array technology, the normal data layouts generated satisfy three features: First, any two physical blocks in one stripe distribute on different disks, making the layout fault-tolerant. Second, the number of data blocks and parity blocks are identical in every disk, which makes the loads in all disks totally balanced. Third, the remaining parts of the data associated to any disk distribute equally among the remaining disks, to make the read load totally balanced when reconstructing data.
(18) In an embodiment of the invention, wherein the first number is a power of a prime number, and the first number is represented by a number n, which is greater than or equal to 4, stripe width is indicated by a number k, and for the number n, (n−1) orthogonal Latin squares being capable of being obtained, data layout of the data storage system is generated as follows: k mutual orthogonal Latin squares in (n−1) mutual orthogonal Latin squares are obtained with rows having same element values in the k mutual orthogonal Latin squares ignored, and then all remaining positions in the k mutual orthogonal Latin squares are traversed in a row-first order, combining element values in same row and column into a mapping group, and each mapping group corresponds to one stripe, and value of each element in the mapping group indicates ordinal number of a disk on which each block in a corresponding stripe is placed.
(19) What needs illustration is that although it sounds rigorous to restrict the disk number to be a prime number, a large quantity of numbers can be chosen in practical application, and the intervals of them are not very large, so most need of users can be satisfied. For instance, if the disk number is limited between 4 and 128, n has 42 legal options: 4, 5, 7, 8, 9, 11, 13, 16, 17, 19, 23, 25, 27, 29, 31, 32, 37, 41, 43, 47, 49, 53, 59, 61, 64, 67, 71, 73, 79, 81, 83, 89, 97, 101, 103, 107, 109, 113, 121, 125, 127 and 128. These legal numbers are displayed in
(20) Specifically, in an embodiment of the invention, wherein the k mutual orthogonal Latin squares are generated according to the aforementioned theorem 2: the first row of each Latin square is ignored, and the first mutual orthogonal Latin square is represented by L.sub.0, assuming the element on the i.sup.th row and j.sup.th column of the m.sup.th orthogonal Latin square to be L.sub.m−1[ij], the mapping group (L.sub.0[ij], L.sub.m−1[ij], . . . , L.sub.k−1[ij]) indicates ordinal numbers of disks on which respective blocks on the ((i−1)*n+j).sup.th stripe are placed, wherein the first block is placed on the L.sub.0[ij].sup.th disk, the m.sup.th block is placed on the L.sub.m−1[ij].sup.th disk, and the k.sup.th block is placed on the L.sub.k−1[ij].sup.th disk, wherein data for each of these disks are placed in blocks
(21)
(22) In
(23) According to the mathematical characteristics of orthogonal Latin square that this invention resorts to, the generated normal data layout satisfies three features: First, any two physical blocks in one stripe distribute on different disks, making the layout fault-tolerant. Second, the number of data blocks and parity blocks are identical in every disk, which makes the load in all disks totally balanced. Third, the remaining parts of the data associated to any disk distribute equally among the remaining disks, to make the read load totally balanced when reconstructing data.
(24) The normal data layout of storage arrays according to the embodiments of the present invention is introduced above, and centralized hot standby disks are replaced by distributed free blocks. The following part describes the solution to disk failures in storage array based on the embodiment of this invention.
(25) According to the embodiments of the present invention, when a disk fails in the storage array, data reconstructed will be uniformly written into the surviving disks.
(26) In an embodiment of the invention, wherein when one disk fails, for each failed stripe associated with the failed disk, data from other disks associated with the failed stripe for calculating reconstructed data are concurrently read, and the reconstructed data are stored in free space reserved on all other disks,
(27) In specific, an ordinal number of the disk on which the reconstructed data is written can be determined as follows:
(28) selecting a Latin square from (n−1) mutual orthogonal Latin squares other than the k mutual orthogonal Latin squares, and referring it as the (k+1).sup.th Latin square,
(29) for each failed stripe associated with the failed disk, identifying position on the Latin squares corresponding to the failed stripe, and obtaining element value at this position on the (k+1).sup.th Latin square, this element value indicating the ordinal number of the disk on which the reconstructed data block is placed,
(30) In this way, the reconstructed data block can be stored in free space of the disk indicated by the number.
(31) In an embodiment of the invention, wherein when there is another disk failure, a Latin square of (n−1) mutual orthogonal Latin squares other than aforementioned (k+1) Latin squares is selected and referred as the (k+2).sup.th Latin square, for each failed stripe associated with the failed disk, position on the Latin squares corresponding to the failed stripe is identified, and element value at the position on the (k+2).sup.th Latin square is obtained, the element value indicating the ordinal number of the disk on which the reconstructed data block is placed, in this way the reconstructed data block can be stored in free space of the disk indicated by the number.
(32) In another embodiment of the invention, wherein when p disks fail simultaneously, the following processes are performed: a stripe associated with any one of the p disks is determined, for any of the stripes associated with any of the p failed disks, the number of data blocks in stripe that locate on those p failed disks is determined; a higher recovery priority for a stripe having a larger number of data blocks located in the p failed disks is assigned; the stripe with higher priority is recovered with priority.
(33) Then an example is used to illustrate how to construct degraded data layout when a disk fails.
(34) The degraded data layout in the invention utilizes the reserved free space in every disk to store reconstructed data, and orthogonal Latin squares are used to reconstruct data distribution. When the order of matrix n is a power of a prime number, we can get n−1 different orthogonal Latin squares. The normal data layout utilizes k of them, with the remaining n−k−l squares intact. When constructing degraded data distribution, an intact square is first selected from these n−k−l squares, then the numbers in the same position of wrong stripes in this new square are the numbers of disks for the redistribution of loss data. Finally, data lost by wrong stripes will be reconstructed in the free areas of this disk.
(35) In the above introduction of data layout, the number of disks in the disk array in
(36) Without loss of generality, we assume disk D.sub.0 fails.
(37) As shown in
(38) To recover these 12 wrong stripes, the data storage systems based on the invention not only need to know the physical location of the other intact data on these stripes, but also have to find new spaces on the disks to store the data reconstructed.
(39) The middle column of
(40)
(41) If failures happen again after the reconstruction of a single disk, we only need to repeat the construction of degraded data layout with new orthogonal Latin squares to realize continuous quick recovery. Specifically, for the t.sup.th disk failure, an orthogonal Latin square should be picked from the n−k−t squares to guide the construction of degraded data layout, and we can get the degraded data layout of n−t disks. When disk failure occurs n−k−l times, the number of disks available decrease to k+1. If another failure occurs in this time, the disk selection of reconstruction only has one choice as the stripe width k is equal to the disk number. Thus we can get the degraded data layout under k disks. In this process, as the degraded data maintains fault tolerance, the storage array can achieve continuous self-recovery without the need to manually replace the bad disks.
(42) In an embodiment of the invention, the data storage system stores data with different storage templates, for first data to be stored in a first template manner, a first corresponding space in the first number of disks is allocated to the first data; for the first corresponding space in the first number of disk, mapping relationship between data stripes in the first template and the first corresponding space is established according to the Latin square based method mentioned before; for second data to be stored in a second template manner, a second corresponding space in the first number of disks is allocated to the second data; for the second corresponding space in the first number of disk, mapping relationship between the data stripes in the second template and the second corresponding space is established according to the Latin square based method mentioned before.
(43) In an embodiment of the invention, wherein the different templates at least differ in one aspect of RAID levels, stripe width, physical block size, and inter-stripe addressing policy.
(44) A main advantage of the data storage system RAID+ based on the invention is that it can provide multiple virtual RAID arrays (that is RAID volumes) in the same disk pool. Each user volume serves different users or loads. When there are multiple users, every user is allocated different logical volume. Every logical volume utilizes the same data template as granularity of space allocation.
(45) Configuration of data templates are mutually independent among different logical volumes. The system administrator can allocate multiple volumes with different stripe width, size of physical blocks, the rank of RAID or the addressing strategy.
(46) RAID+ can track the physical location of each data template in a logical volume through indexing techniques. As each data template is relatively large, containing n×(n−1)×k physical blocks, and the metadata just need to record the physical location of data templates without individually mapping every physical block inside the template, RAID+ can achieve mapping between user volumes and physical space with low metadata overheads. By caching metadata in memory, RAID+ can achieve fast query of the templates' physical location and reduce processing latency in address translation.
(47) When a user request arrives, RAID+ follows the 5 steps below to locate specific physical access locations (Algorithm 1):
(48) 1. Locating data templates: Based on the logical location x the user requests, template number #.sub.t and offset δ.sub.t inside template can be calculated by combing x with the size of user data space it inside single data template.
(49) 2. Querying index tables: Based on the index table of user volume, a function is called to get the physical offset off.sub.t of template #.sub.t.
(50) 3. Locating stripes: With the offset δ.sub.t inside template and the size of data space l.sub.s in stripes, we can get the stripe number #, of the request inside the template and the offset δ.sub.s inside the stripe. Then by querying the mapping between stripes and disks, we can get the set of disk numbers disks that the physical blocks of stripes #.sub.s belong to.
(51) 4. Locating internal locations of stripes: Considering the offset δ.sub.s inside stripes and the access order of logical blocks under stripes' internal layout, we can get id, the number of physical blocks visited inside the stripe, and off, the offset of logical blocks inside templates.
(52) 5. Calculating global locations: The number of disks to visit disk[id] can derived from physical block number id and the mapping relations between stripes and disk number set disks. Then global disk offset is equal to template offset off.sub.t and logical block offset off.
(53) In an embodiment of the invention, for data storage system deployed by the features of the orthogonal Latin squares, for data to store, wherein when data to store is desired to be stored in a read-friendly ordering, mapping relationship between stripes and disks is determined according to the Latin square based method mentioned before, wherein a parity block in one stripe is the last data block of the stripe, individual stripes are sorted so that the ordinal number of the disk on which the last block in a stripe is located is the ordinal number of the disk on which the first data block in the next stripe is located.
(54) In an embodiment of the invention, for data storage system deployed by the features of the orthogonal Latin squares, for data to store, wherein when data to store is desired to be stored in a write-friendly ordering, a parity block in one stripe is the last block of the stripe. Individual stripes are sorted so that the ordinal number of the disk on which the last block in a stripe is located is a certain amount less than the ordinal number of the disk on which the first block in the next stripe is located, the certain amount is a row number of a position of a mapping group corresponding to the stripe in the Latin squares used.
(55) Various embodiments of the invention have been described above. The above description is exemplary, not exhaustive, and the present invention is not limited to the disclosed embodiments. Without deviating from the scope and spirit of the illustrated embodiments, many modifications and changes are obvious to the normal technician in this technical field. Therefore, the scope of protection of the invention shall be subject to the scope of protection claimed.