Accelerated Galois field coding for storage systems
10291265 ยท 2019-05-14
Assignee
Inventors
- Maxim Trusov (Saint Petersburg, RU)
- Mikhail Danilov (Saint Petersburg, RU)
- Ivan Tchoub (Saint Petersburg, RU)
- Sergey Karpenkov (Vsevolozhsk, RU)
- Sergey Koyushev (Saint Petersburg, RU)
Cpc classification
H03M13/154
ELECTRICITY
H03M13/3761
ELECTRICITY
International classification
H03M13/00
ELECTRICITY
G06F11/10
PHYSICS
Abstract
A technique to accelerate Galois Field (GF) arithmetic. The technique, which does not rely on any specific processor instruction set, can be used to accelerate erasure coding within storage systems.
Claims
1. A method for use with a distributed storage system having a plurality of storage nodes each having locally attached storage devices, the method comprising: retrieving a single-element multiplication matrix for a Galois field (GF); generating, by the distributed storage system, a multi-element multiplication matrix for the GF using the single-element multiplication matrix; receiving, at the distributed storage system, a request to store data; erasure-coding the data to produce a plurality of data fragments and a plurality of coded fragments, the erasure-coding being performed by at least one processor in the distributed storage system based on the multi-element multiplication matrix, the erasure-coding being performed by using one or more XOR-based addition operations; storing the plurality of data fragments within local storage devices of at least two of the plurality of storage nodes; and storing the plurality of coded fragments within local storage devices of at least two of the plurality of storage nodes.
2. The method of claim 1 wherein the Galois field includes 2.sup.4 elements and the single-element multiplication matrix is a 1616 matrix.
3. The method of claim 2 wherein the multi-element multiplication matrix is a 16256 matrix.
4. The method of claim 1, wherein erasure-coding the data includes: dividing the data into the plurality of data fragments arranged as a column vector; retrieving a coding matrix having elements in the Galois field; and calculating the dot product of ones of the data fragments and rows of the coding matrix using the multi-element multiplication matrix to generate the plurality of coded fragments.
5. The method of claim 1 further comprising storing the multi-element multiplication matrix to non-volatile memory.
6. The method according to claim 1, further including generating the multi-element multiplication matrix for the GF using the single-element multiplication matrix prior to encoding data.
7. A storage system, comprising: a plurality of storage nodes each having locally attached storage devices; and a processor configured to: retrieve a single-element multiplication matrix for a Galois field (GF); generate a multi-element multiplication matrix for the GF using the single-element multiplication matrix; receive a request to store data; erasure-code the data to produce a plurality of data fragments and a plurality of coded fragments, the erasure coding being performed by using the multi-element multiplication matrix, the erasure coding being performed by using one or more XOR-based addition operations; store the plurality of data fragments within local storage devices of at least two of the plurality of storage nodes; and store the plurality of coded fragments within local storage devices of at least two of the plurality of storage nodes.
8. The storage system of claim 7 wherein the Galois field includes 2.sup.4 elements and the single-element multiplication matrix is a 1616 matrix.
9. The storage system of claim 8 wherein the multi-element multiplication matrix is a 16256 matrix.
10. The storage system of claim 7 wherein erasure-coding the data includes divide the data into the plurality of data fragments arranged as a column vector; retrieve a coding matrix having elements in the Galois field; and calculate the dot product of ones of the data fragments and rows of the coding matrix using the multi-element multiplication matrix to generate plurality of coded fragments.
11. The storage system of claim 7 further comprising a non-volatile memory, wherein the processor is further configured to store the multi-element multiplication matrix to the non-volatile memory.
12. The storage system according to claim 7, wherein the multi-element multiplication matrix for the GF is generated using the single-element multiplication matrix prior to encoding data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The concepts, structures, and techniques sought to be protected herein may be more fully understood from the following detailed description of the drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
DETAILED DESCRIPTION
(10) Before describing embodiments of the structures and techniques sought to be protected herein, some terms are explained. As used herein, the phrases computer, computing system, computing environment, processing platform, data memory and storage system, and data memory and storage system environment are intended to be broadly construed so as to encompass, for example, private or public cloud computing or storage systems, or parts thereof, as well as other types of systems comprising distributed virtual infrastructure and those not comprising virtual infrastructure. The terms application, program, application program, and computer application program herein refer to any type of software application, including desktop applications, server applications, database applications, and mobile applications.
(11) As used herein, the term storage device refers to any non-volatile memory (NVM) device, including hard disk drives (HDDs), flash devices (e.g., NAND flash devices), and next generation NVM devices, any of which can be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). The term storage device can also refer to a storage array comprising one or more storage devices.
(12)
(13) In general operation, clients 102 issue requests to the storage cluster 104 to read and write data. Write requests may include requests to store new data and requests to update previously stored data. Data read and write requests include an ID value to uniquely identify the data within the storage cluster 104. A client request may be received by any available storage node 106. The receiving node 106 may process the request locally and/or may delegate request processing to one or more peer nodes 106. For example, if a client issues a data read request, the receiving node may delegate/proxy the request to peer node where the data resides. In various embodiments, the cluster 104 uses accelerated erasure coding to protect data stored therein, as described below in conjunction with
(14) In various embodiments, the distributed storage system 100 comprises an object storage system, wherein data is read and written in the form of objects, which are uniquely identified by object IDs. In some embodiments, the storage cluster 104 utilizes Elastic Cloud Storage (ECS) from EMC Corporation of Hopkinton, Mass.
(15) In some embodiments, the system 100 employs a flat cluster architecture whereby cluster-level services are distributed evenly among the nodes. To implement cluster-level services using a flat cluster architecture, processing may be coordinated and shared among several nodes using the concept of object ownership. An object stored within the system 100, including system objects and user data, may be owned by a single node 106 at any given time. When a node owns an object, it may be solely responsible for handling updates to the object or for performing other processing associated with the object. Notably, a given node may own an object (e.g., user data) without having a copy of that object's data stored locally (i.e., the object data can be stored on one or more remote nodes).
(16)
(17) In the example shown, a storage node 106 includes the following services: an authentication service 108a to authenticate requests from clients 102; storage API services 108b to parse and interpret requests from clients 102; a storage chunk management service 108c to facilitate storage chunk allocation/reclamation for different storage system needs and monitor storage chunk health and usage; a storage server management service 108d to manage available storage devices capacity and to track storage devices states; and a storage server service 108e to interface with the storage devices 110.
(18) A storage device 110 may comprise one or more physical and/or logical storage devices attached to the storage node 106a. A storage node 106 may utilize VNX, Symmetrix VMAX, and/or Full Automated Storage Tiering (FAST), which are available from EMC Corporation of Hopkinton, Mass. While vendor-specific terminology may be used to facilitate understanding, it is understood that the concepts, techniques, and structures sought to be protected herein are not limited to use with any specific commercial products. In various embodiments, the storage node 106 uses accelerated erasure coding to protect data stored within storage devices 110, as described below in conjunction with
(19) Referring to
(20) The distribution matrix 204 may be a (k+m)k matrix comprising a first sub-matrix 204a having k rows and a second sub-matrix (referred to as the coding matrix) 204b having m rows. The first sub-matrix 204a may be an identity matrix, as shown. In this form, the distribution matrix 204 can be multiplied by a data column vector 202 to result in a data-and-coding column vector 206 comprising the k data fragments 206a and the m coded fragments 206b.
(21) The coding matrix 204b includes coefficients X.sub.i,j which may be selected using known erasure coding techniques. In some embodiments, the coding coefficients are selected such that the system can tolerate the loss of any m fragments. The coefficients X.sub.i,j may be selected based upon a specific erasure coding algorithm used.
(22) It will be appreciated that the encoding process can be performed as m independent dot products using individual rows from the coding matrix 204b and the data column vector 202. In particular, the i.sup.th coded fragment C.sub.1 can be calculated as the dot product of the i.sup.th row of the coding matrix 204b with the data column vector 202.
(23) The data fragments D.sub.1, D.sub.2 . . . , D.sub.k and coded fragments C.sub.1, C.sub.2, . . . , C.sub.m may be distributed among the cluster storage nodes 106 (
(24) If a data fragment D.sub.1, D.sub.2 . . . , D.sub.k is lost (e.g., due to a node failure, a storage device failure, or data corruption), the lost fragment may be regenerated using a decoding matrix (not shown), available data fragments from D.sub.1, D.sub.2, . . . , D.sub.k, and coded fragments C.sub.1, C.sub.2, . . . , C.sub.m. The decoding matrix can be constructed as an inverse of modified distribution matrix 204 using known techniques (which may take into account which data fragments were lost). At least k unique available fragments (either data fragments or coded fragments) may be required to decode a lost data fragment.
(25) The erasure coding technique described above may be classified as matrix-based Reed-Solomon erasure coding. In some embodiments, the distributed storage system 100 performs erasure coding over a Galois Field (GF), which defines the arithmetic operations used by erasure coding (e.g., additional and multiplication). The GF also defines the set of elements from which the coding matrix 204b coefficients X.sub.i,j are selected.
(26) In various embodiments, the distributed storage system performs erasure coding over GF(2.sup.w), meaning a GF having elements. Typical values for w may include 4, 8, 16, and 32. Advantageously, when GF(2.sup.w) is used, the addition operation can be implemented using XOR (i.e., the exclusive OR binary operator). There is need for more efficient implementations of a multiplication operation in GF(2.sup.w) that do not depend on a specific processor instruction set.
(27)
(28) When new data D is added to the system (e.g., via a client 102 request), the system 300 divides the data into k fragments D.sub.1, D.sub.2, . . . , D.sub.k, generates m coded fragments C.sub.1, C.sub.2, . . . , C.sub.m therefrom, and stores the data fragments and the coded fragments across various nodes 301-316. In this example, the twelve data fragments D.sub.2, . . . , D.sub.12a are stored evenly across nodes 301-312 (i.e., one fragment per node) and four coded fragments C.sub.1, C.sub.2, C.sub.3, and C.sub.4 are stored evenly across nodes 313-316, as shown.
(29) Depending on the number of fragments generated and the number of nodes within a cluster, a given node 301-316 could store multiple data and/or coded fragments. Conversely, a given node 301-316 may not store any data or coded fragments.
(30) Referring
(31) Referring to
(32) The right and left half-bytes 422, 424 may be calculated as follows:
Right half-byte=Data byte & 00001111(1)
Left half-byte=(Data byte & 11110000)>>4(2)
where & is the bitwise AND operation and >> is the bitwise right shift operation.
(33) The two half-byte values 422, 424 can then be multiplied separately by elements in GF(2.sup.4)such as coefficients X.sub.i,j from an erasure coding matrix 204b (
Resulting right half-byte=MultMatrix[X.sub.n][Right half-byte](3)
Resulting left half-byte=MultMatrix[X.sub.n][Left half-byte](4)
where X.sub.n represent elements of GF(2.sup.4), such as erasure coding coefficients X.sub.i,j, and MultMatrix represents the 1616 multiplication matrix 400. The multiplication matrix 400 can be implemented as a two-dimension array or other suitable data structure.
(34) Referring back to
(35) The two resulting half-bytes can then be combined back a single byte of data, as follows:
Resulting byte=(Resulting left half-byte<<4){right arrow over ( )}Resulting right half-byte(5)
where ^ is the bitwise XOR operation and << is the bitwise shift left.
(36) Referring to
(37) The multi-element multiplication matrix 440 can be computed directly from the single-element multiplication matrix 400. An illustrative routine for pre-computing the multi-element multiplication matrix 440 is shown in TABLE 1, where ^ is the bitwise XOR operation, MultipleEltMatrix is multiple-element multiplication matrix 440, and SingleEltMatrix is the single-element multiplication matrix 400. It should be understood that although the implementation shown involves bitwise operations, these operations can be performed prior to encoding (i.e., the multi-element multiplication matrix 440 can be pre-computed offline).
(38) TABLE-US-00001 TABLE 1 for ( I in [0 : 16] ) { for ( J in [0 : 255] ) { L = (J & 11110000) >> 4; R = J & 00001111; MultipleEltMatrix[I][J] = (SingleEltMatrix[I][L] << 4) {circumflex over ()} MultipleEltMatrix [I][R] } }
(39) Using the multiple-element multiplication matrix 440, the resulting byte 430 from
Resulting byte=MultipleEltMatrix[X.sub.n][Data byte](6)
where X.sub.n represent an element in GF(2.sup.4) as described above in conjunction with equations (3) and (4).
(40) In various embodiments, the multiple-element multiplication matrix 440 is pre-computed offline (i.e., prior to commencing erasure encoding) and stored in memory (volatile and/or non-volatile) and used to accelerate multiplication operations required during subsequent encoding operations. In real world applications the techniques may improve elapsed encoding time by a factor of two compared to existing implementations. The described techniques are not specific to any particular processor instruction set.
(41) The multiplication acceleration technique described above can be applied to units of data larger than 1 byte, such as 2-byte values or even 4-byte values. The dimensions of the multi-element multiplication matrix 440 would be modified accordingly. For example, using 2-byte data values, the multi-element multiplication matrix 440 becomes a 1665536 matrix. Thus, the memory required to store the multiplication matrix might be a limiting factor.
(42)
(43) Referring to
(44) Once generated, the multi-element multiplication matrix 440 can be used to encode and decode data. In the example method of
(45) At block 506, a request to store new data may be received by a client (e.g., a user application). At block 508, the data may be encoded using multi-element multiplication matrix 440. For example, referring to the erasure coding example of
(46) At block 510, the encoded data may be stored across one or more nodes 106 of the cluster 104 (
(47)
(48) Processing may be implemented in hardware, software, or a combination of the two. In various embodiments, processing is provided by computer programs executing on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
(49) The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
(50) Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
(51) All references cited herein are hereby incorporated herein by reference in their entirety.
(52) Having described certain embodiments, which serve to illustrate various concepts, structures, and techniques sought to be protected herein, it will be apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures, and techniques may be used. Elements of different embodiments described hereinabove may be combined to form other embodiments not specifically set forth above and, further, elements described in the context of a single embodiment may be provided separately or in any suitable sub-combination. Accordingly, it is submitted that scope of protection sought herein should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims.