Storage-optimized data-atomic systems and techniques for handling erasures and errors in distributed storage systems
11226951 · 2022-01-18
Assignee
- Massachusetts Institute Of Technology (Cambridge, MA)
- Northeastern University (Boston, MA)
- University Of Connecticut (Farmington, CT)
Inventors
- Muriel Medard (Belmont, MA)
- Kishori Mohan Konwar (Revere, MA)
- Prakash Narayana Moorthy (Hillsboro, OR)
- Nancy Ann Lynch (Brookline, MA)
- Erez Kantor (Brookline, MA)
- Alexander Allister Schwarzmann (Storrs, CT)
Cpc classification
H04L69/26
ELECTRICITY
G06F16/27
PHYSICS
G06F16/00
PHYSICS
H04L67/1097
ELECTRICITY
International classification
G06F16/00
PHYSICS
G06F16/27
PHYSICS
Abstract
Described are devices, systems and techniques for implementing atomic memory objects in a multi-writer, multi-reader setting. In an embodiment, the devices, systems and techniques use maximum distance separable (MDS) codes, and may be specifically designed to optimize a total storage cost for a given fault-tolerance requirement. Also described is an embodiment to handle the case where some of the servers can return erroneous coded elements during a read operation.
Claims
1. A data storage node for a data storage system having a plurality of logically numbered data storage nodes, of which at most f are faulty, the data storage node comprising: a node interface configured for: receiving a message that is associated with a reader and includes a tag that is totally orderable with respect to other such tags, and responsive to a first receiving of the message, if the logical number of the data storage node is no greater than f+1, forwarding the message to each data storage node having a logical number greater than that of the data storage node; a coding processor configured for computing a coded element from a version of a value that is uniquely identified by the tag; a tag memory configured for storing the tag; a coded element memory configured for storing the computed coded element; a reader registry configured for registering the stored tag in association with the reader; and a reader interface configured for, responsive to a first receiving of the message and when the tag is not higher ordered than a given tag previously stored in the tag memory, sending to the reader the given tag and a previously stored coded element computed from a version of a value that is uniquely identified by the given tag.
2. A data storage node according to claim 1, wherein the node interface is further configured to inform each other data storage node in the plurality about such sending to the reader by the reader interface.
3. A data storage node according to claim 1, wherein: the coding processor is configured for computing the coded element according to both an encoding scheme and the logical number of the data storage node; and the node interface is further configured to replace, within the message, the version of the value to be written by the coded element.
4. A data storage node according to claim 3, wherein the message includes a tag that is totally orderable with respect to other such tags, and wherein the reader interface is further configured to relay the tag and the coded element to a reader when the tag is at least as highly ordered as a tag previously registered in association with the reader in the reader registry.
5. A data storage node according to claim 4, wherein the node interface is further configured to inform each other data storage node in the plurality about such relaying of the tag and the coded element by the reader interface.
6. A data storage node according to claim 4, wherein the tag memory and the coded element memory are further configured for storing the tag and the coded element, respectively, when the tag is higher ordered than a tag previously stored in the tag memory.
7. A data storage network comprising a plurality of logically numbered data storage nodes, of which at most f are faulty, wherein each of the data storage nodes in the plurality is configured to: respond to a first receiving of a message, that pertains to a data storage protocol, is associated with a reader, and includes a tag that is totally orderable with respect to other such tags, by: (a) if the logical number of the data storage node is no greater than f+1, forwarding the message to each data storage node having a logical number greater than that of the data storage node; and (b) processing the message according to the data storage protocol, including registering the tag in association with the reader, and when the tag is not higher ordered than a previously stored tag, sending to the reader the previously stored tag and a previously stored coded element computed from a version of a value uniquely identified by the previously stored tag.
8. A data storage network according to claim 7, wherein each of the data storage nodes in the plurality is further configured to inform each other data storage node in the plurality about such sending to the reader.
9. A data storage network according to claim 7, wherein: processing the message by each of the data storage nodes in the plurality comprises replacing within the message the version of the value to be written by the coded element computed therefrom according to both an encoding scheme and the logical number of the data storage node; and forwarding the message includes, for each data storage node having a logical number greater than f+1, replacing within the message the version of the value to be written by the coded element.
10. A data storage network according to claim 9, wherein the message includes a tag that is totally orderable with respect to other such tags, and wherein processing the message by each of the data storage nodes in the plurality includes relaying the tag and the coded element to a reader when the tag is at least as highly ordered as a tag previously registered in association with the reader.
11. A data storage network according to claim 10, wherein each of the data storage nodes in the plurality is further configured to inform each other data storage node in the plurality about such relaying of the tag and the coded element.
12. A data storage network according to claim 10, wherein processing the message by each of the data storage nodes in the plurality further includes storing the tag and the coded element when the tag is higher ordered than a previously stored tag.
13. A data storage network according to claim 9, wherein the data storage nodes are logically numbered 1 through n, and the encoding scheme uses an [n, k] Maximum Distance Separable (MDS) code, where k≤n−f.
14. A method of operating a data storage node, in a data storage network having a plurality of logically numbered data storage nodes, of which at most f are faulty, the method comprising: responding to a first receiving of a message, that pertains to a data storage protocol, is associated with a reader, and includes a tag that is totally orderable with respect to other such tags, by: (a) if the logical number of the data storage node is no greater than f+1, forwarding the message to each data storage node having a logical number greater than that of the data storage node; and (b) processing the message according to the data storage protocol, including registering the tag in association with the reader, and when the tag is not higher ordered than a previously stored tag, sending to the reader the previously stored tag and a previously stored coded element computed from a version of a value uniquely identified by the previously stored tag.
15. A method according to claim 14, wherein the data storage node is further configured to inform each other data storage node in the plurality of data storage nodes about such sending to the reader.
16. A method according to claim 14, wherein: processing the message by each of the data storage nodes in the plurality comprises replacing within the message the version of the value to be written by the coded element computed therefrom according to both an encoding scheme and the logical number of the data storage node; and forwarding the message includes, for each data storage node having a logical number greater than f+1, replacing within the message the version of the value to be written by the coded element.
17. A method according to claim 16, wherein the message includes a tag that is totally orderable with respect to other such tags, and wherein processing the message includes relaying the tag and the coded element to a reader when the tag is at least as highly ordered as a tag previously registered in association with the reader.
18. A method according to claim 17, further comprising the data storage node informing each other data storage node in the plurality of data storage nodes about such relaying of the tag and the coded element.
19. A method according to claim 17, wherein processing the message further includes storing the tag and the coded element when the tag is higher ordered than a previously stored tag.
20. A method according to claim 16, wherein the data storage nodes are logically numbered 1 through n, and the encoding scheme uses an [n, k] Maximum Distance Separable (MDS) code, where k≤n−f.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing features may be more fully understood from the following description of the drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) Before describing concepts, systems, devices and techniques which allow concurrent read and write operations by several reader and writer clients to one or more storage nodes while offering an atomic consistency guarantee and tolerating failures without violating the atomic consistency guarantee, some introductory concepts and terminology are explained.
(15) As used herein, the term “storage device” (also sometimes referred to herein as a “storage”) refers to any electronic machine or manufacture, such as a volatile or non-volatile memory, that stores digital data for later retrieval.
(16) The term “storage node” (also sometimes referred to herein as a “storage server” or sometimes simply as “server”) refers to any electronic machine or manufacture, such as a computer server, that provides, as an electronic service to another machine or manufacture, the capability to store or retrieve digital data in a storage device.
(17) The term “data communication network” refers to any shared means of data communication by and between two or more computing devices (including storage nodes).
(18) The term “data storage network” refers to two or more storage nodes that cooperate using a data communication network to store data in, and retrieve data from, one or more storage devices.
(19) The term “writer” refers to a computerized process that writes data to a data storage network. The term “reader” refers to a computerized process that reads data from a data storage network. The term “data storage system” (also sometimes referred to herein as a “distributed storage system” or “DSS”) refers to a data storage network having at least one reader and at least one writer.
(20) The term “end user client” refers to a computerized process that communicates with a DSS to store and retrieve data therewith. The term “end user” refers to an individual or juristic entity that controls the operation of an end user client. Writers and readers are sometimes collectively referred to herein as “proxy clients” of the distributed storage system.
(21) The term “version” refers to a particular syntactic value of semantically identified data. For example, a text file typically is semantically identified by a file name or an inode, and a version of that file is the text it contains at any given moment.
(22) It should be appreciated that in some applications it is desirable to intentionally store one or more “copies” of the same data. This is accomplished by referring to each copy by a separate (semantic) name or number. For the purposes of the description herein below, such copies are not considered to be versions of the same data. It should thus be appreciated that it is possible to have multiple copies of the same data (with each copy considered to be a different file) and that each copy may have associated versions thereof at particular points in time.
(23) The term “tag” refers to data that uniquely associates a version of a value v to be written with a requesting writer. For example, a tag may be a pair of totally orderable identifiers (z, w), where z identifies a version for the value v to be written, and w identifies a writer. The identifier z may be, for example, a sequence number or a time stamp. The identifier w may be, for example, a string of letters, an Internet Protocol (IP) address, or a number. Any two such tags may be compared in lexicographic (dictionary) order; that is, for any two tags t.sub.1 and t.sub.2, write t.sub.2>t.sub.1 if either (i) t.sub.2.Math.z>t.sub.1.Math.z or (ii) t.sub.2.Math.z=t.sub.1.Math.z and t.sub.2.Math.w>t.sub.1.Math.w. Thus, use of the term “highest” with respect to a tag need not be a comparison of numerical values. A person of ordinary skill in the art may see how tags may be implemented in a manner not disclosed herein, but nevertheless may be compatible with disclosed embodiments.
(24) Referring now to
(25) DSS 10 further includes a plurality of writers 12a-12N and a plurality of readers 13a-13N which communicate with the data storage network 11a-11d over a communication channel. The number of readers and writers may or may not be equal and in general they will differ.
(26) A plurality of end user clients 14a-14N, generally denoted 14, communicate with DSS 10 via wired or wireless communication paths. It should be noted that some end user clients (e.g. end user client 14a) may be directly coupled to DSS 10 while other end user clients (e.g. end user clients 14b-14N) may be coupled to DSS 10 through a network 15. Network 15 may, for example, correspond to an intranet or an internet or to the Internet (i.e. the global system of interconnected computer networks that use the Internet protocol suite (TCP/IP) to link devices worldwide). It should be appreciated that end user clients 14 and network 15 are here shown in phantom since they are not properly a part of DSS 10.
(27) End user clients 14 issue read and/or write instructions to the data storage network 11 via one or more of the readers 13 and/or writers 12, respectively (i.e. to either write data to or read data from storage) on behalf of end users (not shown). The readers 12 and writers 13 interface with both the end user clients 14 and data storage network 11 to read and write data to the storage nodes in response to requests from one or more of the end user clients 14. Thus, the readers 12 and writers 13 may sometimes be referred to herein as proxy clients 16 of data storage network 11, as they act as an interface or an intermediary for requests from end user clients 14 seeking read and/or write resources.
(28) Writers, readers and storage nodes operate in accordance with the techniques to be describe herein below in conjunction with
(29) In some embodiments, DSS 10 is able to store massive datasets across several hundreds of storage nodes and is appropriate for use in both industrial and scientific applications. DSS 10 may physically reside, for example, in one or more datacenters. DSS 10 may also be deployed in commercial settings by industrial organizations such as Amazon, Google, Facebook, etc. to store user data and deploy applications including, but not limited to social networks, file sharing, and financial transactions.
(30) In some embodiments, the components of DSS 10 may themselves be distributed across geographical regions and in communication via communication paths (which may include any type of wired or wireless communication path including optical paths).
(31) It is assumed that every proxy client (i.e., writer and reader) is connected to every server through a reliable communication link. This means that as long as the destination process is non-faulty, any message sent on the link is guaranteed to eventually reach the destination process. The model allows the sender process to fail after placing the message in the channel; message-delivery depends only on whether the destination is non-faulty. Reliable connectivity between every pair of servers in the system also is assumed. No assumption is made regarding relative order of message delivery in the same channel.
(32) Techniques for imposing consistency during concurrent access of data by readers 13 and writers 12 are described herein below in conjunction with
(33) In describing a write operation, it is assumed that [n, k=n−f] MDS erasure codes are used, where n corresponds to the total number of storage nodes and f corresponds to the number of nodes allowed to fail while still ensuring completion of a write operation. In the particular example of
(34) It should be appreciated that a choice of f may depend on properties of the data storage system and the environment in which it is deployed. Typically, the choice of f is based on a balance between an application-specific need for crash fault tolerance and the financial cost of providing it. By way of illustration, in a typical datacenter about 5% of the servers are inoperative or faulty at any given moment. In such an environment, f may not be large, for example about 10% of the number n of total storage nodes (i.e., twice the average number of failed servers). In any event, as is known in the art, the number f is constrained to be less than 50% of the number n of total storage nodes to permit recovery of a stored value using an [n, n−f] erasure code.
(35) It also should be appreciated, as will become apparent from the description herein below, that only one coded element is stored in a node at any one time. It also should be appreciated that one challenge in the write operation is to send the coded elements to all nodes, despite a writer crash (i.e. to complete a write operation despite a writer crash).
(36) Returning to
(37) To ensure crash fault-tolerance, upon receiving the tagged value (t, v), each node (logically numbered i) of the first f+1 nodes forwards the information (t, v) to nodes i+1 through f+1 inclusive, as shown in
(38) Finally, as illustrated in
(39) In other words, each of non-faulty nodes among the first f+1 nodes (here 22a-22c) receives a value, computes and receives a coded element for itself, and computes and sends a coded element to each of the remaining n−f−1 nodes (here 22d-22e). With this approach, even in the event of a writer crash, if even one of the first f+1 nodes receives a value v to write collectively using the storage nodes 22a-22e, coded elements are sent to all storage nodes 22a-22e, and received by all such storage nodes that are non-faulty. Moreover, because coding each element is a function of the number of the node by which it is to be stored, each of the nodes j receives the same coded element f+1 times—once for each of the first f+1 storage nodes. Further details of the write operation are described below in conjunction with
(40) Referring now to
(41) The reader 30 receives a request to read a value v stored in the storage nodes 22a-22e. Such a read request may originate from an end user client, such as one of end user clients 14 in
(42) Then, in
(43) The SODA algorithm is illustrated in connection with the flow diagrams of
(44) Each server stores three state variables. The first state variable is (t,c.sub.s), a tag and coded element pair which is initially set to (t.sub.0,c.sub.0). The second state variable is denoted R.sub.c, an initially empty set of pairs of the form (r,t.sub.r), each pair indicating that the reader r is being currently served by this server with respect to a tag t.sub.r The third state variable is H, an initially empty set of tuples (t,s′,r), each tuple indicating that the server s′ has sent a coded element corresponding to the tag t, to reader r.
(45) Two types of messages are sent: messages that carry metadata, and messages that comprise in part or full an object value. The messages sent from the proxy clients are labeled with phase names, such as
(46) Bracketed rectangular elements (typified by elements 41 and 43 in
(47) The processing and subprocessing blocks may represent steps performed by functionally equivalent circuits such as a digital signal processor (DSP) circuit, an application specific integrated circuit (ASIC) a field programmable gate array (FPGA), a central processing unit (CPU) or any type of processor or processing element. The flow diagrams do not depict the syntax of any particular programming language, but rather illustrate the functional information one of ordinary skill in the art requires to fabricate circuits or to generate computer software to perform the processing required of the particular apparatus. It should be noted that many routine program elements, such as initialization of loops and variables and the use of temporary variables may be omitted for clarity. The particular sequence of blocks described is illustrative only and can be varied without departing from the spirit of the concepts, structures, and techniques sought to be protected herein. Thus, unless otherwise stated, the blocks described below are unordered meaning that, when possible, the functions represented by the blocks can be performed in any convenient or desirable order. In
(48)
(49) The write operation consists of two phases. In the first phase, the writer queries all servers for the local tags that are stored, awaits response from a majority and then picks the highest-ordered tag t.sub.max. The writer w creates a new tag given by t.sub.w=(t.sub.max.Math.z+1,w). In the second phase, the writer sends the message (t.sub.w,v) to all servers in S, via md-meta-send(t.sub.w,v), and this ensures that every server that is non-faulty will eventually receive the message (t.sub.w,c.sub.s), where c.sub.s denotes the coded element corresponding to server s. If the server s finds that t.sub.w>t, then the local tag and coded element are replaced by (t.sub.w,c.sub.s). In any case, the server sends an acknowledgment back to the writer w. A few additional steps are performed by the server while responding to the message (t.sub.w,c.sub.s); these are explained below in connection the read operation of
(50) In relation to the processes illustrated in
(51) It should be appreciated that, if two or more writers attempt to update the value v concurrently, there exists a race condition in which they both obtain the same value of t.sub.max. According to lexicographic ordering, the highest-ordered writer will have the highest-ordered tag, and as described below may ultimately obtain precedence for storing its value v over the value of a lower-ordered writer. If it is desired for the lower-ordered writer to obtain precedence, the process 42 may use an alternate computation to produce an alternate write tag given by the formula t=(t.sub.max.Math.z+2, w). Such a write tag will obtain storage precedence over other simultaneous writers that produce tags in accordance with the formula in the previous paragraph (i.e., whose storage precedence is “+1” rather than “+2”). This line of reasoning easily may be generalized to permit writers to store data values according to an arbitrary precedence value.
(52) Also in relation to
(53) As shown in
(54)
(55) Thus, in a first process 51, the storage node i determines whether any readers have registered to obtain versions of the value v being concurrently written; that is, whether t≥t.sub.r, where t.sub.r.Math.z is the older version number. If such a condition holds, then the just-received message (t, c′) is relayed to each such registered reader in a process 52. To ensure that state variables are eventually cleaned up in case the reader itself fails, in a subprocess block 53 the other storage nodes are sent a metadata update using the md-meta-send processes illustrated in
(56) Once any registered readers have been notified, the storage node i determines whether its own information is more recent than the new information in process 54. That is, the tag stored for the value v is compared to the tag for the value v just received from the writer. The tag comparisons of process 51 and process 54 may be accomplished, for example, by consulting a local tag memory, as illustrated in
(57) If the stored tag is more recent than the new tag, then the method continues to process 57, in which the writer is acknowledged (as shown by the dashed line in
(58) The SODA read operation is now described in connection with
(59) Any server that receives m registers the (r,t.sub.r) pair locally. Here, the term “register” means adding the pair (r,t.sub.r) to R.sub.c by executing the step R.sub.c←R.sub.c∪{(r,t.sub.r)} during the read-value phase at the server. Similarly, by “unregister” is meant the opposite, i.e., remove the pair from R.sub.c. The server sends the locally available (t,c.sub.s) pair to the reader if t≥t.sub.r Furthermore, every time a new message (t.sub.w,c.sub.s) is received at the server, due to some concurrent write with (t.sub.w,v), the server sends the message (t.sub.w,c.sub.s) to r if t.sub.w≥t.sub.r Note that there can be situations where the server does not store c.sub.s locally, for instance, if the local tag t is higher-ordered than the writer's tag t.sub.w, but simply sends the coded element c.sub.s to r. The reader keeps accumulating (t,c.sub.s) pairs it receives from various servers, until the reader has k coded elements corresponding to some tag t.sub.read. At this point the reader decodes the value (t.sub.read, v). Before returning the value v, the reader sends a READ-COMPLETE message so that the reader can be unregistered by the active servers, i.e., (r,t.sub.r) is removed from their local variable R.sub.c.
(60) The algorithm ensures that a failed reader is not sent messages indefinitely by any server. Assume that the pair (r,t.sub.r) is registered at server s, to continue sending coded elements from new writes for tags higher-ordered than or equal to t.sub.r. Once k distinct coded elements for such a tag is known to have been sent, reader r will be unregistered, and server s no longer sends messages for that read. In order to implement this, any server s′ that sends a coded element corresponding to tag t′ to reader r also sends (s′,t′,r) to all the other servers, by calling md-meta-send(
(61) Server s accumulates any received (s′,t′,r′) tuple in its history variable H, even if reader r′ has not yet been registered by it. The use of the message-disperse primitive by r′, by calling md-meta-send (READ-VALUE (r′,t.sub.r′)), described below in connection with
(62) Since no order in message arrivals is assumed, a
(63) During each read operation the reader appends a unique identifier (e.g., a counter or a time stamp) in addition to its own id r. Though it can be proved that every server will eventually stop sending coded elements to any reader r, it can happen that the entries in H corresponding to rare not entirely cleared. The usage of unique identifiers for distinct read operations from the same reader ensures that the stale entries in H do not affect new reads.
(64) Turning now to
(65) The processes of
(66) Thus, in a process 64, the reader determines whether it has received k coded elements for the tag most recently received. If not, then it must await the arrival of further tagged, coded elements, and returns to the process 63. However, if it has received k coded elements, it has enough information to decode the value v. To permit state variables and their associated resources to be released at the earliest possible moment, in process 65 the reader completes the read by using “md-meta-send” to inform the DSS that the read is complete. This mechanism cooperates with the “md-meta-send” process 53 of
(67)
(68) In process 72, the storage node determines whether its stored value is at least as recent as (if not newer than) the requested version in the received tag t. If not, no further action needs to be taken. However, if the stored value is at least as recent as the requested version, then its coded element is sent to the reader.
(69) Thus, in process 73 the storage node retrieves the coded element from the storage device 56. Then, in process 74 the storage node tags the coded element with its own highest version, and sends the tagged, coded element to the requesting reader. Finally, in process 75 the storage node informs the other storage nodes that it has relayed a tagged, coded element to the reader. The process 75 permits the other storage nodes to clean up their state variables absent reception of a “read-complete” metadata message from the reader in process 65 of
(70)
(71) In a first process 81, a writer (such as a writer 12 of
(72) The writer or reader continues in process 84, in which it receives the tag t.sub.i from storage node i. However, as concurrent writes of the value v may be ongoing, the tag t.sub.i need not reflect the most recent write operation. To guard against this condition, the writer or reader waits to hear from a sufficient number of the storage nodes in the data storage network, as indicated by decision block 85.
(73) Thus, processing block 84 and decision block 85 implement a loop in which the writer or reader awaits a response from a number of storage nodes sufficient to determine which tag to use in further processing. In one embodiment, a sufficient number of storage nodes corresponds to a majority of the storage nodes. Thus, in the case where there are 100 storage nodes, a sufficient number would be 51 storage nodes. In another embodiment, a sufficient number of storage nodes corresponds to the number k of coded values required to recover a value stored according to an [n, k] MDS encoding, which may be much more than the majority. Thus, in the case where up to 10 storage nodes are allowed to fail and k is chosen to be 100−10=90, a sufficient number would be 90 storage nodes. Once the writer or reader has received tags from a sufficient number of storage nodes, the writer or reader selects the tag which indicates the most recent version of the value (e.g. a highest-ordered tag denoted t.sub.max).
(74) Next is explained a modification of the SODA algorithm that handles the case where some of the non-faulty servers can return erroneous coded elements during a read operation. Here the parameter k is chosen as k=n−f−2e. The encoding and distribution of n coded elements among the n servers remain same as above. While decoding, any f missing coded elements, as well as e erroneous coded-elements among the remaining elements, must be tolerated.
(75) For example, assume that coded elements c.sub.1, . . . , c.sub.n-f are available to the decoder—the servers which store the remaining coded elements might have crashed, where e out of these n−f elements are erroneous, and the decoder does not know the error locations. It is well known that [n,k] MDS codes can tolerate any pattern of f erasures and e errors if k=n−f−2e. Use ϕ.sub.err.sup.−1 to denote the decoder used to recover the value v; in this example, v=ϕ.sub.err.sup.−1({c.sub.1, . . . , c.sub.n-f}). Once again, it is assumed that the decoder is aware of the index set I corresponding to the n−f=k+2e coded elements that are being used in the decoder.
(76) Two modifications needed to SODA to implement these features. First, during the read-value phase initiated by the reader, any reader must wait until it accumulates k+2e coded elements corresponding to a tag before it can decode. Recall that the SODA algorithm only requires k coded elements before the reader can decode. Also note that the decoder ϕ.sub.err.sup.−1 for the modified algorithm is different from that used for SODA, since it must accept k+2e coded elements, of which e elements are possibly erroneous. Second, when a server receives a
(77) Now the message-disperse (MD) services that are used to disseminate messages in SODA are described in connection with
(78) The services are provided in terms of the
(79) Data Types and State Variables: In an IO Automata specification of
(80)
(81) Transitions: In
(82) Explanation of the Protocol: The basic idea of the
(83) Next,
(84)
(85) The method continues in a process 92, in which each non-faulty one of the first f+1 servers receives the tagged value message.
(86) A further process 93 in server i determines whether server i is receiving the message m for the first time. If not, then server i takes as given that it already has processed the message m, so no further processing is required, and the method terminates, as indicated. However, if message m is arriving at server i for the first time, the method proceeds to a process 94.
(87) The process 94 ensures that the message eventually will be processed by every one of the non-faulty servers in the plurality of n servers, by sending the identical message m to each server in the first f+1 servers. It can be shown that including this process 94 in the method results in communication overhead on the order of f.sup.2, which advantageously is independent of the number n in the plurality of servers. Moreover, from the precondition that no more than f servers have failed, it can be shown that including the process 94 in the method results in every non-failed server processing the message m, even if the initial sender process 91 fails after only any one such non-failed server has been contacted.
(88) The method of
(89) Since at least one of the first f+1 servers is guaranteed to perform the process 95, each of the remaining n−f+1 servers, say server j, is guaranteed to receive the message (t, c.sub.j) in process 97. As above, it should be appreciated that processes 97-99 are performed by each of the last n−f+1 servers, generally concurrently, and are shown only with respect to a single server j. To ensure that each of the latter subset of servers does not process the message multiple times, a process 98 determines whether the server j is receiving the message (t, c.sub.j) for the first time. As above, if not, then server j takes as given that it already has processed the message m, so no further processing is required, and the method terminates, as indicated. Otherwise, the server j delivers the message (t, c.sub.i) to itself for local storage processing, which is described above in connection with
(90)
(91) It should be appreciated that, while the method of
(92) The method of
(93) The method of
(94)
(95) Thus, in a first process 111, a first storage server receives metadata from a reader or another storage server. Next, in a process 112, the first storage server discriminates between types of messages. In connection with the algorithms and processes described above, if the message is a read-complete message from a reader, then the storage server must unregister the read operation, as described below in connection with process 114. Otherwise, the message is a read-disperse message from another server.
(96) As is known in the art, if a value v is stored using an [n, k] MDS code, then a reader may decode the value from k different coded elements. Moreover, as disclosed above in processes 53 and 75 in connection with various embodiments, each time a tagged, coded element is delivered to a reader, such delivery is communicated using a crash fault-tolerant manner to each data storage server in the data storage system. Using this mechanism, once any given storage server receives an indication that k different storage servers (counting itself) have communicated to a particular reader their respective coded elements for a given value v and a given tag t, the given storage server can take as fact that the associated reader has sufficient data to decode the value v, and thus may delete its own state information related to such decoding and unregister the read operation. Such an indication is determined in process 113.
(97) The storage server unregisters the read operation in process 114. In connection with the state variables described above, this process 114 may include, for example, deleting all tuples (t, s, r), where r identifies the reader in question. Alternately, if no such tuples exist, then the process 114 may include storing a tuple (t0, s, r), where t0 is an empty tag, to record that a read operation completed for the reader with respect to the stored value v.
(98) Finally, having described above the functions performed by various embodiments, the required components inside a data storage server are described. Referring now to
(99) During a write operation (e.g. as described above in conjunction with
(100) A coding processor 126 also receives the value v from the writer interface 122. Coding processor 126 generates coded elements according to the MDS encoding, and forwards them to other storage nodes, as illustrated in
(101) During a read operation (e.g. as described above in conjunction with
(102) In response to receiving a tag query, storage node 120 (and all non-faulty nodes) send the tag with the highest version from tag memory 125 to the reader via reader interface 121. Subsequently, the reader registers itself in the reader registry 128, as explained in connection with process 71 of
(103) Having described preferred embodiments which serve to illustrate various concepts, systems circuits and techniques, which are the subject of this patent, it will now become apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, systems circuits and techniques may be used. For example, it should be noted that individual concepts, features (or elements) and techniques of different embodiments described herein may be combined to form other embodiments not specifically set forth above. Furthermore, various concepts, features (or elements) and techniques, which are described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. It is thus expected that other embodiments not specifically described herein are also within the scope of the following claims.
(104) In addition, it is intended that the scope of the present claims include all other foreseeable equivalents to the elements and structures as described herein and with reference to the drawing figures. Accordingly, the subject matter sought to be protected herein is to be limited only by the scope of the claims and their equivalents.
(105) It should thus be appreciated that elements of different embodiments described herein may be combined to form other embodiments which may not be specifically set forth herein. Various elements, which are described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. Other embodiments not specifically described herein are also within the scope of the following claims.
(106) It is felt, therefore that the concepts, systems, circuits and techniques described herein should not be limited by the above description, but only as defined by the spirit and scope of the following claims which encompass, within their scope, all such changes and modifications.
(107) All publications and references cited herein are expressly incorporated herein by reference in their entirety.