Computer-implemented method, computer program and data processing system

Abstract

A computer-implemented method for the random-based leader election in a distributed network of data processing devices, said distributed network including a plurality of identified asynchronous processes, wherein all said identified processes or a subset thereof are running processes participating in the leader election, including the following steps: a) a random information is generated by each running process and shared with the other running processes, so that each running process maintains a set of said random information, b) a distributed random information is calculated by each running process from the set of random information by applying a first shared transformation function, so that the same distributed random information is made available to each running process, c) a designator of a single one of said running processes is calculated from the distributed random information by means of a second shared transformation function, d) said designator is used to elect a leader amongst said running processes.

Claims

1. A computer-implemented method for a random-based leader election in a distributed network of data processing devices, said distributed network comprising a plurality of identified asynchronous processes, wherein all said identified processes or a subset thereof are running processes participating in the leader election, said method comprising the following steps: a) a random information is generated by each running process and shared with the other running processes, so that each running process maintains a set of said random information, b) a distributed random information is calculated by each running process from the set of random information by applying a first shared transformation function, so that the same distributed random information is made available to each running process, c) a designator of a single one of said running processes is calculated from the distributed random information by means of a second shared transformation function, d) said designator is used to elect a leader amongst said running processes.

2. The method according to claim 1, wherein the sequence of steps a)-d) of claim 1 is repeated at regular or irregular intervals in order to randomly change the elected leader.

3. The method according to claim 1, wherein each running process maintains a sorted set of all running processes and a total number of running processes.

4. The method according to claim 3, wherein the second shared transformation function is defined as
m=R (mod k), wherein m is the designator of the elected leader, R is the distributed random information, k is the total number of running processes, and mod is the modulo operation, wherein the elected leader is elected by selecting the running process that corresponds to the m.sup.th element in said sorted set of running processes.

5. The method according to claim 3, wherein each running process maintains information on the total number of all identified processes and verifies if the total number of running processes corresponds to a predefined quorum of the total number of all identified processes, wherein at least one of steps b), c) or d) is performed only if the quorum is fulfilled.

6. The method according to claim 1, wherein the first shared transformation function is
R=Π.sub.i=1.sup.nr.sub.i(mod o), wherein R is the distributed random information, r is the random information, mod is the modulo operation, and is a Mersenne prime defined as o=2.sup.n−1.

7. The method according to claim 3, wherein the sorted set of running processes is updated to include a process joining the group of running processes, wherein each running process, including the joining process, in step a) of claim 1, is sharing the sorted set of all running processes with the other processes and the sorted set maintained in each running process is merged with the shared sorted set.

8. The method according to claim 3, wherein the sorted set of running processes is updated to remove a process leaving the group of running processes, wherein the leaving process is sending a leave message comprising a process identifier to the other running processes, and the leaving process is removed from the sorted set of running processes.

9. The method according to claim 3, wherein the sorted set of running processes is updated to remove a failing running process, wherein each process identifies that it has not received any random information being shared by the failing process, each process sends a fail message to all remaining running processes inquiring whether the failing process has been identified in the remaining running processes, and removing the failing process from the sorted set of running processes upon receipt of a confirmation messages from all remaining running processes.

10. The method according to claim 1, wherein the sharing of random information in step a) of claim 1 comprises the steps of: each running process submitting its random information to a total order broadcast system, the total order broadcast system broadcasting the random information received from all running processes to each running process in the same order.

11. The method according to claim 3, wherein the sharing of random information in step a) of claim 1 comprises the steps of: each running process assigning a generator round identifier to the generated random information so as to obtain a tuple each consisting of a random information and a generator round identifier, each running process sending the tuple to all other running processes, each running process collecting tuples received from the other running processes, so as to obtain said set of random information, which takes the form of a collection of tuples consisting of tuples having the same generator round identifier, comparing the number of tuples in said collection of tuples with the total number of running processes; and step b) of claim 1 is initiated if the number of tuples in the local collection is equal to the total number of running processes.

12. The method according to claim 3, wherein the sharing of random information in step a) of claim 1 comprises the steps of: each running process assigning a generator round identifier to each generated random information so as to obtain tuples each consisting of a random information and a generator round identifier, each running process sending the tuples directly to all other running processes, each running process collecting tuples received from the other running processes, so as to obtain sets of random information, which take the form of collections of tuples, each collection consisting of tuples having the same generator round identifier, a generator round is being marked as locally complete if the number of tuples in a collection of tuples is equal to the total number of running processes; and step b) of claim 1 is initiated with regard to the completed generator round.

13. The method according to claim 12, wherein a maximum number of collections of tuples is defined and a collection of tuples is deleted if the maximum number of collections is exceeded.

14. The method according to claim 1, wherein a plurality of concurrent overlapping random-based leader election rounds, each round being defined by the sequence of steps a) to d) of claim 1, is performed, wherein a plurality of random-based leader election rounds are maintained by each said running process and a random information is generated over a field of Mersenne prime order by each running process for each particular round exchanged with the other running processes, so that each running process maintains a collection of said random information per concurrent random-based leader election round, each process upon receiving a random information determines a local completeness of random information received from other running processes for a particular random-based leader election round, and a distributed random information is calculated over the multiplicative field of Mersenne prime order by said running process from the locally complete collection of random information of a particular random-based leader election round, so that the same distributed random information is made available to each running process, each process upon receiving a plurality of random information belonging to other random-based leader election rounds than the current locally complete round derives from said plurality of random information the global completeness of said locally complete random-based leader election round and calculates a designator of a single one of said running processes from the distributed random information of such said locally complete random-based leader election round, a leader is elected among said running processes concurrently for each globally complete random-based leader election round based on said designator.

15. The method according to claim 14, wherein each process determinates a state of global completeness across all known running processes for each generator round, comprising: each running process collecting information received from the other running processes for a plurality of concurrent election rounds and such information to be received in any arbitrary order, so as to obtain collections of random information, which take the form of collections of tuples, each collection consisting of tuples having the same generator round identifier, each running process considering a generator round to be in a state of global completeness if any number of tuples received for a plurality of subsequent generator rounds results in the receiving process determining the number of tuples received for some subsequent generator round exceeding a predefined minimum quorum for said subsequent generator round and said quorum to be greater than half of the number of running processes; and step b) of claim 1 is initiated with regard to the completed generator round.

16. A method of randomly assigning a data processing task to one out of a plurality of processes running in a distributed network of data processing devices, the method comprising sending a task request to the plurality of running processes, electing a leader among said running processes by means of the method according to claim 1, and assigning the task to the leader.

17. A computer program product comprising instructions which, when the program is executed by data processing devices, such as computers, arranged in a distributed network cause the data processing devices to carry out the method of claim 1.

18. A data processing system comprising a plurality of data processing devices comprising means for carrying out the method of claim 1.

19. The method according to claim 4, wherein the second shared transformation function is defined as
m=R (mod k)+1, wherein m is the designator of the elected leader, R is the distributed random information, k is the total number of running processes, and mod is the modulo operation, wherein the elected leader is elected by selecting the running process that corresponds to the m.sup.th element in said sorted set of running processes.

20. The method according to claim 6, wherein the first shared transformation function is
R=Π.sub.i=1.sup.nr.sub.i(mod o), wherein R is the distributed random information, r is the random information, mod is the modulo operation, and is a Mersenne prime defined as o=2.sup.n−1, with n being ≥31.

Description

DETAILED DESCRIPTION OF THE INVENTION

(1) In the following, the invention will be described in more detail by reference to specific preferred embodiments of the invention.

(2) 1. Generating a Distributed Random Number

(3) The following describes a non-interactive protocol used by n processes in a distributed network to generate a distributed random number R in one round. At periodic intervals each process generates a random number r and sends it to all other processes. Each process collects the random numbers from all other processes and, once the collection is complete for a given round, generates a composite number R under a first transformation function. This sending and collecting is called a leader election round. Since the reception of random numbers r for a particular leader election round can occur at different times and out of order and for different leader election rounds concurrently, every process implements a vector {right arrow over (V)} of size σ to support the processing of a concurrent rounds.

(4) FIG. 1 illustrates a set of identified processes P={p.sub.1, p.sub.2, . . . p.sub.n} in a distributed network, such that the number of processes n=|P|, wherein λ is the maximum network latency in milliseconds amongst all processes ∈P. Let ϕ be the frequency of leader election rounds, i.e. how frequently per second a new leader election round is started, such that ϕ>>λ.

(5) Let e∈ custom character be an integer exponent of 2, with e≥31, and identical for all p∈P. Let .sub.o.sup.+ be a finite field of prime order o, such that o is the largest prime in 2.sup.e. For example 65521 is the largest prime in 2.sup.16 and .sub.o.sup.+ would contain the integers {0, 1, 2, . . . 65520}. Ideally, the use of a Mersenne prime of the form M.sub.e=2.sup.e−1 is preferred to eliminate the overshoot of randoms in custom character .sub.o.sup.+ beyond o, which otherwise would create a bias towards some first elements in .sub.o.sup.+ at a probability or

(6) $\frac{e - o}{2^{e}} .$
Preferred Mersenne primes for use with the invention are 2.sup.31−1 and 2.sup.61−1. Let r be a random number generated over n, in the interval [2, o−1]. Because of the transformation function outlined below, elements 0 and 1 must be excluded.

(7) Let f.sub.R (r.sub.1, r.sub.2, . . . r.sub.n).fwdarw.R be a first transformation function taking random numbers r to produce a distributed random number R. According to a preferred embodiment of the invention, the first transformation function is based on multiplications within the finite field custom character .sub.o.sup.+ as (a.Math.b).sub.o=(a.Math.b) (mod o) and defined as:

(8) $f_{R} (r_{1}, r_{2}, .Math. r_{n}) = {.Math.}_{i = 1}^{n} r_{i} (\mod o)$

(9) In the following example, the first transformation function is used to transform 3 random numbers r.sub.1, r.sub.2 and r.sub.3 into a distributed random number R using finite field multiplication. Assuming r.sub.1=58042, r.sub.2=41007, r.sub.3=27559, o=65521 the calculation is done as follows: 1. r.sub.1.Math.r.sub.2 (mod m)=58042.Math.41007 (mod m)=12448.sub.65521 2. 12448.Math.r.sub.3 (mod m)=12448.Math.27559 (mod m)=51997.sub.65521 3. The calculated distributed random number is R=51997

(10) FIG. 2 shows the entropy of the transformation function in 100 million rounds of 5 random numbers r using the above first transformation function f.sub.R over a finite field of Mersenne prime order custom character in a histogram of 10,000 buckets. The 500 million random numbers were generated using the Mersenne Twister 19937 generator.

(11) Each generator round is defined as the exchange between processes of random numbers r attributable to the same leader election round. Leader election rounds are numbered and denoted by g, where g∈ custom character , and start at 1 for each process on distributed network bootstrap.

(12) Let g.sub.i be the current generator round at process p.sub.i. At short periodic intervals (preferably every 1 second) each process p.sub.i∈P starts a new generator round by incrementing its counter g.sub.i by one and generating a random number r.sub.i in custom character .sub.o.sup.+. The tuple (g.sub.ir.sub.i) is then immediately broadcast to all the other processes in P\{p.sub.i}. To denote the strict consecutive order, in which random numbers must be generated, r.sub.i′ is defined to be the random number in .sub.o.sup.+ generated by p.sub.i in the previous round g.sub.i′=g.sub.i−1.Math.r.sub.i′ shall strictly precede r.sub.i, which is denoted as r.sub.i′ custom character r.sub.i.

(13) Let C.sub.9 be a collection of tuples (g.sub.n, r.sub.n) within a process p∈P for its round g, where the tuple (g.sub.n, r.sub.n) represents the random r.sub.n created by process p.sub.n in round g.sub.n as received by process p.sub.i. C.sub.g may or may not exist for a round g within a process p. It follows that C.sub.g comes into existence for round g on process p.sub.i when either a) p.sub.i generates a random r.sub.i for round g and adds the tuple (g.sub.i, r.sub.i) to its C.sub.g, or b) the process p.sub.i receives the tuple (g.sub.n, r.sub.n) from process p.sub.n and adds it to its C.sub.g|g=g.sub.n.

(14) As an example, FIG. 3 shows the generator round 7 comprising 3 processes sending tuples to each other in a distributed network, where every process p.sub.i generates its own random number r and receives tuples (g.sub.n, r.sub.n) from other processes p.sub.n.

(15) Let {right arrow over (V)} be a vector of collections C created by a process so that C.sub.n is the element at position n in {right arrow over (V)}. Let custom character be the maximum size of that vector so that ≥|{right arrow over (V)}|. On process p.sub.i the tuple (g.sub.i, r.sub.i) is the random number r.sub.i generated by the local process p.sub.i for round g.sub.i and stored within C.sub.gi at position k within {right arrow over (V)}. The order of random numbers generated is denoted as:
r.sub.i-1∈C.sub.i-1 custom character r.sub.i∈C.sub.i∀i∈{1,2, . . . m}, m=|{right arrow over (V)}|.

(16) The following example shows a process' collection vector {right arrow over (V)} over 7 rounds:

(17) $\overset{\overset{.fwdarw.}{V}}{(\begin{matrix} C_{1} {1, 872}_{1} & {1, 283}_{2} & {1, 924}_{3} \\ C_{2} {2, 276}_{1} & {2, 982}_{2} & {2, 124}_{3} \\ C_{3} {3, 842}_{1} & {3, 294}_{2} & {3, 628}_{3} \\ C_{4} {4, 824}_{1} & {4, 877}_{2} & {4, 482}_{3} \\ C_{5} {5, 926}_{1} & {}_{2} & {}_{3} \\ C_{6} {6, 436}_{1} & {6, 614}_{2} & {6, 944}_{3} \\ C_{7} {}_{1} & {7, 193}_{2} & {}_{3} \end{matrix})}$ (1) Process p.sub.1 maintains a local vector {right arrow over (V)} comprising collections C.sub.1, C.sub.2, . . . C.sub.n. (2) C.sub.5 for round 5 on p.sub.1 comes into existence when p.sub.1 creates its random for that round; at this time the collection does not yet contain any other tuples received by p.sub.1. (3) If a process p.sub.2 generates a random for round 5 and sends the tuple (g.sub.5, r.sub.5)=(5, 719) to process p.sub.1, process p.sub.1, will add it to its C.sub.5. (4) C, comes into existence on p.sub.1, when it receives the tuple (7,193) from p.sub.2. It is to be noted that at this time p.sub.1 has not yet generated its own random for round 7.

(18) As custom character defines the maximum size of {right arrow over (V)}, when a process adds a set for a new round to its vector and |{right arrow over (V)}|≥, it first needs to remove the oldest round(s) to achieve |{right arrow over (V)}|=−1 before adding the new round.

(19) As P is a statically defined set of all identified processes in the distributed network, each process also maintains a sorted set (vector) {right arrow over (K)} of known (running) processes ∈P.

(20) Therefore, a generator round g is considered to be locally complete for a process p.sub.i, when |C.sub.g|=|{right arrow over (K)}|, as in the example shown above for the rounds C.sub.1, C.sub.2, C.sub.3, C.sub.4, C.sub.6.

(21) C.sub.g.sup.p is defined as a collection C for round g on process p. Further, C.sub.g.sup.q is defined as a collection C for round g on process q. Equality for tuples is defined as (g.sub.p,r.sub.p)=(g.sub.q, r.sub.q)|g.sub.p=g.sub.q∧r.sub.p=r.sub.q. The collection C.sub.g.sup.p for round g on process p is congruent to collection C.sub.g.sup.q for round g on process q when all tuples match:
C.sub.g.sup.p≡C.sub.g.sup.q|(g.sub.k.sup.p,r.sub.k.sup.p)=(g.sub.k.sup.q,r.sub.k.sup.q)∀(g.sub.k.sup.p,r.sub.k.sup.p)∈(g.sub.k.sup.q,r.sub.k.sup.q)∈C.sub.g.sup.p,
k∈{1,2, . . . i}, i=|{right arrow over (K)}|.

(22) Finally, a generator round g is defined to be globally complete, when the C.sub.g on all processes p.sub.n are congruent for a round g, that is:
C.sub.g.sup.k≡C.sub.g.sup.l|p.sub.k≠p.sub.l∀k,l∈{1,2, . . . i},i=|{right arrow over (K)}|.

(23) Any time a collection C.sub.g for a process p becomes locally complete, the process calculates the distributed random number R.sub.g by applying the first transformation function previously defined:
f.sub.R(r.sub.1,r.sub.2, . . . r.sub.n).fwdarw.R.sub.g|r.sub.i∈C.sub.g∀i∈{1,2, . . . i},i=|{right arrow over (K)}|.

(24) Given the ordering of round numbers, it follows that:
R.sub.x custom character R.sub.y|x<y∀x,y∈{1,2, . . . m}|m=|{right arrow over (V)}|.

(25) 2. Bootstrapping and Quorum

(26) In the following a preferred method will be described how to bootstrap a distributed network of processes so that they find a quorum, and how processes come to consider a leader election round locally complete in order to commence normal operation. As before, a static complete set of identified processes is denoted as P.

(27) Let q be the minimum quorum necessary to run a leader election round, and let n=|P| and P.sub.q⊂P be a quorum subset of P, where |P.sub.q|=q, q<|P|. Further, P.sub.0 is defined as the remainder subset P.sub.0=P\P.sub.q of processes joining the distributed network at a later time, and constrain P.sub.0 with P.sub.q∩P.sub.o=Ø.

(28) The example shown in FIG. 4 Illustrates a quorum and 2 remainder subsets. FIG. 4 illustrates a total of 5 processes |P|=n=5. The quorum q is defined as a simple majority of 3 out of 5 processes (inner circle). The remainder sets define the two processes p.sub.4 and p.sub.5 joining later. The outer circle Illustrates a future leaver set comprised of one process p.sub.5 that might leave the set of processes later on.

(29) Within the context of the invention, different types of quorum may be applied. A simple majority quorum is defined as

(30) $q_{simple} = \frac{n}{2} + 1.$
However, in order to achieve byzantine fault tolerance, a byzantine majority quorum may be used instead of a simple majority quorum. In this connection, f shall be the maximum number of tolerated faulty nodes and the relationship between n and f shall be constrained to be n=3f+1. A byzantine majority quorum is derived by using the simple majority formula and ignoring f faulty processes, and define:

(31) $q_{byzantine} = \frac{n}{2} + f + 1$

(32) Example calculation for number of nodes and their quorum types, for the tuples: (f, n, q.sub.simple, q.sub.byzantine) (1, 4, 3, 4) (2, 7, 4, 6) (3, 10, 6, 9) (4, 13, 7, 11) (5, 16, 9, 14) (6, 19, 10, 16) (7, 22, 12, 19) (8, 25, 13, 21) (9, 28, 15, 24)

(33) For each process the following states can be defined: joining, running, leaving, failing. When a process first starts, its state is “joining”. It initializes its current round to g=1 and commences creating and sending out tuples (g,r) to other processes ∈P. Such other processes may exist, and if they exist they eventually receive the tuple sent.

(34) When a process p.sub.i in state “joining” receives a tuple (g.sub.k, r.sub.k).sub.m from process p.sub.m for round k, and g.sub.k>g.sub.i, where g.sub.i is the process' current round, the process adopts the higher round number, adds (g.sub.k,r.sub.k).sub.m to its collection C.sub.k, immediately generates a random for round k and sends its tuple (g.sub.k,r.sub.k).sub.i to all other processes ∈P. C.sub.k fulfils the quorum at process p.sub.i, when |C.sub.k|≥q, and in such case the process switches to state “running”.

(35) FIG. 5 is an illustration of the process p.sub.1 joining the quorum subset ⊂P, where q=3. (1) Process p.sub.1 starts in “joining” state. (2) p.sub.1 commences to send out tuples, starting from round r=1. (3) p.sub.1 receives a tuple from p.sub.2 for round 14 and adds it to its local C.sub.14. (4) Since this round 14 is further ahead than its local round, p.sub.1 adopts this round, generates a random for it and sends the tuple to other processes ∈P (which might not be running at that time). It can also safely abandon all previously created rounds C.sub.k|k<14. (5) Now p.sub.1 receives tuples for round 17 from p.sub.2 and p.sub.3. It can be concluded that p.sub.2 and p.sub.3 somehow exchanged tuples in parallel and bootstrapped up to round 17 as a set of two without reaching quorum whilst p.sub.1 was still busy starting up. (6) p.sub.1 generates a random for it and sends the tuple to other processes E P of which it has p.sub.2 and p.sub.3 in its vector {right arrow over (K)} already. (7) After p.sub.1 adds its tuple for round 17 to its collection for that round, then C.sub.17 becomes locally complete. p.sub.1 can also rightfully assume that p.sub.2 and p.sub.3 will shortly be able to add its tuple to their C.sub.17 causing local completeness there as well. (8) Having determined its first locally complete round, p.sub.1 switches to state “running” and continues normal operation. It can also be concluded that p.sub.2 and p.sub.3 will also switch to state “running” had they not been in that state already. This will result in a state of global completeness for the distributed network, so it can be considered up and ready for client requests.

(36) 3. Joining a Process

(37) The method for joining a process p.sub.new to the local sorted vectors of known (running) processes {right arrow over (K)} in a distributed network is similar to the bootstrap process described above. As p.sub.new joins at some later point in time, |{right arrow over (K)}| will have at least the minimum quorum size defined for the network at the other running processes. When the new process joins, it will be added to the vector of known processes {right arrow over (K)}′={right arrow over (K)}∪{p.sub.new}. To enable the addition and subtraction of processes from the respective vectors {right arrow over (K)} at each process, when sending tuples each process also sends along its vector {right arrow over (K)} of all known processes. Whenever a new process starts and commences sending its tuples, it will also start receiving tuples from other processes and build its {right arrow over (K)} by merging its local sorted vector with the one received {right arrow over (K)}′.sub.local={right arrow over (K)}.sub.local∪{right arrow over (K)}.sub.received. The other processes will add this new process to their {right arrow over (K)} and each process in the distributed network will from the following round on base its considerations on {right arrow over (K)}′.

(38) As an example, FIG. 6 is an illustration of processes p.sub.1 joining p.sub.2 and p.sub.3, then p.sub.4 joining the three. (1) Processes p.sub.2 and p.sub.3 are the initial processes running in a set of two. The quorum is defined as q=2. They exchange their vectors {right arrow over (K)}={2,3} with each other. The current set size is |{right arrow over (K)}|=2. (2) Process p.sub.1 starts and joins the distributed network. (3) p.sub.1 sends its vector {right arrow over (K)}={1} along with its tuples to other, yet unknown (running or not), processes in the network. (4) p.sub.2 learns about the existence of p.sub.1 and adds it to its {right arrow over (K)}. It sends the new vector in the next tuple message. (5) p.sub.1 hears back from p.sub.2 and merges its local vector with the vector that p.sub.2 sent along. (6) p.sub.3 receives a tuple from p.sub.1 and merges its local vector with the vector that p.sub.1 sent along. (7) In all messages the processes now have the new vector {right arrow over (K)}′={1,2,3} included. The whole distributed network now has |{right arrow over (K)}′|=3 as criterion for local completeness. (8) Process p.sub.4 starts and sends its vector {right arrow over (K)}={4} along in its messages to other, yet unknown, processes in the distributed network. (9) p.sub.3 learns of p.sub.4's existence and merges the vector sent along with its own vector {right arrow over (K)}′={1,2,3}∪{4}. (10) It sends along this updated vector in its next tuple message so that processes p.sub.1 and p.sub.2 will merge their vectors as well. (11) Processes p.sub.1, p.sub.2, p.sub.3 now recognise p.sub.4 as new member in P and use their updated vector from the next tuple message on. (12) p.sub.4 finds the complete vector in the messages received and merges with its local vector. Now all processes know that the distributed network has grown to 4 members and use their updated {right arrow over (K)}, |{right arrow over (K)}|=4 from now on to trigger local completeness.

(39) 4. Processes Leaving

(40) Processes ∈E P can decide to stop operating at any time. When a process p.sub.s stops, it sends the message leave (p.sub.s,r.sub.s) to all other processes, where r.sub.s is the next round that p.sub.s would normally use to send tuples. p.sub.s also stops sending and receiving tuples.

(41) When a process p.sub.i receives a leave (p.sub.s,r.sub.s) message, it takes p.sub.s out of its local vector {right arrow over (K)} of known running processes and checks whether there is still the minimum number of quorum processes alive. Should the number of remaining running processes be less than the minimum required quorum the process switches back to bootstrap state, but in this case will not reset its round number to 1.

(42) FIG. 7 shows an example of process p.sub.1 leaving the set of p.sub.1, p.sub.2, p.sub.3 (1) Processes p.sub.1, p.sub.2, p.sub.3, p.sub.4, p.sub.5 operate in normal mode “running” at a quorum q=3. (2) The processes are exchanging tuples for 18 rounds, sending along their sorted vector of known running processes {right arrow over (K)}={p.sub.1, p.sub.2, p.sub.3, p.sub.4, p.sub.5}. (3) p.sub.1 decides to stop running. It is currently in round 18, so it sends the message leave (1,19) to all other processes in P. This is the last message p.sub.1 sends to the network. (4) Currently running processes p.sub.2,p.sub.3,p.sub.4,p.sub.5 remove the leaver from their vector {right arrow over (K)}′={right arrow over (K)}\{p.sub.1}. (5) p.sub.2,p.sub.3,p.sub.4,p.sub.5 check whether they still form a quorum. As q=3 they do and so they continue normal operation, now using {right arrow over (K)}={p.sub.2,p.sub.3,p.sub.4,p.sub.5} along with their tuples.

(43) 5. Processes Failing

(44) Processes can be failing without a chance to send a leave message. In this section a preferred method is presented to detect failing processes and remove them from the set of currently running processes, which effects the determination of local completeness for leader election rounds.

(45) Let p.sub.i∈P be a running process gathering tuples received for round k into its local collection C.sub.k. Let {right arrow over (V)} be its vector of collections C and let y define the maximum size of {right arrow over (V)}. A process p.sub.a is considered to be alive for the collection C.sub.k, if (g.sub.k, r.sub.k).sub.a∈C.sub.1, and a process is considered to have failed for C.sub.k otherwise.

(46) Let {right arrow over (A)}.Math.{right arrow over (V)} be the set of collections in {right arrow over (V)} containing tuples from a process p.sub.f. The process p.sub.f is defined to be failing when |{right arrow over (A)}|<< custom character ∧|{right arrow over (V)}|=. One can derive that failure detection can become available once the processes have initially gone through rounds in running state.

(47) When a process p.sub.i detects a failing process p.sub.f at the start of a new round, it sends the message fail(g.sub.i,p.sub.f) to all other processes.

(48) When a process p.sub.j receives fail(g.sub.j,p.sub.f), it checks the condition |{right arrow over (A)}|<< custom character ∧|{right arrow over (V)}|= for his local {right arrow over (V)}. If the condition is true, it sends confirm(g.sub.j,p.sub.f) to all other processes. If the condition is false, it sends alive(g.sub.j,p.sub.f) to all other processes.

(49) When the process p.sub.i receives an alive(g.sub.j,p.sub.f) message from any other process, it continues normal operation and stops issuing fail messages. One can conclude that p.sub.i must have missed some messages from p.sub.f in the past, and its local {right arrow over (K)} or {right arrow over (V)} might not be congruent with the other processes. In particular this can occur when the network partitions, dealt with further below.

(50) A process p.sub.i collects confirm(g.sub.k,p.sub.f) messages for a previously sent fail(p.sub.i,p.sub.f) until it has collected confirmations from all known processes ∈{right arrow over (K)}\{p.sub.f}. It then removes p.sub.f from its {right arrow over (K)}. p.sub.i keeps sending fail(g.sub.j,p.sub.f) in each round until it either receives a alive message or it can eventually remove p.sub.f from its {right arrow over (K)}. After removal of p.sub.f the remaining running processes need to verify if they still form a quorum, or would otherwise revert back to bootstrap state.

(51) FIG. 8 illustrates an example of process p.sub.1 failing, the others detecting it and removing it. (1) Processes p.sub.1, p.sub.2, p.sub.3, p.sub.4, p.sub.5 form a set of running processes in a quorum q=3. (2) The processes exchange tuple messages during normal operation. (3) Process p.sub.1 fails. The other processes stop receiving tuple messages from it. (4) p.sub.2 finds that |{right arrow over (A)}|<< custom character ∧|{right arrow over (V)}|= is true for p.sub.1 and sends fail(23,1) to the other processes (23 is its current round, 1 the process). (5) p.sub.2 hears back confirm(24,1) from p.sub.3 and p.sub.4. It registers the message and can ignore the difference in the round number. (6) p.sub.2 hears back confirm(24,1) from p.sub.5 and eventually it has now gathered all feedback. (7) p.sub.2 now removes p.sub.1 from its sorted vector of running processes {right arrow over (K)}. Given the symmetry of the message passing, it is concluded that the other processes have run through the same workflow. (8) Since p.sub.2,p.sub.3,p.sub.4,p.sub.5 still form a quorum, they can continue in running state.

(52) 6. Network Partitioning

(53) A network partition is defined to be a network split due to some failure of network connections between processes. The following section describes a preferred partition-tolerant behaviour of subnets for the leader election method.

(54) FIG. 9 illustrates an example of a distributed network configured for 5 processes and a quorum q=3. (1) Some components fail so that the communication is partially broken and the network splits into two segments to the left and right sides of the black bar. The dotted lines depict the interrupted communication channels. (2) Processes p.sub.1, P.sub.4, p.sub.s are in the first subnet. They remain in state “running” as they form a quorum of 3. (3) Processes p.sub.2, p.sub.3 form the second subnet. Since they cannot form a quorum they change their state to “bootstrap”.

(55) Let P be the set of all processes and q a single quorum, as used before. Let P.sub.1 be the set of processes in subnet 1, and P.sub.2 the set of processes in subnet 2. A network partition is defined as forming exclusive subsets P.sub.k⊂P, so that P.sub.i∩P.sub.j=Ø|i≠j∀i,j∈{1, 2, . . . k}.

(56) Let n.sub.1=|P.sub.1| and n.sub.2=|P.sub.2| of a partitioned distributed network of 2 segments. The single majority to form a quorum is

(57) $q = \frac{n_{1} + n_{2}}{2} + 1,$
where n.sub.1+n.sub.2=|P|. Since the quorum q needs to be greater than half of the total number of processes so that

(58) $q > \frac{n_{1} + n_{2}}{2},$
one can follow that either n.sub.1≥q or n.sub.2≥q or neither of the two are. One can generalize: let S={P.sub.1, P.sub.2, P.sub.k} be exclusive subsets of P and n.sub.i=|P.sub.i|∀P.sub.i∈S. Then

(59) $q = \frac{n_{1} + n_{2} + .Math. n_{k}}{2} + 1$
and either exactly one segment forms a quorum n.sub.i≥q,n.sub.j<q|i≠j∀j∈{1,2,i−1, i+2, . . . |S|} or no segments do n.sub.j<q∀j∈{1, 2, . . . |S|}.

(60) It can be concluded that whenever a network partitions into 2 or more parts, a maximum of 1 part can remain in the status “running”.

(61) FIG. 10 illustrates an example of the partitioning of a network during normal operation. (1) A distributed network of 5 processes is running with q=3. (2) All processes are exchanging tuples. (3) A network partitioning occurs, splitting the set of processes into a left p.sub.1, p.sub.4, p.sub.5 and right p.sub.2, p.sub.3 segment. (4) The processes in the left segment see p.sub.2 and p.sub.3 failing after some rounds and perform the procedure for failing processes described above. p.sub.2 and p.sub.3 see the processes p.sub.1, p.sub.4, p.sub.s failing and execute the same workflow. (5) In the left and right segments the processes confirm to each other their view on the failed processes. (6) When confirmed, processes in both segments remove the failed processes from their {right arrow over (K)}. (7) During failed process removal, processes in the left segment still form a quorum of 3 and keep their status “running”, while the processes in the left segment miss the quorum and switch to state “bootstrap”. (8) Processes in the right segment keep sending each other bootstrap messages. They send those messages to all processes E P. At some later point in time, when the partition ends and the segments will be reunited, the processes in the former left segment will receive bootstrap processes from the right segment, and all processes will add each other again into their {right arrow over (K)}.

(62) 7. Determining a Leader Acting as a Transaction Master

(63) At any given time a distributed network will have {right arrow over (K)} known running processes. A request is a message from an external process in the same network (such as a client, see below) to processes ∈{right arrow over (K)}. For each request a master process is determined, using leader election rounds for a leader election as described above.

(64) Let request(m, r, data) be a request message sent by an external process, where m∈{1, 2, . . . |{right arrow over (K)}|} is denoted as “master” and specifies an index in {right arrow over (K)}, r is the leader election round, and data denotes some structured dataset supplied by the client, to be processed by the leader.

(65) The leader is selected by calculating a designator of a single one of the running processes from the distributed random information by means of a second shared transformation function, which, according to a preferred embodiment is defined as m=(R.sub.r (mod|{right arrow over (K)}|))+1.

(66) A process p.sub.i receiving a request message determines whether it is the master for such request by checking whether it is the m.sup.th element in its {right arrow over (K)}. It verifies p.sub.i={right arrow over (K)}[m] and m=(R.sub.r (mod|{right arrow over (K)}|))+1. If the first equation matches, p.sub.i is the addressee for the request; if the second equation matches, p.sub.1 is rightfully the leader for round r. If p.sub.i≠{right arrow over (K)}[m], then p.sub.i can safely ignore the request. For m=(R.sub.r (mod|{right arrow over (K)}|))+1 two error conditions are possible: 1) If the left and right sides of the equation do not match, then the request was addressed to p.sub.i under wrong assumptions by the external process (error, adversary attack). In this case p.sub.i sends back the rejection message error(notmaster) to the external process. 2) if R.sub.r or K.sub.r does not exist on p.sub.i, then the external process might be using a round that is too advanced, too old, or referring to a round that is incomplete. In that case p.sub.i sends back the rejection message error(noround) to the external process.

(67) FIG. 11 illustrates an external process sending requests to a distributed network. (1) Processes p.sub.1 . . . p.sub.5 form a distributed network with quorum q=3. They know about rounds 6,7,8 and maintain the respective distributed randoms R.sub.k and {right arrow over (K.sub.k)}. (2) External process sends a request to the network, addressing p.sub.1 in round 7 who verifies that its position in {right arrow over (K.sub.7)} is indeed at index 1. Additionally p.sub.1 verifies R.sub.7 (mod|{right arrow over (K)}.sub.7|)+1=860 (mod 5)+1=1. This is the case, so p.sub.1 can process the request. (3) External process sends a request to the network, addressing p.sub.1 in round 8 who verifies its position in {right arrow over (K.sub.8)} is indeed at index 1. Additionally, p.sub.1 checks R.sub.8 (mod|{right arrow over (K.sub.8)}|)+1=131 (mod 5)+1=2≠1. So p.sub.1 sends back an error. (4) External process sends a request to the network, addressing p.sub.1 in round 9 who cannot check its position since {right arrow over (K.sub.9)} does not yet exist on p.sub.1. So p.sub.1 sends back an error. p.sub.2 cannot check its position either, but since its last known position at {right arrow over (K)}.sub.max={right arrow over (K.sub.8)} was 2 it will reject the request.

(68) 8. Clients

(69) A client-process (client) is an external process that sends requests to running processes, i.e. processes participating in the leader election. A master is the leader within the distributed network, responsible for a client-process' request, and is expected to send a response back to the client.

(70) In this example, clients join the distributed network and take a read-only role in leader election rounds, i.e. they have their own vectors {right arrow over (V)} and {right arrow over (K)} together with collections C.sub.k∈{right arrow over (V)} to maintain knowledge about current processes, rounds, distributed random numbers, joiners and leavers. A client will thus be able to determine the master for every round.

(71) When a client prepares a request to be sent, it starts with the latest locally complete round g in {right arrow over (V)} and calculates the master m from the distributed random number R.sub.g as m=R.sub.g (mod|{right arrow over (K.sub.g)}|)+1. It then uses request(m, g, data) to send its message to either all processes on the distributed network, or directly to the master process p.sub.n={right arrow over (K)}.sub.g[m].

(72) When using broadcasting mode, non-master nodes—while ignoring the master request at leader election level—might kick-off higher level functionality within a process for optimisation, e.g. non-master processes can send their vote on a consensus to the master immediately, saving the master sending out the request first.

(73) When a honest client receives an error or no response within a given time out, it creates a new request using a previous locally complete round's data: request(m.sub.i, i, data), where i=max(j)|j<g∧locallyComplete(C.sub.j)=1∧C.sub.j∈{right arrow over (V)} until it succeeds or the oldest round within {right arrow over (V)} has been tried. If all tries remain erroneous, the client has to give up.

(74) Interpretation of Error Conditions for the Client: error (notmaster) the local {right arrow over (V)} and {right arrow over (K)} seem to be not in-sync with the distributed network, try the previous round. error(noround) the local {right arrow over (V)} occurs to be more advanced than {right arrow over (V)} on master p.sub.m, so trying the previous round might be more successful. timeout when the client has not heard back from master p.sub.m, it can assume that p.sub.in is probably failing and tries the previous round.

(75) FIG. 12 illustrates a client process sending a request using a too advanced round. (1) Processes p.sub.1 . . . p.sub.5 form a distributed network with quorum q=3. They know about rounds 6,7,8 and maintain the respective distributed randoms R.sub.k and {right arrow over (K.sub.k)}. (2) Client sends a request for round 9, but since round 9 is not yet known or locally complete on p.sub.1 the client receives back a error(noround). (3) The client can decrement its round since C.sub.8∈{right arrow over (V)} and tries again for that round.

(76) 9. Master Sharding

(77) According to a further example of an embodiment of the invention, requests to a distributed network can be sharded so that several masters are allowed within the same leader election round. To achieve this, client messages must use structured datasets and an element of that dataset must be ∈ custom character .sub.o.sup.+. To constrain the client from picking arbitrary content to manipulate the master in its favour, that element must either follow some order or have an intrinsic meaning for the higher-level routines associated with the request, such as a serial number, a UUID version 1 with monotonic time increment, or an account number that a particular request binds to.

(78) Using the element ε∈ custom character .sub.o.sup.+ from a client's dataset, ε is multiplied with the actual round's distributed random number R.sub.9 over the finite field .sub.o.sup.+ to obtain the master for a particular client request:
m=((R.sub.g.Math.ε)(mod o))(mod|{right arrow over (K)}|)+1

(79) 10. Preferred Embodiment of the Invention when Used in Total Order Broadcast Systems

(80) The complexity of processing messages of arbitrary order in distributed systems is greatly reduced by relying on group communication primitives that provide higher guarantees than standard multicast communication. One such primitive is called total order broadcast. Informally, the primitive ensures that messages sent to a set of processes are received in the same order by every member of the set. Furthermore, when designing a variant of the inventive leader election method, such total order broadcast systems can be used to deliver each message reliably and exactly once to each and every process. As such, in this type of total order broadcast system, random numbers generated by the running processes are received by all processes in the same order, with random number transporting messages being interspersed among the totally ordered stream of arbitrary messages.

(81) As described earlier, each process generates a new random number at regular or irregular intervals, but this time it submits it to the total order broadcast system. To ensure a process uses its own random number at the right place within the total order, that process needs to put its own random number in its local collection set only at the time it receives it back from the total order broadcast system.

(82) For this variant of the invention each process maintains exactly one collection set of random numbers with elements from all running processes. When a process bootstraps, it waits until its collection set is locally complete, then it switches to normal operation mode.

(83) Whenever a process p.sub.i receives a random r.sub.j from process p.sub.j, it replaces the current element in its collection set at position j with r.sub.j.

(84) As an option, before processing any next message received, p.sub.i calculates a new distributed random number by applying the first transformation function f.sub.R(r.sub.1, r.sub.2, . . . r.sub.n) custom character R. Since all processes perform the same calculation in the same order, R is considered to be globally complete over the set of processes P. As a result each process ∈P knows the current leader at any position in the message stream. It can be followed that in this variant there is no requirement for separate leader election rounds.

(85) As another option, for each client request received a leader is determined by using an extended first transformation function f.sub.R (r.sub.1, r.sub.2, . . . r.sub.n, C) custom character R to generate a per-request random, where C denotes some data taken from the client request, e.g. its signature. Besides the multiplication of integers in a field of prime order as explained earlier, other transformation functions such as (but not limited to) concatenation, xor, and hashing, and a combination thereof, can be used to generate R. Let l be the maximum number of bits required to represent all processes (e.g. 8-bits=256 processes) and L be the length of R in bits. β is denoted to be the maximum bias probability towards the first i processes in P as β=1/2.sup.(L-l), where i=2.sup.l (mod|P|). To follow the recommendation on page 46 of the Federal Information Processing Standard 186-4 of the National Institute of Standards and Technology (NIST) for handling bias probability in calculus with low modulo, having i>0, it is recommended to use a transformation function resulting in R having L≥l+64.

(86) As the leader election in such total order broadcast systems ensures R is totally complete whenever any other message is received, it follows that requiring a client to be involved in leader election messages, and to pick a round and a master, does not apply any more. Client processes send their message to the total order broadcast system, instead to one particular running process (or many). The master for the request is relayed to the clients upon reception of the messages by the appropriate leader.

Computer-implemented method, computer program and data processing system

Assignee

Inventors

Cpc classification

Classification Explorer

G06F9/52

PHYSICS

Classification Explorer

H04L41/0816

ELECTRICITY

Classification Explorer

H04L41/30

ELECTRICITY

Classification Explorer

G06F9/5027

PHYSICS

Classification Explorer

H04L67/1008

ELECTRICITY

Classification Explorer

H04L67/1034

ELECTRICITY

Classification Explorer

H04L67/1019

ELECTRICITY

International classification

Classification Explorer

H04L67/1019

ELECTRICITY

Classification Explorer

H04L67/1034

ELECTRICITY

Classification Explorer

H04L67/1008

ELECTRICITY

Classification Explorer

H04L41/0816

ELECTRICITY

Classification Explorer

H04L41/00

ELECTRICITY

Abstract

Claims

Description