Highly-Available Cluster Leader Election in a Distributed Routing System

20230224243 · 2023-07-13

    Inventors

    Cpc classification

    International classification

    Abstract

    A distributed routing system is provided for use in a communication network, wherein the distributed routing system includes at least one cluster comprising a first plurality of cluster elements from which a second plurality of cluster elements is selected, wherein each of the cluster elements comprised in the second plurality of cluster elements is configured to operate as a cluster leader candidate and wherein one of that second plurality of cluster elements is selected on a temporary basis to act as cluster leader.

    Claims

    1. A distributed routing system for use in a communication network, wherein said distributed routing system includes at least one cluster comprising a first plurality of cluster elements from which a second plurality of cluster elements is selected , wherein each of the cluster elements comprised in the second plurality of cluster elements is configured to operate as a cluster leader candidate (CLC) and wherein one of said second plurality of cluster elements is selected on a temporary basis to act as a cluster leader.

    2. The distributed routing system of claim 1, further comprising a managing software configured to manage routing operations within said at least one cluster, and wherein said management software is divided into a plurality of fragments stored at different cluster elements’ managers and at said cluster manager.

    3. The distributed routing system of claim 2, wherein each fragment belonging to said plurality of fragments, is encompassed within a respective communication container.

    4. The distributed routing system of claim 1, further comprising at least one processor configured to derive information from reports provided by one or more of the cluster elements’ managers, wherein said information pertains to which of the cluster elements’ managers that are associated with said second plurality of cluster elements, are eligible to be selected as cluster leader candidates, based on information derived from each respective one or more cluster elements’ managers.

    5. The distributed routing system of claim 4, wherein said reports are generated on demand and/or upon occurrence of a change in said distributed routing system.

    6. The distributed routing system of claim 4, wherein said reports belong to one of two distinct types, being reports that pertain to logical state of connections extending to the cluster managers, and reports that pertain to acknowledgements made by the cluster leader, of messages that were sent along said connections to that cluster leader.

    7. The distributed routing system of claim 4, wherein in case said reports are inconclusive, said at least one processor is further configured to initiate forwarding messages within a period of time during which a cluster leader is selected, wherein the forwarding of said messages is carried out via one or more intermediate cluster elements’ managers, and wherein each of said messages is associated with a request to receive information on cluster visibility from the cluster manager receiving the respective message.

    8. The distributed routing system of claim 7, wherein said at least one processor is further configured to divide said period of time during which a cluster leader is selected into query periods, the lengths of which are set according to interconnect capabilities of the cluster leader candidates.

    9. The distributed routing system of claim 8, wherein said at least one processor is further configured either to accept the cluster state at the end of a query period, or to start another query period if the information received during the preceding query period is insufficient.

    10. A method for selecting a cluster element that will act as a cluster manager in a distributed routing system operative in a communication network, wherein the distributed routing system includes at least one cluster comprising a first plurality of cluster elements from which a second plurality of cluster elements is selected, said method comprises the steps of: providing information derived from reports generated by one or more of the cluster elements’ managers, wherein the information pertains to which of the cluster elements’ managers that belong to the cluster, are eligible to be selected as cluster leader candidates, based on information derived from each respective one or more cluster elements’ managers; initiating messages within a period of time during which the selection of a cluster leader is carried out, wherein each of these messages is associated with a request to receive information on cluster visibility from all cluster elements’ managers receiving the respective message; and based on responses received to said messages relating to the eligibility of each of the cluster leader candidates to act as a cluster leader, selecting a cluster leader from among these cluster leader candidates.

    11. The method of claim 10, wherein the reports are generated on demand and/or upon occurrence of a change in the distributed routing system.

    12. The method of claim 10, wherein the reports belong to one of two distinct types, being reports that pertain to logical state of connections extending to the cluster managers, and reports that pertain to acknowledgements made by the cluster leader, of messages that were sent along said connections to that cluster leader.

    13. The method of claim 10, wherein in case the reports are inconclusive, the method further comprising a step of initiating forwarding messages within a period of time during which a cluster leader is selected, wherein the forwarding of the messages is carried out via one or more intermediate cluster elements’ managers, and wherein each of the messages is associated with a request to receive information on cluster visibility from the cluster manager receiving the message.

    14. The method of claim 10, further comprising a step of dividing the period of time during which a cluster leader is selected into query periods, the lengths of which are set according to interconnect capabilities of the cluster leader candidates.

    15. The method of claim 10, further comprising a step of determining either to accept the cluster state at the end of a query period, or to start another query period if the information received during the preceding query period is insufficient.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0046] The accompanying drawings, which are incorporated herein and constitute a part of this specification, illustrate several embodiments of the disclosure and, together with the description, serve to explain the principles of the embodiments disclosed herein.

    [0047] FIG. 1 illustrates a distributed router’s traffic-carrying interconnect;

    [0048] FIG. 2 illustrates a distributed router’s internal control traffic-carrying interconnect;

    [0049] FIG. 3 demonstrates a system architecture that includes various element masters and cluster managers and their connectivity, according to an embodiment of the present invention; and

    [0050] FIG. 4 exemplifies a method for carrying out an embodiment construed in accordance with the present disclosure.

    DESCRIPTION OF EXEMPLARY EMBODIMENTS

    [0051] Some of the specific details and values in the following detailed description refer to certain examples of the disclosure. However, this description is provided only by way of example and is not intended to limit the scope of the invention in any way. As will be appreciated by those skilled in the art, the claimed method and device may be implemented by using other methods that are known in the art per se. In addition, the described embodiments comprise different steps that are carried out, not all of which are required in all embodiments of the invention. The scope of the invention can be summarized by referring to the appended claims.

    [0052] The present invention aims to provide a new, partially synchronous approach for ensuring that a highly available cluster leader is selected, thereby facilitating the required operations that need to be carried out by that cluster of communication elements.

    [0053] One of the underlying principles of the present disclosure is the use of timeouts available in control-plane managed protocols. Typically, the time constraints are loose enough to allow full teardown and bring up of the control-plane services between cluster leader candidates (CLCs) without having an adverse effect on peers’ states. Any low-latency protocols negotiated over physical interfaces should preferably be handled by the Forwarding Cluster Element (FCE) until the cluster manager (CM) is operational at the elected cluster leader (CL). Consequently, customer-facing behavior remains unaffected by the migration of control-plane services to the new cluster leader.

    [0054] By relying on nonparticipating CEs, the proposed solution of the present disclosure intends to become resilient to byzantine failures. Should a CE wrongly report a certain CLC as being an unavailable CLC, other CLCs can circumvent the erroneous report by relying on other CEs′ reports. On the other hand, should a CE wrongly report a CLC as being available, the CLCs may try to communicate through it to learn the reported CLC’s true state.

    [0055] Through usage of infrequent messages of small size, the proposed solution aims to impose low interconnect overhead. The aforementioned tolerant timeouts obviate the need for mass data store synchronization, thereby facilitating a low-overhead inter-CLC communications protocol.

    [0056] Lastly, by splitting the leader election (LE) logic between the CM and an element-local application manager, micro-services can be added and removed from elements without affecting the LE logic. The splitting of the logic software between EMs and CMs, each encompassed within an appropriate container, ensures avoidance of byzantine failures and at the same time, decoupling of the cluster leader services and the LE logic.

    [0057] FIG. 2 illustrates an example of a mesh connectivity of CEs that provides such reliability of interconnectivity.

    [0058] By having a redundancy of switching elements, interconnect availability is maintained during individual switch failure. Furthermore, by querying these switching elements, visibility reports can be generated that also rely on Link Layer Discovery Protocol (LLDP) neighborship.

    [0059] While CMs run specifically on CLCs to negotiate the identity of the elected cluster leader (CL), they make use of the cluster elements’ managers that run on all cluster elements to do so. It is then possible to freely modify the composition of application layers such as the control-plane stack as the LE process, is configuration-agnostic in this regard.

    [0060] FIG. 3 is used to further clarify the nature of EM-CM relationships. It illustrates the location and relationships of these entities, emphasizing their containerized environment to allow execution alongside other applications on the CEs.

    [0061] FIG. 4 exemplifies a method construed in accordance with an embodiment of the present disclosure for selecting a cluster leader candidate that will act as a cluster leader in a distributed routing system operative in a communication network.

    [0062] First, information that has been derived from reports generated by one or more of the cluster elements’ managers, is provided (step 410). This information pertains to which CLCs are connected, i.e., which CMs are connected to a given reporting element manager, or in other words which cluster managers are “visible” from each element manager, whether a specific CLC-EM is eligible to become a CL or not, and the identity of the CL, if known. The provisioning of these reports by the elements’ managers (EMs) to the cluster managers (CMs), may preferably be affected either on-demand or upon occurrence of certain changes in the cluster elements. Thus, the visibility of the CM of the cluster elements is always up to date.

    [0063] According to an embodiment of the disclosure, the requests for providing visibility reports, may be classified into two different types, namely, logical and immediate requests. The former type of requests relies on the logical state of the connection to the CM, whereas the latter requests are requests that are sent along certain connections, and the reports generated based on these requests are whether the target CM acknowledged receipt of the respective message, or not.

    [0064] The next step involves initiating messages within a period during which the selection of a cluster leader is carried out (step 420). Each of these messages is associated with a request to receive information on cluster visibility from a cluster leader candidate receiving the respective message. When the reports received are inconclusive, tally requests between CLCs may be used during LE. The term “tally” as used herein, refers to a continuous count of something, in our case, the number of EMs that are able to confirm that a specific CM is currently active, and by proxy to turn its CLC to the cluster leader (CL).

    [0065] The messages will be sent through an intermediate EM to each detected CM to request their cluster visibility, comprised of a tally of EM votes for the current leader’s identity, to which CMs reply if no other CE request had been made beforehand.

    [0066] The leader election (LE) period is preferably divided into query periods, the lengths of which are set according to the capabilities of the CLCs′ interconnect and with respect to the control-plane routing protocols tolerance, that serve as timeouts for visibility queries, logical or immediate, and at the end of which, either the cluster state is accepted, or rejected. In the latter case, a tally may be issued, or another query period is started if the information received is insufficient.

    [0067] Finally, based on responses received to these messages relating to the eligibility of each of the cluster leader candidates to act as a cluster leader, selecting a cluster leader from among these cluster leader candidates (step 430). A tally is completed if a CE receives tally replies from most of the approached CMs, that are then informed of the new leader’s identity along with all reachable EMs. Should no single CLC receive a greater number of tally responses than all other CLCs, a tally termination is sent to the requested CMs to free them for new tally requests, and a random timeout is imposed before attempting to initiate a new tally. Optionally, a pre-determined criterion may be selected to resolve cases where there is a tie between two (or more) CLCs. For example, it may be determined that in case of such a tie, the CLC which is a node having a lower ID number between the two (or more) CLCs receiving the same numbers of tally responses, would be selected.

    [0068] The term node “ID number” as used herein relates to the fact that nodes included in the cluster need to have some sort of identification, which may be used in context of the present disclosure, to correlate visibility reports with the reporters making them, but as mentioned above, they may also be used as tiebreakers in the case where votes received by two or more CLCs for becoming a CL, are the same. Typically, there is no requirement for the identifier besides being comparable and unique.

    [0069] The present invention has been described using detailed descriptions of embodiments thereof that are provided by way of example and are not intended to limit the scope of the invention in any way. The described embodiments comprise different features, not all of which are required in all embodiments of the invention. Some embodiments of the present invention utilize only some of the features or possible combinations of the features. Variations of embodiments of the present invention that are described and embodiments of the present invention comprising different combinations of features noted in the described embodiments will occur to persons of the art. The scope of the invention is limited only by the following claims.