Systems and methods for quality of service (QoS) based management of bottlenecks and flows in networks
11398984 · 2022-07-26
Assignee
Inventors
Cpc classification
H04L43/0876
ELECTRICITY
H04L47/24
ELECTRICITY
H04L41/0823
ELECTRICITY
International classification
H04L12/28
ELECTRICITY
H04L47/24
ELECTRICITY
H04L47/80
ELECTRICITY
Abstract
Techniques based on the Theory of Bottleneck Ordering can reveal the bottleneck structure of a network, and the Theory of Flow ordering can take advantage of the revealed bottleneck structure to manage and configure network flows so as to improve the overall network performance. These two techniques provide insights into the inherent topological properties of a network at least in three areas: (1) identification of the regions of influence of each bottleneck; (2) the order in which bottlenecks (and flows traversing them) may converge to their steady state transmission rates in distributed congestion control algorithms; and (3) the design of optimized traffic engineering policies.
Claims
1. A method for improving network utilization by configuring processing of flows therein, the method comprising: in a network comprising a plurality of links and having a plurality of flows that are processed according to P policies, P being greater than 1, evaluating a plurality of candidate partitionings of the plurality of flows into P groups, using a QoS function that, using network topology, determines network performance for each candidate partitioning, wherein: according to a candidate partitioning, a first flow belongs to a first group of the P groups and one or more flows belong to a second group of the P groups; and to determine the network performance, the QoS function accounts for effect of rate drift in a rate of the first flow on respective rates of all other flows in the plurality of flows; and designating a candidate partitioning that maximizes the network performance as best partitioning, wherein the plurality of flows are allocated to P groups according to the best partitioning and one or more flows in each group are processed according to a respective one of the P policies.
2. The method of claim 1, wherein the QoS function comprises a generalized P-partitioned set function receiving (P−1) sets as inputs.
3. The method of claim 1, wherein the QoS function comprises a max-min-max-flow (m3f) function.
4. The method of claim 3, wherein the max-min-max-flow (m3f) function comprises: (i) a weighted max-min-max-flow function, (ii) a max-min-weighted-max-flow function, or (iii) a weighted max-min-weighted-max-flow function.
5. The method of claim 1, wherein the QoS function guarantees for at least one flow from the plurality of flows a specified minimum flow rate.
6. The method of claim 1, wherein the QoS function is inclusive.
7. The method of claim 6, wherein: the plurality of candidate partitionings comprises one or more sets of partitionings, one or more partitionings in each set corresponding to a respective value of a parameter (λ); and evaluating the plurality of candidate partitionings comprises: selecting a plurality of parameter values in order and, for each parameter value: evaluating from a corresponding set one or more candidate partitionings using the QoS function; and designating a candidate partitioning that maximizes the network performance as optimal partitioning for that parameter value, yielding a plurality of optimal partitionings corresponding to the plurality of parameter values; and designating as the best partitioning from the plurality of optimal partitionings, a partitioning for which the network performance according to the QoS function is maximum.
8. The method of claim 7, wherein evaluating the one or more candidate partitionings for a current parameter value comprises identifying a nested neighborhood of the optimal partitioning for a preceding parameter value.
9. The method of claim 8, wherein: in the optimal partitioning for the preceding parameter value, each flow belonging to a first group in the P groups is larger in size than any flow in a second group in the P groups; each partitioning is encoded using P symbols; and identifying the nested neighborhood comprises selecting a candidate partitioning such that: Manhattan distance between an encoding of the candidate partitioning and an encoding of the optimal partitioning for the preceding parameter value is less than or equal to a specified threshold; and each flow designated to the first group in the optimal partitioning for the preceding parameter value is designated to a corresponding first group in the candidate partitioning.
10. The method of claim 1, wherein evaluating a particular candidate partitioning comprises: constructing a flow gradient graph for the network; decrementing a respective flow rate of each flow in at least one partition according to the candidate partitioning, by a unit rate (δ); propagating reduction in the flow rates through the flow gradient graph to obtain a flow rate drift at each flow in the network; and aggregating the flow rate drifts using a function based on the QoS function to obtain the network performance.
11. The method of claim 10, wherein constructing the flow gradient graph comprises: creating a link vertex for each bottleneck link in the network and a flow vertex for each flow in the network; for each link in the network: identifying one or more flows bottlenecked by that link, and adding a respective directed edge from a link vertex corresponding to that link to the respective flow vertices corresponding to each of the one or more bottlenecked flows; and identifying non-bottlenecked flows passing through the link, and adding a respective directed edge from a flow vertex corresponding to each non-bottlenecked flow to the link vertex corresponding to the link.
12. The method of claim 11, wherein the one or more bottlenecked flows and the one or more non-bottlenecked flows are identified by constructing a bottleneck precedence graph.
13. The method of claim 1, wherein the network is selected from the group consisting of a data network, an energy-distribution network, a cellular network, and a goods-distribution network.
14. The method of claim 1, wherein: P is equal to two; and according to the best partitioning, one or more flows from the plurality of flows belong to a first group and are designated elephant flows, and remaining flow or flows belong to a second group and are designated mouse flows.
15. The method of claim 1, wherein each of the P policies defines a respective processing priority for flows belonging to the corresponding one of the P groups.
16. A system for improving network utilization by configuring processing of flows therein, the system comprising: a first processor; a first memory in electrical communication with the first processor, the first memory comprising instructions which, when executed by a processing unit comprising at least one of the first processor and a second processor, and in electronic communication with a memory module comprising at least one of the first memory and a second memory, program the processing unit to: in a network comprising a plurality of links and having a plurality of flows that are processed according to P policies, P being greater than 1, evaluate a plurality of candidate partitionings of the plurality of flows into P groups, using a QoS function that, using network topology, determines network performance for each candidate partitioning, wherein: according to a candidate partitioning, a first flow belongs to a first group of the P groups and one or more flows belong to a second group of the P groups; and to determine the network performance, the QoS function accounts for effect of rate drift in a rate of the first flow on respective rates of all other flows in the plurality of flows; and designate a candidate partitioning that maximizes the network performance as best partitioning, wherein the plurality of flows are allocated to P groups according to the best partitioning and one or more flows in each group are processed according to a respective one of the P policies.
17. The system of claim 16, wherein the QoS function comprises a generalized P-partitioned set function receiving (P−1) sets as inputs.
18. The system of claim 16, wherein the QoS function comprises a max-min-max-flow (m3f) function.
19. The system of claim 18, wherein the max-min-max-flow (m3f) function comprises: (i) a weighted max-min-max-flow function, (ii) a max-min-weighted-max-flow function, or (iii) a weighted max-min-weighted-max-flow function.
20. The system of claim 16, wherein the QoS function guarantees for at least one flow from the plurality of flows a specified minimum flow rate.
21. The system of claim 16, wherein the QoS function is inclusive.
22. The system of claim 21, wherein: the plurality of candidate partitionings comprises one or more sets of partitionings, one or more partitionings in each set corresponding to a respective value of a parameter (λ); and to evaluate the plurality of candidate partitionings, the instructions program the processing unit to: select a plurality of parameter values in order and, for each parameter value: evaluate from a corresponding set one or more candidate partitionings using the QoS function; and designate a candidate partitioning that maximizes the network performance as optimal partitioning for that parameter value, yielding a plurality of optimal partitionings corresponding to the plurality of parameter values; and designate as the best partitioning from the plurality of optimal partitionings, a partitioning for which the network performance according to the QoS function is maximum.
23. The system of claim 22, wherein to evaluate the one or more candidate partitionings for a current parameter value, the instructions program the processing unit to: identify a nested neighborhood of the optimal partitioning for a preceding parameter value.
24. The system of claim 23, wherein: in the optimal partitioning for the preceding parameter value, each flow belonging to a first group in the P groups is larger in size than any flow in a second group in the P groups; each partitioning is encoded using P symbols; and to identify the nested neighborhood the instructions program the processing unit to select a candidate partitioning such that: Manhattan distance between an encoding of the candidate partitioning and an encoding of the optimal partitioning for the preceding parameter value is less than or equal to a specified threshold; and each flow designated to the first group in the optimal partitioning for the preceding parameter value is designated to a corresponding first group in the candidate partitioning.
25. The system of claim 16, wherein to evaluate a particular candidate partitioning, the instructions program the processing unit to: construct a flow gradient graph for the network; decrement a respective flow rate of each flow in at least one partition according to the candidate partitioning, by a unit rate (δ); propagate reduction in the flow rates through the flow gradient graph to obtain a flow rate drift at each flow in the network; and aggregate the flow rate drifts using a function based on the QoS function to obtain the network performance.
26. The system of claim 25, wherein to construct the flow gradient graph the instructions program the processing unit to: create a link vertex for each bottleneck link in the network and a flow vertex for each flow in the network; for each link in the network: identify one or more flows bottlenecked by that link, and add a respective directed edge from a link vertex corresponding to that link to the respective flow vertices corresponding to each of the one or more bottlenecked flows; and identify non-bottlenecked flows passing through the link, and add a respective directed edge from a flow vertex corresponding to each non-bottlenecked flow to the link vertex corresponding to the link.
27. The system of claim 26, wherein to identify the one or more bottlenecked flows and the one or more non-bottlenecked flows, the instructions program the processing unit to: construct a bottleneck precedence graph.
28. The system of claim 16, wherein the network is selected from the group consisting of a data network, an energy-distribution network, a cellular network, and a goods-distribution network.
29. The system of claim 16, wherein: P is equal to two; and according to the best partitioning, one or more flows from the plurality of flows belong to a first group and are designated elephant flows, and remaining flow or flows belong to a second group and are designated mouse flows.
30. The system of claim 16, wherein each of the P policies defines a respective processing priority for flows belonging to the corresponding one of the P groups.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. Various embodiments of the present invention taught herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
DETAILED DESCRIPTION
(38) Part 1: A Theory of Bottleneck Ordering for Networks
(39) We describe below the Theory of Bottleneck Ordering, a mathematical framework that reveals the bottleneck structure of networks in general, including data networks, energy-distribution networks, cellular networks, networks for distributing physical goods, etc. This theoretical framework provides insights into the inherent topological properties of a network at least in three areas: (1) It identifies the regions of influence of each bottleneck; (2) it reveals the order in which bottlenecks (and flows traversing them) converge to their steady state transmission rates in distributed congestion control algorithms; and (3) it provides key insights to the design of optimized traffic engineering policies.
(40) We demonstrate the efficacy of the proposed theory in TCP congestion-controlled networks for two broad classes of algorithms: congestion-based algorithms (BBR) and loss-based additive-increase/multiplicative-decrease algorithms (Cubic and Reno). Among other results, our network experiments show that: (1) Qualitatively, both classes of congestion control algorithms behave as predicted by the bottleneck structure of the network; (2) flows generally compete for bandwidth only with other flows operating at the same bottleneck level; (3) BBR flows can achieve higher performance and fairness than Cubic and Reno flows due to their ability to operate at the right bottleneck level; (4) the bottleneck structure of a network is generally ever-changing and its levels can be folded due to variations in the flows' round trip times; and (5) against conventional wisdom, low-hitter flows can have a large impact to the overall performance of a network.
(41) Transmission of Bottleneck Information
(42) In the context of data networks, it is well-known that regardless of how many links a connection traverses, from the TCP's viewpoint, any network path generally behaves as a single link with the same round-trip time and bottleneck rate. What is less understood is the fact that not all bottleneck links are of equal importance. To provide some intuition to this argument, let us consider some initial simple network examples.
(43) Consider the network configurations illustrated in
(44) Consider now the network configuration in
(45) While this network configuration is still very simple, an interesting question arises from its topological structure: Are both bottleneck links l.sub.1 and l.sub.2 equally important with regards to the overall performance of the network? For instance, it is easy to see that the derivative of s.sub.2 with respect to c.sub.1 is 0.5, ∂s.sub.2/∂c.sub.1=0.5, while the derivative of s.sub.1 with respect to c.sub.2 is 0, ∂s.sub.1/∂c.sub.2=0. That means the performance of link l.sub.2 (measured in terms of its fair share) depends on the performance of link l.sub.1, but not vice-versa, thus revealing a notion of hierarchy or ordering between these two bottleneck links. The bottleneck structure that we formally introduce herein is a graph that captures these relationships.
(46) An intuitive way to understand the bottleneck structure of a network involves modeling the bottleneck links as nodes communicating with each other, where these nodes implement a distributed congestion control algorithm to determine their fair share while using the smallest possible number of iterations. In
(47) Consider yet another network configuration having three links and four flows, as shown in
(48) While the bottleneck structures of the second and third examples are apparently similar (both have two bottleneck links connected by an edge), there is a subtle yet important difference: In the second example (
(49) Note that the BPG also reveals the number of iterations required by links to converge to their fair share. For instance, while two links related by direct precedence can communicate in one iteration (
(50) The above examples helped us to informally introduce the bottleneck structure of some simple network configurations providing initial intuition behind the existence of (1) bottleneck hierarchies, (2) the notion of convergence time for a distributed congestion control algorithm, and (3) the concept of direct and indirect precedence between two bottlenecks. In the discussion below, we formalize all these intuitive concepts into a mathematical theory of bottleneck ordering and introduce an algorithm to compute the bottleneck precedence graph for arbitrary network topologies.
(51) Parallel Convergence in the Water-Filling Algorithm
(52) We start our work with an implicit assumption: That in steady state, all congestion-controlled flows are bottlenecked at least at one link. The general intuition for this statement is that all congestion control algorithms—by their fundamental objective—are designed not to leave unused bandwidth. This fact is certainly true for algorithms implemented as part of a TCP/IP stack. The second observation we make is that, as shown above, the max-min rate allocation provides a natural way to identify the bottleneck link of every flow in steady state. Intuitively, a flow cannot get a rate above its bottleneck's fair share, as that would be a capacity violation if all other flows acted in the same way.
(53) Thus, in what follows, we use the max-min assignment to analyze the influence that the bottlenecks exert onto each other and reveal the topological structure that inter-relates them. Two notes are relevant in this regard. First, the same general bottleneck principles apply to other allocation schemes different than max-min (for instance, weighted max-min or proportional fairness). The bottleneck structure in these other schemes might be different, but the concept of bottleneck structure is universal and exists in them too for a reason mentioned above. Secondly, as we show in our experiments discussed below, TCP's congestion control (a widely used algorithm in modern networks) does follow very closely the bottleneck structure revealed by the max-min assumption, which makes this choice highly practical.
(54) Next, we describe an algorithm to compute the max-min rate allocation of a network. The original algorithm—called the water-filling Algorithm—was introduced by Bertsekas and Gallager (Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet, Data Networks, volume 2, Prentice-Hall International New Jersey (1992)). To develop the theoretical framework, however, we use a variant of the water-filling algorithm introduced by Ros-Giralt and Tsai (Jordi Ros and Wei Kang Tsai, “A Theory of Convergence Order of Maxmin Rate Allocation and an Optimal Protocol,” Proceedings IEEE INFOCOM 2001, volume 2, pages 717-726 (2001)). This variant, presented in Algorithm 1, exploits a parallelization property in the original water-filling algorithm that allows certain bottleneck links to converge concurrently, leading to a lower execution time when the algorithm runs on a parallel computing architecture.
(55) TABLE-US-00001 Algorithm 1: FastWaterFilling 1 := Set of links in the input network; 2
:= Set of flows in the input network; 3
.sub.l := Set of flows going through link l; 4 c.sub.l := Capacity of link l; 5
:= Set of bottleneck links; 6 r.sub.f := Rate of flow f; 7
.sup.k := Set of unresolved links at iteration k; 8
.sup.k := Set of converged flows at iteration k; 9
.sup.0 =
;
.sup.0 = {∅}; 10 k = 0; 11 while
.sup.k ≠
do 12 s.sub.l.sup.k = (c.sub.l −
r.sub.f)/|
.sub.l \
.sup.k|, ∀l ∈
.sup.k; 13 u.sub.l.sup.k = min{s.sub.l′.sup.k|
.sub.l′ ∩
.sub.
.sup.k}; 14 for l ∈
.sup.k, s.sub.l.sup.k = u.sub.l.sup.k do 15 r.sub.f = s.sub.l.sup.k, ∀f ∈
.sub.l \
.sup.k; 16
.sup.k =
.sup.k\{l}; 17
.sup.k =
.sup.k ∪ {f, ∀f ∈
.sub.l \
.sup.k}; 18 end for 19
.sup.k+1 =
.sup.k;
.sup.k+1 =
.sup.k; 20 k = k + 1; 21 end while 22
=
\
.sup.k; 23 return
, {r.sub.l , ∀f ∈
}, {s.sub.l.sup.k, ∀l ∈
}, k
;
(56) The algorithm relies on two parameters: The fair share of link l at iteration k: s.sub.l.sup.k. As described above, this corresponds to the fair share of a link after removing all the flows that have converged up to iteration k and is computed in line 12. The upstream fair share of link l at iteration k: u.sub.l.sup.k. This parameter corresponds to the minimum fair share among all the links sharing a flow with l and is computed in line 13.
(57) The algorithm tracks the set of unresolved links .sup.k (links whose final fair share has not been determined yet) and the set of converged flows
.sup.k (flows whose final transmission rate has been determined). At each iteration k, for each unresolved link l that has a fair share equal to its upstream fair share, s.sub.l.sup.k=u.sub.l.sup.k (line 14), three actions are taken: (1) The transmission rate of the remaining flows going through link l is assigned to the fair share value (line 15), (2) link l is removed from the set of unresolved links (line 16), and (3) all of the remaining flows going through it are added to the set of converged flows (line 17). This process ends when all the flows and all the bottleneck links have converged to their final transmission rate and fair share values, respectively. Upon completion, the algorithm returns the set of bottleneck links
, the transmission rate of all the flows {r.sub.l.sup.k, ∀f∈
}, the fair share of all the bottleneck links {s.sub.l.sup.k, ∀l∈
}, and the value of k, which corresponds to the depth of the bottleneck structure as we demonstrate below.
(58) The FastWaterFilling algorithm exploits the following property formally proven elsewhere: A link that has the minimum fair share among all the links it shares flows with, can immediately converge to its final fair share. Because in general, more than one link satisfies this property at each iteration k, this allows multiple links to converge concurrently at the same iteration. On average this approach reduces the number of iterations from O(||) in Bertseka's original water-filling algorithm down to O(log(|
|)). In the techniques described herein, we observe that this parallelization property captures the exact natural convergence time of bottleneck links in real networks. Our observation relies on the fact that modern networks use distributed congestion control algorithms-such as TCP BBR, Cubic, and Reno—that effectively behave like a large parallel computer. This property plays a crucial role in understanding the bottleneck structure of a network, as we show next.
(59) Precedent Link Relations
(60) A more detailed analysis of the FastWaterFilling algorithm shows that the order of execution of its while loop reveals a bottleneck structure that is unique to every network. In this hidden structure, bottleneck links relate to each other by following well defined mathematical relationships that describe how (and in what magnitude) a link can affect the performance of another link. In particular, we show there exists only two essential relationships between bottlenecks, direct precedence and indirect precedence, which we formally introduce as follows:
(61) Definition 1 Direct precedence. Let l and l′ be two links such that (1) l converges at iteration k, (2) l′ converges at iteration k′, for some k′>k, and (3) .sub.l∩
.sub.l′≠{Ø}. Then we say that link l is a direct precedent link (or simply a direct precedent) of link l′.
(62) Definition 2 Indirect precedence. Let l and l′ be two links such that (1) l converges at iteration k, (2) l′ converges at iteration k′, for some k′>k, and (3) .sub.l∩
.sub.l′={Ø} but there exists another link l.sub.r such that (4.1)
.sub.l∩
.sub.l.sub.
.sub.l′∩
.sub.l.sub.
(63) The relevancy of these two definitions is justified in the next two lemmas, which state the order of bottleneck convergence and the degree to which one bottleneck can affect the performance of another one in arbitrary networks.
(64) Lemma 1 Bottleneck convergence order. A link l is removed from the set of unresolved links .sup.k at iteration k,
.sup.k=
.sup.k\{l}, if and only if all of its direct and indirect precedent links have already been removed from this set at iteration k−1.
(65) Proof. See Appendix A.1.2.
(66) This lemma introduces the notion of link convergence. We use this term to indicate that a link's fair share has been resolved and will no longer change as the FastWaterFilling algorithm continues iterating.
(67) Intuitively, Lemma 1 says that for a link l to converge, all of its direct and indirect precedent links must have converged first (this is the necessary condition). Moreover, it also says that if k is the highest iteration at which any of its direct and indirect precedent links converges, then link l will converge immediately after it at iteration k+1 (this is the sufficient condition). The importance of this lemma lies in the fact that it reveals the hidden bottleneck structure of the network. Bottleneck links in a network are not all equal with respect to the performance impact they exert onto each other. On the contrary, they follow a well-defined structure uniquely characterized by the topological properties of the network. The following lemma characterizes the influence of bottlenecks onto each other:
(68) Lemma 2 Bottleneck influence. A bottleneck l can influence the performance of another bottleneck l′, i.e., ∂s.sub.l′/∂c.sub.l≠0, if and only if there exists a set of bottlenecks {l.sub.1, l.sub.2, . . . , l.sub.n} such that l.sub.i is a direct precedent of l.sub.i+1, for 1≤i≤n−1, l.sub.1=l and l.sub.n=l′.
(69) Proof. See Appendix A.1.3.
(70) Note that in the above lemma we capture the influence of a link l against another link l′ using the derivative as ∂s.sub.l′/∂c.sub.l. That is, if a change on the effective capacity c.sub.l of link l changes the fair share sp, of another link l′, then we say that link l influences link l′. The lemma states that a bottleneck link l influences another bottleneck link l′ if there exists a set of direct precedent links that provide a path between l and l′.
(71) In
(72) Next, we provide an algorithm to compute all the precedent links to help construct the bottleneck structure of arbitrary networks and a detailed example to illustrate the practical implications of Lemmas 1 and 2.
(73) The Bottleneck Structure of Networks
(74) To compute the bottleneck structure of a network, we need to obtain the direct and indirect precedent links of every bottleneck link. These precedent links correspond to edges in a digraph that has all the bottleneck links as its vertices, revealing the inherent mesh structure that characterizes the relationships between the bottlenecks and how they influence each other. This structure can be formally defined as follows:
(75) Definition 3 Bottleneck precedence graph. We define the bottleneck precedence graph (BPG) as a tuple V, E
of vertices V and edges E such that V=
and E={
.sub.l.sup.k,∀l∈
}∪{
.sub.l.sup.k,∀l∈
}. To differentiate them, we typically represent edges corresponding to direct and indirect precedents links with solid and dashed lines, respectively. We also equivalently refer to the BPG graph as the bottleneck structure of a network.
(76) As noted in above, the FastWaterFilling algorithm already computes the set of bottleneck links . Thus to obtain the bottleneck structure, we can extend this algorithm with additional logic to compute the direct and indirect precedent links. We refer to this procedure as the Bottleneck Precedence Graph (BPG) Algorithm, which we introduce in Algorithm 2.
(77) TABLE-US-00002 Algorithm 2: BPG 1 ,
,
.sub.l, c.sub.l,
, r.sub.f,
.sup.k,
.sup.k are as in Algorithm 2.2; 2
.sub.l.sup.k := Set of direct precedents of link l at iteration k; 3
.sub.l.sup.k := Set of indirect precedents of link l at iteration k; 4
.sub.l.sup.k := Set of relays of link l at iteration k; 5
.sup.0 =
;
.sup.0 = {∅}; 6
.sub.l.sup.0 =
.sub.l.sup.0 =
.sub.l.sup.0 = {∅}, ∀l ∈
; 7 k = 0; 8 while
.sup.k ≠
do 9 s.sub.l.sup.k = (c.sub.l −
r.sub.f)/|
.sub.l\
.sup.k|, ∀l ∈
.sup.k; 10 u.sub.l.sup.k = min{s.sub.l′.sup.k|
.sub.l′ ∩
.sub.1 ≠ {∅}, ∀l′ ∈
.sup.k}; 11 for l ∈
.sup.k, s.sub.l.sup.k = u.sub.l.sup.k do 12 r.sub.f = s.sub.l.sup.k, ∀f ∈
.sub.l; 13
.sup.k =
.sup.k\{l}; 14
.sup.k =
.sup.k ∪ {f, ∀f ∈
.sub.l}; 15 for l′ ∈
.sup.k,
.sub.l′ ∩
.sub.l ≠ {∅} do 16
.sub.l′.sup.k =
.sub.l′.sup.k ∪ l; 17 end for 18 for l′, l.sub.r ∈
.sup.k,
.sub.l′ ∩
.sub.l ≠ {∅}, s.sub.l.sub.
.sub.l′.sup.k =
.sub.l′.sup.k ∪ {l.sub.r}; 20 end for 21 for l′ ∈
.sub.l.sub.
.sub.l.sup.k, l.sub.r ∈
.sub.l.sup.k\
.sub.l.sup.k do 22
.sub.l.sup.k =
.sub.l.sup.k ∪ {l′}; 23 end for 24 end for 25
.sup.k+1 =
.sup.k;
.sup.k+1 =
.sup.k; 26
.sub.l.sup.k+1 =
.sub.l.sup.k;
.sub.l.sup.k+1 =
.sub.l.sup.k;
.sub.l.sup.k+1 =
.sub.l.sup.k; 27 k = k + 1; 28 end while 29
=
\
.sup.k; 30 return
, {
.sub.l.sup.k, ∀l ∈
}, {
.sub.l.sup.k, ∀l ∈
}
;
(78) For every link l, the BPG Algorithm keeps track of the sets of direct precedents .sub.l.sup.k, indirect precedents
.sub.l.sup.k, and relays
.sub.l.sup.k (lines 2-4). Every time a link converges, these sets are updated as follows (lines 15-23): Direct links (lines 15-17). When a link l converges, it becomes a direct precedent of all the links that it shares flows with and that have not converged yet. Relay links (lines 18-20). For any two links l′ and l.sub.r that have still not converged, if they share a flow and s.sub.l.sub.
(79) The BPG algorithm returns a tuple of vertices (corresponding to the bottleneck links ) and edges (corresponding to the set of direct
.sub.l.sup.k and indirect
.sub.l.sup.k precedents for every bottleneck link l, where k is the algorithm's last iteration), which provide all the necessary information to construct the bottleneck structure of the network. The last value of the iterator k is also of relevance because it corresponds to the length of the longest path in the BPG graph, as shown in the next lemma:
(80) Lemma 3 Depth of the bottleneck structure. The diameter of the BPG graph is equal to the last value of the iterator k in the BPG algorithm. We refer to this value as the depth or simply the number of levels of the bottleneck structure.
(81) Proof. See Appendix A.1.4
(82) Next, we use two examples to illustrate how the BPG graph is constructed. In both these examples, we use the network shown in
(83) Example 1 Computation of the BPG graph. In this simple example we assume the presence of five flows {f.sub.1, . . . , f.sub.5} as shown in ={l.sub.1, . . . , l.sub.9},
={f.sub.1, . . . , f.sub.5};
.sub.1={f.sub.1,f.sub.2},
.sub.4={f.sub.1,f.sub.4,f.sub.5},
.sub.5={f.sub.3},
.sub.12={f.sub.5},
.sub.15={f.sub.5},
.sub.6={f.sub.3,f.sub.4},
.sub.10={f.sub.5}. We also assume that the links' effective capacities are such that c.sub.1=80, c.sub.4=110, c.sub.6=130, c.sub.15=20 Gbps, while all the other links are assumed to have a capacity large enough such that they are not bottlenecks. Note that in general these effective capacities need not be equal to the physical capacity of the links. This is because it is possible to reserve a fraction of the physical capacity to high priority traffic (traffic whose bandwidth must be guaranteed) so that the analysis of the bottleneck structure is performed without taking into account such traffic, effectively reducing the available capacity of the links.
(84)
(85) Note also that there exists an indirect precedence between links l.sub.15 and l.sub.1. As in a direct precedent case, this also means flows bottlenecked at link l.sub.1 cannot converge until flows bottlenecked at link l.sub.15 have converged. The difference here is that there exists no common flow going through both links, unlike in the case of direct precedents. Instead, execution of the BPG algorithm shows that links l.sub.15 and l.sub.1 have link l.sub.4 as a relay link (.sub.l.sub.
(86) Example 2 B4's bottleneck structure for full-mesh/shortest-path. In this example, we compute the bottleneck structure of the B4 network for a more realistic case considering a larger number of flows. In particular, we assume the existence of a flow between each pair of data centers—i.e., a full mesh connectivity—with each flow taking the shortest path. As a first-order approximation, we reason that the full mesh assumption is intuitively meaningful if we consider that in general, every data center may need to communicate with each other. Also, the choice of shortest paths can serve as a raw approximation of the actual routing. While in a production system, both of these assumptions can be relaxed by using actual historical or real-time flow and routing information of the network to compute its precise bottleneck structure. This could, for instance, be inferred using NetFlow, sFlow or BGP-LS protocols among others-see also Appendix A.5 for details. The full-mesh/shortest-path model provides a simple but powerful base benchmark that can help measure the approximate bottleneck structure of a network by using only its fixed topological structure.
(87) The BPG graph of the B4 network under the full-mesh/shortest-path assumption and with all link capacities set to 100 Gbps is presented in
(88) In contrast, links l.sub.16, l.sub.11, l.sub.9, l.sub.18, l.sub.19, l.sub.14, l.sub.15, l.sub.3, l.sub.6, l.sub.7, l.sub.12, l.sub.1, are leaves in the bottleneck structure, which means they have no impact on the performance of any other link. This analysis reveals that when looking for high-impact bottlenecks, network operators should consider links that are located at higher levels in the BPG graph. Of interest is comparing the influence of links l.sub.8 and l.sub.10, both being the transatlantic connections between the US and EU regions. The bottleneck structure-which takes into account the structural properties of the topology, the routing and the flow information of the network-indicates that link l.sub.8 has a higher influence than l.sub.10 (from
(89) Furthermore, the graph shows that link l.sub.8 can affect the performance of link l.sub.10 (since there is a directed path from l.sub.8 to l.sub.10), but not vice versa. Similarly, on the transpacific connections side, the BPG graph shows that link l.sub.2 has a more significant influence than link l.sub.4 in terms of nodes they can affect, but in this case, neither of these two links can influence the performance of each other since there is no directed path connecting them. Accordingly, the kind of structural information that the BPG graph reveals helps network operators identify and organize the relevancy of each bottleneck.
(90)
(91) Property 1 Monotonic fair shares along a precedence path. Let s.sub.l.sub.
(92) Proof. See Appendix A.1.1.
(93) Minimum Convergence Time of Distributed Congestion Control Algorithms
(94) We described above that an intuitive way to understand the bottleneck structure of a network includes modeling its bottleneck links as nodes that are trying to compute a globally optimal fair-share value by exchanging locally available information. This dualism between the two models-bottleneck structure and distributed communication-provides a robust framework to measure the minimum convergence time attainable by any distributed congestion control algorithm. We capture this concept in the following lemma:
(95) Lemma 4 Minimum convergence time of a distributed congestion control algorithm. Let τ(l.sub.i,l.sub.j) be a weight assigned to each edge (l.sub.i,l.sub.j) of the BPG graph as follows: (1) If l.sub.i is a direct precedent of l.sub.j, then τ(l.sub.i,l.sub.j) is the time that it takes for a message to be sent from l.sub.i to l.sub.j; (2) If l.sub.i is an indirect precedent of l.sub.j, then τ(l.sub.i,l.sub.j)=max{τ(l.sub.i,l.sub.r)+τ(l.sub.r,l.sub.j)|foranyrelaylink|_rbetween|_iand|_j}. Let l.sub.1−l.sub.2− . . . −l.sub.n be a longest path terminating at link l.sub.n according to these weights. Then the minimum convergence time for link l.sub.n is Σ.sub.1≤i≤n−1 τ(l.sub.i,l.sub.i+1).
(96) Proof. See Appendix A.1.5.
(97) It is important to note that the above lower bound is not attainable in the case of TCP congestion control algorithms because, in such protocols, there is no direct communication mechanism between links-since they are end-to-end protocols. Instead, TCP algorithms rely on implicit feedback based on signals such as packet loss or round trip time, effectively requiring a significantly longer time to propagate from one link to another. Nevertheless, the importance of the above lemma is in the claim that convergence time increases as the depth of the bottleneck structure increases, and this general principle holds even for end-to-end protocols. In the discussion below, we test this lemma against a TCP congestion control algorithm and empirically demonstrate that convergence time does increase as the depth of the bottleneck structure increases.
(98) The Influence of Flows onto Bottleneck Links and onto Each Other
(99) While the theory of bottleneck ordering helps understand the relations that exist between links and the influences they exert onto each other, the BPG graph resulting from this mathematical framework does not reveal any information regarding the performance of flows. There exists, however, a simple way to transform the BPG graph into a structure that does take into account flows and helps characterize their performance. We refer to this new structure as the flow gradient graph, formally defined as follows:
(100) Definition 4 Flow gradient graph. The flow gradient graph is a digraph such that: For every bottleneck link and for every flow, there exists a vertex. For every flow f: (1) If f is bottlenecked at link l then there exists a directed edge from l to f; (2) If f is not bottlenecked at link l but it passes through it, then there exists a directed edge from f to l.
(101) The flow gradient graph can be constructed using the FastWaterFilling algorithm described above since this procedure computes the bottleneck links and for every flow in the network it discovers its bottleneck right at line 17 (Algorithm 2) when the flow is entered into the converged set. We use an example to illustrate how this graph is constructed.
(102) Example 3 Computation of the flow gradient graph. Consider the network in
(103) The flow gradient graph is a useful extension of the BPG since it allows to generalize the bottleneck analysis onto flows. We can see bottlenecks and flows as two sides of the same coin; thus if the network were an economy, links and flows would correspond to the supply and the demand, respectively. This duality principle implies that all the general lemmas and properties described herein regarding the bottleneck structure of a network have a dual correspondence in the domain of flows. For instance, Lemma 2 (bottleneck influence) can be translated to the domain of flows as follows:
(104) Lemma 5 Flow influence. A flow f can influence the performance of another flow f′, i.e., ∂.sub.r.sub.
(105) Proof. See Appendix A.1.6.
(106) Using this Lemma, we can infer the following properties from the flow gradient graph in
(107) Experimental Results
(108) To experimentally evaluate the ability of the proposed theory to predict bottleneck and flow performance in real networks, we developed a sandbox using Mininet and the POX SDN controller. The sandbox takes as input information related to a network's topology, routing, and traffic flows. With that, the sandbox can create the desired network configurations, including the configurations presented in
(109) On the Bottleneck Structure of TCP Networks
(110) In this initial experiment, we tested the hypothesis that TCP congestion-controlled networks behave as predicted by the theory of bottleneck ordering. In particular, we experimented with the three networks presented in
(111) We show the corresponding BPG graphs in . The rest of the links are assumed to have a capacity large enough so that they are not bottlenecks. All flows are configured to transmit a data set of 250 MB using TCP, and all links have a 2-millisecond latency.
(112) In this experiment, we also measured the degree of fairness achieved by each congestion control algorithm using Jain's fairness index normalized to the max-min rate allocation. (See Appendix A.4 for a summarized description.) Since the bottleneck structure of the network is also the solution to the theoretical max-min problem, this index effectively serves as an indicator of the degree to which the congestion control algorithm under test can assign rates according to the bottleneck structure itself.
(113) The results for BBR and Cubic are presented in
(114) TABLE-US-00003 TABLE 1 Completion time of the slowest flow (seconds). Algorithm 2-Level 3-Level 4-Level BBR 230 235 230 Cubic 380 385 810 Reno 375 370 750
(115) TABLE-US-00004 TABLE 2 Jain's fairness index. Algorithm 2-Level 3-Level 4-Level BBR 0.99 0.98 0.92 Cubic 0.94 0.96 0.72 Reno 0.93 0.96 0.73
The Effect of Latency on the Bottleneck Structure of a Network
(116) In the discussion above, we saw that AIMD algorithms fail to identify the bottleneck structure of the test networks, and as a consequence, they perform significantly poorer than BBR. We now study this phenomenon in more detail.
(117) The first observation we make is that differences in flow RTTs can play a role in the capabilities of a congestion control algorithm to identify the bottleneck structure. For instance, in
(118) Due to its lower RTT, the lower level flow f.sub.1 can capture bandwidth from the higher-level flow f.sub.2, and this leads to the collapse of the 2 levels in the Cubic case. This is in contrast with BBR (
(119) To demonstrate the effect of RTT, we tested the 3-level network (
(120) The above results show that BBR does significantly better at identifying each flow's true bottleneck, but can we still fool this congestion control algorithm to collapse its bottleneck structure? In the next experiment, we take this objective by using the 3-level network, initially setting all link latencies to zero, and progressively increasing the RTT of the flows on the upper bottleneck levels to verify if the structure collapses. To control the RTT of each flow individually, we add a fictitious link attached to the source of each flow with a large capacity (so it does not become a bottleneck) and a latency that we adjust to achieve a specific end-to-end RTT value for each flow. In
(121) On how Low-Hitter Flows can be High-Impact Flows
(122) Next, we illustrate an example of how the theory of bottleneck ordering can be used to support traffic engineering operations. We start with an initial network configuration, and our objective is to identify the flows that have the largest impact on the performance of the rest of the flows. Based on the information from this analysis, a network operator can choose to re-route high-impact traffic or assign it to a lower priority queue to help improve the overall performance of the network.
(123) To drive this experiment, we consider the network introduced in
(124) In
(125) This result might seem counter-intuitive if we consider the fact that flow f.sub.6 is among the smallest throughput flows, but it is easy to explain if we take into account the bottleneck structure of the network. In particular, from the flow gradient graph in
(126) This result indicates that traditional approaches, which focus on finding the heavy hitter flows, might tend to sub-optimal choices because they do not take into account the full bottleneck structure of the network. We reason that network operators interested in finding the high impact flows should also consider the BPG and the flow gradient graph structures to aid their optimization process.
(127) TABLE-US-00005 TABLE 3 As predicted by the theory of bottleneck ordering, flow f.sub.6 is a significantly higher impact flow than flow f.sub.5 Comp. time (secs) f.sub.1 f.sub.2 f.sub.3 f.sub.4 f.sub.5 f.sub.6 Slowest With all flows 664 340 679 331 77 636 679 Without flow f.sub.5 678 350 671 317 — 611 678 Without flow f.sub.6 416 295 457 288 75 — 457 Avg rate (Mbps) f.sub.1 f.sub.2 f.sub.3 f.sub.4 f.sub.5 f.sub.6 Total With all flows 7.7 15.1 7.5 15.4 65.8 8.1 119.6 Without flow f.sub.5 7.5 14.5 7.6 16.1 — 8.3 54 Without flow f.sub.6 12.2 17.2 11.1 17.7 68.1 — 126.3
(128) Finally, we also note that we performed the same experiments using TCP Cubic and Reno and obtained the same behavior in both algorithms. (For the sake of brevity, we include those results in Appendix A.3.) Overall, the results show that the same traffic engineering analysis based on the bottleneck structure of the network applies not only to BBR but to TCP congestion-controlled networks in general.
(129) Convergence Time with Increased Number of Bottleneck Levels and Flow Competition
(130) In the discussion above, we saw that the depth of the bottleneck structure of a network is closely related to the convergence time of a distributed congestion control algorithm. We set to experimentally verify this property by using the network configuration in
(131) To complement the convergence time analysis, we also tested the impact of increasing the number of flows at each level. With this objective, we implemented a flow multiplier factor as part of our sandbox. Multiplying the number of flows by a factor m means that each flow in a given network is replicated (preserving its source, destination, and route) m times. This approach can effectively increase by m the number of flows competing at the same level, without modifying the bottleneck structure of the network.
(132) Table 4 presents the results of running the described experiment varying the number of levels n and the flow multiplier factor m from 1 to 4, for a total of 16 experiments, using the BBR algorithm. The results show the behavior predicted by the theoretical framework: Convergence time increases (1) as the number of bottleneck levels increases, and (2) as the number of flows at the same level increases.
(133) TABLE-US-00006 TABLE 4 Converge time increases with the number of levels and the number of level-competing flows. 1-Level 2-Level 3-Level 4-Level num. flows × 1 2 2 2 2 num. flows × 2 2 2 12 26 num. flows × 3 4 16 14 54 num. flows × 4 14 26 34 72
(134) While network operators cannot influence the number of flows, they have some ability to alter the number of levels in the BPG graph. For instance, one strategy would include configuring routes (or even design network topologies) that tend to generate bottleneck structures of smaller depth. This strategy would help flows converge faster and lead to networks operating at higher utilization. The study and design of routes and topologies that yield shallow bottleneck structures, while very interesting, is left for future work.
(135) To the best of our knowledge, the proposed theoretical framework is the first to address the problem of characterizing the bottleneck structure of a network and the influences and relationships existing between bottlenecks from both formal and practical standpoints. According to a known technique that refers to the bottleneck relationships as dependency chains and observes that performance perturbation of bottleneck links may affect other bottleneck links, does not address the problem of formally identifying the hidden structure that controls such perturbations.
(136) Various other bottleneck characterization techniques appear to assume implicitly that the bottlenecks in a network have a flat structure. Traditional traffic optimization approaches generally only focus on heavy-hitter flows using a single-link model. In contrast, the techniques described herein reveal the bottleneck structure of a network, which allows optimizing traffic engineering by taking into account the network's topological, routing, and flow properties.
(137) In the techniques described herein, we base the bottleneck analysis on max-min rate assignment. An initial algorithm to compute the max-min rate allocation of a network has been improved and is known. Our algorithm to construct the bottleneck structure is based on a further extension of that improved algorithm.
(138) Congestion control algorithms have been one of the most widely researched subjects in the field of computer communications since the Internet collapsed in 1986 due to the lack of them. Starting from the first TCP congestion control algorithm implemented in 1987, key research has focused on characterizing the bottleneck to help maximize the performance of flows individually while achieving fairness collectively. Such single-bottleneck oriented approach, however, has left a gap in attempting to understand the global structure of the problem.
(139) In this first part, we addressed this gap by introducing a theory of bottleneck ordering. The theoretical framework leads to the BPG algorithm, a procedure that reveals the bottleneck structure of a network in the form of a graph—the BPG graph. This graph provides vital insights to the understanding of the collective bottleneck behavior in a distributed network, among others: (1) The implicit relationships that exist between bottlenecks; (2) the bounded regions in the network that links and flows can influence; and (3) the convergence properties of the congestion control algorithm run by the network.
(140) Our experimental results show that: (1) Real networks do have a bottleneck structure that may follow the BPG graph closely; (2) RTT variations can lead to imperfections (levels that fold) in the bottleneck structure, which can severely affect performance; (3) congestion-based algorithms such as BBR are able to infer the bottleneck structure significantly better than loss-based/AIMD algorithms such as Cubic and Reno, which leads them to achieve superior performance in both throughput and fairness. The theory also reveals intriguing results that low-hitter flows can have a higher impact on a network than heavy-hitter flows, or the notion that routing and topology can play a role in the design of shallower bottleneck structures that would improve the convergence time of congestion control algorithms.
(141) Part 2: A Theory of Flow Ordering for Networks
(142) How large is a network flow? Existing work in the literature has addressed this question by considering metrics such as the number of bytes, transmission rate, duration, jitter or burstiness, to name a few. But is one metric better than another one and, more importantly, what is the true meaning of the elusive concept of flow size? In the techniques described herein we show that a formal mathematical definition of flow size depends on the topological structure of the network and a utility function that describes its QoS properties. This leads to a theory of flow ordering, a mathematical framework that explains the connection between the abstract concept of flow size and the QoS properties of networks, including data network, energy-distribution networks, cellular networks, and networks for transport of physical goods. To put the theory to the test, we apply it to derive computationally tractable algorithms to resolve the problem of topologically ordering TCP flows according to their QoS impact.
INTRODUCTION
(143) The objective of this part of the discussion is to develop the following concept: the size of a flow should not only be measured in terms of arbitrary scalar metrics such as rate, byte counts or duration as it has been done in the past, but also through a formal framework that takes into account the mathematical properties—in terms of its topology and its QoS utility function—of a data network. The contributions of this technique are as follows: We demonstrate that the flow ordering of a data network can be derived as the solution to a problem we refer as the l-partitioned QoS problem. This approach is general in that it takes into account the topological and the QoS properties of the network. The particular case of the 2-partitioned QoS problem (l=2) leads to a general definition of the concepts of elephant and mouse flows. We demonstrate that all existing metric-based definitions of elephant flow implicitly assume a single-bottleneck link network topology. The present theory allows to identify the flow ordering in arbitrary network topologies. The general problem of identifying the flow ordering is NP complete. Fortunately, we show that if the QoS function satisfies the property of nested convexity (as defined herein), then the greedy algorithm (introduced below) is optimal and can find the flow ordering in polynomial time. This opens a study area around the problem of designing and analyzing QoS functions that are nested convex. Based on the concepts of max-min and max-flow optimization, we present an initial candidate of a QoS function, and analyze its nested convexity properties under a variety of scenarios that are suitable to TCP networks. To illustrate the applicability of the proposed framework, we use it to study the problem of flow ordering in Google's B4 network.
(144) In the following, we present the main body of this work. We first introduce the concept of partitioned QoS, the general definition of elephant flow, the topological properties of inclusiveness and nested convexity and an optimal greedy algorithm.
(145) Thereafter, we provide empirical measurements to help validate the theory and illustrate its applicability, and provide conclusions.
(146) Network Model and QoS Functions
(147) We start by introducing a definition of the network model that will be used throughout the discussion below. As used herein, a network can be a data network, an energy (e.g., electricity) distribution network, a cellular network, a network for transporting goods, etc.
(148) Definition 5 Network. We define a network N as a tuple T,F
such that: T=
S,L
is a graph having a set of vertices S and a set of edges L. F={f.sub.1, f.sub.2, . . . , f.sub.|F|} is a set of flows, where each flow f.sub.i is an ordered list of vertices
.sub.i,1.fwdarw.
.fwdarw.( . . . ).fwdarw.
.sub.i,|f.sub.
∈S and (
,
)∈L, for all 1≤i≤|F| and 1≤j<|f.sub.i|.
(149) In the language of data networks, the vertices in S correspond to switches (or routers) whereas the edges in L correspond to links connecting the switches in the network. Together they uniquely characterize the topology of the network T. In addition, a flow f.sub.i∈F corresponds to a specific communication connection (e.g., depending on the network architecture and model, this could correspond to a TCP/UDP connection, an MPLS path, an IP tunnel, etc.) between a source vertex and a destination vertex.
(150) Next we introduce the concept of partitioned QoS function, the starting point from where our theoretical framework is constructed:
(151) Definition 6 Partitioned QoS function. Let N=T,F
be a network and assume that each flow f.sub.i∈F is assigned resources from N according to one of l possible policies (P.sub.1, P.sub.2, . . . , P.sub.i). For instance, while not limited to, an example of policies can be those conforming to the rule “flows assigned to policy P.sub.i are forwarded at a higher priority than flows assigned to policy P.sub.i+1”. Let the tuple
F.sub.1, F.sub.2, . . . , F.sub.l
be an l-part partition on the set of flows F such that the flows in F.sub.i are assigned to policy P.sub.i, for 1≤i≤l. We will say that q(F.sub.1, F.sub.2, . . . , F.sub.i) is an l-partitioned QoS function of network N if q(F.sub.1, F.sub.2, . . . , F.sub.l)>q(F.sub.1′, F.sub.2′, . . . , F.sub.l′) implies that the quality of service (QoS) achieved with configuration
F.sub.1, F.sub.2, . . . , F.sub.l
is greater than with configuration
F.sub.1′, F.sub.2′, . . . , F.sub.l′
. When there is no need to make the number of partitions explicit, we will also say that q( ) is a partitioned QoS function or simply a QoS function.
(152) A partitioned QoS function is therefore a utility function on the space of all the possible l-part partitions of F. Throughout this essay, the following encoding will provide a convenient way to express some of the mathematical and algorithmic results:
(153) Definition 7 Flow encoding. A network configuration expressed as an l-part partition of a set of flows F, F.sub.1, F.sub.2, . . . , F.sub.l
, can be conveniently encoded as an |F|-dimensional vector p where each element p.sub.i is such that:
p.sub.i=kiff.sub.i∈F.sub.k (1)
(154) Hence, any l-part partition has one corresponding vector in the space [l].sup.|F|, where [l]={k∈|k≤1}. Using this encoding, a partitioned QoS function can be expressed as a utility function mapping the set of vectors in [l].sup.|F| onto the set of real numbers
:
q:[l].sup.|F|.Math. (2)
(155) A partitioned QoS function can also be expressed in terms of finding ways to store |F| objects into l boxes according to a utility function q( ). In this case, the search space is given by the sum of all multinomial coefficients, which corresponds also to the size of the space [l].sup.|F|:
Σ.sub.k.sub.
(156) The problem of finding an optimal network configuration F.sub.1, F.sub.2, . . . , F.sub.l
can now be reduced to the problem of identifying the element in [l].sup.|F| that maximizes q( ):
(157) Definition 8 Partitioned QoS problem. We will say that an l-part partition F.sub.1*, F.sub.2*, . . . , F.sub.l*
of F is a QoS optimal partition if q(F.sub.1*, F.sub.2*, . . . , F.sub.l*)≥q(F.sub.1, F.sub.2, . . . , F.sub.l) for any other l-part partition
F.sub.1, F.sub.2, . . . , F.sub.l
of F. We will refer to the problem of finding such optimal partition as the partitioned QoS problem or the l-partitioned QoS problem if there is a need to make the number of partitions explicit.
(158) General Definition of Elephant and Mouse Flow
(159) The previous formulation leads to a formal interpretation and definition of the concepts of elephant and mouse flows:
(160) Definition 9 General definition of elephant and mouse flows. Let N=T,F
be a network and assume that each of its flows is assigned resources from the network according to two possible policies, P.sub.1 and P.sub.2. Let also q( ) be a 2-partitioned QoS function. Then we will say that F.sub.e.Math.F is the set of elephant flows if and only if
F.sub.e,F\F.sub.e
is a QoS optimal partition. Consequently, we will say that F\F.sub.e is the set of mouse flows.
(161) The above definition shows that the problem of identifying the set of elephant flows in a network corresponds to an l-partitioned QoS problem with l=2. It also allows us to establish a direct relationship between the meaning of elephant flow and its effects on the QoS of a network: the set of elephant flows is one that when exclusively assigned to policy P.sub.1, the resulting QoS of the network is maximal. The same can be said of mouse flows: the set of mouse flows is one that when exclusively assigned to policy P.sub.2, the resulting QoS is also maximal. Notice that without loss of generality, we will take the convention F.sub.1,F.sub.2
=
F.sub.e,\F.sub.e
, so that the set of elephant and mouse flows are assigned policies P.sub.1 and P.sub.2, respectively.
(162) Following the notation in equation (1), any 2-partitioned QoS function q( ) can be encoded using a mapping between the set of |F|-dimensional binary vectors and the set of real numbers:
q:b.sup.|F|.Math. (4)
where b=[2]=0.1.
(163) Since the above is a general definition of elephant and mouse flows, then any of the definitions proposed in the existing literature should correspond to a particular solution of a specific 2-partitioned QoS problem. The next lemma reveals the structural connection of the existing solutions—characterized by the problem's implicit network topology and QoS function—and the general definition of elephant and mouse flows:
(164) Lemma 6 Metric-based partitioned QoS. Let m(f) be any function mapping the set of flows F with the set of nonnegative real numbers, m:F.fwdarw..sup.+. We will call m(f) a flow-metric function or simply the metric function when the context is understood. Examples of metric functions include the number of bytes or the transmission rate of a flow. Assume the following 2-partitioned QoS function:
(165)
where r(f) is the rate of flow f, C is a network capacity parameter, K.sub.1 and K.sub.2 are two arbitrary positive numbers chosen so that K.sub.2>K.sub.1, and π(f) and ρ(f) are two penalty and reward functions satisfying the following conditions:
π(f.sub.i)>π(f.sub.j)⇔m(f.sub.i)>m(f.sub.j) ρ(f.sub.i)>ρ(f.sub.j)⇔m(f.sub.i)<m(f.sub.j), for all f.sub.i,f.sub.j∈F (6)
(166) Then the set of elephant flows corresponds to F.sub.1=F.sub.e={f.sub.1, f.sub.2, . . . , f.sub.λ*} such that m(f.sub.i)>m(f.sub.i+1), Σ.sub.∀i>λ*r(f.sub.i)≤C, and F.sub.e has minimal cardinality.
(167) Proof. Equation (5) can be modeled using the network described in
(168) The optimal 2-partitioned QoS is therefore one that redirects the minimum number of the largest flows—according to the metric function m( )—to the low-priority queue without exceeding the capacity C of the bottleneck link. This partitioned QoS problem can be trivially solved using the following greedy algorithm. Let {f.sub.1, f.sub.2, . . . , f.sub.|F|} be the list of flows ordered according to the rule m(f.sub.i)≥m(f.sub.i+1). Then the set of elephant flows corresponds to F.sub.1=F.sub.e={f.sub.1, f.sub.2, . . . , f.sub.λ*} such that Σ.sub.∀i>λ*r(f.sub.i)≤C and F.sub.e has minimal cardinality, i.e., λ* is minimal.
(169) The 2-partitioned QoS function in equation (5) demonstrates that for any elephant flow detection algorithm in the literature which uses a metric function m(f) to classify elephant flows, there exists at least one partitioned QoS function—e.g., that in equation (5)—whose solution leads to the ordering given by the metric. For instance, elephant flow definitions such as those based on the metrics of flow rate, byte count or burstiness are all equivalent to solving the partitioned QoS problem in
(170) General Definitions of Flow Ordering and Flow Size
(171) We now center around the problem of deriving a formal definition of flow ordering and flow size for data networks. While we will discuss this subject for the case l=2, the following results can be demonstrated to be generally applicable to the general case of arbitrary number of policies l≥2.
(172) Definition 10λ-optimal QoS partition. Let N,P.sub.1,P.sub.2,q( )
be a 2-partitioned QoS problem. We will say that F.sub.λ⊂F defines α λ-optimal QoS partition on the set of flows F if q(F.sub.λ,F\F.sub.λ)≥q(F′,F\F′) for all F′ such that F′⊂c F and |F.sub.λ|=|F′|=λ. When the meaning is clear, we will simply use the term λ-optimal partition.
(173) Notice that using Definitions 6 and 10, we have that the set of elephant flows F.sub.e is an |F.sub.e|-optimal QoS partition. The intuition behind the concept of a λ-optimal QoS partition is as follows. Assume first that λ=1. The 1-optimal QoS partition corresponds to a single-flow partition F.sub.1={f.sub.1} such that the exclusive mapping of flow f.sub.1 onto policy P.sub.1 leads to the best QoS configuration among all other possible single-flow partition configurations. It follows that flow f.sub.1 has the biggest QoS impact—as its mapping onto policy P.sub.1 is QoS maximal—and we can conclude that among all flows, f.sub.1 is the largest one from a QoS impact standpoint. This reveals a recursive strategy to identify the ordering of flows according to an abstract concept of size as follows: Start by finding F.sub.1={f.sub.1}⊂F such that F.sub.1 is a 1-optimal partition. Mark f.sub.1 as the largest flow. Next, find F.sub.2={f.sub.1,f.sub.2}⊂F such that F.sub.2 is a 2-optimal partition. Mark f.sub.2 as the second largest flow. Continue recursively for all possible λ-optimal partitions until all the flows are marked.
(174) Notice that in order for the above construction process to succeed, the condition F.sub.λ⊂F.sub.λ+1 must be satisfied at each iteration. This leads to the first property that a 2-partitioned QoS function needs to satisfy to ensure the flows can be ordered according to their size:
(175) Property 2 Inclusive QoS functions. We will say that a 2-partitioned QoS function is inclusive if its λ-optimal partition F.sub.λ includes the (λ−1)-optimal partition F.sub.λ−1, for 2≤λ≤|F|. Equivalently, assuming the vector p.sub.{circumflex over (l)}» is the configuration encoding of (F.sub.λ,F\F.sub.λ) (see definition 7), a 2-partitioned QoS function is inclusive if p.sub.λ−1=p.sub.λ−1∧p.sub.λ, where ∧ is the bitwise logical conjunction.
(176) We are now in a position to formally introduce the concept of flow size ordering:
(177) Definition 11 Flow ordering. Let N, P.sub.1, P.sub.2, q( )
be a 2-partitioned QoS problem and assume that q( ) is inclusive. Then we will say that f.sub.1 is the largest elephant flow in the set F if and only if:
q(f.sub.1,F\{f.sub.1})≥q(f′,F\{f′}), for all f′∈F. (7)
(178) Similarly, we will say that a flow f.sub.2 is the second largest flow in the set F if and only if:
q({f.sub.1,f.sub.2),F\{f.sub.1,f.sub.2})≥q((f.sub.1,f′},F\{f.sub.1,f′}), (8)
for all f′∈F and f′≠f.sub.1. More in general, we will say that a flow f.sub.i∈F is the i-th largest flow in the set F if and only if:
q(F.sub.i,F\F.sub.i)≥q(F′,F\F′) (9)
where
F.sub.i=F.sub.i−1∪{f.sub.i},f.sub.i.Math.F.sub.i−1
F′=F.sub.i−1∪{f′}, for all f′.Math.F.sub.i−1 (10)
for 1≤i≤|F| and F.sub.0={Ø}.
(179) We will also say that the ordered list {f.sub.1, f.sub.2, . . . , f.sub.|F|} defines the flow size ordering of the 2-partitioned QoS problem N, P.sub.1, P.sub.2, q( )
.
(180) Definition 11 introduces both the mathematical definition of flow ordering and a greedy algorithm to order the flows according to such ordering. It is easy to see that if the QoS function is inclusive, then the greedy algorithm is optimal, i.e., it yields the correct flow ordering and the correct set of λ-optimal partitions. Further, for the particular case of l=2, the top flows in this ordering correspond to the set of elephant flows. We will elaborate a bit more on this point below.
(181) While knowing the flow ordering is often enough to characterize and optimize network problems, sometimes it can be useful to have an exact measurement or metric for the size of a flow. For instance, as we have seen, all metric based solutions found in the literature rely on such metric to provide the definition of elephant flow and to construct heavy hitter detection algorithms. Next, we construct a formal metric that captures the mathematical meaning of flow size according to the QoS properties of the network.
(182) Property 3 Decreasing returns to gains. Let {f.sub.1, f.sub.2, . . . , f.sub.|F|} be the flow size ordering of a 2-partitioned QoS problem N, P.sub.1, P.sub.2, q( )
and let σ.sub.q (f.sub.i)=q(F.sub.i,F\F.sub.i)−q(F.sub.i−1, F\F.sub.i−1). We will say that q( ) has decreasing returns to gains, if the following is true:
σ.sub.q(f.sub.i)≤σ.sub.q(f.sub.i−1), for 2≤i≤|F| (11)
(183) Intuitively, the property of decreasing returns to gains tells us that as more flows are moved into policy P.sub.1, the QoS gains achieved by performing such action become smaller. This property leads to a natural definition of flow size as follows:
(184) Definition 12 Flow size and q-size metric. Let f.sub.1, f.sub.2, . . . , f.sub.|F|) be the flow size ordering of a 2-partitioned QoS problem N, P.sub.1, P.sub.2, q( )
and assume q( ) has decreasing returns to gains. We define σ.sub.q(f.sub.i)=q(F.sub.i, F\F.sub.i)−q(F.sub.i−1, F\F.sub.i−1) as the size of flow f. We will also refer to the function σ.sub.q(f.sub.i) as the q-size metric of the QoS problem
N, P.sub.1, P.sub.2, q( )
. When the subindex q is redundant, we will simply use the expression σ(f.sub.i).
(185) The formal concept of flow size can be understood by analogy with an intuitive method that one can use to measure the size of objects in nature. Suppose that we have different types of objects and that we are interested in knowing their volume ordering. We can immerse each object into a pool of water and measure the amount of water displaced. The object with the n-th largest volume will displace the n-th largest amount of water. This approach is equivalent to the definition of flow size ordering introduced above. The water in the pool can be seen as the total level of QoS available. When a flow is classified as elephant, it is moved into policy P.sub.1, as if an object were removed from a pool, enhancing the overall QoS of the network by an amount equal to the amount of water displaced. The size of the flow f.sub.i corresponds to the quantity of water displaced, which mathematically can be expressed as the difference in QoS between the two configurations: F.sub.i, F\F.sub.i
and
F.sub.i−1,F\F.sub.i−1
. This value corresponds to σ(f.sub.i).
(186) The property of decreasing returns to gains is necessary to provide a natural metric onto the set of real numbers upon which we can measure in absolute terms the size of a flow, but it is not needed in order to define the ordering of the flows according to their size. To define the ordering, a 2-partitioned QoS function q( ) only needs to satisfy the inclusiveness property. In the discussion below, we identify the convexity properties that need to be supported by q( ) to ensure inclusiveness. Before doing that, we complete this analysis showing that all metric-based definitions of flow ordering have decreasing returns to gains:
(187) Lemma 7 Decreasing returns to gains for metric-based ordering. All metric-based partitioned QoS functions have decreasing returns to gains.
(188) Proof. Consider the metric-based partitioned QoS problem described in Lemma 6. Using the condition m(f.sub.i)>m(f.sub.i+1) and equation 6, we can apply equation (9) to derive the flow size ordering as follows:
q({Ø},F)=K.sub.1−Σ.sub.∀f∈Fπ(f)=q.sub.0
q(F.sub.1,F\F.sub.1)=q.sub.0+π(f.sub.1)=q.sub.1
q(F.sub.2,F\F.sub.2)=q.sub.1+π(f.sub.2)=q.sub.2
( . . . )
q(F.sub.λ*-1,F\F.sub.λ*−1)=q.sub.λ*−2+π(f.sub.λ*−1)=q.sub.λ*−1
q(F.sub.λ*,F\F.sub.λ*)=K.sub.2+Σ.sub.∀f∈F\F*.sub.
q(F.sub.λ*+1,F\F.sub.λ*+1)=q.sub.λ*−ρ(f.sub.λ*+1)=q.sub.λ*+1
( . . . )
q(F.sub.l−1,F\F.sub.l−1)=q.sub.|F|−2−ρ(f.sub.l−1)=q.sub.l−1
q(F,{Ø})=q.sub.l−1−ρ(f.sub.l)=q.sub.l (12)
where F.sub.i=F.sub.i−1∪f.sub.i, for all 1≤i≤|F|, F.sub.0={Ø} and F.sub.|F|=F.
(189) The above sequence has two separated regions graphically illustrated in
(190) The second region generates the sequence F.sub.λ*, . . . , F.sub.|F| which corresponds to all possible λ-optimal partitions under the assumption that the network is not congested-lower level of equation (5). Under this region, moving a flow f.sub.i from policy P.sub.2 to policy P.sub.1 decreases the QoS of the network by ρ(f.sub.i)—i.e., the QoS gain foregone by flow f.sub.i when scheduled as low priority. Similarly, using definition 9, {f.sub.λ*+1, . . . , f.sub.|F|} corresponds to the set of mouse flows in the network. The optimal partition is reached when λ=λ*, that is, when all elephant flows are moved to policy P.sub.1, which is accomplished with the configuration (F*, F\F*) and the total QoS is:
K.sub.2+Σ.sub.∀f∈F\F*.sub.
(191) It is easy to see that σ(f.sub.i)=π(f.sub.i) for 1≤i≤λ* and σ(f.sub.i)=−ρ(f.sub.i) for λ*<i≤l. Now using equation (6), we can conclude that all metric-based QoS functions have decreasing returns to gain.
(192) Study of Inclusive QoS Functions
(193) The property of inclusiveness is relevant because if a QoS function satisfies it, then the greedy algorithm introduced in Definition 11 is optimal and the problem of ordering the flows in a network becomes tractable. We now study the mathematical properties of inclusive QoS functions. We start with a simple example.
(194) Example 4 Inclusive paths on 3 flow networks. Let N, P.sub.1, P.sub.2, q( )
be a 2-partitioned QoS problem and assume that F={f.sub.1,f.sub.2,f.sub.3}. We are interested in understanding which possible mappings of the flows onto policies P.sub.1 and P.sub.2 preserve the property of inclusiveness. Using the encoding in definition 7 and since |F|=3 and l=2, the set of possible flow mappings correspond to the following binary symbols:
[l].sup.|F|=[2].sup.3={{000},{001},{010}, {011},{100},{101},{110},{111}} (14)
(195) The encoding of the solution set works as follows. A zero in the i-th position of each symbol means that flow f.sub.i is classified as a mouse flow-hence mapped onto policy P.sub.2. Similarly, a one in the j-th position of each symbol means that flow f.sub.j is classified as an elephant flow-hence mapped onto policy P.sub.1. To find an optimal mapping, we start with the symbol {000} that maps all flows onto policy P.sub.2 and attempt to increase the QoS of the network by migrating flows to policy P.sub.1 following the recursive strategy described in equation (9). The complete set of possible optimization paths following this strategy is presented in
(196) As indicated in the graph, out of a total of nine possible paths, there exist six paths that satisfy the property of inclusiveness:
(197)
(198) Example 4 illustrates with a simple network configuration that not all possible optimization paths lead to a well defined flow ordering. Instead, only those that satisfy the property of inclusiveness do. We are now interested in identifying the topological properties that partitioned QoS functions must satisfy to ensure the property of inclusiveness is satisfied. To this end, we introduce first the concepts of partition distance and nested neighborhood:
(199) Definition 13 Partition and nested neighborhoods. Let F.sub.1, F.sub.2
and
F.sub.1′, F.sub.2′
be two 2-part partitions of a set of flows F and let the vectors p and p′ be their corresponding configuration encoding (definition 7). We define the partition distance d( ) between
F.sub.1, F.sub.2
and
F.sub.1′, F.sub.2′
as the difference between the number of flows assigned to F.sub.1 and F.sub.1′. (Equivalently, it is also the difference between the number of flows assigned to F.sub.2 and F.sub.2′). Mathematically:
d.sub.π(F.sub.1,F.sub.2
,
F′.sub.1,F′.sub.2
)=∥F.sub.1|−|F′.sub.1∥=λF.sub.2|−|F′.sub.2∥ (16)
or, in a more compact form using the configuration encoding:
d.sub.π(p,p′)=|∥p∥.sub.1−∥p′∥.sub.1| (17)
where ∥x∥.sub.1 is the Manhattan norm. We also define the partition neighborhood n.sub.π( ) of a 2-part partition as the set of all 2-part partitions that are at a partition distance 1 of it:
n.sub.π(F.sub.1,F.sub.2
)={
F′.sub.1,F′.sub.2
s.t.d.sub.π(
F.sub.1,F.sub.2
,
F′.sub.1,F′.sub.2
)=1} (18)
equivalently:
n.sub.π(p)={p′s.t.d.sub.π(p,p′)=1} (19)
(200) The partition neighborhood allows us to define the concept of nested neighborhood n.sub.o( ) as follows:
n.sub.o(F.sub.1,F.sub.2
)={
F′.sub.1,F′.sub.2
∈n.sub.π(
F.sub.1,F.sub.2
)s.t.F.sub.1⊂F′.sub.1}⇔n.sub.o(p)={p′∈n.sub.π(p)s.t.p=p∧p′} (20)
where ∧ is the bitwise logical conjunction.
(201) Notice that the nested neighborhood of a 2-part partition is contained inside its partition neighborhood, n.sub.o(p)⊂n.sub.π(p), hence the name nested. We can now introduce the concept of nested convexity which will allow us to characterize the property of inclusiveness:
(202) Definition 14 Nested convexity on 2-partitioned QoS functions. Let a configuration encoding p be a ∥p∥.sub.1-optimal partition of an arbitrary 2-partitioned QoS problem N, P.sub.1, P.sub.2, q( )
. The QoS function q( ) is said to be nested convex in the vicinity of p if the following is true:
∀p′∈n.sub.π(p)∃p″∈n.sub.o(p)s.t.q(p″)≥q(p′) (21)
(203) Finally, we are in a position to characterize the property of inclusiveness using the convexity properties of the QoS function:
(204) Lemma 8 Nested convexity and inclusiveness. A 2-partitioned QoS function is inclusive if and only if it is nested convex in the vicinity of its λ-optimal partitions, for all 1≤λ≤|F|.
(205) Proof. Let N, P.sub.1, P.sub.2, q( )
be a 2-partitioned QoS problem and assume that q( ) is nested convex in the vicinity of its λ-optimal partitions but not inclusive. Then, from Property 2, there must be a λ-optimal partition which does not include the (λ−1)-optimal partition, for some λ between 2 and |F|. For such λ, we have that p.sub.λ−1≠p.sub.λ−1∧p.sub.λ and, from equation (20), the following holds:
p.sub.λ.Math.n.sub.o(p.sub.λ−1) (22)
(206) Since p.sub.λ is a λ-optimal partition, we also have that:
q(p.sub.λ)>q(p),∀p∈n.sub.λ(p.sub.λ−1) (23)
(207) But equations (22) and (23) contradict the nested convex definition in equation (21). This proves the “if” part of the lemma. Assume now that q( ) is inclusive but not nested convex in the vicinity of some λ-optimal partition p.sub.λ and let p.sub.λ+1 be a (λ+1)-optimal partition. By definition, we have that p.sub.λ+1 is in the partition neighborhood of p.sub.λ, that is, p.sub.λ+1∈n.sub.π(p.sub.λ). From equation (21) and since q( ) is not nested convex in the vicinity of p, we also have that q(p.sub.λ+1)>q(p′) for all p′∈n.sub.o(p). This implies that p.sub.λ+1 is not in the nested neighborhood of p.sub.λ, and hence that p.sub.λ≠p.sub.λ∧p.sub.λ+1, which contradicts the assumption that q( ) is inclusive from Property 2.
(208) In Appendix B.1 we provide a topological interpretation of the property of nested convexity and the reason it leads to a polynomial time greedy algorithm to compute the λ-optimal partitions. We conclude this initial theoretical framework summarizing in
(209) Identification of the Flow Ordering in TCP Networks
(210) To network operators, understanding which flows are responsible for the largest degradation of performance is critical in order to quickly debug network problems. Consider for example the case of a misconfigured BGP route. Such situation causes certain flows to use routes that are less efficient, therefore negatively impacting the overall QoS of the network. According to the theory of flow ordering described herein, if we are able to correctly order the flows in a network based on their QoS impact, these large flows should reveal themselves at the top of the list. Another common example is the case of flows that violate their service level agreement (SLA), whether due to a network misconfiguration or to a violation of the contract between the user and the network provider.
(211) Traditional methods attempt to identify such flows by looking at scalar metrics such as flow transmission rate, total number of bytes transmitted, or jitter. As shown in Lemma 6, however, such approaches do not take into consideration the topological structure of the network—e.g., the graph structure of the network, the routes followed by the flows, the capacity of the links, or the inner effects of the congestion control algorithm regulating the flows' transmission rates—and are unable to capture the QoS utility function inherent to a network. We now show how the theory of flow ordering can be used in practice to order TCP flows according to their impact to the overall QoS of the network and its topological structure.
(212) The theoretical framework described herein contemplates an area of research around the problem of designing suitable partitioned QoS functions that are practical and useful to network operators. Towards this objective, in what follows we describe and study a first class of QoS functions that are applicable to real networks.
(213) Definition 15 Max-min/max-flow (m3f) QoS function. Let N=T, F
be a network and let F.sub.e be an arbitrary subset of flows, F.sub.e.Math.F. We define the max-min/max-flow (m3f) QoS function as follows:
(214)
where r* is the max-min rate allocation corresponding to network N and δ is a non-negative real number which we refer to as the traffic shaper hyperparameter.
(215) The above QoS function corresponds to the sum of all the flow rates subject to the constraint that all flows are transmitting at the max-min rate while traffic shaping the flows in F.sub.e by an amount δ. Intuitively, this QoS function models the following properties: The QoS measures the global welfare of the network by summing (in general aggregating, using a function based on the QoS function) up the transmission rates of all flows. To build an analogy, if the network was an economy, a flow would correspond to the economic output of single economic agent, and the QoS value would correspond to the gross domestic product of the economy. The model assumes that flows operate under a congestion control algorithm that assigns the max-min rate to each flow. The QoS of the network is controlled by the hyperparameter δ. Those flows that are moved onto a policy P.sub.1 (the flows in the set F.sub.e, see Definition 9), are forced to reduce their max-min rate by an amount of δ. This new constraint brings the network to a new optimal max-min solution, and therefore to a new value of the QoS.
(216) We reason that the m3f QoS function is suitable to model networks that are operated by TCP congestion control algorithms-such as Cubic or BBR, among many others-because such algorithms aim at maximizing network utilization while ensuring fairness among all flows. Some TCP congestion algorithms are explicitly designed to converge to max-min (e.g., XCP) while others converge to other fairness criteria such as proportional fairness. In addition, it is well-known that algorithms using additive increase/multiplicative decrease (AIMD)—a type of control loop used by many TCP variants-converge to use equal amounts of bandwidth from their bottleneck link similar to max-min. Equivalent mathematical models for other bandwidth allocation schemes using the framework described herein are contemplated.
(217) A key characteristic of our model in conjunction with the theory of flow ordering is that it allows us to reason about the optimal value of F.sub.e (the set of flows that need to be traffic shaped) that maximize the overall QoS (the welfare) of the network. In this model the hyperparameter δ can also be interpreted as the delta step in a gradient descent algorithm. In particular, in the case of the m3f QoS function, the objective is to identify the subset of flows F.sub.e⊂F such that when they are traffic shaped by a small step size δ, the new resulting max-min rate allocation increases maximally towards the max-flow rate allocation.
(218) It should be understood that m3f is only one example of a QoS function that may be used for ordering flows according to flow sizes or flow metrics so as to improve the overall QoS of a network. In general, the size or metric of a flow represents the network resources required for processing the flow, where the larger the size/metric of a flow the greater the network resources required for processing the flow. Network resources include one or more of link bandwidth, processing capacity, processing time, memory, etc. When the designation of a flow is changed from a relatively less resource consuming flow to a relatively more resource consuming flow, a suitable QoS function may apply or increase a penalty associated with the metric of that flow. The QoS function may also apply or increase a reward associated with the respective metrics of one or more other flows. Thus, the QoS function accounts for the overall QoS of the network, which may also be referred to as network performance of global welfare of the network.
(219) The penalty/reward, which are described above in connection with Equations (5) and (6), represent a reduction/increase in transmission rates associated with the flow. A reduction in the rate of a flow can decrease the instantaneous resource requirement of that flow, making a greater portion of the available network resource(s) for one or more other flows, which can increase the overall QoS of the network. Examples of QoS function include set functions. The m3f function can also be weighted. In particular, max-min can be weighted, max-flow can be weighted, or they both can be weighted. Some QoS functions are constrained such that one or more flows are guaranteed at least respective specified minimum transmission rates or bandwidths. Some QoS functions may be constrained such that one or more flows are limited to at most respective specified maximum transmission rates or bandwidths.
(220) This formulation provides us with a natural way to compute derivatives or gradients in the form of infinitesimal adjustments to flow rates that help improve the QoS of a network. We formalize this concept in the next definition:
(221) Definition 16 Flow gradients. Let N=T,F
be a network associated with a QoS function q( ) (Definition 6). We define the gradient of a flow f∈F, ∇.sub.f.sub.
(222)
More in general, we define the gradient produced by the set of flows F.sub.e⊂F as:
(223)
(224) Note that this notation should not be directly compared with the traditional definition of gradient used in the field of calculus. The main difference is that in our problem we deal with discrete set functions, instead of continuous and differentiable functions. Unlike the general definition of gradient which results in a vector-valued real function, the flow gradient is a scalar-valued set function. Thus one can reason the name of derivative would be more suitable than gradient. We use instead the more general terminology of gradient to better reflect the concept of incremental path, since our objective is to identify optimization paths within the problem's feasible set that lead to better QoS. Note that since derivatives are a special case of gradients for unidimensional functions, the gradient terminology is still mathematically accurate.
(225) The following example illustrates how the proposed framework can be used to order the flows for the case of the max-min/max-flow QoS function:
(226) Example 5 Ordering flows in a network. Let N be the network example described in
(227)
(228) In
(229) Each gradient ∇.sub.F.sub.
(230)
where r.sub.i(X) corresponds to the max-min rate of flow r.sub.i after all the flows in X are traffic shaped by δ.
(231) Computing such expression provides two challenges. First, it requires the calculation of a new max-min solution for the whole network after the traffic shapers are added. Since real networks can run millions of flows (or many more), this could be a consuming computation, specially when such gradients need to be computed in real time as network conditions vary. Secondly, the computation of the limit in expression (27) requires choosing a small enough δ to ensure the value of the gradient is accurate, but it is not clear how this value can be computed efficiently without requiring multiple iterations, with each iteration involving another calculation of the max-min rate allocation. The algorithm we propose next overcomes both of these challenges by using the knowledge gained from the topological structure of the network. Before describing the algorithm, we need to introduce first the concept of flow gradient graph:
(232) Definition 17 Flow gradient graph. (In what follows we will take the classic definitions of saturated link and bottleneck link in the max-min sense.)
(233) Let .sub.b(f) and
.sub.
(234) .sub.b(f) is the set of saturated links traversed by flow f that are a bottleneck to flow f.
.sub.
(235) We define the flow gradient graph of a network N=S, L
, F
as a directed graph Γ(N)=
V, E
constructed as follows: 1. Start with V=E={Ø}. 2. For each saturated link l∈L, add l as a vertex to V. We will call these link vertices. 3. For each flow f∈F, add f as a vertex to V. We will call these flow vertices. 4. For each flow f, add a directed edge (l,f) for all l∈
.sub.b(f). 5. For each flow f, add a directed edge (f,l) for all l∈
.sub.
(236) .sub.b(f) and
.sub.
.sub.b(f.sub.1)={l.sub.1},
.sub.
b(f.sub.2)={l.sub.2},
.sub.
.sub.b(f.sub.3)={l.sub.1},
.sub.
.sub.b(f.sub.4)={l.sub.2},
.sub.
.sub.b(f.sub.5)={l.sub.3, l.sub.4},
.sub.
.sub.b(f.sub.6)={l.sub.1},
.sub.
(237) The key characteristic of the flow gradient graph is that it encodes the necessary information to efficiently compute the flow gradients based on the bottleneck structure of the network. In particular, the gradient of any flow f∈F can be computed by applying a rate shift of δ to the flow vertex associated with f and measuring the resulting rate drift propagation on all other flows by traveling through the directed graph. For this process to be successful, there are three invariants that must be preserved when computing the resulting drift for the affected flows: Invariant 1: Drift conservation. The sum of rate drifts arriving to and departing from a link vertex must be equal to zero. Invariant 2: Fairness conservation. The rate drifts departing from a link vertex must be equally distributed. Invariant 3: Bottleneck consistency. The rate drifts leaving a flow vertex must be equal to the minimum of the rate drifts arriving to it.
(238) The first invariant is needed to ensure that the rate drifts maintain all links saturated—otherwise links could become either unsaturated or the total link utilization could become larger than its capacity—while the second invariant is needed to ensure max-min fairness is preserved. Invariants 1 and 2 work together to compute the resulting rate drifts in the link vertices and are graphically represented in
(239) For the sake of brevity, we only include the formal description of the flow gradient algorithm as supplemental material in Appendix B.2 and we focus the next discussion around an example to describe how the algorithm works.
(240)
(241) This ensures that the sum of all rate drifts arriving and departing from l.sub.1 is equal to zero (Invariant 1) and that all the departing rate drifts are equal to each other (Invariant 2). Since link vertex l.sub.2 has a total input drift of −δ+δ/2=−δ/2 arriving from flow vertices f.sub.3 and f.sub.6, applying again Invariants 1 and 2 the output drift corresponding to flows f.sub.2 and f.sub.4 must be equal to δ/4. Now at link vertex l.sub.3 there is a total input drift of −δ+δ/4=−3δ/4 arriving from flow vertices f.sub.4 and f.sub.6, which leads to an output drift onto flow vertex f.sub.5 of 3δ/4. However, since flow f.sub.5 has a second bottleneck with a drift of 0, applying invariant 3 we have that the total drift leaving flow vertex f.sub.5 is 0. The actual gradient of flow f.sub.6 can now be easily obtained according to equation (27) by dividing the sum of all drifts by δ, (−δ+δ/2+δ/2+δ/4+δ/4+0)/δ, which results in the expected value of ½, in agreement with the value shown in
(242) As illustrated in the above example, the flow gradient graph provides two beneficial properties. First, the algorithm only requires recomputing the parts of a network that are strictly affected by the rate drift. In particular, the flows that need to be recomputed to derive the gradient of flow f correspond to the subset of flows that are descendants to flow f's parent link vertices. For instance, the gradient of flow f.sub.5 is trivially equal to −δ and it can be computed directly without recomputing the rate of any other flow since flow f.sub.5's parent vertices l.sub.3 and l.sub.4 do not have any children vertices in the flow gradient graph other than f.sub.5 itself. Thus the proposed algorithm can be seen as a streaming algorithm in that it avoids recomputing the max-min rates for all flows by only considering the necessary rate drift changes from a given initial max-min solution.
(243) This property is important to ensure the scalability of the real-time computation for networks involving a very large number of flows. Secondly, the algorithm is independent of the value of S. This property is a result of the fact that the max-min optimization problem can be characterized by a piecewise linear function, since it can be expressed as a recurrence of linear programming problems. The flow gradient graph is able to capture this property in that the sum of all the rate drifts is proportional to δ, and this factor gets canceled out by the denominator in equation (27). Thus in practice we can fully eliminate the variable δ from the flow gradient graph and work only with the actual rate drift weights, as is shown in the rest of the graphs in
(244) Notice that the flow gradient graph is a valid framework to compute gradients provided that the value of δ is small enough so that the overall bottleneck structure of the network is preserved. Choosing a δ too large could change the structure of the flow gradient graph itself resulting in an erroneous computation. Fortunately, the case of small δ is precisely the case we are interested in, since our objective is to obtain the flow ordering by computing changes in QoS when flows are subject to infinitesimal traffic shapers according to equation (27).
(245) Measurements and Evaluation
(246) We describe below two experiments that we conducted to help empirically validate some of the principles described in our theoretical framework. In the first experiment, we aimed at quantifying and providing some empirical evidence on the property of nested convexity for small networks. In the second experiment, we demonstrate how the theory of flow ordering can be used to identify large flows in Google's B4 network.
(247) Measurements of Nested Convexity in Small Networks
(248) In this experiment we developed a network generator and measured the property of nested convexity for the max-min/max-flow QoS function. To bound the number of possible networks in the search space, we limit this experiment to networks where all links have the same capacity. While real networks in general have links with different capacities, we reason the case of equal link capacity is relevant in controlled environments such as enterprise networks and serves also as a first order approximation for networks that don't satisfy this requirement. Our network generator makes use of the combinatorial properties of the problem to ensure the networks generated are unique, eliminating duplicated networks that are symmetric with each other.
(249) We present measurements for networks up to 7 links and 7 flows. Measuring networks with higher number of links and flows rapidly becomes intractable, thus for this experiment we only focus on these smaller networks. While obviously such small networks may not be realistic in the real world, the next results may still be interesting in small networking environments or in scenarios where flows are aggregated into a reduced number of “superflows”. For instance, in MPLS networks, many IP connections can be aggregated into a reduced number of MPLS paths. Such set of MPLS paths can be modeled as flows, in which case our objective becomes the ordering of these paths according to their overall impact on the QoS of the network.
(250) Table 5 presents the results. All measurements are done by inspecting all possible networks with the corresponding number of links and flows except for the last three cases at the lower right corner of the table which are marked with “*”. These three cases generated an intractable amount of networks and instead we randomly explored 100 million networks for each of them and projected the results as estimations. The four numbers measured in each cell are (in this order): the total number of networks, the number of non-nested convex networks, the number of singular networks and the number of singular networks that are non-nested convex.
(251) We define a singular network as a network that has at least one flow with more than one bottleneck link. We use the term singular because in the continuum space of rate allocations, the case where a flow happens to have the exact same bottleneck rate in more than one link can be seen as a singularity. Notice that as the number of flows increases, the chances of having singular networks decreases because it becomes more difficult for two links to yield the exact same bottleneck rate to a given flow. In real world scenarios, singular networks are very unlikely or may only manifest in infrequent instantaneous situations.
(252) A surprising result derived from Table 5 is that for all networks explored, non-singularity is a sufficient condition for nested convexity. This can be seen by the fact that the second and fourth numbers (the number of non-nested convex networks and the number of singular networks that are non-nested) in each cell are identical. We formalize this in a lemma:
(253) Lemma 9 Nested convexity for small networks. Let N be a non-singular network with constant link capacities and assume its QoS can be characterized with the m3f function described in equation (24). If N has up to 6 links and up to 6 flows, then N is deterministically nested convex. If N has up to 7 links and up to 7 flows, then N is nested convex with very high probability.
(254) The contribution of the above lemma is that it enables a very simple mechanism to verify if a network is nested convex for small networks. This is because the property of singularity can be identified by simple inspection of the max-min rate in each bottleneck link of the network, a procedure which is well known to be solvable in polynomial time. At the time of this writing we are still running our network generator to evaluate if all the networks with 7 flows and 7 links are deterministically nested convex. While it will not necessarily add more insights to our argument, this computation will complete in about two weeks, in which case we will be able to assert whether the above lemma also holds deterministically for all networks up to 7 links and 7 flows. We note also that the above lemma does not hold for larger networks. In Appendix B.3 we present the smallest network we found that does not satisfy the above lemma.
(255) TABLE-US-00007 TABLE 5 Nested convexity measurements for small networks with constant link capacity. # Flows # Links 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 6 10 15 21 28 36 0 0 0 0 0 0 0 1 1 2 2 3 3 4 0 0 0 0 0 0 0 3 7 28 84 219 462 924 1716 0 0 0 3 6 15 27 4 13 41 81 185 295 546 0 0 0 3 6 15 27 4 15 120 680 3060 11628 38760 116280 0 0 0 114 452 1757 5159 11 86 500 1989 7650 21646 64884 0 0 0 114 452 1757 5159 5 31 496 5456 46376 324632 1947792 10295472 0 0 0 2713 22460 115791 809369 26 431 4851 38091 274241 1108290 7844214 0 0 0 2715 22460 115791 809369 6 63 2016 43680 720720 9657648 109453344 1078897248* 0 0 0 51870 159332 5810982 121577707 57 1905 42004 678894 1566495 96492451 963645397 0 0 0 51870 159332 5810982 121577707 7 127 8128 349504 11358880 297602656 6547258432* 124397910208* 0 0 0 136211 226206 916018544 9298946216 120 7953 345521 1623094 1938302 6261684990 117290829240 0 0 0 126211 226206 916018544 9298946216
(256) Table 5 shows nested convexity measurements for small networks with constant link capacity. Each cell shows (in this order) the total of number of (1) networks, (2) non-nested convex networks, (3) singular networks and (4) singular networks that are non-nested convex. (Results marked with “*” were obtained by randomly exploring 100 million networks).
(257) Identifying Top Flows in Google's B4 Network
(258) We describe several experiments to help illustrate how a network operator can use the proposed framework to identify large flows according to their impact to the QoS of the network and based on the topological structure of the network. To help bring some realism to the experiment, we choose the topology from a real world network, in this case Google's B4 network. This toplogy is described in Appendix B.4. In our tests, we assumed that all links have a constant capacity. Note that under this assumption, the actual value of the link capacity is irrelevant to the analysis.
(259) Our first experiment involved measuring the flow ordering assuming there exists a flow between each pair of nodes in the network. Because there are 12 nodes, this results in a total of 144 flows. These flows can be interpreted as traffic engineered paths that are used to move traffic from one site to another, and our objective was to determine how efficient each path is according to the topological structure of the network. We started by assuming each flow uses a shortest path between its endpoints and later introduce less efficient flows and verify if the algorithm is able to capture their larger QoS impact. For this experiment we assumed the max-min/max-flow QoS function.
(260) Using the flow gradient graph algorithm by performing an exhaustive computation of all gradients for all possible subsets of flows up to cardinality 4, we obtain the λ-optimal partitions {f.sub.12}, {f.sub.12,f.sub.131}, {f.sub.12,f.sub.131,f.sub.132}, {f.sub.12,f.sub.131,f.sub.132,f.sub.142} for λ=1,2,3,4, respectively. Thus the largest flow is f.sub.12, followed by f.sub.131, f.sub.132 and f.sub.142. For 5 and more flows the search space becomes intractable, since from equation (3), a total of 2.sup.144 executions of the flow gradient graph algorithm would need to be computed for the full exploration. It is in this regard that our theoretical framework can be used to identify the flow ordering in a tractable manner. Assuming the property of nested convexity-equivalently, inclusiveness, as shown in Lemma 8-we can compute the full ordering of all the 144 flows in polynomial time (a total of 144*143/2 executions of the flow gradient graph algorithm) by using the greedy algorithm introduced in Definition 11. By using this approach, we obtained the flow ordering: f.sub.12>f.sub.131>f.sub.132>f.sub.142>f.sub.11>f.sub.37>f.sub.47>f.sub.130>f.sub.10>f.sub.109>( . . . )>f.sub.123>f.sub.40>f.sub.136>f.sub.53>f.sub.105>f.sub.27>f.sub.79>f.sub.66>f.sub.14>f.sub.92, where we only present the top and bottom 10 flows. Appendix B.4 presents the complete ordering including the values of the flow gradients. We observed that: Google's B4 network is at least nested convex for all values of λ up to 4, since the top 4 flows returned under the assumption of nested convexity are exactly the same as those obtained from the exhaustive combinatorial search. It is possible that the network is also nested convex for more values above 4, but this can't be verified due to the intractability of the problem. By assuming nested convexity, we obtain a complete ordering of all the flows that allows us to identify correct traffic shaping strategies that lead to increments in QoS without the need to perform combinatorial exploration. For instance, we obtain that the maximal λ with positive gradient is equal to 132, with a gradient value of ∇.sub.F.sub.
(261) One observation from the experiment is that many of the positive flow gradients are very similar to each other and have a small value. (See Appendix B.4 for full detail on all the gradients.) For instance, the largest flow f.sub.12 has a gradient of 0.66, which from equation (7) implies that the gradient of any other flow f.sub.i≠f.sub.12 must be no larger than 0.66. This reflects the notion that all flows are relatively efficient and that there is no flow that stands out as substantially larger than any another flow.
(262) We reason that this is true because we chose each flow to follow a shortest path between its endpoints, which makes them efficient flows from a topological standpoint. Thus the m3f QoS function is able to capture this property. To further test this characteristic, we add 12 more flows to our network experiment, f*.sub.1, . . . , f*.sub.12, and we force each flow f.sub.i* to go through links l.sub.0, . . . , l.sub.i−1, for 1≤i≤12. Our goal is to verify that the proposed theoretical framework and the m3f function are capable of identifying flow f*.sub.12 as a very inefficient and large flow in the network since it goes through all possible links. Indeed by computing the flow ordering for all the 166 flows (the previous 144 plus the new 12 flows), we now obtain the top 10 flow ordering f*.sub.12>f*.sub.10>f*.sub.9>f*.sub.11>f.sub.8*>f.sub.7>f.sub.5>f.sub.47>f.sub.130>f.sub.11, which correctly identifies f*.sub.12 as the top flow.
(263) Interestingly, the m3f QoS function assigns a higher size to f*.sub.10 than to f*.sub.11, even though the latter one uses one more link. This is the type of subtle insights that the proposed topological approach to flow classification brings, since the ordering is not only based on the amount of links used by a flow, but on a comprehensive view of the complete structure of the network, including the inner interactions among all the flows with respect to each other and the links they traverse. Finally, we also observe that the proposed framework is able to quantify the magnitude by which a flow is larger than other flows. Indeed, the gradient of the top flow f*.sub.12 is equal to 3.87, ∇.sub.f*.sub.
(264) Conclusions and Open Questions
(265) Suppose that you are a network operator and suddenly detect that the performance of your network has been drastically reduced due to an unexpected set of misbehaving flows—e.g., maybe due to a mis-configured BGP route or maybe because such flows are not respecting their service level agreement. You are interested in identifying these flows, but your network is running millions of flows and you cannot afford debugging each of them individually. The theory of flow ordering techniques described herein provide a mathematical tool to tackle this problem in a tractable manner. By providing a real-time list of flows ordered according to their QoS impact, network operators can focus on debugging the top flows that are most likely to have caused the collapse of the network.
Appendix A: Additional Material to the Theory of Bottleneck Ordering
(266) A.1 Mathematical Proofs
(267) Property 4 Monotonic fair shares. Let s.sub.l.sup.k be the fair share of link l at iteration k of the BPG algorithm. Then s.sub.l.sup.k≤s.sub.l.sup.k+1.
(268) Proof. This result is described in connection with in Property 3.6 discussed in: Jordi Ros-Giralt and Wei Kang Tsai, “A Theory of Convergence Order of Maxmin Rate Allocation and an Optimal Protocol,” Proceedings IEEE INFOCOM 2001, volume 2, pp. 717-726 (2001), which is incorporated herein by reference in its entirety.
(269) A.1.1 Property 1
(270) Monotonic fair shares along a precedence path. Let s.sub.l.sub.
(271) Proof. By induction, it is sufficient to prove that the lemma holds for n=2. Let k.sub.1 and k.sub.2 be the value of k at the iteration when links l.sub.1 and l.sub.2 are removed from the set of unresolved links (BPG algorithm/line 13). This implies s.sub.l.sub..sub.l′.sup.k=
.sub.l′.sup.k∪l) if link l has been resolved (l.Math.
.sup.k) and link l′ has not been resolved (l′∈
.sup.k). Suppose now that s.sub.l.sub.
(272) A.1.2 Lemma 1
(273) Bottleneck convergence order. A link l is removed from the set of unresolved links L{circumflex over ( )}k at iteration k in the BPG algorithm, .sup.k=
.sup.k\{l}, if and only if all of its direct and indirect precedent links have already been removed from this set at iteration k−1.
(274) Proof. We start with the sufficient condition. Assume that at an arbitrary iteration k all direct and indirect precedents of a link l.sub.1 have converged. We will also assume that link l.sub.1 does not converge at iteration k and arrive at a contradiction. It must be that s.sub.l.sub.∩
≠{Ø}, such that at iteration k it has not converged yet, l.sub.r∈
.sup.k, and s.sub.l.sub.
. Assume link l.sub.1 converges at iteration k.sub.1, then since all of its direct precedents converged at iteration k, we have s.sub.l.sub.
(275) To address the necessary condition, suppose that a link l.sub.1 converges at iteration k and that another link l.sub.2 that has not converged yet at the same iteration is either a direct or indirect precedent of link l.sub.1. By construction of the BPG algorithm, however, this is not possible since once link l.sub.1 converges at iteration k, it is removed from the set of unresolved links .sup.k, thus no further direct or indirect links can be added to it.
(276) A.1.3 Lemma 2
(277) Bottleneck influence. A bottleneck l can influence the performance of another bottleneck l′, i.e., ∂s.sub.l′/∂c.sub.l≠0, if and only if there exists a set of bottlenecks {l.sub.1, l.sub.2, . . . , l.sub.n} such that l.sub.i is a direct precedent of l.sub.i+1, for 1≤i≤n−1, l.sub.1=l and l.sub.n=l′.
(278) Proof. We start with the sufficient condition. By induction, it is enough to show the case n=2. Assume that link l.sub.1 is a direct precedent of link l.sub.2, then there must exist a flow f bottlenecked at link l.sub.1 that shares a flow with both l.sub.1 and l.sub.2, f∈∩
. Note that any infinitesimal change on the capacity of link l.sub.1 changes the rate of flow r.sub.f, since this flow is bottlenecked at the same link (BPG Algorithm/lines 9 and 12). Note also that such variation propagates to link l.sub.2 inducing a change in its fair share (line 9), which implies as ∂s.sub.l.sub.
(279) To address the necessary condition, note first that the performance of a link is uniquely determined by the fair share equation (line 9):
s.sub.l.sup.k(c.sub.l−r.sub.f)/|
\
.sup.k|, ∀l∈
.sup.k (28)
This equation depends on internal link parameters (such as its capacity c.sub.l and the number of flows going through it) as well as external parameter such as the rates of the flows bottlenecked at some other link {r.sub.f|∀f∈.sup.k∩
.sub.l}. Thus s.sub.l can only be externally influenced by changing these rates. It is easy to see that these rates correspond to flows that are bottlenecked at links that are direct precedents of link l. Thus the fair share of a link can only change if the rate of one or more of its flows bottlenecked at one of its direct precedent links changes. Since the rates of these flows are equal to the fair share of these direct precedent links, this implies that in order to induce a change in the fair share of s.sub.l, it is necessary to change the capacity of one or more of its direct precedent links.
A.1.4 Lemma 3
(280) Depth of the bottleneck structure. The diameter of the BPG graph—which we will also refer as the depth or the number of levels of the bottleneck structure—is equal to the last value of the iterator k in the BPG algorithm.
(281) Proof. Since the diameter of the BPG graph corresponds to the maximum length of any of its paths and since at every iteration the algorithm adds one more vertex to any longest path in the graph, the value of k at the end of the algorithm must be equal to the size of any longest path.
(282) A.1.5 Lemma 4
(283) Minimum convergence time of a distributed congestion control algorithm. Let τ(l.sub.i,l.sub.j) be a weight assigned to each edge (l.sub.i, l.sub.j) of the BPG graph as follows: (1) If l.sub.i is a direct precedent of l.sub.j, then τ(l.sub.i, l.sub.j) is the time that it takes for a message to be sent from l.sub.i to l.sub.j; (2) If l.sub.i is an indirect precedent of l.sub.j, then τ(l.sub.i, l.sub.j)=max{τ(l.sub.i,l.sub.r)+τ(l.sub.r, l.sub.j)|foranyrelaylink|_rbetween|_iand|_j}. Let l.sub.1−l.sub.2− . . . −l.sub.n be a longest path terminating at link l.sub.n according to these weights. Then the minimum convergence time for link l.sub.n is Σ.sub.1≤i≤n−1τ(l.sub.i, l.sub.i+1).
(284) Proof. This lemma is a consequence of Lemma 1. From Lemma 1, we know that a link l cannot resolve its fair share until all of its direct and indirect precedents have resolved their own fair share. Furthermore, we know that if all the direct and indirect precedents of a link l have been resolved at iteration k, then link l can converge immediately after at iteration k+1. To derive the current lemma, we only need to take into account the time that it takes to communicate a message from the direct and indirect links of link l to link l itself. Because direct precedent links share a flow, communication can propagate directly. For the case of indirect precedence, communication has to go through the relay link. In particular, we need to select the relay link that imposes the longest distance, as that's the longest time it can take to propagate the state between a link and its indirect precedent. The τ( ) function introduced in the lemma captures the value of these communication times for the two possible relations between bottleneck links.
(285) A.1.6 Lemma 5
(286) Flow influence. A flow f can influence the performance of another flow f′, i.e., ∂r.sub.f′/∂r.sub.f≠0, if and only if there exists a set of bottlenecks {l.sub.1, l.sub.2, . . . , l.sub.n} such that (1) l.sub.i is a direct precedent of l.sub.i+1, for 1≤i≤n−1, (2) flow f′ is bottlenecked at link l.sub.n and (3) flow f goes through l.sub.1.
(287) Proof. Let f and f′ be two arbitrary flows and assume that flow f′ is bottlenecked at link l′. From lines 9 and 12 of the BPG algorithm, the rate of a flow corresponds to the fair share of its bottleneck. Thus flow f can only influence flow f′ if it can affect link l″s fair share, s.sub.l′. Assume that we apply an infinitesimal change to flow f's rate r.sub.f. This could be achieved by applying a traffic shaper that slightly decreases its rate (negative infinitesimal) or by imposing a minimum rate constraint that assigns a rate slightly higher than the fair share of its bottleneck link s.sub.l (positive infinitesimal). For any link l traversed by flow f, such small perturbation will lead to an infinitesimal change in its fair share s.sub.l. Now from Lemma 2 we know that a link l can only influence another link l′ if there exists a directed path from l to l′ in the BPG graph. Furthermore, Lemma 2 is a necessary and sufficient condition. Thus both the necessary and sufficient conditions of this lemma must hold too if we set l=l.sub.1 and l′=l.sub.n.
(288) A.2 Results of the Bottleneck Structure with TCP Reno
(289) We demonstrated above that BBR generally performs better in capturing a given network's bottleneck structure than AIMD-based protocols such as Cubic and Reno. While the results of BBR and Cubic are presented in
(290) A.3 Results of the Low-Hitter Flow Experiment for TCP Cubic and Reno
(291) The plots in
(292) A.4 Jain's Fairness Index
(293) Jain's index is a metric that rates the fairness of a set of values x.sub.1, x.sub.2, . . . , x.sub.n according to the following equation:
(294)
The index value ranges from 1/n (worst case) to 1 (best case).
(295) For multi-link networks the value x.sub.i must be normalized to an optimal fairness allocation. We normalize x.sub.i as the ratio f.sub.i/O.sub.i, where f.sub.i is the rate of flow f.sub.i achieved through the experiments and O.sub.i is its expected max-min fair throughput.
(296) A.5 Notes on Using the Framework in Production Networks
(297) We now present practical considerations that would allow the integration of the proposed theory into modern data networks. The proposed theoretical framework requires three inputs: Network topology. The set of links, , and their capacities, {c.sub.i|∀l.sub.i∈
}. Real-time topology information can normally be obtained from software defined networking (SDN) components or can infer it using techniques from the field of Network Tomography. Flow information. The set of flows, {f.sub.i|∀l.sub.i ∈
}. This input requires knowing the IP tuple (source and destination IP addresses and port numbers), and we can obtain that by using monitoring protocols (e.g., NetFlow, sFlow or by using security monitoring tools. Routing information. The set of links traversed by each flow or, equivalently, the set of flows traversing each link: {
.sub.l|∀l∈
}. We can infer this information from several vantage points, including switch routing tables, SDN controller.
(298) Of the multiple avenues of potential operational use cases, we list the following: Offline capacity planning. Having access to BPG graphs, network operators can identify high-impact bottlenecks and make trade-offs regarding link updates and capacity planning. For instance, bottlenecks that are consistently closer to the root of the BPG graph may be more important to upgrade than those located near the leaves because their influence is bigger (Lemma 2). Real-Time Traffic engineering. As an online tool to help optimize traffic engineering, the proposed framework can be employed to compute the bottleneck structure and identify high-impact bottlenecks and flows in real-time.
Appendix B: Additional Material to the Theory of Flow Ordering
(299) B.1 Topological Interpretation of the Property of Nested Convexity
(300)
(301) B.2 Flow Gradient Computation Algorithm
(302) Algorithm 3 provides the pseudocode of the algorithm introduced above in connection with Definition 16.
(303) TABLE-US-00008 Algorithm 3: ComputeFlowGradient(L, F, f*) 1 % Local variables 2 .sub.b(f) := Flow f's bottleneck links; 3
(f) := Flow f's non-bottleneck links; 4
.sub.b(l) := Link l's bottleneck flows; 5
(l) := Link l's non-bottleneck flows; 6 Δ(f) := Flow f's rate drift; 7 β(l) := Link l's bandwidth drift; 8
:= List of flows to be processed; 9 % Initialization code 10 Set Δ(f) = 0 for all f ∈ F; 11 Set β(l) = 0 for all l ∈ L; 12 Set
= {∅} 13 % Root initialization 14 Δ(f*) = −1; 15
=
∪
(f*); 16 for l ∈
.sub.b(f.sub.i) do 17 β(l) = β(l) − 1; 18 for f ∈
.sub.b(l) and f ≠ f do 19 Δ(f) = β(l)/(|
.sub.b(l)| − 1); 20
=
∪
(f); 21 end for 22 end for 23 % Main rate propagation loop 24 while
≠ {∅} do 25 l = EXTRACT_HIGHEST_LINK(
); 26 β(l) = β(l) −
Δ(f); 27 for f ∈
(l) do 28 Δ(f) = min{Δ(f), β(l)/|
.sub.b(l)|}; 29
=
∪
(f); 30 end for 31 end while 32 return
Δ(f);
B.3 Smallest Non-Nested Convex Networks
(304)
(305)
(306) B.4 Google's B4 Network Topology and Flow Gradients
(307)
(308)
(309) It is clear that there are many ways to configure the device and/or system components, interfaces, communication links, and methods described herein. The disclosed methods, devices, and systems can be deployed on convenient processor platforms, including network servers, personal and portable computers, and/or other processing platforms. Other platforms can be contemplated as processing capabilities improve, including personal digital assistants, computerized watches, cellular phones and/or other portable devices. The disclosed methods and systems can be integrated with known network management systems and methods. The disclosed methods and systems can operate as an SNMP agent, and can be configured with the IP address of a remote machine running a conformant management platform. Therefore, the scope of the disclosed methods and systems are not limited by the examples given herein, but can include the full scope of the claims and their legal equivalents.
(310) The methods, devices, and systems described herein are not limited to a particular hardware or software configuration, and may find applicability in many computing or processing environments. The methods, devices, and systems can be implemented in hardware or software, or a combination of hardware and software. The methods, devices, and systems can be implemented in one or more computer programs, where a computer program can be understood to include one or more processor executable instructions. The computer program(s) can execute on one or more programmable processing elements or machines, and can be stored on one or more storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), one or more input devices, and/or one or more output devices. The processing elements/machines thus can access one or more input devices to obtain input data, and can access one or more output devices to communicate output data. The input and/or output devices can include one or more of the following: Random Access Memory (RAM), Redundant Array of Independent Disks (RAID), floppy drive, CD, DVD, magnetic disk, internal hard drive, external hard drive, memory stick, or other storage device capable of being accessed by a processing element as provided herein, where such aforementioned examples are not exhaustive, and are for illustration and not limitation.
(311) The computer program(s) can be implemented using one or more high level procedural or object-oriented programming languages to communicate with a computer system; however, the program(s) can be implemented in assembly or machine language, if desired. The language can be compiled or interpreted. Sets and subsets, in general, include one or more members.
(312) As provided herein, the processor(s) and/or processing elements can thus be embedded in one or more devices that can be operated independently or together in a networked environment, where the network can include, for example, a Local Area Network (LAN), wide area network (WAN), and/or can include an intranet and/or the Internet and/or another network. The network(s) can be wired or wireless or a combination thereof and can use one or more communication protocols to facilitate communication between the different processors/processing elements. The processors can be configured for distributed processing and can utilize, in some embodiments, a client-server model as needed. Accordingly, the methods, devices, and systems can utilize multiple processors and/or processor devices, and the processor/processing element instructions can be divided amongst such single or multiple processor/devices/processing elements.
(313) The device(s) or computer systems that integrate with the processor(s)/processing element(s) can include, for example, a personal computer(s), workstation (e.g., Dell, HP), personal digital assistant (PDA), handheld device such as cellular telephone, laptop, handheld, or another device capable of being integrated with a processor(s) that can operate as provided herein. Accordingly, the devices provided herein are not exhaustive and are provided for illustration and not limitation.
(314) References to “a processor”, or “a processing element,” “the processor,” and “the processing element” can be understood to include one or more microprocessors that can communicate in a stand-alone and/or a distributed environment(s), and can thus can be configured to communicate via wired or wireless communication with other processors, where such one or more processor can be configured to operate on one or more processor/processing elements-controlled devices that can be similar or different devices. Use of such “microprocessor,” “processor,” or “processing element” terminology can thus also be understood to include a central processing unit, an arithmetic logic unit, an application-specific integrated circuit (IC), and/or a task engine, with such examples provided for illustration and not limitation.
(315) Furthermore, references to memory, unless otherwise specified, can include one or more processor-readable and accessible memory elements and/or components that can be internal to the processor-controlled device, external to the processor-controlled device, and/or can be accessed via a wired or wireless network using a variety of communication protocols, and unless otherwise specified, can be arranged to include a combination of external and internal memory devices, where such memory can be contiguous and/or partitioned based on the application. For example, the memory can be a flash drive, a computer disc, CD/DVD, distributed memory, etc. References to structures include links, queues, graphs, trees, and such structures are provided for illustration and not limitation. References herein to instructions or executable instructions, in accordance with the above, can be understood to include programmable hardware.
(316) Although the methods and systems have been described relative to specific embodiments thereof, they are not so limited. As such, many modifications and variations may become apparent in light of the above teachings. Many additional changes in the details, materials, and arrangement of parts, herein described and illustrated, can be made by those skilled in the art. Accordingly, it will be understood that the methods, devices, and systems provided herein are not to be limited to the embodiments disclosed herein, can include practices otherwise than specifically described, and are to be interpreted as broadly as allowed under the law.