Systems and methods for tensor scheduling
11372629 · 2022-06-28
Assignee
Inventors
Cpc classification
International classification
Abstract
A technique for efficient scheduling of operations in a program for parallelized execution thereof using a multi-processor runtime environment having two or more processors includes constraining the type or number of loop optimization transforms that may be explored such that memory and processing capacity available for the scheduling task are not exceeded, while facilitating a tradeoff between memory locality, parallelization, and/or data communication between memory modules of the multi-processor runtime environment.
Claims
1. A method for transforming a program having a loop nest for parallelized execution using at least two processors, the method comprising: determining a first statement within a first loop nest within a specified program, the first statement accesses one or more tensors from memory; for scheduling operations defined by the first statement at a first dimension of the first loop nest, selecting a first statement grouping that comprises the first statement; and specifying constraints that limit a scheduler for applying either loop fusion transform or loop permutation transform to the first statement grouping, wherein specifying the constraints comprises: determining a first loop index maximizing an objective function; designating as permutable in the first loop nest a first loop corresponding to the first loop index; and designating all other loops in the first loop nest as nonpermutable.
2. The method of claim 1, wherein: the specified program comprises a second statement, that accesses at least one of the one or more tensors from the memory, creating a read-after-write, write-after-read, or write-after-write dependence between the first and the second statements.
3. The method of claim 2, wherein the first statement grouping comprises the second statement.
4. The method of claim 1, wherein the first statement grouping is a strongly connected component (SCC).
5. The method of claim 1, further comprising: selecting a second statement grouping that: (i) comprises a second statement that is included within a second loop nest and that access from the memory at least one of the one or more tensors, and (ii) has a dependence relation with the first statement grouping.
6. The method of claim 5, wherein specifying the constraints further comprises: limiting the scheduler to applying the loop fusion transform only if a dependence distance between the first statement grouping and the second statement grouping is zero.
7. The method of claim 5, wherein specifying the constraints further comprises: limiting the scheduler to applying the loop fusion transform only if a maximum dependence distance between the first statement grouping and the second statement grouping is less than or equal to a specified threshold distance.
8. The method of claim 5, wherein specifying the constraints comprises: determining a second loop index maximizing the objective function; designating as permutable in the second loop nest a second loop corresponding to the second loop index; and designating all other loops in the second loop nest as nonpermutable.
9. The method of claim 8, wherein the objective function assigns: a respective weight to each loop in the first loop nest; a respective weight to each loop in the second loop nest; and a penalty proportional to a maximum dependence distance between the first statement grouping and the second statement grouping.
10. The method of claim 9, wherein: the weight assigned to a candidate loop in the first loop nest is a first function of a dimension of the candidate loop within the first loop nest; or the weight assigned to a candidate loop in the second loop nest is a second function of a dimension of the candidate loop within the second loop nest; and the first function or the second function is linear, non-linear, or procedural function.
11. The method of claim 10, wherein the non-linear function is a Gaussian or mixed Gaussian function.
12. The method of claim 9, wherein the respective weight is assigned to a candidate loop within the first loop nest or the second loop nest according to a cost function representing one or more of: locality of memory access resulting from permutation of the candidate loop, permutability of the candidate loop with other loops, or parallelism of operations of the candidate loop.
13. The method of claim 12, wherein the cost function is non-linear or procedural.
14. The method of claim 9, wherein specifying the constraints comprises designating the first loop and the second loop as fusable if the penalty is less than a sum of the respective weights assigned to the first and second loop indices.
15. The method of claim 9, wherein the respective weight for each loop in the first loop nest and the respective weight for each loop in the second loop nest are determined using an artificial neural network (ANN).
16. The method of claim 1, wherein the objective function is specified or is implemented using an artificial neural network (ANN).
17. The method of claim 1, further comprising: repeating the specifying constraints step for scheduling the operations defined by the first statement at a second dimension of the first loop nest, wherein the second dimension is inner relative to the first dimension.
18. The method of claim 1, wherein the scheduler is a polyhedral scheduler.
19. A method for transforming a program having a loop nest for parallelized execution using at least two processors, the method comprising: evaluating, using a non-linear or procedural evaluation function, respective computational improvements in scheduling operations defined by a first loop nest within a specified program by transforming by a scheduler the first loop nest according to a plurality of candidate transforms, wherein evaluation of a particular candidate transform by the scheduler requires a particular memory space and a particular processing capacity; and limiting a total number of the plurality of candidate transforms such that a total memory space collectively required by the scheduler in evaluating the total number of the plurality of candidate transforms does not exceed an allocated memory space or a total processing capacity collectively required by the scheduler in evaluating the total number of the plurality of candidate transforms does not exceed an allocated processing capacity.
20. The method of claim 19, wherein the processing capacity comprises processor cycles or processor time.
21. The method of claim 19, wherein: the non-linear or procedural evaluation function represents one or more of: locality of memory access resulting from a permutation of a loop in the first loop nest, permutability of one or more loops in the first loop nest, or parallelism of operations associated with the loop in the first loop nest.
22. The method of claim 19, wherein a first statement within the first loop nest accesses from memory one or more tensors.
23. The method of claim 22, wherein: a second statement within a second loop nest accesses from the memory at least one of the one or more tensors, creating a dependence between the first loop nest and the second loop nest; and each candidate transform in the plurality of candidate transforms maintains the dependence.
24. The method of claim 19, wherein the scheduler is a polyhedral scheduler or an artificial neural network (ANN)-based scheduler.
25. The method of claim 19, wherein the non-linear evaluation function is a Gaussian function or a mixed Gaussian function.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
(2) The present disclosure will become more apparent in view of the attached drawings and accompanying detailed description. The embodiments depicted therein are provided by way of example, not by way of limitation, wherein like reference numerals/labels generally refer to the same or similar elements. In different drawings, the same or similar elements may be referenced using different reference numerals/labels, however. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating aspects of the invention. In the drawings:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
I. Motivation
(15) It is well known that most of the computation time in many compute-intensive programs is spent in a small section of the program that contains nested loops. These loops can often be parallelized and optimized for cache locality, vectorization, etc. Manual loop optimization and parallelization, however, can be fastidious and error prone. The polyhedral model offers a compact and precise way of representing operations performed in such loops, as well as transformations thereof, data spaces, and dependences as sets of points in a vector space. Such points can be defined by affine constraints, i.e., polyhedra. These constraints form a static representation and encapsulate a conservative approximation of the program semantics.
(16) A vast body of literature relies on its mathematical representation to perform a broad range of program optimizations, which subsume loop transformations as known traditionally. Polyhedral optimization is usually performed as a series of passes, with specific roles such as exposing parallelism and data locality, forming tasks, distributing such tasks across processor grids, generating communications and synchronizations, and more. Once optimized, programs are rendered into an imperative (but optimized and usually parallel) form, in which they can be optimized further through traditional compiler flows.
(17) However, the tractability of polyhedral compilation remains limited because some of its passes rely on analyzing high-dimensional polyhedra with algorithms (such as integer linear programming (ILP)) that do not scale well with the number of dimensions of the problem. Polyhedral scheduling, whose typical role is to expose loop parallelism and tilability while optimizing for data locality, is one of such passes. In fact, in our experience, most of the polyhedral compilation time tends to be spent in the polyhedral scheduling pass, and existing literature also acknowledges this issue.
(18) This lack of tractability is due to the way schedules are being computed, which is roughly as follows. The ordering of data accesses in the original program (i.e., dependences) are expressed in the product space of iteration spaces of all statements considered for scheduling. The space of semantically correct transformations is expressed by introducing new variables (so-called Farkas multipliers) that relate the constraints of the former space to coefficients of the affine transformation that the algorithm is looking for. The result is a polyhedral set of legal transformations, within which an optimal solution is searched using linear programming. Since we consider the Cartesian product of combination of per-statement constraints, the dimensionality of the polyhedron expressing the set of legal transformations grows as the product of the number of statements and the number of iteration space constraints per statement. The optimality of the solution is defined by the (linear) objective function chosen by the particular polyhedral scheduler under consideration. Modern objective functions typically express a trade-off between often-conflicting factors such as parallelism and data locality. Formulations in use in the current polyhedral mappers are usually looking for optimal integer schedules (using an ILP), which does not scale well with the size of the problem.
(19) Here, we acknowledge that a significant portion of loop codes already have enough loop parallelism to fulfill the needs of the targeted architecture. We show that for these codes, more tractable polyhedral scheduling algorithms are applicable, which only deal with obtaining locality while preserving enough parallelism for these programs. Because most of the adequate codes involve tensors (data structures having three or more (e.g., 4, 8, 12, 20, etc.) dimensions, we call these scheduling techniques “tensor schedulers.”
(20) The rest of the discussion is organized as follows: After describing polyhedral scheduling in section II, we present the general idea that the tensor schedulers are based upon, in section III. Then, we detail three tensor scheduler versions, with increasing sophistication, in section IV. We show in particular how the problem can be reduced to very small linear programming (LP) problems, and that non-linear cost functions can be introduced, to more accurately model cache behavior and model tradeoffs between parallelism, locality, and permutability at different levels. We evaluate scalability vs. schedule quality gains in section V, and offer conclusions in section VI.
II. Background
(21) The polyhedral abstraction represents program semantics in a concise, mathematical way through sets of affine constraints that make up the faces of a polyhedron. Changing the shape of this polyhedron affects the execution order of the code, which impacts performance. The mathematical representation allows for the use of well-established linear programming techniques to find an ideal shape. Note that the polyhedral model handles not only polyhedra, but also unions thereof, which we call domains from this point on.
(22) Given an input code, a generalized dependence graph (GDG) is constructed. This is a directed multigraph in which nodes represent statements and edges represent dependences. Statements are represented through their iteration domains and array access functions. An iteration domain is formed from the bounds of the loops surrounding a statement and contains all of the instances in which the statement is executed.
(23) .sub.S1, represents a timestep, or instance, in which the statement is executed. Specifically, it defines specific values for the surrounding loop iterators. Array access functions show the array accesses of a statement as a matrix in which the rows map to a dimension of the array and the columns correspond to the loop iterators, loop-invariant expressions (called parameters) and constants.
(24) A dependence occurs when two statements access some memory location and at least one of those accesses is a write. A legal schedule is one that respects ordering constraints defined by dependences. Dependences are represented through a polyhedral domain. A dependence represents all of the iterations in which statement T depends on statement S.
is defined by the constraints:
(25)
where f.sub.S and f.sub.T define the aliasing memory accesses of statements S and T. The relation {right arrow over (s)}{right arrow over (t)} is the lexicographic precedence constraint, which denotes that iteration vector {right arrow over (s)} (also denoted {right arrow over (l)}.sub.s) comes before the iteration vector {right arrow over (t)} (also denoted {right arrow over (l)}.sub.t) and is defined as:
∃i,(s.sub.1,s.sub.2, . . . ,s.sub.i)=(t.sub.1,t.sub.2, . . . ,t.sub.i){circumflex over ( )}s.sub.i+1<t.sub.i+1
(26) The goal of a polyhedral scheduler is to schedule when statements are executed. The scheduling process is done by finding a one-dimensional scheduling function (also often called hyperplane), θ.sub.S, for each dimension of each statement S:
θS({right arrow over (i)}.sub.S)={right arrow over (α)}.sub.S.Math.{right arrow over (i)}.sub.S+{right arrow over (γ)}.sub.S.Math.{right arrow over (n)}+γ.sub.S.sup.0 (1)
where {right arrow over (α)}.sub.S and {right arrow over (γ)}.sub.S are vectors of constants and {right arrow over (n)} is a vector of program parameters. Each function maps to a loop in the transformed code and, together, form the multi-dimensional affine schedule function:
(27)
(28) where S is an m dimensional statement. The {right arrow over (α)}.sub.S and {right arrow over (γ)}.sub.S vectors of the θ function in Equation (2) are often represented with their own matrices such that:
(29)
(30) Dependences provide a convenient encoding of program semantics. A transformation is legal if and only if the dependences are preserved, i.e. Θ.sub.S ({right arrow over (s)})Θ.sub.T ({right arrow over (t)}). The exact representation of dependence polyhedra expands the transformation options available compared to traditional compiler optimizations.
III. General Idea
(31) An important idea of our work is that there exists a class of programs in which many of the transformations that are typically considered in the scheduling process are not needed. For this class of programs, which we refer to as tensor codes, transformations such as skewing, scaling, and loop reversal will not necessarily yield more optimal code. These are codes that do not have many loop carried dependences in the input code and so identifying parallel loops is generally straightforward. Performance in these codes comes from identifying the right combination of fusion/fission and loop permutation. Limiting to this subset of transformations opens opportunities for both improving compiler scalability and improving performance cost models.
(32) Let us define a permutation hyperplane, θ.sub.pi, as a special hyperplane that replaces a in Equation (1) with the i-th standard basis vector:
(33)
The tensor scheduler constructs a schedule exclusively from permutation hyperplanes. In fact, there is a bijective mapping from loops in the input code to permutation hyperplanes.
A. Scalability Improvement
(34) We use schedulers that build one scheduling dimension at a time as a starting point. Schedulers that build a multi-dimensional schedule at once exist, but they require solving much more complex problems, and are consequently less scalable. Polyhedral scheduling generally has three stages. The first stage defines a set of constraints, based on dependences, that describe the set of legal schedules. The second stage identifies an optimal schedule by solving an ILP problem within the set of legal schedules. Finally, additional constraints are added to the space of legal solutions to ensure that all subsequent schedules are linearly independent with the schedule found in stage two. Unfortunately, none of these stages scales smoothly with the number of statements and loop dimensions.
(35) We address scalability in the first stage by using a single schedule per a strongly connected components (SCC) instead of statements. SCCs are formed by statements that have cyclic dependences (i.e. they depend on each other). This reduces the dimensionality of the polyhedra containing the set of legal solutions. Other types of groups of related statements, generally referred to as a statement grouping, may also be used.
(36) One other form of groupings may take into account the number of instructions to be executed per group at each iteration. Parallelization among the processing elements of a vector processor can require smaller groups (down to one operation) to optimize for instruction-level parallelism, while parallelization across cores may not need to take into account this kind of parallelism and can afford bigger groups in order have a more scalable scheduling process.
(37) Another type of grouping may be based on the number of store operations. Dataflow analysis can be more difficult to apply to statements that perform more than one write/store. Grouping may also be based on the weakness of data accesses. It is sometimes possible to group some or all conditionally-controlled accesses to a data element into one group. When all of them can, grouping turns “weak” accesses (i.e., accesses that may occur when a corresponding condition is true) into “strong” accesses (i.e., accesses that generally do occur). In some cases, grouping may be based on the ability to hide local, temporary values from the scheduler. Such values may be produced by one operation in the group and fully consumed within other operations in the same group, within the same iteration.
(38) Groupings may also be based on the amount of similar array accesses in the to-be-grouped statements. The number of constraints in the feasible space of a scheduling problem is generally closely related to the number of array references per statement. Grouping operations with equal array references can make it possible to express these equal or equivalent references as a single array reference in the scheduling problem. In some cases, grouping takes into account heuristics about the likelihood that some operations will require the same schedule. Among them, if the linear part (i.e., the part that does not involve loop-constants) of data-structure references accessed by two operations is the same, we may choose to place both operations in the same group.
(39) The second stage relies on solving an ILP optimality problem, which is NP-hard. To be tractable, the number of constraints and dimensions in the problem needs to stay low enough. The ILP constraints in traditional polyhedral schedulers has a O((m+n)mn) size on average for m constraints with n variables. Scalability is improved with our tensor scheduler in two ways for this stage. First, the restriction to permutations introduces many equalities in the ILP constraints, effectively reducing the dimensionality of the problem. Second, we only have one schedule to find per SCC as explained above. Last, we only schedule two SCCs (or fused SCCs) at a time. This effectively makes n depend only upon the maximum loop depth in either SCC, as opposed to the number of statements times such depth.
(40) The final stage of the scheduling process augments the ILP formulation to ensure that the next schedule dimensions to be found are linearly independent from the scheduling hyperplanes found so far. This is typically done by finding the orthogonal subspace H.sub.S.sup.⊥ of the multi-dimensional schedule found so far for S, H.sub.S.
(41)
(42) For the next schedule {right arrow over (θ)}.sub.S to be linearly independent with H.sub.S the following condition must hold true.
(43)
Alternatively, if bounded coefficients are imposed for θ.sub.S, orthogonality can be ensured by adding extra ILP constraints that exactly exclude being in H.sub.S.
(44) For our tensor schedulers, the orthogonal subspace can be identified in a straightforward manner. Given a permutation hyperplane, {right arrow over (θ)}.sub.pi linear independence can be achieved by adding the constraint i=0 to the space of legal solutions.
(45) B. Cost Model Improvements
(46) Because the polyhedral model depends on solving an ILP based on affine constraints, the performance cost models must also be affine. This is typically done with the minimum latency formulation. Given a vector of program parameters, {right arrow over (n)}, the minimum latency formulation proposes to bound the maximum dependence distance with an affine function L of the parameters:
L={right arrow over (h)}.Math.{right arrow over (n)}+k
such that,
L({right arrow over (n)})−θ({right arrow over (i.sub.S)},{right arrow over (i.sub.T)},{right arrow over (n)})≥0,∀({right arrow over (i.sub.S)},{right arrow over (i.sub.T)},{right arrow over (n)})∈ (6)
(47) In practice, however, this is not a particularly beneficial model of performance. It does not take into account effects due to caching or memory access pattern. Modeling performance as a linear function is a difficult problem since performance is inherently non-linear.
(48) The restriction to permutation hyperplanes presents an opportunity to improve the quality of the cost models. The bijective mapping of loops to hyperplanes means that the number of potential schedules at any level is relatively low (i.e. the dimensionality of the loop nest). This allows us to examine each loop in the input code and come up with a weighting for determining how desirable it would be to permute a loop to a particular level. This weighting vector can then be incorporated into the ILP objective function.
(49)
(50) This objective function does not need to be hierarchical. As a result, we can use off-the-shelf ILP solvers, which are much faster than any parametric hierarchical ILP.
{right arrow over (ω)}=[1 0 100].sup.T
(51) Solving the ILP to maximize Equation (7) gives the schedule shown in
(52) C. Applicability
(53) A significant portion of codes considered for polyhedral optimization have plenty of natural loop parallelism, including elementwise computations, matrix multiplications, tensor contractions and sequences thereof. In hierarchical polyhedral optimization, code gets scheduled once per level of the hierarchy. The first mapping level typically extracts a lot of loop parallelism, making subsequent levels often only need a tensor scheduler.
IV. Scalable Scheduling in the Permutation, Fusion Space
(54) The family of schedulers we propose here combine several features, each of which can improve scalability. First, we consider one schedule transformation per SCC of the GDG (or per statement grouping, in general), which can greatly reduce the dimensionality of the problem. This is especially true since all the statements nested in the same loop in the input program have, by definition, similar loop constraints. We reduce the dimensionality of the search space by collapsing all per-statement schedule dimensions onto one set of dimensions for the whole statement grouping (e.g., an SCC), effectively intersecting their individual schedule validity constraints. The discussion below uses SCCs as an example, for the simplicity of discussion. The techniques described below apply to other types of statement groupings, as well.
(55) Second, SCCs may be greedily scheduled by pairs, in an order given by an estimate of the amount of data shared among the SCCs. Such amount is computed from the dependence relations between the SCCs. The goal is to evaluate: (1) what is the best fusion and permutation combination, and (2) is it worth fusing, if the only way to do so is by permuting in a highly undesirable way The goal may also include evaluating whether it is worth permuting if the cost of fusing too high. When SCC nodes are fused, it forms a fused subgraph, which gets scheduled with another adjacent SCC. Statements in a fused subgraph will share a common loop in the generated code.
(56) Third, we use an adaptive cost model. When one or both nodes being scheduled by the tensor scheduler belong to a fused subgraph, the criteria for fusion becomes stricter. Nodes in a subgraph cannot have their schedule changed, because altering the schedule could violate the conditions under which the nodes were fused in the first place. Also, merging a new node into a fused subgraph requires all SCC edges between the node and subgraph to be evaluated.
(57) In the remainder of this section, we present three types of tensor schedulers. The first only performs fusion only if there is a zero dependence distance between two SCC nodes. The second is parametric, which allows for fusion plus a constant shift. The third incorporates permutation to find the best combination of permutation and fusion.
(58) A. Simple Tensor Scheduler (STS)
(59) The simple tensor scheduler only attempts to fuse SCC nodes, without any loop permutations. As a result, the A transformation matrix for each statement remain the identity. The STS attempts to maximally fuse all SCC nodes at each scheduling level, but only if the dependence distance is zero. Although fusion is possible with a non-zero dependence distance, it is not always desirable. Fusion with a non-zero dependence distance may turn a parallel loop into a sequential one (the non-zero dependence meaning that the dependence is carried by the loop). Furthermore, if the dependence distance is large, fusion may not provide any benefits to locality; since there is a higher chance that data may be evicted from cache before it can be reused. The STS considers zero-distance fusion to be always desirable, which is mostly true since it does not affect the level of parallelism and data between the fused statements is reused in the same iteration. An exception is when fusion introduces too much register pressure, which can be handled by cost models as presented below.
(60) For two SCC nodes, A and B at scheduling level i, we check the following condition:
θ.sub.pi.sup.B+θ.sub.pi.sup.A=0⊇.sub.A.fwdarw.B (8)
where θ.sub.pi.sup.B and θ.sub.pi.sup.A are B's and A's i.sup.th permutation hyperplanes, respectively. When Equation (8) holds true, this indicates that the i.sup.th loop levels of A and B can be safely executed at the same time without violating .
B. Parametric Tensor Scheduler (pTS)
(61) The parametric tensor scheduler relaxes the zero dependence distance constraint required by the STS. The pTS fuses SCCs as long as the dependence distance is bounded by a constant. Furthermore, it identifies a γ.sub.0 shift that will minimize this dependence distance. This is done by solving a linear programming problem subject to a set of clamping constraints. Clamping constraints define a set of parallel hyperplanes that bound the dependence polyhedron. Every point in {right arrow over (s)},{right arrow over (t)}
∈
is contained by the half spaces:
θ.sub.B({right arrow over (t)})+θ.sub.A({right arrow over (s)})≥0
θ.sub.B({right arrow over (t)})−θ.sub.A({right arrow over (s)})−d≥0 (9)
Equation (9) defines two parallel hyperplanes separated by a constant distance, d, called the clamping distance. Minimizing the clamping distance basically gives a scheduling hyperplane that minimizes the dependence distance and allows A and B to be fused.
(62) When the distance is constant, there is a tradeoff between γ.sub.0 and d. By setting γ.sub.B.sub.
(63) C. Permutation Tensor Scheduler (PERTS)
(64) The permutation tensor scheduler extends the parametric tensor scheduler with the inclusion of permutation transformations. This can greatly improve the quality of the generated schedules, by exploring tradeoffs between fusion and permutation. Fusing some statements could restrict the permutations available and vice versa. For the other tensor schedulers, fusion is always beneficial because the loop ordering was fixed. The scheduling matrix for each statement remains the identity regardless of fusion. This tradeoff leads to two types of fusion heuristics—max fuse and smart fuse. The max fuse scheduler always performs fusion when possible, whereas the smart fuse scheduler only fuses when it is expected to give better performance.
(65) Algorithm 1 shown in
(66)
where mS is the dimensionality of a statement grouping S, i.e., the number of dimensions of a loop nest associated with the statement grouping mS. Equation (10) requires that of the mS loops in the loop nest, one and only one can be permuted.
(67) Selection of the permutation hyperplane is handled by assigning a weight co to source and destination permutation hyperplanes for the current schedule dimension, and considering the following objective function:
argmax(Σ.sub.iω.sub.α.sub.
Equation (11) considers together a pair of SCCs, designated as a source (src) SCC and a destination (dst) SCC, and having a dependence therebetween. Maximizing this objective function yields a value of a loop index i and a loop index j. These loop indices indicate that the i-th loop in the loop nest associated with the source SCC and/or the j-th loop in the loop nest associated with the destination SCC may be permuted, and/or the two SCCs may be fused. The permutation may be beneficial only if the cost of fusion (e.g., in terms of loss of parallelism), as indicated by the clamping distance d is not excessive. In some cases, the weights are normalized to a range [0, 100] and, as such, the clamping distance is also weighted by the constant 100. This constant and the weight ranges are adjustable.
(68) The weighting vectors in Equation (11) may be computed from a non-linear cost model that may include one or more of the three separate models shown in
(69)
(70) Specifically, locality is generally improved if SCCs (statements groups, in general) involving memory access with large strides (e.g., greater that a few tens of bytes, a few hundreds of bytes, a few thousands of bytes, etc.) are not fused or scheduled at the inner dimensions of a loop nest. Likewise, locality can be improved by allowing or encouraging fusion of SCCs (statement groups, in general) involving memory access with small strides (e.g., less than a few thousands of bytes, a few hundreds of bytes, a few bytes, etc.) at the inner dimensions of a loop nest. If a loop nest has m dimensions, where the innermost dimension is referred to as dimension “m” and the outermost dimension is referred to as dimension “1,” the inner dimensions may be defined as dimensions greater than or equal to k, where 1≤k<m. In general, m can be 3, 4, 8, 10, 15, etc. and k can be 2, 5, 12, 15, etc.
(71) Since scheduling is typically performed starting with the outermost dimension, progressing to the innermost dimension, a cost model can be designed such that large-strides correspond to higher costs or weights, as shown in
(72) The Gaussian 404 in
(73)
(74)
(75) Given a pair of SCCs, Equation (11) informs whether the source and destination SCCs can be fused. The terms source SCC and designation SCC indicate a dependence from the destination SCC to the source SCC in the SCC graph. This equation is evaluated at each scheduling dimension, yielding a value of dimension (depth) i for the source SCC and a value of dimension (depth) j for the destination SCC. In general, while performing scheduling at dimension k, the values of i and j that maximize Equation (11) can be used as follows: (i) the source and destination SCCs may be fused within a loop at depth i or j; and/or (ii) a loop at depth i in the source SCC may be permuted; and/or (iii) a loop at depth j in the destination SCC may be permuted. As noted above, permutation may not be performed if the cost of fusion is excessive because the clamping distance d is large.
(76) In general, the weights used in Equation (11) can be derived from a cost function that can be linear, non-linear (such as those described with reference to
(77) Referring again to Algorithm 1 (
(78) Smart Fuse: The PERTS under maximum fusion may fuse whenever the ILP formed from the clamping constraints of Equation (9) has a solution. Under smart fusion, scheduling is performed twice—once for fusion and once for fission. Scheduling with fission is straightforward since it simply requires selecting the loop with the highest weight derived from the selected cost model. Either the fusion or fission transformation is then selected based on the following equation:
(79)
(80) where F(θ) represents the data reuse between the two statement groups and P(θ) represents the benefits (i.e. cost function) from the best permutation found in each case. Thus, if fission enables permutations whose benefit offset the benefit of fusion, the loops are not fused. F(θ) can be a 1D Gaussian centered at zero and indexed by the product of dependence distance and number of bytes shared between the statement groups. For the fission schedule, the trip count of the loop may be treated as the dependence distance. In the event this is parametric, we can treat the loop count as a large fixed constant (e.g., 1024, 4096, etc.). This allows the PERTS to tailor its transformations based on the size of the loop. P(θ) can be computed using a linear, a non-linear (e.g., the Gaussian mixture shown in
(81)
V. Generalized Schedulers
(82) The use of non-linear and/or procedural cost functions is not limited to scheduling according to Algorithm 1 (
(83) In order to ensure that the scheduler does not run out of memory or does not take excessively long to derive a schedule, memory space only up to a certain size, e.g., 128 MB, 1 GB, etc., may be allocated for the scheduler. In addition, or alternatively, processing capacity and/or time, e.g., in terms of maximum CPU time, a maximum number of CPU cycles, maximum CPU load, a maximum actual time, etc. may be allocated. For each candidate transform, the memory and/or processing requirements of the transform are obtained, e.g., from past experience.
(84) These requirements may be parameterized in terms of one or more characteristics of the program to be scheduled. These parameters may include the number of loop nests in the program, average or maximum depth of the loop nests, the maximum or average number of tensors accessed in the program or in each loop nest, the maximum or average size of the tensors, which can be large (e.g., 10 MB, 200 MB, 1 GB, 150 GB, or more), the maximum or average number of dimensions of the tensors accessed in the program or in each loop, which can also be large (e.g., 5, 12, 20, 40, etc.).
(85) The number of different transforms the scheduler may explore may be determined based on the respective requirements of different transforms and the allocated memory size and processing capacity, such that applying the different transforms would not exceed the allocated memory size and processing capacity. While this can limit the number of transforms explored by the scheduler, the non-linear or procedural cost function (which may also be considered tradeoff or benefit-penalty functions) enable exploring diverse candidate solutions so that a schedule that can simultaneously optimize objectives such as locality, parallelization, data reuse, data communication, processing system utilization, etc., can be obtained.
V. Evaluation
(86) We implemented embodiments of the tensor schedulers described above in the R-Stream compiler (rcc for short), which is a polyhedral compiler, and we compared their performance with that of a conventional scheduler (specifically, rcc's Pluto-style scheduler). The comparison faced one challenge: in rcc, the scheduling process is a series of many passes or phases, some of which correct (using loop permutations, skewings, etc., and sometimes fissions) the mistakes of previous passes. As a result, the effects of a scheduler are only quite indirectly exposed in the final version of the compiled/scheduled program.
(87) In order to address this problem, we analyzed the properties of the GDG right after scheduling. We compared several features over a variety of 66 benchmarks, which includes stencils, basic linear algebra subprograms (BLAS) codes, radar codes, tensor codes, etc. The compared features include: the running times of the schedulers, an indicator of the quality of fusion, and an indicator of parallelism.
(88) To obtain an indicator of the quality of fusion, we ranked the program statements by their so-called “beta” coordinates, a vector of integers that reflects how they are fused. If the common prefix of two statement's beta vector is of length x, it means the statements have been fused up to dimension x. Hence, we sort statements by their beta coordinates, and the fusion indicator is the sum (plus one) of common prefixes of consecutive statements in this sorted list.
(89) An indicator of parallelism is available in statements as per dimension annotations. While we could easily perform a weighted sum of parallelism annotations, we found it hard to find weights that express how important doall parallelism is, compared to permutability, or the availability of a reduction loop. We therefore focused on counting sequential loops out of which parallelism is lost. The parallelism score was computed as: 1.1—(#seq loops/total #loops).
(90) Table 1 shows the average, min, and max of the ratio between indicators of the base scheduler and the corresponding indicators of tensor schedulers.
(91) TABLE-US-00001 TABLE 1 Comparison of Scheduler Performance (avg; min; max) time fusion parallelism simple (1.71; 0.86; 4.41) (1.51; 0.7; 13) (1.34; 0.56; 7.00) parametric.sup.2 (1.53; 0.52; 2.52) (2.04; 0.70; 13) (1.47; 0.56; 7.67) perm-simple (1.15; 0.34; 2.40) (1.28; 0.70; 7.0) (1.50; 0.57; 11.0) perm-smart (1.12; 0.33; 2.29) (1.32; 0.70; 7.0) (1.50; 0.57; 11.0)
(92) We observe a general speedup of the scheduling time, decreasing as we increase the sophistication of the tensor scheduler from simple (STS), to parametric (pTS), to perm-simple (PERTS-Max Fuse), to perm-smart (PERTS-Smart Fuse). In some experiments, pTS performed better with fusion. One observation is that none of the schedulers is always the fastest, the best fuser, or the best parallelizer, although the non-permutation schedulers are marginally slower with small input programs. The RStream-TF front-end generates code for combinations of deep learning network layers, and optimizes them through R-Stream. In this case, tensor schedulers can then be invoked by RStream-TF on the appropriate layers.
VII. Conclusion
(93) We introduced four new lightweight schedulers of increasing capabilities and complexity. The different scheduling techniques described above, as well as the methodology proposed to apply them, provide several new alternatives to heavy-weight polyhedral schedulers described in earlier work. These new schedulers provide more efficient and scalable scheduling capabilities for the codes that already expose enough parallelism in their original formulation. The reduced scheduling/compilation times offered by these schedulers are of general interest but also have important and immediate applications in just-in-time (JIT) compilers, for which compilation speed is typically paramount.
(94) It is clear that there are many ways to configure the device and/or system components, interfaces, communication links, and methods described herein. The disclosed methods, devices, and systems can be deployed on convenient processor platforms, including network servers, personal and portable computers, and/or other processing platforms. Other platforms can be contemplated as processing capabilities improve, including personal digital assistants, computerized watches, cellular phones and/or other portable devices. The disclosed methods and systems can be integrated with known network management systems and methods. The disclosed methods and systems can operate as an SNMP agent, and can be configured with the IP address of a remote machine running a conformant management platform. Therefore, the scope of the disclosed methods and systems are not limited by the examples given herein, but can include the full scope of the claims and their legal equivalents.
(95) The methods, devices, and systems described herein are not limited to a particular hardware or software configuration, and may find applicability in many computing or processing environments. The methods, devices, and systems can be implemented in hardware or software, or a combination of hardware and software. The methods, devices, and systems can be implemented in one or more computer programs, where a computer program can be understood to include one or more processor executable instructions. The computer program(s) can execute on one or more programmable processing elements or machines, and can be stored on one or more storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), one or more input devices, and/or one or more output devices. The processing elements/machines thus can access one or more input devices to obtain input data, and can access one or more output devices to communicate output data. The input and/or output devices can include one or more of the following: Random Access Memory (RAM), Redundant Array of Independent Disks (RAID), floppy drive, CD, DVD, magnetic disk, internal hard drive, external hard drive, memory stick, or other storage device capable of being accessed by a processing element as provided herein, where such aforementioned examples are not exhaustive, and are for illustration and not limitation.
(96) The computer program(s) can be implemented using one or more high level procedural or object-oriented programming languages to communicate with a computer system; however, the program(s) can be implemented in assembly or machine language, if desired. The language can be compiled or interpreted. Sets and subsets, in general, include one or more members.
(97) As provided herein, the processor(s) and/or processing elements can thus be embedded in one or more devices that can be operated independently or together in a networked environment, where the network can include, for example, a Local Area Network (LAN), wide area network (WAN), and/or can include an intranet and/or the Internet and/or another network. The network(s) can be wired or wireless or a combination thereof and can use one or more communication protocols to facilitate communication between the different processors/processing elements. The processors can be configured for distributed processing and can utilize, in some embodiments, a client-server model as needed. Accordingly, the methods, devices, and systems can utilize multiple processors and/or processor devices, and the processor/processing element instructions can be divided amongst such single or multiple processor/devices/processing elements.
(98) The device(s) or computer systems that integrate with the processor(s)/processing element(s) can include, for example, a personal computer(s), workstation (e.g., Dell, HP), personal digital assistant (PDA), handheld device such as cellular telephone, laptop, handheld, or another device capable of being integrated with a processor(s) that can operate as provided herein. Accordingly, the devices provided herein are not exhaustive and are provided for illustration and not limitation.
(99) References to “a processor”, or “a processing element,” “the processor,” and “the processing element” can be understood to include one or more microprocessors that can communicate in a stand-alone and/or a distributed environment(s), and can thus can be configured to communicate via wired or wireless communication with other processors, where such one or more processor can be configured to operate on one or more processor/processing elements-controlled devices that can be similar or different devices. Use of such “microprocessor,” “processor,” or “processing element” terminology can thus also be understood to include a central processing unit, an arithmetic logic unit, an application-specific integrated circuit (IC), and/or a task engine, with such examples provided for illustration and not limitation.
(100) Furthermore, references to memory, unless otherwise specified, can include one or more processor-readable and accessible memory elements and/or components that can be internal to the processor-controlled device, external to the processor-controlled device, and/or can be accessed via a wired or wireless network using a variety of communication protocols, and unless otherwise specified, can be arranged to include a combination of external and internal memory devices, where such memory can be contiguous and/or partitioned based on the application. For example, the memory can be a flash drive, a computer disc, CD/DVD, distributed memory, etc. References to structures include links, queues, graphs, trees, and such structures are provided for illustration and not limitation. References herein to instructions or executable instructions, in accordance with the above, can be understood to include programmable hardware.
(101) Although the methods and systems have been described relative to specific embodiments thereof, they are not so limited. As such, many modifications and variations may become apparent in light of the above teachings. Many additional changes in the details, materials, and arrangement of parts, herein described and illustrated, can be made by those skilled in the art. Accordingly, it will be understood that the methods, devices, and systems provided herein are not to be limited to the embodiments disclosed herein, can include practices otherwise than specifically described, and are to be interpreted as broadly as allowed under the law.