Compiler-level general matrix multiplication configuration optimization
11842178 · 2023-12-12
Assignee
Inventors
Cpc classification
International classification
Abstract
A system and method is provided for optimizing general matrix multiplication (GEMM) on target hardware by splitting matrices to be multiplied into tiles and formulating a tiling configuration search problem for matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes running time on the target hardware for multiplication of matrices A (m×k) and B (k×n) on the target hardware for respective configuration states as a function of matrix parameters m, k, and n, and numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the target hardware is obtained by implementing a Greedy Best-First-Search (GBFS) algorithm or a Neighborhood Actor Advantage Critic (N-A2C) algorithm that optimizes the running time for multiplication of the matrices on the target hardware, and the target hardware is configured and computations are run accordingly.
Claims
1. A computer-implemented method of optimizing general matrix multiplication (GEMM) on target hardware, the method comprising: splitting, by one or more processing devices, matrices to be multiplied into tiles; formulating, by the one or more processing devices, a tiling configuration search problem for the matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes a running time on the target hardware for multiplication of the matrices on the target hardware, the formulating including formulating the tiling configuration search problem into following optimization problem: min cost (s; m, k, n, d.sub.m, d.sub.k, d.sub.n), s where: s denotes a configuration state, wherein the configuration state s is defined as s=[s.sub.m s.sub.k, s.sub.n, J], where s.sub.m=[m.sub.0, m.sub.1, . . . , m.sub.dm-1], s.sub.k=[k.sub.0, k.sub.1, . . . , k.sub.dk-1], s.sub.n=[n.sub.0, n.sub.1, . . . , n.sub.dn-1], and J is a binary number indicating whether the configuration state s satisfies search constraints for searching tiling configurations on the matrices, wherein an action space for exploring from a current state s[i,j] to another state is defined such that s[i]=s[i]*2 and s[j]=s[j]/2 where i and j are in the same split of m defined as from dimension 0 to dimension d.sub.m−1, k defined from dimension d.sub.m to dimension d.sub.m+d.sub.n−1, or n defined from dimension d.sub.m+d.sub.n to dimension d.sub.m+d.sub.n+d.sub.k−1, wherein each configuration state s is a unique state and a transition from one configuration state to another is defined as s′=step (s, a), where a is a particular action that transitions a current state s to state s′, further comprising defining a reward function for the optimization problem as:
r(s,a)=1,
cost(s′,m,k,n,d.sub.m,d.sub.k,d.sub.n) where for each configuration state s′ which can be accessed by any possible action a from one configuration state s, the state s′ and the state s are neighbor states, cost denotes a running time for the configuration state s, m, k, and n are matrix parameters for multiplication of a first matrix A (m×k) by a second matrix B (k×n), to produce a matrix C (m×n), where m is a number of rows of a first matrix, k is number of columns of the first matrix and number of rows of a second matrix, and n is the number of columns of the second matrix, and d.sub.m, d.sub.k, d.sub.n are numbers of respective nested loops for each dimension m, k, and n, respectively; and running, by the one or more processing device, a GEMM computation on the target hardware based on the optimal tiling configuration.
2. The computer-implemented method as in claim 1, wherein the configuration search space for the optimal tiling configuration is limited by d.sub.m, d.sub.k, and d.sub.n, where dm is restricted to 2.sup.x, d.sub.k is restricted to 2.sup.y, and d.sub.n is restricted to 2.sup.z, where x, y, and z are non-negative integers.
3. The computer-implemented method as in claim 1, wherein selecting the optimal tiling configuration for the matrices to be multiplied that optimizes the running time for multiplication of the matrices on the target hardware comprises implementing a Greedy Best-First-Search (GBFS) algorithm to search for an optimal tiling configuration starting at a random starting state so or a selected starting state s.sub.0 to run respective configurations on the target hardware to find an optimal state s′ with a minimum cost within one of a predetermined search time and a predetermined fraction of the configuration search space for the GBFS algorithm.
4. The computer-implemented method as in claim 3, wherein the configuration search space is defined for the optimal tiling configuration for a random subset of neighbors p of the configuration state s where 1≤p≤total number of neighbors of s and the neighbor state s′ for s in response to action a is s′=step (s, a).
5. The computer-implemented method as in claim 1, wherein selecting the optimal tiling configuration for the matrices to be multiplied that optimizes the running time for multiplication of the matrices on the target hardware comprises implementing a Neighborhood Actor Advantage Critic (N-A2C) algorithm to search for an optimal tiling configuration starting at a random or selected starting state so to run respective configurations on the target hardware to find an optimal state s′ with a minimum cost within one of a predetermined search time and a predetermined fraction of the configuration search space.
6. The computer-implemented method as in claim 5, wherein neural networks of an actor and critic in the N-A2C algorithm are initialized with random weights and, for each episode of the N-A2C algorithm, the configuration search space is searched for the optimal tiling configuration from the starting state s.sub.0 in T continuous steps using an ε—greedy algorithm for each step, where (1) with a probability of ε an action a guided by a policy generated by the actor's neural network for each state s is taken to get a next configuration state s′ or (2) with a probability of 1−ε a random action a is chosen from a current configuration state s to get a next configuration state s′.
7. The computer-implemented method as in claim 6, wherein when the configuration state s′ has not been visited before, the state s′ is added to a collected candidate set and the T continuous steps are repeated until a number of collected states s′ in the collected candidate set reaches a predefined threshold.
8. The computer-implemented method as in claim 6, wherein the number of the T continuous steps decays in value over iterations of exploration of the configuration search space.
9. A computing device for optimizing general matrix multiplication (GEMM) on target hardware, comprising: at least one processor; and a machine-readable medium coupled to the at least one processor, with the machine-readable medium comprising instructions thereon that, when executed by the at least one processor, causes the at least one processor to perform operations including: splitting matrices to be multiplied into tiles; formulating a tiling configuration search problem for the matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes a running time on the target hardware for multiplication of the matrices on the target hardware, the formulating including formulating the tiling configuration search problem into following optimization problem: min cost (s; m, k, n, d.sub.m, d.sub.k, d.sub.n), s where: s denotes a configuration state, wherein the configuration state s is defined as s=[s.sub.m, s.sub.k, s.sub.n, J], where s.sub.m=[m.sub.0, m.sub.1, . . . , m.sub.dm-1], s.sub.k=[k.sub.0, k.sub.1, . . . , k.sub.dk-1], s.sub.n=[n.sub.0, n.sub.1, . . . , n.sub.dn-1], and J is a binary number indicating whether the configuration state s satisfies search constraints for searching tiling configurations on the matrices, wherein an action space for exploring from a current state s[i,j] to another state is defined such that s[i]=s[i]*2 and s[j]=s[j]/2 where i and j are in the same split of m defined as from dimension 0 to dimension d.sub.m−1, k defined from dimension d.sub.m to dimension d.sub.m+d.sub.n−1, or n defined from dimension d.sub.m+d.sub.n to dimension d.sub.m+d.sub.n+d.sub.k−1, wherein each configuration state s is a unique state and a transition from one configuration state to another is defined as s′=step (s, a), where a is a particular action that transitions a current state s to state s′, further comprising defining a reward function for the optimization problem as:
r(s,a)=1,
cost(s′,m,k,n,d.sub.m,d.sub.k,d.sub.n) where for each configuration state s′ which can be accessed by any possible action a from one configuration state s, the state s′ and the state s are neighbor states, cost denotes a running time for the configuration state s, m, k, and n are matrix parameters for multiplication of a first matrix A (m×k) by a second matrix B (k×n), to produce a matrix C (m×n), where m is a number of rows of a first matrix, k is number of columns of the first matrix and number of rows of a second matrix, and n is the number of columns of the second matrix, and d.sub.m, d.sub.k, d.sub.n are numbers of respective nested loops for each dimension m, k, and n, respectively; and running a GEMM computation on the target hardware based on the optimal tiling configuration.
10. The device as in claim 9, wherein the instructions further comprise instructions that when executed by the at least one processor limit the configuration search space for the optimal tiling configuration by d.sub.m, d.sub.k, and d.sub.n, where d.sub.m is restricted to 2.sup.x, d.sub.k is restricted to 2.sup.y, and d.sub.n is restricted to 2.sup.z, where x, y, and z are non-negative integers.
11. The device as in claim 9, wherein the instructions further comprise instructions that when executed by the at least one processor select the optimal tiling configuration for the matrices to be multiplied that optimizes the running time for multiplication of the matrices on the target hardware by implementing a Greedy Best-First-Search (GBFS) algorithm to search for an optimal tiling configuration starting at a random starting state so or a selected starting state so to run respective configurations on the target hardware to find an optimal state s′ with a minimum cost within one of a predetermined search time and a predetermined fraction of the configuration search space for the GBFS algorithm.
12. The device as in claim 11, wherein the instructions further comprise instructions that when executed by the at least one processor define the configuration search space for the optimal tiling configuration for a random subset of neighbors p of the configuration state s where 1≤p≤total number of neighbors of s and the neighbor state s′ for s in response to action a is s′=step (s, a).
13. The device as in claim 9, wherein the instructions further comprise instructions that when executed by the at least one processor select the optimal tiling configuration for the matrices to be multiplied that optimizes the running time for multiplication of the matrices on the target hardware by implementing a Neighborhood Actor Advantage Critic (N-A2C) algorithm to search for an optimal tiling configuration starting at a random or selected starting state so to run respective configurations on the target hardware to find an optimal state s′ with a minimum cost within one of a predetermined search time and a predetermined fraction of the configuration search space.
14. The device as in claim 13, wherein the instructions further comprise instructions that when executed by the at least one processor initialize neural networks of an actor and critic in the N-A2C algorithm with random weights and, for each episode of the N-A2C algorithm, search the configuration search space for the optimal tiling configuration from the starting state s.sub.0 in T continuous steps using an ε—greedy algorithm for each step, where (1) with a probability of ε an action a guided by a policy generated by the actor's neural network for each state s is taken to get a next configuration state s′ or (2) with a probability of 1−ε a random action a is chosen from a current configuration state s to get a next configuration state s′.
15. The device as in claim 14, wherein the instructions further comprise instructions that when executed by the at least one processor add the state s′ to a collected candidate set and repeat the T continuous steps until a number of collected states s′ in the collected candidate set reaches a predefined threshold when the configuration state s′ has not been visited before.
16. The device as in claim 14, wherein the instructions further comprise instructions that when executed by the at least one processor cause the number of the T continuous steps to decay in value over iterations of exploration of the configuration search space.
17. A computing device for optimizing general matrix multiplication (GEMM) on target hardware, comprising: at least one processor; a tiling module splitting matrices to be multiplied into tiles; an optimization module formulating a tiling configuration search problem for the matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes a running time on the target hardware for multiplication of the matrices on the target hardware, the formulating including formulating the tiling configuration search problem into the following optimization problem: min cost (s; m, k, n, d.sub.m, d.sub.k, d.sub.n), s where: s denotes a configuration state, wherein the configuration state s is defined as s=[s.sub.m, s.sub.k, s.sub.n, J], where s.sub.m=[m.sub.0, m.sub.1, . . . , m.sub.dm-1], s.sub.k=[k.sub.0, k.sub.1, . . . , k.sub.dk-1], s.sub.n=[n.sub.0, n.sub.1, . . . , n.sub.dn-1], and J is a binary number indicating whether the configuration state s satisfies search constraints for searching tiling configurations on the matrices, wherein an action space for exploring from a current state s[i,j] to another state is defined such that s[i]=s[i]*2 and s[j]=s[j]/2 where i and j are in the same split of m defined as from dimension 0 to dimension d.sub.m−1, k defined from dimension d.sub.m to dimension d.sub.m+d.sub.n−1, or n defined from dimension d.sub.m+d.sub.n to dimension d.sub.m+d.sub.n+d.sub.k−1, wherein each configuration state s is a unique state and a transition from one configuration state to another is defined as s′=step (s, a), where a is a particular action that transitions a current state s to state s′, further comprising defining a reward function for the optimization problem as:
r(s,a)=1,
cost(s′,m,k,n,d.sub.m,d.sub.k,d.sub.n) where for each configuration state s′ which can be accessed by any possible action a from one configuration state s, the state s′ and the state s are neighbor states, cost denotes a running time for the configuration state s, m, k, and n are matrix parameters for multiplication of a first matrix A (m×k) by a second matrix B (k×n), to produce a matrix C (m×n), where m is a number of rows of a first matrix, k is number of columns of the first matrix and number of rows of a second matrix, and n is the number of columns of the second matrix, and d.sub.m, d.sub.k, d.sub.n are numbers of respective nested loops for each dimension m, k, and n, respectively; and an execution module running a GEMM computation on the target hardware based on the optimal tiling configuration.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION
(18) In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes may be made without departing from the scope of the present invention. The following description of example embodiments is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.
(19) The functions or algorithms described herein may be implemented in software in one embodiment. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
(20) The systems and methods described herein optimize matrix multiplication using GEMM and matrix tiling techniques. An overview of these concepts is provided before formulating a GEMM tiling optimization problem addressed by the techniques described herein.
(21) Matrix Multiplication
(22) Most prior art deep learning frameworks rely on hardware-specific operator libraries for optimization, which is mostly achieved by manual tuning. Existing approaches perform either graph-level or operator-level optimization during compilation and are not generic enough to accommodate all AI frameworks and hardware. For a compiler-level GEneral Matrix Multiplication (GEMM) operator, there exists configurations for matrix tiling. Different configuration settings significantly impact the deep learning performance. Current GEMM configuration optimization relies on manual tuning, based on limited hardware architectures. For general hardware architectures, it is beneficial to set a smart tuner to automatically and efficiently search the GEMM configuration space and to select the one with the highest performance. The configurations can be automatically tuned in a grid search, where all possible configuration candidates in the configuration space are tested in the system and the one with the highest performance is selected. However, for general GEMM configuration optimization, it is time consuming for the system to test with a large majority of configuration candidates. The configuration space is large, but relations exist for similar configurations. A smart and efficient configuration tuner is required to search the optimal configuration with a low tuning time.
(23) Matrix multiplication is an important component in AI and in many machine learning algorithms, particularly in the domain of deep learning. Training parameters (weights) of a deep neural network in a vectorized fashion essentially involves multiplication of matrices with various sizes. The input matrices in deep learning are normally very large (e.g., a 256×1024 matrix is multiplied with a 1024×128 matrix). Compilers are used to translate the matrix multiplication routines into low-level code optimized for specific hardware. Compiler-level optimization may be used to significantly improve performance.
(24) Fully-Connected (FC) layers (
(25) GEMM and Matrix Tiling
(26) GEMM is a general procedure ubiquitously used in linear algebra, machine learning, statistics, and many other areas and is implemented in the BLAS (Basic Linear Algebra Subprograms) library. GEMM multiplies two input matrices to produce an output matrix and has the following iterative form: C←αAB+βC. The key difference between GEMM in deep learning and regular matrix multiplication is that the input matrices handled in deep learning are normally much larger. For example, a single layer in a typical convolution neural network may require multiplication of a 256×1024 matrix by a 1024×128 matrix to produce a 256×128 matrix. Regular three-for-loop matrix multiplication computation as shown below requires 34 million (256×1024×128) floating point operations (FLOPs): For i in range(m): For j in range(n):
C[i][j]=0 For l in range(k):
C[i][j]+=A[i][l]×B[l][j]
Modern deep neural networks may have hundreds of convolutional layers (e.g., ResNet152), and such networks may need several billions of FLOPs to finish operations in all layers for an input image.
(27) The time it takes to complete a GEMM computation largely depends on the cache hit rate of memory access. The large sizes of matrices usually prevent the entire matrices from being loaded into memory or cache. However, GEMM can optimize memory access by iteratively splitting computation into smaller tiles, often referred to as the tiling process. A resultant matrix is initialized with zeros. GEMM uses outer products to compute part of a tile of the result and accumulates it on top of what has been stored in that tile. A tile is loaded from memory into cache and accumulates a new result on top of that.
(28) Original memory access patterns need to be transformed to adapt to the cache policy of particular hardware. It is not straightforward to decide an optimal tiling strategy because it requires an accurate estimate of accessed array regions in loops to match with cache size of target hardware and to meet other constraints. An optimal tiling configuration chooses a tile size for each loop to collectively achieve the lowest running time on target hardware. The rest of this description addresses how to compute such an optimal tiling configuration.
(29) Problem Formulation
(30) TVM, as described by Chen, et al., in “TVM: An automated end-to-end optimizing compiler for deep learning,” 13.sup.th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578-594, Carlsbad, Calif. (2018), is used to investigate the performance of matrix tiling for GEMM. TVM is an open source deep learning compiler stack for CPUs, GPUs, and specialized accelerators. Further details regarding TVM may be found at TVM.ai. The objective is to find an optimal matrix tiling configuration that has the minimal running time on target hardware.
(31) TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding, for example. TVM also offers automated optimization of low-level programs to hardware characteristics by employing a novel learning-based cost modeling method for rapid exploration of code optimizations.
(32) TVM facilitates matrix tiling optimization for GEMM by generating Intermediate Representation (IR) of a particular configuration. For example, a simple example IR of a GEMM tiling configuration with a blocking factor of 32 on an x86 CPU for GEMM with (m=1024, k=1024, n=1024) (shortened as (1024, 1024, 1024)) may be illustrated as:
(33) TABLE-US-00001 produce C { for (x.outer, 0, 32) { for (y.outer, 0, 32) { for (x.inner.init, 0, 32) { for (y.inner.init, 0, 32) { C[(((((x.outer*1024) + y.outer) + (x.inner.init*32))*32) + y.inner.init)] = 0.000000f }} for (k.outer, 0, 256) { for (k.inner, 0, 4) { for (x.inner, 0, 32) { for (y.inner, 0, 32) { C[(((((x.outer*1024) + y.outer) + (x.inner.init*32))*32) + y.inner)] += A[(((((x.outer*8192) + k.outer)*4 + k.inner) + (x.inner* 1024))] * B[(((((y.outer+ (k.outer)*128)) + (k.inner*32))*32) + y.inner)] }}}}}}}
(34) Generally, a GEMM tiling configuration can be defined as:
{right arrow over (ξ)}={right arrow over (ξ.sub.m)}×{right arrow over (ξ.sub.i)}×{right arrow over (ξ.sub.n)} (1)
where
(35)
(36)
(37)
(38) Multiplication of two matrices A(m×k) and B(k×n) produces matrix C(m×n). d.sub.m, d.sub.k and d.sub.n are the number of nested loops for each dimension m, k and n, respectively, where d.sub.m is restricted to 2.sup.x, d.sub.k is restricted to 2.sup.y, and d.sub.n is restricted to 2.sup.z, where x, y, and z are integers. m.sub.i, k.sub.i, n.sub.j, ∀.sub.i∈[0, d.sub.m) ∀.sub.l∈[0, d.sub.k) ∀.sub.j ∈[0, d.sub.n), are the number of iterations of a respective loop. The configuration in the IR above is m.sub.0=m.sub.1=32, k.sub.0=256, k.sub.1=4, n.sub.0=n.sub.1=32, and d.sub.m=d.sub.k=d.sub.n=2.
(39) The optimal tiling configuration search problem can be formulated into the following optimization problem:
(40)
When applying the matrix multiplication between A and B, the objective is thus to find an optimal configuration of tiling parameters [m.sub.0, m.sub.1, . . . , k.sub.0, k.sub.1, . . . , n.sub.0, n.sub.1, . . . ] that has the minimal running time on the target hardware. In this example, cost( ) denotes the running time for the configuration s, given the dimension of matrices (m, k, n) and the number of the nested loops on each dimension d.sub.m, d.sub.k, and d.sub.n. As illustrated in
Methodology
(41) The methodology described herein solves the GEMM problem formulated above by modeling the optimization problem as a search problem through which an optimal configuration can be identified. Relationships between eligible configurations are identified, and a network of such configuration is created. Unlike XGBoost, the starting point of a new route is the optimal point from the previous exploration routes. The new route is determined by both random exploration and search direction using the Compiler-Level GEMM Optimization methods described herein. In sample embodiments, Compiler-Level GEMM Optimization is implemented using methods described herein as the G-BFS method guided by Greedy Best-First-Search and the N-A2C method using Neighborhood Actor Advantage Critic Reinforcement Learning that explore this network to find the best configuration. These two configuration tuning methods and the configuration search model together allow exploitation of relations between similar configurations.
(42) Configuration Search Modeling
(43) The configuration tuning problem is modeled as a Markov Decision Process (MDP), where each configuration is regarded as a unique state (or vertex) s in MDP defined as follows:
s=[s.sub.m,s.sub.k,s.sub.n,J] (5)
where s.sub.m=[m.sub.0, m.sub.1, . . . , m.sub.d.sub.
(44) As in the GEMM application, with similar configuration settings (i.e., the configuration parameters for each dimension of two states are equal or close), the performance of the two states is more likely to be similar. In sample embodiments, the states represent different ways to multiply the matrices together. As explained below, tiling techniques are used to determine how to most efficiently store the states in hardware with limited memory. Taking advantage of the relationship between similar configurations and considering the constraints of the matrix size in each configuration, the action space (or edges in the matrix) may be defined as follows:
A={s.sub.x[i]←2s.sub.x[i] and s.sub.x[j]←s.sub.x[j]/2} (6)
where ∀x∈{m, k, n}, ∀i, j∈(0, d.sub.x), and i≠j and are in the same split of m defined as dm−1, k defined as dm+dn−1, or n defined as dm+dn+dk−1. Accordingly, a step function step may be defined as the transition from one state to another:
s′=step(s,a) (7)
With the input of a particular action a∈A, the current state s transitions to state s′. The neighbors s′ are thus defined as half or double the current state.
(45) In addition, if the agent takes action a and transitions from state s to state s′, the reward function may be defined as follows:
(46)
The reward function may also be represented as T(s′)−T(s) where T(s′) is the actual running time of the next state and T(s) is the actual running time of the current state. Following the MDP model, the agent is expected to determine its policy π to efficiently approach and discover a state s* with the lowest running time in the target hardware system. It will be appreciated that this approach takes advantage of relations with similar configurations.
(47) Two different configuration tuning approaches for compiler-level GEMM optimization based on the configuration search model, guided by G-BFS and N-A2C reinforcement learning, respectively, will now be described as well as a discussion of their strengths in different scenarios. It will be appreciated that the optimization may be a one-time optimization for dimensions of matrices of a set size but that the optimization may be run each time there is a change in the matrix configurations or the hardware.
(48) G-BFS Guided Tuning Method
(49) The G-BFS guided tuning method is guided by a modified Greedy Best-First-Search guided approach and follows the flowchart in
g(s)=[s′=step(s,a)∀a∈A] (9)
(50) ρ(ρ∈1, 2, . . . , len(g(s))}) states are randomly selected from g(s) at 404 and are tested in hardware at 406 to get the computation time of all explored parameter candidates. For each state s′ from g(s), if s′ is legitimate and has not been visited before, s′ and its running time cost(s′) is enqueued into priority queue Q 400 at 408 and states is added in the visited list S.sub.v. If its running time cost(s) is smaller than the current minimum running time, state s′ is set as the optimal state visited and its running time is recorded in the priority queue Q 400 as cost.sub.min. The iteration continues until the priority queue Q 400 is empty or the search time reaches the maximum time T.sub.max specified by the user. The current optimal state s* and its running time cost.sub.min are returned as configuration tuning results at 410. The configuration with the reduced running time may be used to determine the configuration tiling strategy. In this fashion, the G-BFS method does not need to expand to consider all neighboring states or vertices. A summary of the G-BFS algorithm is shown in Algorithm 1 below.
(51) TABLE-US-00002 Algorithm 1: G-BFS Method 1: Initialization: Q = PriorityQueue( ), S.sub.v, s.sub.0 2: Q.push((cost(s.sub.0), s.sub.0)); 3: Add s.sub.0 in S.sub.v; 4: while Q ≠ Ø and t.sub.search < T.sub.max and Configs.sub.searched < p.Configs.sub.total do 5: (cost(s), s) = Q.pop( ); 6: β = Take ρ neighbors randomly from g(s); 7: for s.sup.1 in β do 8: if s.sup.1 is legitimate and s.sup.1 not ϵ S.sub.v then 9: Q.push((cost(s.sup.1), s.sup.1)); 10: Add s.sup.1 in S.sub.v; 11: if cost.sub.min > cost(s.sup.1) then 12: cost.sub.min = cost(s.sup.1); 13: s.sup.* = s.sup.1; 14: end if 15: end if 16: end for 17: end while 18: Return: Configuration s.sup.* with cost.sub.min.
(52)
(53)
(54) An advantage of the G-BFS method is that it is a light-weighted algorithm where the priority queue ranks the computation time of all stored states and each time pops the configuration candidate with the lowest computation time. The guidance from the modified greedy best first search model is able to reduce the ratio of the tested parameter candidates from the entire GEMM configuration space and to decrease the probability of testing parameter candidates that require a large computation time. The G-BFS method for parameter tuning thus may significantly reduce the searching time and the ratio of visited configuration candidates for the tuner.
(55) N-A2C Method
(56) As the G-BFS method explores only one step from the considered state for each iteration, its performance may be affected when the cost from similar states exhibits large random noise. In the N-A2C method, on the other hand, as shown in
(57) The N-A2C method is summarized in Algorithm 2 below. As illustrated, the method starts with the configuration search model described above. The system initializes a random or hand-crafted starting state so, a fixed-size memory buffer M to record the latest search information, and an empty hashtable Ho to record all visited states with the associated cost. For the A2C model, both actor and critic initialize their neural networks with random initial weights. For each episode, from the same starting point, the agent 600 explores T continuous steps by taking an N-step action-guided walk at 602. For each step, the agent follows the c-greedy algorithm, where with probability of c, the agent takes random action a guided by the policy π(s) generated by the actor's neural network. With a probability of 1−∈, the agent chooses a random action a from the current state and takes an action following the policy output from the neural network of the actor. Based on the current state s and action a, the next state s′ is obtained from Eqn. (7). If the next state s′ has not been visited before, s′ is added into collected candidate set β.sub.collect. The step is stored as (state, action, reward, next_state). The state with the minimum running time in this episode is chosen as the initial state from the next episode. The process iterates at 604 until the number of collected states reaches the predefined threshold, and the hardware executes the GEMM computation code generated with the collected configuration candidates. The hashmap H.sub.v and memory buffer M are then updated with the configuration candidates and the associated running time at 606. The exploration data stored in M is used to incrementally-train the neural networks of the critic and actor of A2C at 608. The training step is based on the predicted performance (reward) of the critic. In sample embodiments, the training may be transferred to another matrix region or algorithm as appropriate.
(58) TABLE-US-00003 Algorithm 2: N-A2C Method 1: Initialization: s.sub.0, M, H.sub.v, cost.sub.min 2: for each episode do 3: while len(β.sub.collect) < len(β.sub.test) do 4: s = s.sub.0; 5: for each step until T steps do 6: if rand( ) < ϵ then 7: a follows π(s); 8: else 9: a is randomly selected from A; 10: end if 11: s' = step(s, a); 12: if s' not in H.sub.v then 13: Add s' in β.sub.collect; 14: end if 15: s = s'; 16: end for 17: end while 18: for s' in β.sub.collect do 19: if cost.sub.min > cost(s') then 20: cost.sub.min = cost(s'); 21: s.sup.* = s'; 22: s.sub.0 = s.sup.*; 23: end if 24: H.sub.v[s'] = cost(s'); 25: Store (s, a, r(s, a), s') to M, where ∀.sub.s, ∀.sub.a satisfy step(s,a) = s'; 26: Train actor's and critic's neural networks with M; 27: end for 28: if t.sub.search ≥ T.sub.max or Configs.sub.searched ≥ ρ.Configs.sub.total do 29: Return: Configuration s.sup.* with cost.sub.min 30: end if 31: end for
(59) Generally, the N-A2C method may efficiently search the optimal GEMM configuration with fixed exploration steps in each episode. Nevertheless, heuristics can be applied. Just like learning-rate decay in training deep neural networks, the exploration step T can have a decay process, i.e., starting with a large value and gradually reducing to a small number. In addition, the exploration step T can also increase to explore new configuration neighborhoods.
(60)
(61)
(62) The guidance from the N-A2C reinforcement learning model is also able to reduce the ratio of the tested parameter candidates from the entire GEMM configuration space and to decrease the probability of testing parameter candidates that require a large computation time. The N-A2C method for parameter tuning thus also may significantly reduce the searching time and the ratio of visited configuration candidates for the tuner.
(63) Experimental Results
(64) The performance of the GEMM configuration tuning approaches have been evaluated on an Nvidia Titan Xp GPU in the TVM framework. Following similar settings in TVM for GPUs, the number of nested loops for each dimension is set as d.sub.m=4, d.sub.k=2, d.sub.n=4. The random selection parameter ρ=5 is set for the G-BFS method, and the maximum number of search steps T=3 is set for the N-A2C method. Without losing generality, the initial state is set for both methods as s.sub.0=[[m, 1, 1, 1], [k, 1], [n, 1, 1, 1]], which represents the configuration without multi-level matrix tiling. The performance of the configuration tuning methods can be further improved by setting the initial state to a more meaningful configuration. In order to investigate the performance of the configuration search methods, the methods are compared with state-of-the-art algorithms including the XGBoost method in the TVM framework and the general configuration optimization method using a RNN controller by Google researchers. Unless otherwise specified, the computation time for each configuration is the arithmetic mean for 10 repeated trials on the tested GPU hardware.
(65) Without losing their applicability to deep learning, the four approaches were evaluated on a perceptron network, which is the fundamental building block for state-of-the-art neural network architectures while mainly including GEMM for computation. The computation in the perceptron network is denoted as Y=W.sup.TX, where W∈R.sup.(k,m), X∈R.sup.(k,n), and Y∈R.sup.(m,n); n is the batch size, k is the input dimension, and m is the output dimension. Accordingly, the size of perceptron network is denoted in the format of (m, k, n), which corresponds to matrices A(m×k) and B(k×n) for GEMM.
(66) In
(67)
(68)
(69) A Greedy Best First Search (G-BFS) method and a Neighborhood Actor Advantage Critic (N-A2C) method have been described for compiler-level GEMM optimization that take advantage of the performance of neighborhood configurations. The G-BFS method outperforms the XGBoost and RNN methods consistently, and the N-A2C method performs even better for large matrices. Empirical results show that both methods achieve significant performance improvement over state-of-the-art configuration tuning methods such as those using XGBoost and the RNN controller. Both methods are general in the sense that they are applicable to other compiler-level tuning tasks and can be used for optimization of other tensor operators with a large configuration space.
(70) Also, the G-BFS method and the N-A2C method provide an efficient balancing between exploration and exploitation. While the prior art XGBoost system always starts the random walk for exploration at a fixed or random point with a totally random walk over a large exploration space, the G-BFS method and the N-A2C method each consider a parameter candidate where the exploration is limited to its neighbors. The number of random selections of the neighbors affects the speed to find an optimal configuration parameter or parameters. Since the value of the random selection number is small, it is suitable for a dataset with many local optimal solutions where the method is able to quickly bypass the local optimal solution and find the optimal solution. Also, a large value of the random selection number is suitable for a convex dataset where the closest path to the optimal solution may be found quickly.
(71) Neural Networks
(72) In some of the above steps, neural networks are established for an actor and for a critic in the N-A2C method with random initial weights. The following paragraphs provide an overview of the use and operation of such neural networks.
(73) Artificial intelligence (AI) is a field concerned with developing decision-making systems to perform cognitive tasks that have traditionally required a living actor, such as a person. Artificial neural networks (ANNs) are computational structures that are loosely modeled on biological neurons. Generally, ANNs encode information (e.g., data or decision making) via weighted connections (e.g., synapses) between nodes (e.g., neurons). Modern ANNs are foundational to many AI applications, such as automated perception (e.g., computer vision, speech recognition, contextual awareness, etc.), automated cognition (e.g., decision-making, logistics, routing, supply chain optimization, etc.), automated control (e.g., autonomous cars, drones, robots, etc.), among others.
(74) Many ANNs are represented as matrices of weights that correspond to the modeled connections. ANNs operate by accepting data into a set of input neurons that often have many outgoing connections to other neurons. At each traversal between neurons, the corresponding weight modifies the input and is tested against a threshold at the destination neuron. If the weighted value exceeds the threshold, the value is again weighted, or transformed through a nonlinear function, and transmitted to another neuron further down the ANN graph-if the threshold is not exceeded then, generally, the value is not transmitted to a down-graph neuron and the synaptic connection remains inactive. The process of weighting and testing continues until an output neuron is reached; the pattern and values of the output neurons constituting the result of the ANN processing.
(75) The correct operation of most ANNs relies on correct weights. However, ANN designers do not generally know which weights will work for a given application. Instead, a training process is used to arrive at appropriate weights. ANN designers typically choose a number of neuron layers or specific connections between layers including circular connection, but the ANN designer does not generally know which weights will work for a given application. Instead, a training process generally proceeds by selecting initial weights, which may be randomly selected. Training data is fed into the ANN and results are compared to an objective function that provides an indication of error. The error indication is a measure of how wrong the ANN's result was, compared to an expected result. This error is then used to correct the weights. Over many iterations, the weights will collectively converge to encode the operational data into the ANN. This process may be called an optimization of the objective function (e.g., a cost or loss function), whereby the cost or loss is minimized.
(76) A gradient descent technique is often used to perform the objective function optimization. A gradient (e.g., partial derivative) is computed with respect to layer parameters (e.g., aspects of the weight) to provide a direction, and possibly a degree, of correction, but does not result in a single correction to set the weight to a “correct” value. That is, via several iterations, the weight will move towards the “correct,” or operationally useful, value. In some implementations, the amount, or step size, of movement is fixed (e.g., the same from iteration to iteration). Small step sizes tend to take a long time to converge, whereas large step sizes may oscillate around the correct value or exhibit other undesirable behavior. Variable step sizes may be attempted to provide faster convergence without the downsides of large step sizes.
(77) Backpropagation is a technique whereby training data is fed forward through the ANN-here “forward” means that the data starts at the input neurons and follows the directed graph of neuron connections until the output neurons are reached—and the objective function is applied backwards through the ANN to correct the synapse weights. At each step in the backpropagation process, the result of the previous step is used to correct a weight. Thus, the result of the output neuron correction is applied to a neuron that connects to the output neuron, and so forth until the input neurons are reached. Backpropagation has become a popular technique to train a variety of ANNs. Any well-known optimization algorithm for back propagation may be used, such as SGD, Adam, etc.
(78)
(79) The processing node 1010 may be a CPU, GPU, field programmable gate array (FPGA), digital signal processor (DSP), application specific integrated circuit (ASIC), or other processing circuitry. In an example, multiple processing nodes may be employed to train different layers of the ANN 1000, or even different nodes 1020 within layers. Thus, a set of processing nodes 1010 may be arranged to perform the training of the ANN 1000.
(80) The set of processing nodes 1010 is arranged to receive a training set 1030 for the ANN 1000. The ANN 1000 comprises a set of nodes 1020 arranged in layers (illustrated as rows of nodes 1020) and a set of inter-node weights 1040 (e.g., parameters) between nodes in the set of nodes 1020. In an example, the training set 1030 is a subset of a complete training set. Here, the subset may enable processing nodes 1020 with limited storage resources to participate in training the ANN 1000.
(81) The training data may include multiple numerical values representative of a domain, such as red, green, and blue pixel values and intensity values for an image or pitch and volume values at discrete times for speech recognition. Each value of the training, or input 1050 to be classified once ANN 1000 is trained, is provided to a corresponding node 1020 in the first layer or input layer of ANN 1000. The values propagate through the layers and are changed by the objective function.
(82) As noted above, the set of processing nodes 1020 is arranged to train the neural network to create a trained neural network. Once trained, data input into the ANN 1000 will produce valid classifications 1060 (e.g., the input data 1050 will be assigned into categories), for example. The training performed by the set of processing nodes 1020 is iterative. In an example, each iteration of the training of the neural network is performed independently between layers of the ANN 1000. Thus, two distinct layers may be processed in parallel by different members of the set of processing nodes 1010. In an example, different layers of the ANN 1000 are trained on different hardware.
(83) ANN 1000 may calculate one or more neuron or synapse weights 1040 for criteria based upon one or more machine learning algorithms. During training, historical action information representing past actions may be labeled with an indication of whether the decision made was ultimately successful, in this case, the reward. Thus, the reward, which is based on both robot navigation and the ability to track the object, is used to update the network weights 1040. Note that in various networks, initial weights may be pre-set. In other networks, initial weights may be randomized. In one embodiment, a module or processor executing computer instructions to effectuate the neural network learning operations modifies a source neuron's output with a synapse weight to determine the contribution of the source neuron to cause the sink neuron to fire. Practically, in this embodiment, a single and modified value is integrated at the sink neuron in response to the source neuron activation.
(84) Computer Architecture
(85)
(86) One example computing device in the form of a computer 1100 may include a processing unit 1102, memory 1103, removable storage 1110, and non-removable storage 1112. Although the example computing device is illustrated and described as computer 1100, the computing device may be in different forms in different embodiments. For example, the computing device may instead be a smartphone, a tablet, smartwatch, or other computing device including the same or similar elements as illustrated and described with regard to
(87) Memory 1103 may include volatile memory 1114 and non-volatile memory 1108. Computer 1100 may include—or have access to a computing environment that includes—a variety of computer-readable media, such as volatile memory 1114 and non-volatile memory 1108, removable storage 1110 and non-removable storage 1112. Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM) or electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
(88) Computer 1100 may include or have access to a computing environment that includes input interface 1106, output interface 1104, and a communication interface 1116. Output interface 1104 may include a display device, such as a touchscreen, that also may serve as an input device. The input interface 1106 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computer 1100, and other input devices.
(89) The computer may operate in a networked environment using a communication connection to connect to one or more remote computers, such as database servers. The remote computer may include a personal computer (PC), server, router, network PC, a peer device or other common DFD network switch, or the like. The communication connection may include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, Wi-Fi, Bluetooth, or other networks. According to one embodiment, the various components of computer 1100 are connected with a system bus 1120.
(90) Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 1102 of the computer 1100, such as a program 1118. The program 1118 in some embodiments comprises software that, when executed by the processing unit 1102, performs operations according to any of the embodiments included herein. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium such as a storage device. The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory. Storage can also include networked storage, such as a storage area network (SAN). Computer program 1118 may be used to cause processing unit 1102 to perform one or more methods or algorithms such as the G-BFS and N-A2C algorithms described herein.
(91) In an example embodiment, the computing device 1100 is configured to optimize general matrix multiplication (GEMM) on target hardware. The computing device 1100 includes at least one processor 1102 and a machine-readable medium 1103 coupled to the at least one processor 1102. The machine-readable medium includes instructions 1118 thereon that, when executed by the at least one processor 1102, causes the at least one processor 1102 to implement a tiling module, an optimizer module, and an execution module. The tiling module includes software that splits matrices to be multiplied into respective tiles and the optimizer module includes software that formulates an optimal tiling configuration for the matrices to be multiplied that collectively achieves a lowest running time for multiplication of the matrices on the target hardware into the following optimization problem,
(92)
where s denotes a configuration state, cost denotes a running time for the configuration state s, m, k, and n are matrix parameters for multiplication of a first matrix A (m×k) by a second matrix B (k×n), to produce a matrix C (m×n), where m is a number of rows of a first matrix, k is number of columns of the first matrix and number of rows of a second matrix, and n is the number of columns of the second matrix, and d.sub.m, d.sub.k, d.sub.n are numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the matrices to be multiplied is the tiling configuration that optimizes the running time for multiplication of the matrices on the target hardware. The computer 1100 further implements an execution module that runs a computation on the target hardware based on the optimal tiling configuration. In some embodiments, the computer 1100 may include other or additional modules for performing any one of or combination of steps described in the embodiments. Further, any of the additional or alternative embodiments or aspects of the method, as shown in any of the figures or recited in any of the claims, are also contemplated to include similar modules.
(93) Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.