DEEP LEARNING ACCELERATOR MODELS AND HARDWARE
20220358349 · 2022-11-10
Inventors
Cpc classification
G06F15/80
PHYSICS
G06F9/5027
PHYSICS
G06F9/5038
PHYSICS
International classification
G06F15/80
PHYSICS
Abstract
A first deep learning accelerator (DLA) model can be executed using a first subset of a plurality of DLA cores of a DLA chip. A second DLA model can be executed using a second subset of the plurality of DLA cores of the DLA chip. The first subset can include a first quantity of the plurality of DLA cores. The second subset can include a second quantity of the plurality of DLA cores that is different than the first quantity of the plurality of DLA cores.
Claims
1. A method, comprising: executing a first deep learning accelerator (DLA) model using a first subset of a plurality of DLA cores of a DLA chip; and executing a second DLA model using a second subset of the plurality of DLA cores of the DLA chip, wherein the first subset comprises a first quantity of the plurality of DLA cores and the second subset comprises a second quantity of the plurality of DLA cores that is different than the first quantity of the plurality of DLA cores.
2. The method of claim 1, further comprising assigning the first quantity of the plurality of DLA cores to the first subset of the DLA cores based at least in part on a first computational capability of the first DLA model.
3. The method of claim 2, further comprising assigning the second quantity of the plurality of DLA cores to the second subset of the DLA cores based at least in part on a second computational capability of the second DLA model, wherein the second computational capability is greater than the first computational capability.
4. The method of claim 3, wherein assigning the second quantity of the plurality of DLA cores comprises assigning a greater quantity of the plurality of DLA cores to the second subset of the plurality of DLA cores than the first quantity of the plurality of DLA cores assigned to the first subset of the plurality of DLA cores.
5. The method of claim 3, further comprising assigning less than all of the plurality of DLA cores to a respective subset of the DLA cores.
6. The method of claim 3, further comprising assigning the first quantity and the second quantity of the plurality of DLA cores without regard to a total quantity of the plurality of DLA cores.
7. The method of claim 1, further comprising executing the first DLA model using the first subset of the plurality of DLA cores and the second DLA model using the second subset of the plurality of DLA cores at least partially concurrently.
8. The method of claim 1, further comprising: executing a third DLA model using a third subset of the plurality of DLA cores of the DLA chip, wherein the third subset comprises a third quantity of the plurality of DLA cores that is different than the first and third quantities of the plurality of DLA cores; and assigning the third quantity of the plurality of DLA cores to the third subset of the DLA cores based at least in part on a third computational capability of the third DLA model, wherein the third computational capability is different than the first and second computational capabilities.
9. An apparatus, comprising: a physical deep learning accelerator (DLA) chip comprising a plurality of DLA cores; and a compiler coupled to the physical DLA chip and configured to: assign a number of DLA cores of the physical DLA chip to a virtual DLA chip; and cause the number of DLA cores to execute a DLA model having a computational capability that is less than a cumulative computational capability of the plurality of DLA cores.
10. The apparatus of claim 9, wherein the compiler is further configured to assign the number of DLA cores to the virtual DLA chip based at least in part on a size of a computational layer of the DLA model.
11. The apparatus of claim 9, wherein the compiler is further configured to: assign a different number of DLA cores of the physical DLA chip to a different virtual DLA chip, and cause the different number of DLA cores of the different virtual DLA chip to execute a different DLA model.
12. The apparatus of claim 11, wherein the compiler is further configured to: assign the number of DLA cores to the virtual DLA chip based at least in part on a size of a computational layer of the DLA model; and assign the different number of DLA cores to the different virtual DLA chip based at least in part on a size of a computational layer of the different DLA model, wherein the size of the computational layer of the DLA model is different than the size of the computational layer of the different DLA model.
13. The apparatus of claim 11, wherein the compiler is further configured to: assign the number of DLA cores to the virtual DLA chip based at least in part on a computational capability of the DLA model; and assign the different number of DLA cores to the different virtual DLA chip based at least in part on a computational capability of the different DLA model, wherein the computational capability of the DLA model is different than the computational capability of the different DLA model.
14. The apparatus of claim 11, wherein the compiler is further configured to assign the number of DLA cores to the virtual DLA chip based at least in part on signaling indicative of a user-defined quantity of DLA cores to assign to the virtual DLA chip.
15. The apparatus of claim 11, wherein the compiler is further configured to assign the number of DLA cores to the virtual DLA chip based at least in part on signaling indicative of a user-defined subset of the plurality of DLA cores of the physical DLA chip to assign to the virtual DLA chips.
16. A non-transitory machine-readable medium storing instructions executable by a processing resource to: assign a first quantity of a plurality of deep learning accelerator (DLA) cores of a physical DLA chip to a first virtual DLA chip based at least in part on a first processing requirement of a first DLA model; assign a second quantity of the plurality of DLA cores of the physical DLA chip to a second virtual DLA chip based at least in part on a second processing requirement of a second DLA model; execute the first DLA model using the first virtual DLA chip; and execute the second DLA model using the second virtual DLA chip.
17. The medium of claim 16, further storing instructions to: assign a greater quantity of the plurality of DLA cores to the first virtual DLA chip than to the second virtual DLA chip in response to the first processing requirement being greater than the second processing requirement; and assign a lesser quantity of the plurality of DLA cores to the first virtual DLA chip than to the second virtual DLA chip in response to the second processing requirement being greater than the first processing requirement.
18. The medium of claim 16, further storing instructions to: responsive to instructions to execute a third DLA model, assign a third quantity of the plurality of DLA cores to a third virtual DLA chip based at least in part on a third processing requirement of the third DLA model, wherein the third processing requirement is different than the first and second processing requirements; and execute the third DLA model using the third virtual DLA chip.
19. The medium of claim 18, further storing instructions to: responsive to subsequent instructions to execute the first DLA model, assign the first quantity of the plurality of DLA cores to the first virtual DLA chip; and execute the first DLA model using the first virtual DLA chip having the first quantity of the plurality of DLA cores assigned thereto.
20. The medium of claim 18, further storing instructions to: responsive to instructions to execute a fourth DLA model, assign a fourth quantity of the plurality of DLA cores to a fourth virtual DLA chip based at least in part on a fourth processing requirement of the fourth DLA model, wherein the fourth processing requirement is different than the third processing requirement; and execute the fourth DLA model using the fourth virtual DLA chip.
21. A non-transitory machine-readable medium storing instructions executable by a processing resource to: determine whether execution of a computational layer of a first deep learning accelerator (DLA) model on representative data, using a first virtual DLA chip, yields results having at least a threshold confidence value, wherein the first virtual DLA chip comprises a first plurality of DLA cores of a physical DLA chip; responsive to determining that execution of the computational layer of the first DLA model yields results having less than the threshold confidence value, execute a second DLA model, using a second virtual DLA chip, on results from execution of the computational layer of the first DLA model, wherein the second virtual DLA chip comprises a second plurality of DLA cores of the physical DLA chip that is greater in quantity than the first plurality of DLA cores; and responsive to determining that execution of a respective last computational layer of the first DLA model yields results having less than the threshold confidence value: assign an additional DLA core of the physical DLA chip to the first virtual DLA chip; and execute the first DLA model on data received by the physical DLA chip using the first virtual DLA chip including the additional DLA core.
22. The medium of claim 21, further storing instructions to determine whether execution of the computational layer of the first DLA model on the representative data yields results having at least the threshold confidence value at a compile time.
23. The medium of claim 21, further storing instructions to: determine whether the execution of the second DLA model provides at least a threshold quantity of correct inferences per second per watt; and responsive to determining that execution of the second DLA model yields results having less than the threshold quantity of correct inferences per second per watt: assign another additional DLA core of the physical DLA chip to the second virtual DLA chip; and execute the second DLA model on the data received by the physical DLA chip using the second virtual DLA chip including the other additional DLA core.
24. A method, comprising: determining which computational layers of a first deep learning accelerator (DLA) model to execute on data received by a physical DLA chip subsequent to compile time by: executing, at the compile time and using a first virtual DLA chip, a first number of computational layers of a first DLA model on representative data; executing a second DLA model, using the second virtual DLA chip, on results from execution of the first number of computational layers of the first DLA model on the representative data, wherein the first virtual DLA chip comprises a different quantity of DLA cores of the physical DLA chip than the second virtual DLA chip; and determining whether results from execution of the second DLA model on results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least a threshold confidence value.
25. The method of claim 24, further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value: executing, subsequent to the compile time and using the first virtual DLA chip, the first number of computational layers of the first DLA model on data received by the physical DLA chip; and executing the second DLA model on results from execution of the first number of computational layers of the first DLA model.
26. The method of claim 25, further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value: executing, using the first virtual DLA chip, a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers; executing, using the second virtual DLA chip, the second DLA model on results from execution of the second number of computational layers of the first DLA model on the representative data; and determining whether results from execution of the second DLA model on results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value.
27. The method of claim 26, further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value: executing, using the first virtual DLA chip, the second number of computational layers of the first DLA model on data received by the physical DLA chip subsequent to the compile time.
28. The method of claim 26, further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value: executing a number of computational layers of the second DLA model, using the second virtual DLA chip, on the results from execution of the second number of computational layers of the first DLA model on the representative data, wherein the number of computational layers includes an additional computational layer of the second DLA model or excludes a computational layer of the second DLA model executed on the results from execution of the first number of computational layers of the first DLA model.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
DETAILED DESCRIPTION
[0011] The present disclosure includes apparatuses and methods related to executing deep learning accelerator (DLA) models using subsets of DLA cores of a DLA chip. Artificial intelligence (AI) can be employed on devices and/or systems that have a limited power supply. As used herein, artificial intelligence refers to the ability to improve a machine through “learning” such as by storing patterns and/or examples which can be utilized to take actions at a later time. Deep learning refers to a device's ability to learn from data provided as examples. Deep learning can be a subset of artificial intelligence. Artificial neural networks, among other types of networks, can be classified as deep learning.
[0012] Non-limiting examples of AI applications include deep-learning edge applications such as object detection, classification, tracking, and navigation. Deep-learning edge applications can be deployed on unmanned vehicles (e.g., drones) that are dependent on battery-based power supplies. How deep-learning edge applications are deployed on and/or utilized with such power-constrained devices and/or systems is contingent on efficient energy utilization of deep-learning edge applications. Deep-learning edge applications may be executed by DLAs. However, DLAs are not re-configurable or partitioned. During manufacturing, DLAs are produced meeting requirements of a workload, but the DLAs cannot be adapted to changes in the workload post-manufacturing. Some previous approaches to improving energy efficiency of DLAs may include using DLA application specific integrated circuits (ASICs).
[0013] Multiple DLA models (e.g., deep learning models) of the same type (e.g., MobileNet, ResNet, VGG19, etc.) may be executed to perform detection and/or classification for a given deep learning task. Each of the DLA models may be deployed on a respective DLA ASIC. The DLA ASICs may have different computational capabilities and/or processing requirements corresponding to computational capabilities and/or processing requirements of the DLA models. As used herein, “computational capability” refers to capability to perform computations whereas “processing requirements” refer to requirements to perform computations. To modify the computational capabilities and/or processing requirements of such a DLA package post-manufacturing would require an ability to modify the hardware of the DLA ASICs.
[0014] As used herein, “execution of a DLA model on data” refers to performance of calculations on the data using a DLA chip according to parameters of the DLA model. A DLA model can have parameters (be configured) such that execution of the DLA model yields results of at least a particular confidence value (e.g., accuracy value). As used herein, “accuracy of results yielded from execution of a DLA model” refers to a quantity of correct predictions made by the DLA model relative to a quantity of total predications made by the DLA model. Confidence in particular results yielded from execution of a DLA model can be referred to as, and expressed as, an accuracy value. Examples of parameters of a DLA model include, but are not limited to, a maximum quantity of multiply and accumulate circuits (MACs) of a DLA to be used during execution of the DLA model. Other non-limiting examples of parameters of a DLA model can be a maximum quantity of iterations of computations during execution of the DLA model and a maximum quantity of computations to be performed during execution of the DLA model. In at least one embodiment, execution of a DLA model implemented on a DLA can include utilization of at most a particular quantity (e.g., a subset) of MACs implemented on the DLA. Such parameters of a DLA model can limit the computational capability of the DLA model, which, in turn, can limit the accuracy of results yielded from execution of the DLA model. However, what may be lost in computational capability can be gained in reduced resource consumption.
[0015] A DLA model configured to yield high-accuracy results can utilize a greater quantity of MACs, perform a greater quantity of iterations of computations, and/or perform a greater quantity of computations during execution of the DLA model than a different DLA model configured to yield low-accuracy results. Execution of a DLA model configured to yield high-accuracy results can consume more resources than execution of a DLA model configured to yield low-accuracy results. For example, execution of a DLA model configured to yield high-accuracy results can have greater power requirements (greater power consumption) than execution of a DLA model configured to yield results of a lower accuracy.
[0016] In some previous approaches, DLA models configured to yield high-accuracy results may be executed in situations that do not require high-accuracy results. Thus, some previous approaches may expend more power executing a DLA model having high computational capability when executing a DLA model having low computational capability yields sufficiently accurate results. Executing a DLA model having high computational capability in such circumstances expends excess power relative to executing a DLA model having low computational capability. In low-power devices, such as Internet-of-Things (IoT) devices, reducing excess power expenditures is important.
[0017] Aspects of the present disclosure address the above and other deficiencies. For instance, execution of various DLA models can be assigned to subsets of DLA cores of a DLA chip. The quantity of DLA cores assigned to execute a DLA model can be based on the computational capability and/or processing requirements of the DLA model. Some embodiments of the present disclosure provide post-manufacturing flexibility not available in previous approaches. For example, the quantity of DLA cores of a DLA chip assigned to a subset and/or the quantity of subsets can be modified in response to modification of workloads and/or DLA models. Subsets of DLA cores can be configured on-demand (“on-the-fly”) at any time. An advantage of some embodiments described herein is an ability for on-demand workload aware compute deployment, utilization, and/or management. Computational capability of a DLA chip can be available on-demand and is scalable to satisfy changing requirements of deep-learning edge applications.
[0018] As used herein, the singular forms “a”, “an”, and “the” include singular and plural referents unless the content clearly dictates otherwise. Furthermore, the word “may” is used throughout this application in a permissive sense (i.e., having the potential to, being able to), not in a mandatory sense (i.e., must). The term “include,” and derivations thereof, mean “including, but not limited to.” The term “coupled” means directly or indirectly connected.
[0019] The figures herein follow a numbering convention in which the first digit or digits correspond to the drawing figure number and the remaining digits identify an element or component in the drawing. Similar elements or components between different figures may be identified by the use of similar digits. For example, 230 may reference element “30” in
[0020]
[0021] The memory device 104 and host 102 can be a satellite, a communications tower, a personal laptop computer, a desktop computer, a digital camera, a mobile telephone, a memory card reader, an IoT enabled device, an automobile, among various other types of systems. For clarity, the system 100 has been simplified to focus on features with particular relevance to the present disclosure. The host 102 can include a number of processing resources (e.g., one or more processors, microprocessors, or some other type of controlling circuitry) capable of accessing the memory device 104.
[0022] The memory device 104 can provide main memory for the host 102 or can be used as additional memory or storage for the host 102. By way of example, the memory device 104 can be a dual in-line memory module (DIMM) including memory arrays 110 operated as double data rate (DDR) DRAM, such as DDR5, a graphics DDR DRAM, such as GDDR6, or another type of memory system. Embodiments are not limited to a particular type of memory device 104. Other examples of memory arrays 110 include RAM, ROM, SDRAM, LPDRAM, PCRAM, RRAM, flash memory, and three-dimensional cross-point, among others. A cross-point array of non-volatile memory can perform bit storage based on a change of bulk resistance, in conjunction with a stackable cross-gridded data access array. Additionally, in contrast to many flash-based memories, cross-point non-volatile memory can perform a write in-place operation, where a non-volatile memory cell can be programmed without the non-volatile memory cell being previously erased.
[0023] The control circuitry 106 can decode signals provided by the host 102. The control circuitry 106 can also be referred to as a command input and control circuit and can represent the functionality of different discrete ASICs or portions of different ASICs depending on the implementation. The signals can be commands provided by the host 102. These signals can include chip enable signals, write enable signals, and address latch signals, among others, that are used to control operations performed on the memory array 110. Such operations can include data read operations, data write operations, data erase operations, data move operations, etc. The control circuitry 106 can comprise a state machine, a sequencer, and/or some other type of control circuitry, which may be implemented in the form of hardware, firmware, or software, or any combination of the three.
[0024] Data can be provided to and/or from the memory array 110 via data lines coupling the memory array 110 to input/output (I/O) circuitry 122 via read/write circuitry 114. The I/O circuitry 122 can be used for bi-directional data communication with the host 102 over an interface. The read/write circuitry 114 is used to write data to the memory array 110 or read data from the memory array 110. As an example, the read/write circuitry 114 can comprise various drivers, latch circuitry, etc. In some embodiments, the data path can bypass the control circuitry 106.
[0025] The memory device 104 includes address circuitry 120 to latch address signals provided over an interface. Address signals are received and decoded by a row decoder 118 and a column decoder 116 to access the memory array 110. Data can be read from memory array 110 by sensing voltage and/or current changes on the sense lines using sensing circuitry 112. The sensing circuitry 112 can be coupled to the memory array 110. The sensing circuitry 112 can comprise, for example, sense amplifiers that can read and latch a page (e.g., row) of data from the memory array 110. Sensing (e.g., reading) a bit stored in a memory cell can involve sensing a relatively small voltage difference on a pair of sense lines, which may be referred to as digit lines or data lines.
[0026] The memory array 110 can comprise memory cells arranged in rows coupled by access lines (which may be referred to herein as word lines or select lines) and columns coupled by sense lines (which may be referred to herein as digit lines or data lines). Although the memory array 110 is shown as a single memory array, the memory array 110 can represent a plurality of memory arrays arraigned in banks of the memory device 104. The memory array 110 can include a number of memory cells, such as volatile memory cells (e.g., DRAM memory cells, among other types of volatile memory cells) and/or non-volatile memory cells (e.g., RRAM memory cells, among other types of non-volatile memory cells).
[0027] The memory device 104 can include a DLA 130 (e.g., a DLA ASIC). Hereinafter, the DLA 130 can be referred to as a physical DLA chip. The DLA 130 can be implemented on or near an edge of the memory device 104. For example, as illustrated by
[0028] The DLA 130 can be coupled to the control circuitry 106. The control circuitry 106 can control the DLA 130. For example, the control circuitry 106 can provide signaling to the row decoder 118 and the column decoder 116 to cause the transferring of data from the memory array 110 to the DLA 130 to provide an input to the DLA 130. The control circuitry 106 can cause the output of the DLA 130 to be provided to the I/O circuitry 122 and/or be stored back to the memory array 110.
[0029] The DLA 130 can be controlled, by the control circuitry 106, for example, to execute an artificial neural network (ANN) 109. A DLA model is a non-limiting example of an ANN. The ANN 109 can include hardware and/or firmware to implement a DLA model for performing operations on data. In some embodiments, the memory device 104 can be configured to store an ANN (e.g., the ANN 109) and the DLA 130 can be used to supplement operation of the ANN for various functions. For example, the DLA 130 and ANN 109 can be used to identify an object in an image and/or changes in images. Data indicative of an image can be input to the DLA 130.
[0030] In some embodiments, a compiler 103 can be hosted by the host 102. As used herein, “compiler” refers to hardware and/or software that compiles instructions from a source device to cause an action at a destination device. For example, the compiler 103 can compile instructions from the host 102 to cause the DLA 130 to execute one or more DLA models in accordance with the instructions. As used herein, a “compiler being configured to X” and “compiler being used to X” refers to the compiler compiling instructions to cause X.
[0031] As described herein, and particularly in association with
[0032] The compiler 103 can be configured to assign a number of DLA cores of the DLA 130 to a subset of DLA cores and cause the number of DLA cores to execute a DLA model having a computational capability that is less than a cumulative computational capability of the plurality of DLA cores. The compiler 103 can assign the number of DLA cores based on a size of a computational layer of the DLA model. The compiler 103 can be configured to assign a different number of DLA cores of the physical DLA chip to a different subset of DLA cores and cause the different number of DLA cores of the different virtual DLA chip to execute a different DLA model. The compiler 103 can be configured to assign the number of DLA cores based on a size of a computational layer of the DLA model and assign the different number of DLA cores based on a size of a computational layer of the different DLA model. The size of the computational layer of the DLA model can be different than the size of the computational layer of the different DLA model. The compiler 103 can be configured to assign the number of DLA cores based on a computational capability of the DLA model and assign the different number of DLA cores based on a computational capability of the different DLA model. The computational capability of the DLA model can be different than the computational capability of the different DLA model. The compiler 103 can be configured to assign the number of DLA cores based on a user-defined quantity of DLA cores and/or a user-defined subset of the plurality of DLA cores of the DLA 130.
[0033] The complied instructions generated by the compiler 103 can be provided to the control circuitry 106 to cause the control circuitry 106 to execute the compiled instructions. Once the compiled instructions are stored in the memory array 110, the host 102 can provide commands to the memory device 104 to execute the compiled instructions utilizing the DLA 130. The compiled instructions can be executed by the DLA 130 to execute the ANN 109. The control circuitry 106 can cause the compiled instructions to be provided to the DLA 130. The control circuitry 106 can cause the DLA 130 to execute the compiled instructions. The control circuitry 106 can cause the output of the DLA 130 to be stored back to the memory array 110, to be returned to the host 102, and/or to be used to perform additional computations in the memory device 104.
[0034] The control circuitry 106 can also include assigning circuitry 108. In some embodiments, the assigning circuitry 108 can comprise an ASIC configured to assign DLA cores to one or more subsets of DLA cores as described herein. In some embodiments, the assigning circuitry 108 can represent functionality of the control circuitry 106 that is not embodied in separate discrete circuitry. The control circuitry 106 and/or the assigning circuitry 108 can be configured to assign execution of a DLA model to one or more DLA cores of a DLA chip (represented by the memory array 110). The control circuitry 106 and/or the assigning circuitry 108 can be configured to assign execution of a first DLA model to a first subset of DLA cores and execution of a second DLA model to a second subset of DLA cores. The control circuitry 106 and/or the assigning circuitry 108 can be configured to assign a quantity of DLA cores to a subset based on processing requirements of a DLA model to be executed by the executed by the subset of DLA cores. The control circuitry 106 and/or the assigning circuitry 108 can be configured to receive user-defined subsets of DLA cores and/or user-defined quantities of DLA cores.
[0035]
[0036]
[0037] The virtual DLA chips 234 and 236 can be configured to execute a DLA model having a computational capability that is less than a cumulative computational capability of the plurality of DLA cores 232. The virtual DLA chip 234 includes 4 of the 14 DLA cores 232 of the physical DLA chip 230, the DLA cores 232-1, 232-2, 232-8, and 232-9. The virtual DLA chip 236 includes 8 of the 14 DLA cores 232 of the physical DLA chip 230, the DLA cores 232-3, 232-4, 232-5, 232-6, 232-7, 232-10, 232-11, 232-12, 232-13, and 232-14. In the example of
[0038] The virtual DLA chip 234 includes fewer of the DLA cores 232 than the virtual DLA chip 236. Thus, the virtual DLA chip 234 can be referred to as a “little” virtual DLA chip and the virtual DLA chip 236 can referred to as a “big” virtual DLA chip. The virtual DLA chip 234 can execute a DLA model having lower computational capability and/or processing requirements whereas the virtual DLA chip 236 can execute a DLA model having higher computational capability and/or processing requirements. In some embodiments, the virtual DLA chips 234 and 236 can execute respective DLA models concurrently. For example, the virtual DLA chip 234 can execute a DLA model concurrently with execution of another DLA model, by the virtual DLA chip 236, having higher computational capability and/or processing requirements. In some embodiments, multiple virtual DLA chips can execute the same DLA model and/or DLA models having similar computational capabilities and/or processing requirements concurrently.
[0039] In some embodiments, the virtual DLA chip 234 can execute computational layers of a DLA model until the confidence of the results falls below a threshold. In response to the confidence of the results falling below the threshold, another DLA model having higher computational capability and/or processing requirements can be executed by the virtual DLA chip 236 using the results from the virtual DLA chip 234 as input. Such embodiments provide energy savings by not executing a DLA model using the “big” virtual DLA chip 236, which consumes more energy than the “little” virtual DLA chip 234, until the “big” virtual DLA chip 236 is needed.
[0040]
[0041] In comparison to
[0042] The quantity of member DLA cores of a virtual DLA chip can be modified in response to changes to a DLA model to be executed by the virtual DLA chip. Relative to
[0043] In some embodiments, the quantity of member DLA cores of a virtual DLA chip can be user-defined. A user can provide input specifying respective quantities of DLA cores to assign to one or more virtual DLA chips. For example, a user can provide input that the virtual DLA chip 234 is to include 4 of the DLA cores 232 as described in association with
[0044] Although
[0045]
[0046] In some embodiments, if and when to switch execution of DLA models and corresponding virtual DLA chips can be determined at compile time based on data representative of data on which execution of the DLA models is anticipated (hereinafter referred to as representative data). Instead of determining if and when to switch execution of DLA models and corresponding virtual DLA chips reactively based on confidence of results from execution of the DLA models on data received after compile time, in some embodiments if and when to switch execution of DLA models and corresponding virtual DLA chips can be determined proactively based on execution of DLA models on representative data. Results from execution of the DLA models on the representative data can evaluated (e.g., confidence of results can be evaluated) to determine if and when to switch execution of DLA models and corresponding virtual DLA chips. A quantity of computational layers of a first DLA model to be executed prior to switching to execution of a second DLA model can be determined.
[0047] Subsequent to executing the first and second DLA models on the representative data, the determined quantity of computational layers of the first DLA model can be executed on data received by the physical DLA chip 330 before switching to execution of the second DLA model, regardless of confidence of results from execution of the first DLA model and/or the second DLA model on the data. Executing DLA models on representative data at compile time to determine if and when to exit early from execution of the first DLA model and/or the second DLA model can improve execution of the first DLA model and/or the second DLA model on data subsequent to compile time by eliminating evaluation of results from execution of the first DLA model and/or the second DLA model on data subsequent to compile time. Eliminating evaluation of results subsequent to compile time can decrease the amount of time between executions of computational layers of a DLA model and/or execution of a computational layer of a DLA model and execution of a computational layer of different DLA model.
[0048]
[0049] The representative data 340 can be chosen based on expected data on which DLA models will be executed. In the example of
[0050] At 342 of the upper sequence, computational layer L1 of the first DLA model is executed on the representative data 340 using the virtual DLA chip 334. At 348, an early exit from execution of the first DLA model occurs and the results from execution of the computational layer L1 are input to the second virtual DLA chip 336. At 344 and 346, respectively, two computational layers of the second DLA model, computational layer L1 and computational layer L2, are executed using the second virtual DLA chip 336. The upper sequence yields results having 1,000 inf/s/w.
[0051] At 341 and 343 of the lower sequence, respectively, computational layer L1 and computational layer L2 of the first DLA model are executed on the representative data 340 using the virtual DLA chip 334. At 349, an early exit from execution of the first DLA model occurs and the results from execution of the computational layers L1 and L2 of the first DLA model are input to the second virtual DLA chip 336. At 345, a computational layer of the second DLA model, computational layer L1, is executed using the second virtual DLA chip 336. The lower sequence yields results having 1,500 inf/s/w. Thus, the lower sequence yields more accurate results than the upper sequence. The lower sequence can be selected for execution of the first and second DLA models on data received by the physical DLA chip 330 after compile time based on the higher accuracy of that sequence (1,500 inf/s/w versus 1,000 inf/s/w) or the accuracy being at least a threshold accuracy (e.g., at least 1,250 inf/s/w).
[0052] In some embodiments, executing DLA models on representative data can be used to determine if the quantity of member DLA cores of virtual DLA chips (e.g., the virtual DLA chips 334 and 336) needs to be changed to improve the accuracy of results. For example, instead of or in addition to changing which computational layers of DLA models to execute to improve the accuracy of the results, additional DLA cores can be assigned to one or more of the virtual DLA chips 334 and 336. In some embodiments, the quantity of member DLA cores of one or more of the virtual DLA chips 334 and 336 can be decreased if the accuracy of results from the representative data 340 is more than needed to reduce energy consumption by (improve energy efficiency of) the physical DLA chip 330.
[0053]
[0054] At block 460, the method can include executing a first DLA model using a first subset of a plurality of DLA cores of a DLA chip. The first subset can include a first quantity of the plurality of DLA cores. The first quantity of the plurality of DLA cores can be assigned to the first subset of the DLA cores based on a first computational capability of the first DLA model.
[0055] At block 462, the method can include executing a second DLA model using a second subset of the plurality of DLA cores of the DLA chip. The second subset can include a second quantity of the plurality of DLA cores that is different than the first quantity of the plurality of DLA cores. The second quantity of the plurality of DLA cores can be assigned to the second subset of the DLA cores based on a second computational capability of the second DLA model. The second computational capability can be greater than the first computational capability. A greater quantity of the plurality of DLA cores can be assigned to the second subset of the plurality of DLA cores than the first quantity of the plurality of DLA cores assigned to the first subset of the plurality of DLA cores. The second quantity of the plurality of DLA cores can be assigned to the second subset of the DLA cores based on a second computational capability of the second DLA model. The second computational capability can be less than the first computational capability. A lesser quantity of the plurality of DLA cores can be assigned to the second subset of the plurality of DLA cores than the first quantity of the plurality of DLA cores assigned to the first subset of the plurality of DLA cores.
[0056] Although not specifically illustrated, the method can include executing the first DLA model using the first subset of the plurality of DLA cores and the second DLA model using the second subset of the plurality of DLA cores at least partially concurrently.
[0057] Although not specifically illustrated, the method can include executing a third DLA model using a third subset of the plurality of DLA cores of the DLA chip. The third subset can include a third quantity of the plurality of DLA cores that is different than the first and third quantities of the plurality of DLA cores. The third quantity of the plurality of DLA cores can be to the third subset of the DLA cores based on a third computational capability of the third DLA model. The third computational capability can be different than the first and second computational capabilities.
[0058]
[0059] At block 570, the method can include determining which computational layers of a first DLA model to execute on data received by a physical DLA chip subsequent to a compile time. Determining which computational layers of the first DLA model to execute can include, at block 571, executing, at compile time and using a first virtual DLA chip, a first number of computational layers of a first DLA model on representative data and, at block 572, executing a second DLA model, using a second virtual DLA chip, on results from execution of the first number of computational layers of the first DLA model on the representative data. The first virtual DLA chip can include a different quantity of DLA cores of the physical DLA chip than the second virtual DLA chip. Determining which computational layers of the first DLA model to execute can further include, at block 573, determining whether results from execution of the second DLA model on results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least a threshold confidence value.
[0060] Although not specifically illustrated, the method can include responsive to determining that the results from execution of the second DLA model on the results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value, executing the second number of computational layers of the first DLA model, subsequent to the compile time and using the first virtual DLA chip, on data received by the physical DLA chip. The method can include, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value, executing a second number of computational layers of the first DLA model, using the first virtual DLA chip, on the representative data. The second number of computational layers can include an additional computational layer of the first DLA model or exclude a computational layer of the first number of computational layers. The second DLA model can be executed, using the second virtual DLA chip, on results from execution of the second number of computational layers of the first DLA model on the representative data. Whether results from execution of the second DLA model on results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value can be determined.
[0061] The method can include, responsive to determining that the results from execution of the second DLA model on the results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value, executing, subsequent to the compile time and using the first virtual DLA chip, the second number of computational layers of the first DLA model on data received by the physical DLA chip. The method can include, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value, executing a number of computational layers of the second DLA model, using the second virtual DLA chip, on the results from execution of the second number of computational layers of the first DLA model on the representative data. The number of computational layers of the second DLA model can include an additional computational layer of the second DLA model or exclude a computational layer of the second DLA model executed on the results from execution of the first number of computational layers of the first DLA model.
[0062]
[0063] The machine can be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0064] The example computer system 690 includes a processing device 691, a main memory 693 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 697 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage system 699, which communicate with each other via a bus 697.
[0065] The processing device 691 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processing device can be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets, or processors implementing a combination of instruction sets. The processing device 691 can also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 691 is configured to execute instructions 692 for performing the operations and steps discussed herein. The computer system 690 can further include a network interface device 695 to communicate over the network 696.
[0066] The data storage system 699 can include a machine-readable storage medium 689 (also known as a computer-readable medium) on which is stored one or more sets of instructions 692 or software embodying any one or more of the methodologies or functions described herein. The instructions 692 can also reside, completely or at least partially, within the main memory 693 and/or within the processing device 691 during execution thereof by the computer system 690, the main memory 693 and the processing device 691 also constituting machine-readable storage media.
[0067] In some embodiments, the instructions 692 include instructions to implement functionality corresponding to the host 102 and/or the memory device 104. The instructions 692 can be executed to cause the machine to assign a first quantity of a plurality of DLA cores of a physical DLA chip to a first virtual DLA chip based on a first processing requirement of a first DLA model and assign a second quantity of the plurality of DLA cores of the physical DLA chip to a second virtual DLA chip based on a second processing requirement of a second DLA model. The instructions 692 can be executed to cause the machine to execute the first DLA model using the first virtual DLA chip and execute the second DLA model using the second virtual DLA chip. The instructions 692 can be executed to cause the machine to assign a greater quantity of the plurality of DLA cores to the first virtual DLA chip than to the second virtual DLA chip in response to the first processing requirement being greater than the second processing requirement. The instructions 692 can be executed to cause the machine to assign a lesser quantity of the plurality of DLA cores to the first virtual DLA chip than to the second virtual DLA chip in response to the second processing requirement being greater than the first processing requirement.
[0068] The instructions 692 can be executed to cause the machine to, responsive to instructions to execute a third DLA model, assign a third quantity of the plurality of DLA cores to a third virtual DLA chip based on a third processing requirement of the third DLA model. The third processing requirement can be different than the first and second processing requirements. The instructions 692 can be executed to cause the machine to execute the third DLA model using the third virtual DLA chip. The instructions 692 can be executed to cause the machine to, responsive to subsequent instructions to execute the first DLA model, assign the first quantity of the plurality of DLA cores to the first virtual DLA chip and execute the first DLA model using the first virtual DLA chip having the first quantity of the plurality of DLA cores assigned thereto. The instructions 692 can be executed to cause the machine to, responsive to instructions to execute a fourth DLA model, assign a fourth quantity of the plurality of DLA cores to a fourth virtual DLA chip based on a fourth processing requirement of the fourth DLA model and execute the fourth DLA model using the fourth virtual DLA chip. The fourth processing requirement can be different than the third processing requirement.
[0069] The instructions 692 can be executed to cause the machine to determine whether execution of a computational layer of a first DLA model on representative data, using a first virtual DLA chip, yields results having at least a threshold confidence value. A non-limiting example of a confidence value is an accuracy value (e.g., correct inferences per second per watt). The first virtual DLA chip can include a first plurality of DLA cores of a physical DLA chip. The instructions 692 can be executed to cause the machine to, responsive to determining that execution of the computational layer of the first DLA model yields results having less than the threshold confidence value, execute a second DLA model, using a second virtual DLA chip, on results from execution of the number of computational layers of the first DLA model. The second virtual DLA chip can include a second plurality of DLA cores of the physical DLA chip that is greater than the first plurality of DLA cores. The instructions 692 can be executed to cause the machine to, responsive to determining that execution of a respective last computational layer of the first DLA model yields results having less than the threshold confidence value, assign an additional DLA core of the physical DLA chip to the first virtual DLA chip and execute the first DLA model on data received by the physical DLA chip using the first virtual DLA chip including the additional DLA core. The instructions 692 can be executed to cause the machine to determine whether execution of the computational layer of the first DLA model on the representative data yields results having at least the threshold confidence value at a compile time.
[0070] The instructions 692 can be executed to cause the machine to determine whether the execution of the second DLA model provides at least a threshold quantity of correct inferences per second per watt. The instructions 692 can be executed to cause the machine to, responsive to determining that execution of the second DLA model yields results having less than the threshold quantity of correct inferences per second per watt, assign another additional DLA core of the physical DLA chip to the second virtual DLA chip and execute the second DLA model on the data received by the physical DLA chip using the second virtual DLA chip including the other additional DLA core.
[0071] While the machine-readable storage medium 689 is shown in an example embodiment to be a single medium, the term “machine-readable storage medium” should be taken to include a single medium or multiple media that store the one or more sets of instructions. The term “machine-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term “machine-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
[0072] Although specific embodiments have been illustrated and described herein, those of ordinary skill in the art will appreciate that an arrangement calculated to achieve the same results can be substituted for the specific embodiments shown. This disclosure is intended to cover adaptations or variations of various embodiments of the present disclosure. It is to be understood that the above description has been made in an illustrative fashion, and not a restrictive one. Combinations of the above embodiments, and other embodiments not specifically described herein will be apparent to those of skill in the art upon reviewing the above description. The scope of the various embodiments of the present disclosure includes other applications in which the above structures and methods are used. Therefore, the scope of various embodiments of the present disclosure should be determined with reference to the appended claims, along with the full range of equivalents to which such claims are entitled.
[0073] In the foregoing Detailed Description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the disclosed embodiments of the present disclosure have to use more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment.