ARTIFICIAL NEURAL NETWORK AND COMPUTATIONAL ACCELERATOR STRUCTURE CO-EXPLORATION APPARATUS AND METHOD
20230077987 · 2023-03-16
Assignee
Inventors
Cpc classification
G06F9/5027
PHYSICS
G06F18/217
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06N3/0985
PHYSICS
G06F18/285
PHYSICS
International classification
Abstract
An artificial neural network and computational accelerator structure co-exploration apparatus, includes: a neural architecture search (NAS) module configured to determine neural network architecture, and a differentiable accelerator and network co-exploration (DANCE) evaluation module configured to determine accelerator architecture according to the determined neural network architecture and predict hardware metrics for the determined accelerator architecture.
Claims
1. An artificial neural network and computational accelerator structure co-exploration apparatus, comprising: a neural architecture search (NAS) module configured to determine neural network architecture; and a differentiable accelerator and network co-exploration (DANCE) evaluation module configured to determine accelerator architecture according to the determined neural network architecture and predict hardware metrics for the determined accelerator architecture.
2. The apparatus of claim 1, wherein the NAS module simultaneously evaluates a plurality of candidate neural network architectures to select the neural network architecture and calculate a cross-entropy loss (LossCE).
3. The apparatus of claim 1, wherein the DANCE evaluation module is constructed through pre-training, and includes: a hardware generation network configured to be built through pre-training, explore optimal hardware according to the determined neural network architecture as the accelerator architecture, and determine at least one of a processing element (PE) array configuration (PEx and PEy), a register file (RF) configuration, and a dataflow (DF) configuration; and a cost estimation network configured to predict the hardware metrics based on configurations of the accelerator architecture.
4. The apparatus of claim 3, wherein the hardware generation network generates random networks within a network architecture space and determines one of the random networks as the optimal hardware.
5. The apparatus of claim 4, wherein the hardware generation network explores the random networks by being configured as multi-layer perceptrons using a rectified linear unit (ReLU) as an activation function.
6. The apparatus of claim 5, wherein the hardware generation network makes an output value approach an input value of the cost estimation network in a manner of feature forwarding the output value to the input value by connecting the last of the multi-layer perceptrons with Gumbel-Softmax.
7. The apparatus of claim 3, wherein the cost estimation network is configured as a multi-layer regression that uses a rectified linear unit (ReLU) as an activation function and applies batch normalization to each layer.
8. The apparatus of claim 7, wherein the cost estimation network predicts the hardware metrics by determining latency, area, and energy consumption through the multi-layer regression.
9. The apparatus of claim 8, wherein the cost estimation network predicts the hardware metrics by calculating a linear combination or a product of the latency, the area, and the energy consumption
10. An artificial neural network and computational accelerator structure co-exploration method, comprising: performing a NAS module that determines neural network architecture; and performing a DANCE evaluation module that determines accelerator architecture according to the determined neural network architecture and predicts hardware metrics for the determined accelerator architecture.
11. The method of claim 10, wherein the performing of the DANCE evaluation module constructed through pre-training includes: performing a hardware generation network that explores optimal hardware according to the determined neural network architecture as the accelerator architecture, and determines at least one of a processing element (PE) array configuration (PEx and PEy), a register file (RF) configuration, and a dataflow (DF) configuration; and performing a cost estimation network that predicts the hardware metrics based on configurations of the accelerator architecture.
12. The method of claim 11, wherein the performing of the hardware generation network includes generating random networks within a network architecture space and determining one of the random networks as the optimal hardware.
13. The method of claim 12, wherein the performing of the hardware generation network includes exploring the random networks by being configured as multi-layer perceptrons using a rectified linear unit (ReLU) as an activation function.
14. The method of claim 11, wherein the performing of the cost estimation network includes configuring the cost estimation network as a multi-layer regression that uses a rectified linear unit (ReLU) as an activation function and applies batch normalization to each layer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
DETAILED DESCRIPTION
[0056] Since the description of the present disclosure is merely an embodiment for structural or functional explanation, the scope of the present disclosure should not be construed as being limited by the embodiments described in the text. That is, since the embodiments may be variously modified and may have various forms, the scope of the present disclosure should be construed as including equivalents capable of realizing the technical idea. In addition, a specific embodiment is not construed as including all the objects or effects presented in the present disclosure or only the effects, and therefore the scope of the present disclosure should not be understood as being limited thereto.
[0057] On the other hand, the meaning of the terms described in the present application should be understood as follows.
[0058] Terms such as “first” and “second” are intended to distinguish one component from another component, and the scope of the present disclosure should not be limited by these terms. For example, a first component may be named a second component and the second component may also be similarly named the first component.
[0059] It is to be understood that when one element is referred to as being “connected to” another element, it may be connected directly to or coupled directly to another element or be connected to another element, having the other element intervening therebetween. On the other hand, it is to be understood that when one element is referred to as being “connected directly to” another element, it may be connected to or coupled to another element without the other element intervening therebetween. In addition, other expressions describing a relationship between components, that is, “between”, “directly between”, “neighboring to”, “directly neighboring to” and the like, should be similarly interpreted.
[0060] It should be understood that the singular expression include the plural expression unless the context clearly indicates otherwise, and it will be further understood that the terms “comprises” or “have” used in this specification, specify the presence of stated features, steps, operations, components, parts, or a combination thereof, but do not preclude the presence or addition of one or more other features, numerals, steps, operations, components, parts, or a combination thereof.
[0061] In each step, an identification code (for example, a, b, c, and the like) is used for convenience of description, and the identification code does not describe the order of each step, and each step may be different from the specified order unless the context clearly indicates a particular order. That is, the respective steps may be performed in the same sequence as the described sequence, be performed at substantially the same time, or be performed in an opposite sequence to the described sequence.
[0062] The present disclosure may be embodied as computer readable code on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices in which data may be read by a computer system. Examples of the computer readable recording medium may include a read only memory (ROM), a random access memory (RAM), a compact disk read only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage, or the like. In addition, the computer readable recording medium may be distributed in computer systems connected to each other through a network, such that the computer readable codes may be stored in a distributed scheme and executed.
[0063] Unless defined otherwise, all the terms used herein including technical and scientific terms have the same meaning as meanings generally understood by those skilled in the art to which the present disclosure pertains. It should be understood that the terms defined by the dictionary are identical with the meanings within the context of the related art, and they should not be ideally or excessively formally defined unless the context clearly dictates otherwise.
[0064]
[0065] Referring to
[0066] The NAS module 110 may perform an operation of determining the neural network architecture, and the DANCE evaluation module 130 may perform an operation of determining accelerator architecture corresponding to neural network architecture determined by the NAS module 110 and predicting hardware metrics for the accelerator architecture.
[0067] More specifically, the NAS module 110 may select the neural network architecture by simultaneously evaluating a plurality of candidate neural network architectures, and may calculate a cross-entropy loss (Loss.sub.CE) related thereto.
[0068] In one embodiment, the DANCE evaluation module 130 may be constructed through pre-training, and may be configured to include two networks. That is, the DANCE evaluation module 130 may include a hardware generation network and a cost estimation network.
[0069] First, the hardware generation network is optimal hardware accelerator architecture according to the neural network architecture determined by the NAS module 110, and may perform an operation of performing exploration and determining at least one of a processing element (PE) array configuration (PEx and PEy), a register file (RF) configuration, and a dataflow configuration for the accelerator architecture. That is, the hardware generation network may be pre-trained to explore the optimal hardware architecture, and may generate optimal configurations for the optimal hardware architecture as parameters. For example, the hardware generation network may generate, as an output, the features PEx and PEy, the register file RF, the dataflow DF, and the like of the PE array for the optimal hardware architecture.
[0070] In one embodiment, the hardware generation network may generate random networks within the network architecture space and determine one of the random networks as the optimal hardware. That is, the hardware generation network may receive a random network as an input and may generate an output that may be used as a ground-truth for training the evaluator network.
[0071] In an embodiment, the hardware generation network may explore the random networks by being configured as multi-layer perceptrons using a rectified linear unit (ReLU) as an activation function. For example, as illustrated in
[0072] In an embodiment, the hardware generation network 131 may make an output value approach an input value of the cost estimation network in a manner of feature forwarding the output value to the input value by connecting the last of the multi-layer perceptrons with Gumbel-Softmax. For example, as illustrated in
[0073] In addition, the cost estimation network may perform an operation of predicting hardware metrics based on configurations related to the accelerator architecture. In an embodiment, the cost estimation network may be configured as a multi-layer regression that uses the rectified linear unit (ReLU) as the activation function and applies batch normalization to each layer. For example, the cost estimation network 133 may be configured as the five-layer regression as illustrated in
[0074] In an embodiment, the cost estimation network may predict the hardware metrics by determining latency, area, and energy consumption through multi-layer regression. In this case, the ground truth generated by the evaluation software may be used in the cost estimation process.
[0075] In one embodiment, the cost estimation network may predict the hardware metrics by calculating a linear combination or a product (combination and product) of the latency, the area, and the energy consumption. That is, the cost estimation network may predict hardware metrics using the cost function, and the cost function may be defined as a linear combination of the latency, the area, and the energy consumption, or may be defined as the combination and the product between the latency, the area, and the energy consumption.
[0076]
[0077] Referring to
[0078] Hereinafter, the artificial neural network and computational accelerator structure co-exploration method according to the present disclosure will be described in more detail with reference to
[0079] The neural architecture search (NAS) may automate the design of DNN architectures to cope with the increasing network sizes and the corresponding manual design efforts. Early in the neural architecture search, reinforcement learning (RL) or evolutionary algorithms (EA) have been adopted for network generation.
[0080] However, in the case of these algorithms, the search cost may be very high, and may consume several thousand GPU-days due to the full training required for all candidates. A search for a differentiable neural architecture according to the present disclosure may generate a supergraph and find a path therein as a way to alleviate these costs. In other words, the search for differentiable neural architectures may find networks with state-of-the-art performance in orders of magnitude shorter time.
[0081] Hardware accelerators for DNNs focus on parallel execution of multiple MAC (Multiply-Accumulate) operations, which are the most common operations in recent CNNs.
[0082] In general, the DNN layer may include multiple dimensions of computational operations. For example, the convolutional layer may include seven computing operation layers as illustrated in
[0083] Analysis of how each choice affects the DNN latency in the accelerator design may be performed by a simulator or an analytical evaluation tool. The co-exploration method according to the present disclosure may utilize Timeloop combined with Accelergy as a state-of-the-art accelerator evaluation toolchain for training evaluation networks on the DANCE framework.
[0084] In the method of co-exploration of network architecture and accelerator design, existing methods may use the reinforcement learning (RL) as a controller due to a relatively simple method of formulating a problem. However, all of these methods may include the same search cost problem that occurs in the reinforcing learning-based NAS algorithm.
[0085] In contrast, the method according to the present disclosure may apply the idea of a differentiable NAS to a joint exploration problem, which may greatly reduce the cost of exploration while creating a network and accelerator design with the highest accuracy. The existing method called EDD can provide a differentiable method for joint exploration problems. However, the method may have some important limitations. The EDD may be modeled as latency (latency) obtained by dividing a total of flops of the network by the amount of computational resources. As a result, the true relationship between the network architecture and the accelerator design may not be considered in co-search. This may theoretically not allow searches for some important characteristics such as the dataflow or the register file size. Also, the main focus of the EDD may be to use various types of quantization for each layer. Therefore, the EDD may include the assumption that there is (shareable) dedicated hardware for each layer and there is a difference from general accelerators.
[0086] Referring to
[0087] Referring to
[0088] The right part of
Loss=Loss.sub.CE+λ.sub.1∥w∥+λ.sub.2 Cost.sub.HW [Equation 1]
[0089] Here, λ.sub.1 and λ.sub.2 are hyperparameters that adjust a trade-off between terms. The Loss.sub.CE is a cross-entropy loss and ∥w∥ is a weight decay term. Also, Cost.sub.HW is a cost function of the hardware accelerator calculated from the output value of the evaluator network. For example, the cost function may correspond to a linear combination of the latency, the area, and the energy consumption, or correspond to energy-delay-area product (EDAP).
[0090] Original (non-differentiable) cost estimation software may be composed of a hardware generation tool and a cost estimation tool. A hardware generation tool may use network architecture as an input and generate a hardware accelerator design as an output.
[0091] The co-exploration method according to the present disclosure may use a dataflow as a search space of a hardware accelerator design, the number of PEs in the X and Y dimensions, and the register file size. Thereafter, the cost estimation tool may generate cost metrics as an output using the hardware accelerator and the network architecture. In general, the hardware generation tool may be implemented as an outer loop containing the cost estimation tool. That is, the hardware generation tool may generate as output an optimal solution for a given network architecture A within a hardware search space H by using an exact algorithm such as an exhaustive exploration or branch-and-bound algorithm.
[0092] In an embodiment, the co-exploration method according to the present disclosure may use Timeloop for latency and Accelergy for energy/area in the cost estimation process. In this case, the Timeloop and Accelergy may correspond to a state-of-the-art cost estimation toolchain. The co-exploration method according to the present disclosure may design a unique hardware generation tool using the cost estimation tool. The co-exploration method according to the present disclosure may generate a random network as an input on the network architecture space A, and the output of the toolchain may be used as the ground-truth for training the components of the evaluator network.
[0093] The evaluator network according to the present disclosure may include two modules: a hardware generation network and a cost estimation network. Referring to
[0094] The cost estimation network may be modeled as a five-layer regression with the residual connections. The cost estimation network may include the ReLU as the activation function, and the batch normalization may be applied to all layers. The cost estimation network may generate as output the three cost metrics of interest (i.e., latency, area, and energy consumption) based on the ground truth generated by the evaluation software. For example, the evaluation software may include Timeloop and Accelergy. The present disclosure may use a mean squared relative error (MSRE) loss to train each evaluator network, and may be expressed as Equation 2 below.
Loss.sub.MRSE=Σ.sub.i(1−ŷ.sub.i,y.sub.i).sup.2 [Equation 2]
[0095] Here, y.sub.i is the hardware cost function (Cost.sub.HW) for each metric generated from the results of the Timeloop+Accelergy, and ŷ.sub.i is the same cost function calculated using the network output. A general MSE loss may be used, but in this case, there may be a problem of giving inappropriate weights to metrics having high values. For example, the latency value output in the search space may be in the range from 8 ns to 100 ns or more for each layer. In case of using the MSE loss, 10 ns error (error) of 8 ns latency and 10 ns error of 100 ns latency are regarded as the same, thereby giving an unfair advantage to more accurately modeling situations with long latency. That is, under the condition of finding the accelerator with low latency, the MSRE loss may be more desirable.
[0096] In the evaluator architecture, the cost estimation network that outputs the HW cost metrics may mean that two functions of finding the optimal hardware and estimating the metrics should be modeled internally. A standalone network may show significantly higher accuracy, but latency may be further improved by adding a feature forwarding path in the output of the hardware generation network. That is, the output of the hardware generation network may be connected to the network architecture as an input to the cost estimation network. For example, when the Gumbel softmax is used as the last layer of the hardware generation network, the output value of the hardware generation may be as close as possible to the input of the cost estimation network.
[0097] Compared to optimizing the classification accuracy of an application, optimizing the cost metrics may correspond to a relatively easier task in the gradient decent. For example, selecting 0 for most of the layers may quickly optimize all latency, area, and energy consumption. When the network architecture is limited to these solutions, it may be difficult to find a more critical architecture, even if necessary to optimize for the highest accuracy. To mitigate this effect, hyperparameter warm-up scheduling may be used. The hyperparameter warm-up scheduling may use λ.sub.2 in Equation 1 as a small value for first few epochs and increase λ.sub.2 to a desired value later after the network architecture reaches a certain stage for high accuracy.
[0098] Basically, the hardware cost function may use a linear combination of the three hardware cost metrics as the cost function CostHW of Equation 1, and may be expressed as Equation 3 below.
Cost.sub.HW_linear=λ.sub.EEnergy+λ.sub.LLatency+λ.sub.AArea [Equation 3]
[0099] By controlling λ.sub.E, λ.sub.L, and λ.sub.A, conditions for how to measure the balance between each cost metric may be set. To match the scale of these hyperparameters, mJ, ms, and μm.sup.2 units may be used for each cost.
[0100] In addition, the hardware cost function may use a product of all metrics as the cost function, and may be expressed as Equation 4 below.
Cost.sub.HW.sub.
[0101] Here, the EDAP corresponds to common metrics (e.g., energy-delay-area product) used to evaluate hardware. In this case, it may be advantageous in that there are no additional hyperparameters and no units.
[0102] Hereinafter, the experimental results regarding the present disclosure will be described.
[0103] Several experiments may be performed based on the CIFAR-10 and ImageNet (ILSVRC2012) datasets for the co-exploration method (i.e., DANCE) according to the present disclosure. All algorithms may be implemented in PyTorch and run on four RTX2080Ti GPUs.
[0104] Search Space
[0105] For H, which is a hardware accelerator search space, the latest accelerator Eyeriss may be used as a backbone. As design parameters, the number of PEs, the RF size, and the dataflow may be used. In the case of a two-dimensional PE array, variables PEX and PEY may be allocated separately for each dimension. Here, the range of each value may be 8 to 24. In the configuration, the larger the PEX, the more channels the layers may have, and the larger the PEY, the larger the feature maps may be used for parallelism. The RF size per PE may have a value between 4 and 64. In the case of the dataflow, three dataflows may be selected from existing hardware accelerators (i.e., weight stationary (WS), output stationary (OS), row stationary (RS)). About 128 GB/s of HBM memory may be set for off-chip memory. Each variable on the evaluator network may be formulated as a one-hot vector to simplify the cascaded connection between the hardware generation network and the cost estimation network.
[0106] In the case of A, which is a network architecture search space, ProxylessNAS may be used as a backbone network architecture. There are 13 layers in the network, and the number of channels may increase for every 3 layers.
[0107] In addition to the skip connection, each of the nine intermediate layers may include seven candidate operations of MBConv3×3_expand3, MBConv3×3_expand6, MBConv5×5_expand3, MBConv5×5_expand6, MBConv7×7_expand3, MBConv7×7_expand6, and Zero When Zero is selected, only skipped connections may be included, and the layer can effectively disappear from the network. The architectural parameters may be learned through a binarized method (e.g., ProxylessNAS).
[0108] Evaluator Network Results
[0109] 1) Cost Estimation Network: Table 1 below corresponds to the experimental results for the components of the evaluator network.
TABLE-US-00001 TABLE 1 Network Accuracy Hardware Generation PE.sub.X PE.sub.Y RF_Size Dataflow 98.9% 98.3% 98.3% 98.8% Cost Estimation Latency Energy Area (w/o feature forwarding) 93.7% 96.3% 92.8% Cost Estimation Latency Energy Area (w/feature forwarding) 99.6% 99.7% 99.9% Overall Evaluator Latency Energy Area 98.3% 98.3% 99.2%
[0110] The cost estimation network and the hardware generation network may be trained independently based on the ground truth values, and then combined with each other. Each layer of the cost estimation network may have a width of 256 and the network may be trained using an Adam optimizer with a training rate of 0.0001 for 200 epochs. A batch size of 256 may be applied. The cost estimation network may be trained on 1.8 million cases generated by Timeloop+Accelergy in the search space and verified on 450,000 cases. As a result, it may be shown that all three cost metrics are sufficiently accurate in that they show more than 99% accuracy. Also, it may be observed that the feature forwarding improves the accuracy to 4.3% p on average.
[0111] 2) Hardware generation network: In the case of a hardware generation network, the layer width may be set to 128. As the loss function, a general CE loss may be used, and may be expressed as Loss.sub.CE_HW. The hardware generation network may be trained using SGD with batch size of 128 for 200 epochs, and the training rate may start at 0.001 and decrease by a factor of 0.1 every 50 epochs. In addition, 50,000 network cases may be generated in the search space, and 10,000 cases may be used for validation. It may be confirmed that the accuracy of the hardware generation network is almost 99% in all hardware accelerator design parameters, which is sufficiently accurate. In other words, the hardware generation network is not only accurate and differentiable, but may also run much faster than the original generation toolchain. The inference time of the hardware generation network with the same function takes about 0.5 ms with a single GPU, while the generation tool takes about 112 seconds using 48 threads on 24 cores of two Intel Xeon Silver-4214 CPUs.
[0112] 3) End-to-end evaluator network results: The entire evaluator network may be tested with the combination between the hardware generation network and the cost estimation network. Even if the median is not a one-hot vector, the Gumbel softmax may approximate it well and still maintain about 99% accuracy for cost metrics.
[0113] Co-Exploration Results
[0114] 1) Experimental results for CIFAR-10: For the first baseline, a search may be performed using ProxylessNAS and hardware generation may be performed on the searched network using an exhaustive exploration tool. This may represent a typical separation design performed in practice. A search may be performed for 120 epochs with a batch size of 256 while a warm-up may be performed for 40 epochs. The SGD optimizer with Nesterov momentum may be used for searches using cosine scheduling with a learning rate of 0.025, weight reduction of 0.00004 (λ.sub.1), label smoothing of 0.1, and momentum of 0.9. After the search, the final network may be trained from scratch for 300 epochs. The hyperparameters for training may be identical except that the training rate is 0.008 and the weight decay factor is 0.001. In addition, the EDD may be used as a second baseline. Since the EDD may not be applied to hardware parameters for dataflow and register files, the co-exploration is performed based on only the number of PEs, and the post search may be performed on the remaining parameters. A problem that may occur in the EDD is that a loss function that multiplies a classification loss by a latency loss is used, and it may be expressed as Equation 5 below.
Loss=λ.sub.2.Math.Loss.sub.CE.Math.ΣLatency [Equation 5]
[0115] Here, λ2 does not adjust the weight between the two terms. This may lead to serious problems where the network shrinks too much to quickly optimize latency. As a result, the solution may provide very low hardware cost but unacceptable accuracy. Therefore, in order to alleviate the problem, an experiment for changing the loss function as in Equation 1 may be performed, and may be expressed as EDD+Proposed Loss func.
[0116] Using the DANCE, the co-exploration may be performed based on cost functions. For CostHW_linear, three cost functions may be set: latency-oriented, energy-oriented, and balanced. All other hyperparameters may be set to be the same as the baseline. Similar to after-search training, one exact hardware generation after search may be performed to obtain an optimal hardware accelerator design.
[0117] Overall, the DANCE may achieve a better network accelerator design than the baseline. For comparison, two designs may be used, one with high accuracy (−A) and the other with an efficient hardware design (−B). For a high precision design (−A), the DANCE may achieve almost the same accuracy as the baseline (no penalty). For an efficient hardware design (−B), the design with the best cost function may be selected within a 1 to 2% accuracy reduction. The DANCE may perform efficient co-exploration to achieve up to 10× better EDAP or 3× better latency. Using the latency-oriented cost function, the latency is much lower than the other functions, while the energy-oriented cost function may achieve better energy consumption than the other two functions. As a result, using the DANCE may mean that may tune the cost hyperparameter to get the solution interested.
[0118] Referring to
[0119] 2) Experimental results for ImageNet: Table 2 below shows the performance of DANCE on the ImageNet dataset
TABLE-US-00002 TABLE 2 Method Acc. Latency Energy EDAP Baseline (No penalty) + HW 71.12% 23.3 ms 71.6 mJ 3014.0 Baseline (Flops Penalty) + HW 70.56% 13.4 ms 70.9 mJ 2709.0 EDD + Proposed Loss func. 70.34% 28.1 ms 94.8 mJ 5642.5 DANCE (Cost.sub.HW.sub.
[0120] Baseline with separate hardware search gives 71.12% accuracy. However, hardware may be expensive. When the Flops Penalty or the EDD are applied, it may not find an efficient solution. The DANCE may find a good trade-off point and may provide a much better cost metric with only a slight reduction in accuracy with up to 3×EDAP benefits.
[0121] Network and Accelerator Design Searched by DANCE
[0122] Referring to
[0123] The latency-oriented network (
[0124] The energy-oriented network (
[0125] Comparison of DANCE with Existing Co-Exploration Algorithms
[0126] Table 3 below may correspond to the result of comparing the DANCE with other accelerator/network co-exploration algorithms (i.e., Alg. [10] to [14] and [17]).
TABLE-US-00003 TABLE 3 Net-HW Alg. Backbone Dataset Acc.(%) GPU-hours Candidates Method Relation [11] Custom DAC-SDC 68.6 N/A 68 CD* ✓ [12] Custom CIFAR-10 89.7 N/A N/A RL ✓ [13] ResNet-9 CIFAR-10 93.2 3.5 h.sup. ~160 RL ✓ [14] NASBench CIFAR-100 74.2 2300 h 2300 RL ✓ [10] ProxylessNAS CIFAR-10 85.2 103.9 h 308 RL ✓ [17].sup.† ProxylessNas CIFAR-10 94.4 3 h 1 gradient X DANCE ProxylessNAS CIFAR-10 95.0 3 h 1 gradient ✓ .sup.†Reproduced and modified for the same setting *CD = Coordinate Descent
[0127] Since all environments are different (e.g., ASIC vs FPGA, different technology nodes, different NAS backbones, etc.), it is not possible to directly compare the measured values. Also, even the accuracy may not be directly compared because it relies on the underlying NAS algorithm. However, if the difference is large, it can imply the searching capability of the method, so accuracy and search cost can be summarized for rough comparison.
[0128] Most co-exploration algorithms may utilize reinforcement learning and may have a problem of having to train many candidates in the exploration process. As a result, many of them may only output suboptimal network architectures with poor accuracy.
[0129] The search time may also represent an advantage of the DANCE, and may be much faster compared to RL-based tasks. For the algorithm [13], the difference is small, but this is because the backbone architecture is based on a manually fine-tuned architecture with a small model size. The ‘candidates’ column may correspond to an attempt to fairly compare search costs in consideration of this case. That is, it may correspond to the number of candidates that each algorithm needs to train during the search. The RL-based co-exploration algorithms may need hundreds to thousands of candidates for training, but the DANCE may only use one candidate. Algorithm [17] is differentiable and may provide similar accuracy and search cost when reprocessed with the same NAS backbone. However, since the algorithm [17] may not reflect the network-hardware relationship, the co-exploration solution may provide much lower quality than the DANCE.
[0130] The DANCE, the co-exploration method according to the present disclosure, may correspond to a new differentiable method to jointly explore hardware accelerators and network architectures targeting both high accuracy and low cost metrics. The co-exploration method according to the present disclosure may model neural network-based hardware evaluators to obtain efficient hardware designs without compromising accuracy with very low search costs. The co-exploration method according to the present disclosure may reduce costs for the co-exploration problem in many future fields, such as video or natural language processing.
[0131] Although exemplary embodiments of the present invention have been disclosed hereinabove, it may be understood by those skilled in the art that the present invention may be variously modified and altered without departing from the scope and spirit of the present invention described in the following claims.