G06N3/0495

Automatic Selection of Quantization and Filter Pruning Optimization Under Energy Constraints

Systems and methods for producing a neural network architecture with improved energy consumption and performance tradeoffs are disclosed, such as would be deployed for use on mobile or other resource-constrained devices. In particular, the present disclosure provides systems and methods for searching a network search space for joint optimization of a size of a layer of a reference neural network model (e.g., the number of filters in a convolutional layer or the number of output units in a dense layer) and of the quantization of values within the layer. By defining the search space to correspond to the architecture of a reference neural network model, examples of the disclosed network architecture search can optimize models of arbitrary complexity. The resulting neural network models are able to be run using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art, mobile-optimized models.

Automatic Selection of Quantization and Filter Pruning Optimization Under Energy Constraints

Systems and methods for producing a neural network architecture with improved energy consumption and performance tradeoffs are disclosed, such as would be deployed for use on mobile or other resource-constrained devices. In particular, the present disclosure provides systems and methods for searching a network search space for joint optimization of a size of a layer of a reference neural network model (e.g., the number of filters in a convolutional layer or the number of output units in a dense layer) and of the quantization of values within the layer. By defining the search space to correspond to the architecture of a reference neural network model, examples of the disclosed network architecture search can optimize models of arbitrary complexity. The resulting neural network models are able to be run using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art, mobile-optimized models.

SYSTEMS AND METHODS FOR TRAINING ENERGY-EFFICIENT SPIKING GROWTH TRANSFORM NEURAL NETWORKS
20230021621 · 2023-01-26 ·

Growth-transform (GT) neurons and their population models allow for independent control over spiking statistics and transient population dynamics while optimizing a physically plausible distributed energy functional involving continuous-valued neural variables. A backpropagation-less learning approach trains a GT network of spiking GT neurons by enforcing sparsity constraints on network spiking activity overall. Spike responses are generated because of constraint violations. Optimal parameters for a given task is learned using neurally relevant local learning rules and in an online manner. The GT network optimizes itself to encode the solution with as few spikes as possible and operate at a solution with the maximum dynamic range and away from saturation. Further, the framework is flexible enough to incorporate additional structural and connectivity constraints on the GT network. The framework formulation is used to design neuromorphic tinyML systems that are constrained in energy, resources, and network structure.

METHOD AND APPARATUS FOR COMPRESSION AND TRAINING OF NEURAL NETWORK
20230229894 · 2023-07-20 ·

A neural-network-based signal processing method and apparatus according to the present invention may: receive a bitstream including information about a neural network model, wherein the bitstream includes at least one neural network access unit; obtain information about the at least one neural network access unit from the bitstream; and reconstruct the neural network model on the basis of the information about the at least one neural network access unit.

METHOD AND APPARATUS FOR COMPRESSION AND TRAINING OF NEURAL NETWORK
20230229894 · 2023-07-20 ·

A neural-network-based signal processing method and apparatus according to the present invention may: receive a bitstream including information about a neural network model, wherein the bitstream includes at least one neural network access unit; obtain information about the at least one neural network access unit from the bitstream; and reconstruct the neural network model on the basis of the information about the at least one neural network access unit.

METHOD AND COMPUTING DEVICE FOR DETERMINING OPTIMAL PARAMETER

Provided are a method and computing device for determining an optimal parameter set. The method includes receiving an inference model, a dataset, and a constraint, configuring a set of compression methods and a set of parameters, applying a first compression method and a first parameter related to the first compression method to the inference model through a compression pipeline, determining whether a compressed inference model is generated from the inference model through the compression pipeline, when it is determined that the compressed inference model is not generated, applying a second compression method, and a second parameter to the inference model, following the first compression method, when it is determined that the compressed inference model is generated, transmitting the compressed inference model to the target device, and determining an optimal set of parameters on the basis of the performance of the compressed inference model received from the target device.

METHOD AND COMPUTING DEVICE FOR DETERMINING OPTIMAL PARAMETER

Provided are a method and computing device for determining an optimal parameter set. The method includes receiving an inference model, a dataset, and a constraint, configuring a set of compression methods and a set of parameters, applying a first compression method and a first parameter related to the first compression method to the inference model through a compression pipeline, determining whether a compressed inference model is generated from the inference model through the compression pipeline, when it is determined that the compressed inference model is not generated, applying a second compression method, and a second parameter to the inference model, following the first compression method, when it is determined that the compressed inference model is generated, transmitting the compressed inference model to the target device, and determining an optimal set of parameters on the basis of the performance of the compressed inference model received from the target device.

Optimization for artificial neural network model and neural processing unit
11710026 · 2023-07-25 · ·

A computer-implemented apparatus installed and executed in a computer to search an optimal design of a neural processing unit (NPU), a hardware accelerator used for driving a computer-implemented artificial neural network (ANN) is disclosed. The NPU comprises a plurality of blocks connected in a form of pipeline, and the number of the plurality blocks and the number of the layers within each block of the plurality blocks are in need of optimization to reduce hardware resources demand and electricity power consumption of the ANN while maintaining the inference accuracy of the ANN at an acceptable level. The computer-implemented apparatus searches for and then outputs an optimal L value and an optimal C value when a first set of candidate values for a number of layers L and a second set of candidate values for a number of channels C per each layer of the ANN is provided.

INLINE DECOMPRESSION
20230223954 · 2023-07-13 ·

Techniques and apparatuses to decompress data that has been stack compressed is described. Stack compression refers to compression of data in one or more dimensions. For uncompressed data blocks that are very sparse, i.e., data blocks that contain many zeros, stack compression can be effective. In stack compression, uncompressed data block is compressed into compressed data block by removing one or more zero words from the uncompressed data block. A map metadata that maps the zero words of the uncompressed data block is generated during compression. With the use of the map metadata, the compressed data block can be decompressed to restore the uncompressed data block.

INLINE DECOMPRESSION
20230223954 · 2023-07-13 ·

Techniques and apparatuses to decompress data that has been stack compressed is described. Stack compression refers to compression of data in one or more dimensions. For uncompressed data blocks that are very sparse, i.e., data blocks that contain many zeros, stack compression can be effective. In stack compression, uncompressed data block is compressed into compressed data block by removing one or more zero words from the uncompressed data block. A map metadata that maps the zero words of the uncompressed data block is generated during compression. With the use of the map metadata, the compressed data block can be decompressed to restore the uncompressed data block.