Patent classifications
G06F7/50
Processing system and method for binary weight convolutional neural network
The present invention provides a processing system for a binary weight convolutional neural network. The system comprises: at least one storage unit for storing data and instructions; at least one control unit for acquiring the instructions stored in the storage unit and sending out a control signal; and, at least one calculation unit for acquiring, from the storage unit, node values of a layer in a convolutional neural network and corresponding binary weight value data and obtaining node values of a next layer by performing addition and subtraction operations. With the system of the present invention, the data bit width during the calculation process of a convolutional neural network is reduced, the convolutional operation speed is improved, and the storage capacity and operational energy consumption are reduced.
Processing system and method for binary weight convolutional neural network
The present invention provides a processing system for a binary weight convolutional neural network. The system comprises: at least one storage unit for storing data and instructions; at least one control unit for acquiring the instructions stored in the storage unit and sending out a control signal; and, at least one calculation unit for acquiring, from the storage unit, node values of a layer in a convolutional neural network and corresponding binary weight value data and obtaining node values of a next layer by performing addition and subtraction operations. With the system of the present invention, the data bit width during the calculation process of a convolutional neural network is reduced, the convolutional operation speed is improved, and the storage capacity and operational energy consumption are reduced.
USING AND/OR REDUCE CARRY CHAINS ON PROGRAMMABLE HARDWARE
The present disclosure relates to a carry chain logic system that leverages carry in and carry out signals from logic blocks to implement logic functions on programmable hardware (e.g., FPGA hardware). In particular, implementations of the carry chain logic system facilitate implementation of logic gates (e.g., AND/OR gates) having high number of input signals without incurring routing delays caused by routing output signals between logic components implemented across different logic stages. For example, implementations described herein involve feeding carry out signals between adders of a logic chain across multiple logic components on a common logic stage, thus reducing routing penalties caused by routing signals via a routing fabric of the programmable hardware.
USING AND/OR REDUCE CARRY CHAINS ON PROGRAMMABLE HARDWARE
The present disclosure relates to a carry chain logic system that leverages carry in and carry out signals from logic blocks to implement logic functions on programmable hardware (e.g., FPGA hardware). In particular, implementations of the carry chain logic system facilitate implementation of logic gates (e.g., AND/OR gates) having high number of input signals without incurring routing delays caused by routing output signals between logic components implemented across different logic stages. For example, implementations described herein involve feeding carry out signals between adders of a logic chain across multiple logic components on a common logic stage, thus reducing routing penalties caused by routing signals via a routing fabric of the programmable hardware.
Computer-implemented perceptual apparatus
A method for compressing a digital representation of a stimulus includes encoding the digital representation as a feature vector within a feature space. The method also includes multiplying the feature vector with a Jacobian that maps the feature space to a non-Euclidean perceptual space according to a perceptual system that is capable of perceiving the stimulus. This multiplication generates a perceptual vector within the non-Euclidean perceptual space. The method also includes applying an update operator to the perceptual vector to move the perceptual vector in the perceptual space to an updated vector such that the updated vector has a lower entropy than the perceptual vector. The method also includes rounding the updated vector into a compressed vector that is smaller than the feature vector.
Computer-implemented perceptual apparatus
A method for compressing a digital representation of a stimulus includes encoding the digital representation as a feature vector within a feature space. The method also includes multiplying the feature vector with a Jacobian that maps the feature space to a non-Euclidean perceptual space according to a perceptual system that is capable of perceiving the stimulus. This multiplication generates a perceptual vector within the non-Euclidean perceptual space. The method also includes applying an update operator to the perceptual vector to move the perceptual vector in the perceptual space to an updated vector such that the updated vector has a lower entropy than the perceptual vector. The method also includes rounding the updated vector into a compressed vector that is smaller than the feature vector.
COMPUTE IN MEMORY ARCHITECTURE AND DATAFLOWS FOR DEPTH-WISE SEPARABLE CONVOLUTION
Certain aspects of the present disclosure provide a method, including: storing a depthwise convolution kernel in a first one or more columns of a CIM array; storing a fused convolution kernel in a second one or more columns of the CIM array; storing pre-activations in one or more input data buffers associated with a plurality of rows of the CIM array; processing the pre-activations with the depthwise convolution kernel in order to generate depthwise output; modifying one or more of the pre-activations based on the depthwise output to generate modified pre-activations; and processing the modified pre-activations with the fused convolution kernel to generate fused output.
COMPUTE IN MEMORY ARCHITECTURE AND DATAFLOWS FOR DEPTH-WISE SEPARABLE CONVOLUTION
Certain aspects of the present disclosure provide a method, including: storing a depthwise convolution kernel in a first one or more columns of a CIM array; storing a fused convolution kernel in a second one or more columns of the CIM array; storing pre-activations in one or more input data buffers associated with a plurality of rows of the CIM array; processing the pre-activations with the depthwise convolution kernel in order to generate depthwise output; modifying one or more of the pre-activations based on the depthwise output to generate modified pre-activations; and processing the modified pre-activations with the fused convolution kernel to generate fused output.
Accelerating binary neural networks within latch structure of non-volatile memory devices
A non-volatile memory device includes an array of non-volatile memory cells that are configured to store weights of a neural network. Associated with the array is a data latch structure that includes a page buffer, which can store weights for a layer of the neural network that is read out of the array, and a transfer buffer, that can store inputs for the neural network. The memory device can perform multiply and accumulate operations between inputs and weight of the neural network within the latch structure, avoiding the need to transfer data out of the array and associated latch structure for portions of an inference operation. By using binary weights and inputs, multiplication can be performed by bit-wise XNOR operations. The results can then be summed and activation applied, all within the latch structure.
Accelerating binary neural networks within latch structure of non-volatile memory devices
A non-volatile memory device includes an array of non-volatile memory cells that are configured to store weights of a neural network. Associated with the array is a data latch structure that includes a page buffer, which can store weights for a layer of the neural network that is read out of the array, and a transfer buffer, that can store inputs for the neural network. The memory device can perform multiply and accumulate operations between inputs and weight of the neural network within the latch structure, avoiding the need to transfer data out of the array and associated latch structure for portions of an inference operation. By using binary weights and inputs, multiplication can be performed by bit-wise XNOR operations. The results can then be summed and activation applied, all within the latch structure.