G06F7/76

ACCELERATING TABLE LOOKUPS USING A DECOUPLED LOOKUP TABLE ACCELERATOR IN A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

ACCELERATING TABLE LOOKUPS USING A DECOUPLED LOOKUP TABLE ACCELERATOR IN A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

SYSTEMS, METHODS, AND APPARATUSES FOR TILE LOAD

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.

Systems, methods, and apparatuses for tile store

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in at least a form of decode circuitry to decode an instruction having fields for an opcode, a source matrix operand identifier, and destination memory information, and execution circuitry to execute the decoded instruction to store each data element of configured rows of the identified source matrix operand to memory based on the destination memory information.

OPERATIONS ON MATRIX OPERANDS IRRESPECTIVE OF WHERE OPERANDS ARE STORED IN MEMORY
20230229588 · 2023-07-20 ·

Apparatus, systems, and techniques to transform data in memory for deep learning operations. In at least one embodiment, a compiler inserts one or more data transforms into a software program to transform one or more data elements arbitrarily arranged in memory and improve performance of one or more deep learning operations.

OPERATIONS ON MATRIX OPERANDS IRRESPECTIVE OF WHERE OPERANDS ARE STORED IN MEMORY
20230229588 · 2023-07-20 ·

Apparatus, systems, and techniques to transform data in memory for deep learning operations. In at least one embodiment, a compiler inserts one or more data transforms into a software program to transform one or more data elements arbitrarily arranged in memory and improve performance of one or more deep learning operations.

COMBINED DIVIDE/SQUARE ROOT PROCESSING CIRCUITRY AND METHOD
20230017462 · 2023-01-19 ·

An apparatus comprises combined divide/square root processing circuitry to perform, in response to a divide instruction, a given radix-64 iteration of a radix-64 divide operation, and in response to a square root instruction, a given radix-64 iteration of a radix-64 square root operation; in which: the combined divide/square root processing circuitry comprises shared circuitry to generate at least one output value for the given radix-64 iteration on a same data path used for both the radix-64 divide operation and the radix-64 square root operation.

CARRY-LOOKAHEAD ADDER, SECURE ADDER AND METHOD FOR PERFORMING CARRY-LOOKAHEAD ADDITION
20230214189 · 2023-07-06 ·

A carry-lookahead adder is provided. A first mask unit performs first mask operation on first input data with the first mask value to obtain first masked data. A second mask unit performs second mask operation on second input data with the second mask value to obtain second masked data. A first XOR gate receives the first and second mask values to provide a variable value. A half adder receives the first and second masked data to generate a propagation value and an intermediate generation value. A third mask unit performs third mask operation on the propagation value with the third mask value to obtain the third masked data. A carry-lookahead generator provides the carry output and the carry value according to carry input, the generation value, and the propagation value. The second XOR gate receives the third masked data and the carry value to provide the sum output.

METHOD AND SYSTEM FOR ON-DEVICE INFERENCE IN A DEEP NEURAL NETWORK (DNN)

The disclosure relates to method and system for on-device inference in a deep neural network (DNN). The method comprises: determining whether one or more layers of the DNN satisfy one of a first, a second and a third condition, the one or more layers including one or more convolution layers and one or more resampling layers; performing the on-device inference based on the determination, wherein performing the on-device inference comprises at least one of: optimizing the one or more convolution layers in the one or more parallel branches based on the one or more layers of the DNN satisfying the first condition, optimizing the at least one of the resampling layers based on the one or more layers of the DNN satisfying the second condition, and modifying operation of the at least one of the resampling layers based on the one or more layers of the DNN satisfying the third condition.

METHOD AND SYSTEM FOR ON-DEVICE INFERENCE IN A DEEP NEURAL NETWORK (DNN)

The disclosure relates to method and system for on-device inference in a deep neural network (DNN). The method comprises: determining whether one or more layers of the DNN satisfy one of a first, a second and a third condition, the one or more layers including one or more convolution layers and one or more resampling layers; performing the on-device inference based on the determination, wherein performing the on-device inference comprises at least one of: optimizing the one or more convolution layers in the one or more parallel branches based on the one or more layers of the DNN satisfying the first condition, optimizing the at least one of the resampling layers based on the one or more layers of the DNN satisfying the second condition, and modifying operation of the at least one of the resampling layers based on the one or more layers of the DNN satisfying the third condition.