Patent classifications
G06F9/3001
Discrete Three-Dimensional Processor
A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D random-access memory (3D-RAM) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-RAM arrays. The first die does not comprise the off-die peripheral-circuit component of the 3D-RAM arrays.
BITWISE PRODUCT-SUM ACCUMULATIONS WITH SKIP LOGIC
A method, device, and system for performing a partial sum accumulation of a product of input vectors and weight vectors in a wordwise-input and bitwise-weight manner results in a partial accumulated product sum. The partial accumulated product sum is compared with a threshold condition after each weight bit, and when the partial accumulated product sum meets the threshold condition, a skip indicator is asserted to indicate that remaining computations of a sum accumulation are skipped.
SIMD DATA PATH ORGANIZATION TO INCREASE PROCESSING THROUGHPUT IN A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
REDUCED MEMORY WRITE REQUIREMENTS IN A SYSTEM ON A CHIP USING AUTOMATIC STORE PREDICATION
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Signed multiplication using unsigned multiplier with dynamic fine-grained operand isolation
An N×N multiplier may include a N/2×N first multiplier, a N/2×N/2 second multiplier, and a N/2×N/2 third multiplier. The N×N multiplier receives two operands to multiply. The first, second and/or third multipliers are selectively disabled if an operand equals zero or has a small value. If the operands are both less than 2.sup.N/2, the second or the third multiplier are used to multiply the operands. If one operand is less than 2.sup.N/2 and the other operand is equal to or greater than 2.sup.N/2, the first multiplier is used or the second and third multipliers are used to multiply the operands. If both operands are equal to or greater than 2.sup.N/2, the first, second and third multipliers are used to multiply the operands.
Variable-length instruction buffer management
A vector processor is disclosed including a variety of variable-length instructions. Computer-implemented methods are disclosed for efficiently carrying out a variety of operations in a time-conscious, memory-efficient, and power-efficient manner. Methods for more efficiently managing a buffer by controlling the threshold based on the length of delay line instructions are disclosed. Methods for disposing multi-type and multi-size operations in hardware are disclosed. Methods for condensing look-up tables are disclosed. Methods for in-line alteration of variables are disclosed.
Neural network training mechanism
- Gokcen Cilingir ,
- Elmoustapha Ould-Ahmed-Vall ,
- Rajkishore Barik ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Justin E. Gottschlich ,
- Prasoonkumar Surti ,
- Chandrasekaran Sakthivel ,
- Abhishek Appu ,
- John C. Weast ,
- Sara S. Baghsorkhi ,
- Barnan Das ,
- Narayan Biswal ,
- Stanley J. Baran ,
- Nilesh V. Shah ,
- Archie Sharma ,
- Mayuresh M. Varerkar
An apparatus to facilitate neural network (NN) training is disclosed. The apparatus includes training logic to receive one or more network constraints and train the NN by automatically determining a best network layout and parameters based on the network constraints.
Systems and methods for performing horizontal tile operations
Disclosed embodiments relate to systems and methods for performing instructions specifying horizontal tile operations. In one example, a processor includes fetch circuitry to fetch an instruction specifying a horizontal tile operation, a location of a M by N source matrix comprising K groups of elements, and locations of K destinations, wherein each of the K groups of elements comprises the same number of elements, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction by generating K results, each result being generated by performing the specified horizontal tile operation across every element of a corresponding group of the K groups, and writing each generated result to a corresponding location of the K specified destination locations.
Extended memory operations
Systems, apparatuses, and methods related to extended memory operations are described. Extended memory operations can include operations specified by a single address and operand and may be performed by a computing device that includes a processing unit and a memory resource. The computing device can perform extended memory operations on data streamed through the computing tile without receipt of intervening commands. In an example, a computing device is configured to receive a command to perform an operation that comprises performing an operation on a data with the processing unit of the computing device and determine that an operand corresponding to the operation is stored in the memory resource. The computing device can further perform the operation using the operand stored in the memory resource.
Systems, apparatuses, and methods for controllable sine and/or cosine operations
Embodiments of systems, apparatuses, and methods for performing vector-packed controllable sine and/or cosine operations in a processor are described. For example, execution circuitry executes a decoded instruction to compute at least a real output value and an imaginary output value based on at least a cosine calculation and a sine calculation, the cosine and sine calculations each based on an index value from a packed data source operand, add the index value with an index increment value from the packed data source operand to create an updated index value, and store the real output value, the imaginary output value, and the updated index value to a packed data destination operand.