Patent classifications
G06F15/8053
VECTOR COMPUTATIONAL UNIT
A microprocessor system comprises a computational array and a vector computational unit. The computational array includes a plurality of computation units. The vector computational unit is in communication with the computational array and includes a plurality of processing elements. The processing elements are configured to receive output data elements from the computational array and process in parallel the received output data elements.
Vector table load instruction with address generation field to access table offset value
A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.
Application specific integrated circuit accelerators
An application specific integrated circuit (ASIC) chip includes: a systolic array of cells; and multiple controllable bus lines configured to convey data among the systolic array of cells, in which the systolic array of cells is arranged in multiple tiles, each tile of the multiple tiles including 1) a corresponding sub array of cells of the systolic array of cells, 2) a corresponding subset of controllable bus lines of the multiple controllable bus lines, and 3) memory coupled to the subarray of cells.
HARDWARE ACCELERATED ANOMALY DETECTION IN A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Sparse SIMD Cross-lane Processing Unit
Aspects of the disclosure are directed to a cross-lane processing unit (XPU) for performing data-dependent operations across multiple data processing lanes of a processor. Rather than implementing operation-specific circuits for each data-dependent operation, the XPU can be configured to perform different operations in response to input signals configuring individual operations performed by processing cells and crossbars arranged as a stacked network in the XPU. Each processing cell can receive and process data across multiple data processing lanes. Aspects of the disclosure include configuring the XPU to use a vector sort network to perform a duplicate count eliminating the need to configure the XPU separately for sorting and duplicate counting.
HYBRID HARDWARE ACCELERATOR AND PROGRAMMABLE ARRAY ARCHITECTURE
Techniques are disclosed for the use of a hybrid architecture that combines a programmable processing array and a hardware accelerator. The hybrid architecture dedicates the most computationally intensive blocks to the hardware accelerator, while maintaining flexibility for additional computations to be performed by the programmable processing array. An interface is also described for coupling the processing array to the hardware accelerator, which achieves a division of functionality and connects the programmable processing array components to the hardware accelerator components without sacrificing flexibility. This results in a balance between power/area and flexibility.
Method and apparatus for performing reduction operations on a set of vector elements
An apparatus and method are described for performing SIMD reduction operations. For example, one embodiment of a processor comprises: a value vector register containing a plurality of data element values to be reduced; an index vector register to store a plurality of index values indicating which values in the value vector register are associated with one another; single instruction multiple data (SIMD) reduction logic to perform reduction operations on the data element values within the value vector register by combining data element values from the value vector register which are associated with one another as indicated by the index values in the index vector register; and an accumulation vector register to store results of the reduction operations generated by the SIMD reduction logic.
Method for forming constant extensions in the same execute packet in a VLIW processor
In a very long instruction word (VLIW) central processing unit instructions are grouped into execute packets that execute in parallel. A constant may be specified or extended by bits in a constant extension instruction in the same execute packet. If an instruction includes an indication of constant extension, the decoder employs bits of a constant extension instruction to extend the constant of an immediate field. Two or more constant extension slots are permitted in each execute packet, each extending constants for a different predetermined subset of functional unit instructions. In an alternative embodiment, more than one functional unit may have constants extended from the same constant extension instruction employing the same extended bits. A long extended constant may be formed using the extension bits of two constant extension instructions.
USING A VECTOR PROCESSOR TO CONFIGURE A DIRECT MEMORY ACCESS SYSTEM FOR FEATURE TRACKING OPERATIONS IN A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Simultaneous multi-processor (SiMulPro) apparatus, simultaneous transmit and receive (STAR) apparatus, DRAM interface apparatus, and associated methods
Infection by viruses and rootkits from data memory devices, data messages and data operations are rendered impossible by construction for the Simultaneous Multi-Processor (SiMulPro) cores, core modules, Programmable Execution Modules (PEM), PEM Arrays, STAR messaging protocol implementations, integrated circuits (referred to as chips herein), and systems composed of these components. Greatly improved energy efficiency is disclosed. A system implementation of an Application Specific Integrated Circuit (ASIC) communicating with a DRAM controller interacting with a DRAM array is presented with this resistance to virus and rootkit infection, and simultaneously capable of 1 Teraflop (Tflop) FP16, 1 TFlop FP32 and 1 Tflop FP64 performance while accessing 1 Tbyte of DRAM with a power budget comparable to today's desktop or notebook computers accessing 8 Gbytes of DRAM. Innovations to the STAR communication apparatus will enable the optical communication between chips to carry at least ½ Tbit/second data to and from DRAMs, and each other.