Patent classifications
G06F9/325
REDUCED MEMORY WRITE REQUIREMENTS IN A SYSTEM ON A CHIP USING AUTOMATIC STORE PREDICATION
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
EFFICIENT PROCESSING OF NESTED LOOPS FOR COMPUTING DEVICE WITH MULTIPLE CONFIGURABLE PROCESSING ELEMENTS USING MULTIPLE SPOKE COUNTS
Disclosed in some examples, are methods, systems, devices, and machine-readable mediums which provide for more efficient CGRA execution by assigning different initiation intervals to different PEs executing a same code base. The initiation intervals may be a multiple of each other and the PE with the lowest initiation interval may be used to execute instructions of the code that is to be executed at a greater frequency than other instructions than other instructions that may be assigned to PEs with higher initiation intervals.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Built-in self-test for a programmable vision accelerator of a system on a chip
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
ADVANCED PROCESSOR ARCHITECTURE
The invention relates to a method for processing instructions out-of-order on a processor comprising an arrangement of execution units. The inventive method comprises: 1) looking up operand sources in a Register Positioning Table and setting operand input references of the instruction to be issued accordingly; 2) checking for an Execution Unit (EXU) available for receiving a new instruction; and 3) issuing the instruction to the available Execution Unit and enter a reference of the result register addressed by the instruction to be issued to the Execution Unit into the Register Positioning Table (RPT).
Control registers to store thread identifiers for threaded loop execution in a self-scheduling reconfigurable computing fabric
Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.
DETECTING INFINITE LOOPS IN A PROGRAMMABLE ATOMIC TRANSACTION
Disclosed in some examples are systems, methods, devices, and machine-readable mediums to detect and terminate programmable atomic transactions that are stuck in an infinite loop. In order to detect and terminate these transactions, the programmable atomic unit may use an instruction counter that increments each time an instruction is executed during execution of a programmable atomic transaction. If the instruction counter meets or exceeds a threshold instruction execution limit without reaching the termination instruction, the programmable atomic transaction may be terminated, all resources used (e.g., memory locks) may be freed, and a response may be sent to a calling processor.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
METHOD AND DEVICE FOR PROVIDING A VECTOR STREAM INSTRUCTION SET ARCHITECTURE EXTENSION FOR A CPU
A method and device for providing a vector stream instruction set architecture extension for a CPU. In one aspect, there is provided a vector stream engine unit comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.
Processor for executing a loop acceleration instruction to start and end a loop
A processor achieving a zero-overhead loop, includes instruction stream control circuitry and loop control circuitry. The loop control circuitry includes loop address detecting circuitry and loop end determining circuitry. By combining instructions and hardware, the loop control circuitry eliminates additional control instructions required b each loop iteration and can achieve loop acceleration with zero overhead, thereby improving the loop execution efficiency.