Patent classifications
G06F15/8061
ELECTRONIC DEVICE AND PROCESSING METHOD
An electronic device includes a first processing unit and a second processing unit, a first memory unit correspondingly set for the first processing unit and configured for data access by the first processing unit, and a second memory unit correspondingly set for the second processing unit and configured for data access by the second processing unit. The first processing unit occupies at least part of storage space of the second memory unit when a first criteria is met, and/or the second processing unit occupies at least part of storage space of the first memory unit when a second criteria is met.
Lazy push strategies for vectorized D-Heaps
Techniques are provided for lazy push optimization, allowing for constant time push operations. A d-heap is used as the underlying data structure for indexing values being inserted. The d-heap is vectorized by storing values in a contiguous memory array. Heapify operations are delayed until a retrieve operation occurs, improving insert performance of vectorized d-heaps that use horizontal aggregation SIMD instructions at the cost of slightly lower retrieve performance.
Using a vector processor to configure a direct memory access system for feature tracking operations in a system on a chip
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Method and Apparatus for Desynchronizing Execution in a Vector Processor
In one implementation a vector processor unit having preload registers for at least some of vector length, vector constant, vector address, and vector stride. Each preload register has an input and an output. All the preload register inputs are coupled to receive a new vector parameters. Each of the preload registers' outputs are coupled to a first input of a respective multiplexor, and the second input of all the respective multiplexors are coupled to the new vector parameters.
BUILT-IN SELF-TEST FOR A PROGRAMMABLE VISION ACCELERATOR OF A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Programmable vision accelerator
In one embodiment of the present invention, a programmable vision accelerator enables applications to collapse multi-dimensional loops into one dimensional loops. In general, configurable components included in the programmable vision accelerator work together to facilitate such loop collapsing. The configurable elements include multi-dimensional address generators, vector units, and load/store units. Each multi-dimensional address generator generates a different address pattern. Each address pattern represents an overall addressing sequence associated with an object accessed within the collapsed loop. The vector units and the load store units provide execution functionality typically associated with multi-dimensional loops based on the address pattern. Advantageously, collapsing multi-dimensional loops in a flexible manner dramatically reduces the overhead associated with implementing a wide range of computer vision algorithms. Consequently, the overall performance of many computer vision applications may be optimized.
Asymmetrical processor memory architecture
An asymmetrical processing system is provided. The processor has a vector unit comprised of one or more computational units coupled with a vector memory space and a scalar unit coupled with a data memory space and the vector memory space, the scalar unit accessing one or more memory locations within the vector memory space.
Techniques to configure physical compute resources for workloads via circuit switching
Embodiments are generally directed apparatuses, methods, techniques and so forth to select two or more processing units of the plurality of processing units to process a workload, and configure a circuit switch to link the two or more processing units to process the workload, the two or more processing units each linked to each other via paths of communication and the circuit switch.
USING A VECTOR PROCESSOR TO CONFIGURE A DIRECT MEMORY ACCESS SYSTEM FOR FEATURE TRACKING OPERATIONS IN A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Instruction and logic to provide stride-based vector load-op functionality with mask duplication
Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults.