Patent classifications
G06F15/8007
HARDWARE ACCELERATED ANOMALY DETECTION IN A SYSTEM ON A CHIP
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.
Control barrier network for reconfigurable data processors
A processing system comprises a control bus and a plurality of logic units. The control bus is configurable by configuration data to form signal routes in a control barrier network coupled to processing units in an array of processing units. The plurality of logic units has inputs and outputs connected to the control bus and to the array of processing units. A logic unit in the plurality of logic units is operatively coupled to a processing unit in the array of processing units and is configurable by the configuration data to consume source tokens and a status signal from the processing unit on the inputs and to produce barrier tokens and an enable signal on the outputs based on the source tokens and the status signal on the inputs.
Use of global interactions in efficient quantum circuit constructions
The disclosure describes various aspects of techniques for using global interactions in efficient quantum circuit constructions. More specifically, this disclosure describes ways to use a global entangling operator to efficiently implement circuitry common to a selection of important quantum algorithms. The circuits may be constructed with global Ising entangling gates (e.g., global Mølmer-Sørenson gates or GMS gates) and arbitrary addressable single-qubit gates. Examples of the types of circuits that can be implemented include stabilizer circuits, Toffoli-4 gates, Toffoli-n gates, quantum Fourier transformation (QTF) circuits, and quantum Fourier adder (QFA) circuits. In certain instances, the use of global operations can substantially improve the entangling gate count.
APPARATUS AND METHOD WITH PARALLEL DATA PROCESSING
An apparatus with parallel processing includes: a first processor module; and a second processor module configured to perform parallel processing in synchronization with the first processor module, wherein the first processor module is configured to: determine first operation result data using an operation process in a first time interval; transmit the first operation result data to the second processor module; determine second operation result data using the operation process in a second time interval; and determine whether to transmit the second operation result data to the second processor module, and wherein the second processor module is configured to determine second prediction result data corresponding to the second operation result data based on the first operation result data received from the first processor module and a prediction process in response to the first processor module determining not to transmit the second operation result data to the second processor module.
Method and apparatus to process SHA-2 secure hashing algorithm
A processor includes an instruction decoder to receive a first instruction to process a secure hash algorithm 2 (SHA-2) hash algorithm, the first instruction having a first operand associated with a first storage location to store a SHA-2 state and a second operand associated with a second storage location to store a plurality of messages and round constants. The processor further includes an execution unit coupled to the instruction decoder to perform one or more iterations of the SHA-2 hash algorithm on the SHA-2 state specified by the first operand and the plurality of messages and round constants specified by the second operand, in response to the first instruction.
RECONFIGURABLE SIMD ENGINE
An exemplary SIMD computing system comprises a SIMD processing element (SPE) configured to perform a selected operation on a portion of a processor input data word, with the operation selected by control signals read from a control memory location addressed by a decoded instruction. The SPE may comprise one or more adder, multiplier, or multiplexer coupled to the control signals. The control signals may comprise one or more bit read from the control memory. The control memory may be an MxN (M rows by N columns) memory having M possible SIMD operations and N control signals. Each instruction decoded may select an SPE operation from among N rows. A plurality of SPEs may receive the same control signals. The control memory may be rewritable, advantageously permitting customizable SIMD operations that are reconfigurable by storing in the control memory locations control signals designed to cause the SPE to perform selected operations.
ACCELERATED PROCESSING DEVICE AND METHOD OF SHARING DATA FOR MACHINE LEARNING
A processing device is provided which comprises a plurality of compute units configured to process data, a plurality of arithmetic logic units, instantiated separate from the plurality of compute units, and configured to store the data at the arithmetic logic units and perform calculations using the data and an interconnect network, connecting the arithmetic logic units and configured to provide the arithmetic logic units with shared access to the data for communication between the arithmetic logic units. The interconnect network is also configured to provide the compute units with shared access to the data for communication between the compute units.
Compiler-assisted inter-SIMD-group register sharing
Systems, apparatuses, and methods for efficiently sharing registers among threads are disclosed. A system includes at least a processor, control logic, and a register file with a plurality of registers. The processor assigns a base set of registers to each thread of a plurality of threads executing on the processor. When a given thread needs more than the base set of registers to execute a given phase of program code, the given thread executes an acquire instruction to acquire exclusive access to an extended set of registers from a shared resource pool. When the given thread no longer needs additional registers, the given thread executes a release instruction to release the extended set of registers back into the shared register pool for other threads to use. In one implementation, the compiler inserts acquire and release instructions into the program code based on a register liveness analysis performed during compilation.
Computer architecture with synergistic heterogeneous processors
A computer architecture employs multiple special-purpose processors having different affinities for program execution to execute substantial portions of general-purpose programs to provide improved performance with respect to a general-purpose processor executing the general-purpose program alone.
STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.