G06F15/8092

Systems And Methods For Systolic Array Design From A High-Level Program
20210081354 · 2021-03-18 ·

Systems and methods for automated systolic array design from a high-level program are disclosed. One implementation of a systolic array design supporting a convolutional neural network includes a two-dimensional array of reconfigurable processing elements arranged in rows and columns. Each processing element has an associated SIMD vector and is connected through a local connection to at least one other processing element. An input feature map buffer having a double buffer is configured to store input feature maps, and an interconnect system is configured to pass data to neighboring processing elements in accordance with a processing element scheduler. A CNN computation is mapped onto the two-dimensional array of reconfigurable processing elements using an automated system configured to determine suitable reconfigurable processing element parameters.

Vector processing unit

A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.

Static shared memory access with one piece of input data to be reused for successive execution of one instruction in a reconfigurable parallel processor

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise an arithmetic logic unit (ALU), a data buffer associated with the ALU, and an indicator associated with the data buffer to indicate whether a piece of data inside the data buffer is to be reused for repeated execution of a same instruction as a pipeline stage.

Technologies for efficient exit from hyper dimensional space in the presence of errors
10949214 · 2021-03-16 · ·

Technologies for performing hyper-dimensional operations in memory includes a device with a memory media and a memory controller. The memory controller is configured to receive a query from a requestor and determine, in response to receiving the query, a reference hyper-dimensional vector associated with the query. The memory controller is further configured to perform a nearest neighbor search by searching columns of a stochastic associative array in the memory media to determine a number of matching bit values for each row relative to the reference hyper-dimensional vector, wherein each bit in a column of the stochastic associative array represents a bit value of a corresponding row, identify a closest matching row that has a highest number of matching bit values, and output data of the closest matching row.

DATA PROCESSING
20210042261 · 2021-02-11 ·

Data processing apparatus comprises processing circuitry to apply processing operations to one or more data items of a linear array comprising a plurality, n, of data items at respective positions in the linear array, the processing circuitry being configured to access an array of nn storage locations, where n is an integer greater than one, the processing circuitry comprising: instruction decoder circuitry to decode program instructions; and instruction processing circuitry to execute instructions decoded by the instruction decoder circuitry; wherein the instruction decoder circuitry is responsive to an array access instruction, to control the instruction processing circuitry to access, as a linear array, a set of n storage locations arranged in an array direction selected, under control of the array access instruction, from a set of candidate array directions comprising at least a first array direction and a second array direction different to the first array direction.

Processor core design optimized for machine learning applications

A computing system includes a plurality of functional units, each functional unit having one or more inputs and an output. There is a shared memory block coupled to the inputs and outputs of the plurality of functional units. There is a private memory block assigned to each of the plurality of functional units. An inter functional unit data bypass (IFUDB) block is coupled to the plurality of functional units. The IFUDB is configured to route signals between the one or more functional units without use of the shared memory block.

RECONFIGURABLE PARALLEL PROCESSING
20210019281 · 2021-01-21 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

APPARATUSES, METHODS, AND SYSTEMS FOR VECTOR PROCESSOR ARCHITECTURE HAVING AN ARRAY OF IDENTICAL CIRCUIT BLOCKS

Systems, methods, and apparatuses relating to vector processor architecture having an array of identical circuit blocks are described. In one embodiment, a processor includes a single centralized circuit comprising an instruction decoder and a controller; and a plurality of circuit slices that each comprise an arithmetic logic unit, a multiplier, a register file, a local memory, and a same plurality of logic circuits and a packed data datapath in between, wherein each circuit slice includes a physical port that provides a unique identification value that identifies a circuit slice from the other circuit slices, and the controller is to broadcast a same configuration value to the plurality of circuit slices to cause a first circuit slice to enable a first logic circuit and enable a second logic circuit of the first circuit slice based on its unique identification value and the configuration value, and cause a second circuit slice to enable a same, first logic circuit and disable a same, second logic circuit of the second circuit slice based on its unique identification value and the configuration value.

Method and system for partial wavefront merger

A method and system for partial wavefront merger is described. Vector processing machines employ the partial wavefront merger to merge partial wavefronts into one or more wavefronts. The system includes a partial wavefront manager and unified registers. The partial wavefront manager detects wavefronts in different single-instruction-multiple-data (SIMD) units which contain inactive work items and active work items (hereinafter referred to as partial wavefronts), moves the partial wavefronts into one or more SIMD unit(s) and merges the partial wavefronts into one or more wavefront(s). The unified register allows each active work item in the one or more merged wavefront(s) to access the previously allocated registers in the originating SIMD units. Consequently, the contents of the unified registers do not have to be copied to the SIMD unit(s) executing the one or merged wavefront(s).

METHOD AND SYSTEM FOR SAVING POWER IN A REAL TIME HARDWARE PROCESSING UNIT

The present invention provides an analog-digital hybrid architecture, which performs 256 multiplications and additions at a time. The system comprises 256 Processing Elements (PE) (108), which are arranged in a matrix form (16 rows and 16 columns). The digital inputs (110) are converted to analog signal (114) using digital to analog converters (DAC) (102). One PE (108) produces one analog output (115) which is nothing but the multiplication of the analog input (114) and the digital weight input (112). The implementation of PE is done by using i) capacitors and switches and ii) resistor and switches. The outputs from multiple PEs (108) in a column are connected together to produce one analog MAC output (116). In the similar manner, the system produces 16 MAC outputs (118) corresponding to 16 columns. Analog to digital converters (ADC) (104) are used to convert the analog MAC output (116) to digital form (118).