Patent classifications
G06F9/3889
WAVEFRONT SELECTION AND EXECUTION
Techniques are provided for executing wavefronts. The techniques include at a first time for issuing instructions for execution, performing first identifying, including identifying that sufficient processing resources exist to execute a first set of instructions together within a processing lane; in response to the first identifying, executing the first set of instructions together; at a second time for issuing instructions for execution, performing second identifying, including identifying that no instructions are available for which sufficient processing resources exist for execution together within the processing lane; and in response to the second identifying, executing an instruction independently of any other instruction.
Information processing system and method for controlling information processing system
A method for controlling an information processing system, the information processing system including multiple information processing devices coupled to each other, each of the multiple information processing devices including multiple main operation devices and multiple aggregate operation devices that are coupled to each other, the method includes: acquiring, by each of the aggregate operation devices, array data items from a main operation device coupled to the concerned aggregate operation device; determining the order of dimensions in which a process is executed and in which the information processing devices are coupled to each other; executing for each of the dimensions in accordance with the order of the dimensions, a process of halving the array data items and distributing the array data items to information processing devices arranged in the dimension; executing a process of transmitting, to information processing devices arranged in the dimension, operation results calculated based on data items.
ARITHMETIC LOGIC UNIT LAYOUT FOR A PROCESSOR
A processor has first, second and third ALUs. The first ALU has on a first side an input and an output. The second ALU has a first side facing the first side of the first ALU, an input and an output on the first side of the second ALU and being in a rotated orientation relative to the input and the output of the first side of the first ALU, and an output on a second side of the second ALU. The third ALU has a first side facing the second side of the second ALU, and an input and an output on the first side of the third ALU. The input of the first side of the first ALU is logically directly connected to the output of the first side of the second ALU.
NEURAL PROCESSING ACCELERATOR
A system for calculating. A scratch memory is connected to a plurality of configurable processing elements by a communication fabric including a plurality of configurable nodes. The scratch memory sends out a plurality of streams of data words. Each data word is either a configuration word used to set the configuration of a node or of a processing element, or a data word carrying an operand or a result of a calculation. Each processing element performs operations according to its current configuration and returns the results to the communication fabric, which conveys them back to the scratch memory.
MEMORY-BASED DISTRIBUTED PROCESSOR ARCHITECTURE
Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
Arithmetic processing unit and control method for arithmetic processing unit
An arithmetic processing unit includes an instruction decoder, first to fourth reservation stations, first and second computing units, first and second load-store units, and an allocation unit. The allocation unit, when the execution instruction is a first instruction that is executable in first and second computing units but not executable in first and second load-store units, allocates the first instruction to first or second reservation station based on a first allocation table, and when the execution instruction is a second instruction that is executable in the first and second load-store units but not executable in the first and second computing units, allocates the second instruction to third or fourth reservation station based on a second allocation table.
SPARSE MATRIX CALCULATIONS UNTILIZING IGHTLY COUPLED MEMORY AND GATHER/SCATTER ENGINE
A processor for sparse matrix calculation can include an on-chip memory, a cache, a gather/scatter engine and a core. The on-chip memory can be configured to store a first matrix or vector, and the cache can be configured to store a compressed sparse second matrix data structure. The compressed sparse second matrix data structure can include: a value array including non-zero element values of the sparse second matrix, where each entry includes a given number of element values; and a column index array where each entry includes the given number of offsets matching the value array. The gather/scatter engine can be configured to gather element values of the first matrix or vector using the column index array of the sparse second matrix. In a horizontal implementation, the gather/scatter engine can be configured to gather sets of element values from different sub-banks within a same row based on the column index array of the sparse matrix. In a vertical implementation, the gather/scatter engine can be configured to gather sets of element values from different rows based on the column index array of the sparse matrix. In a hybrid horizontal/vertical implementation, the gather/scatter engine can be configured to gather sets of element values from sets of rows and from different sub-banks within the same rows based on the column index array of the sparse matrix. The core can be configured to perform sparse matrix-matrix multiplication or sparse-matrix-vector multiplication using the gathered elements of the first matrix or vector and the value array of the compressed sparse second matrix.
Process scheduling in a processing system having at least one processor and shared hardware resources
A method for enabling scheduling of processes in a processing system having at least one processor and associated hardware resources, at least one of the hardware resources being shared by at least two of the processes. The method is characterized by controlling execution of a process based on a usage bound of the number of allowable accesses, by the process, to a shared hardware resource by halting execution of the process when the number of allowable accesses has been reached, and enabling idle mode or start of execution of a next process. In this way, costly hardware overprovisioning and/or the need for shutting down processor cores can be avoided. By controlling execution of a process based on a usage bound of the number of allowable accesses to a shared hardware resource, instead of simply dividing CPU time between processes, highly efficient shared-resource-based process scheduling can be achieved.
INSTRUCTION SCHEDULING
Apparatuses and methods for instruction scheduling in an out-of-order decoupled access-execute processor are disclosed. The instructions for the decoupled access-execute processor comprises access instructions and execute instructions, where access instructions comprise load instructions and instructions which provide operand values to load instructions. Schedule patterns of groups of linked execute instructions are monitored, where the execute instructions in a group of linked execute instructions are linked by data dependencies. On the basis of an identified repeating schedule pattern configurable execution circuitry adopts a configuration to perform the operations defined by the group of linked execute instructions of the repeating schedule pattern.
Reconfigurable multi-thread processor for simultaneous operations on split instructions and operands
A superscalar processor has a thread mode of operation for supporting multiple instruction execution threads which are full data path wide instructions, and a micro-thread mode of operation where each thread supports two micro-threads which independently execute instructions. An executed instruction sets a micro-thread mode and an executed instruction sets the thread mode.