G06F9/3013

REGISTER FILE FOR SYSTOLIC ARRAY

A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.

ULTRA-LOW-POWER AND LOW-AREA SOLUTION OF BINARY MULTIPLY-ACCUMULATE SYSTEM AND METHOD

Data structure and microcontroller architecture performing binary multiply-accumulate operations using multiple partial copies of weights. Destination-register location, source-register location, and weight-register location are received. Using the weight-register location, a sub-set of the weight bits is copied a select number of times based on a filter index value that is received. Each copy of the sub-set of weights is executed in parallel. Using the source-register location, a sub-set of the input bits is selected based on the size of the sub-set of weights, wherein the sub-set of input bits is shifted one bit from a previous sub-set of input bits. XOR operation is performed on each corresponding bit in the copy of the sub-set of weights with each corresponding bit in the selected sub-set of input bits. In a corresponding destination sub-location, output of each XOR operation is aggregated with each other and with current value of the corresponding destination sub-location.

Mapping convolution to a channel convolution engine

A processor system comprises a first and second group of registers and a hardware channel convolution processor unit. The first group of registers is configured to store data elements of channels of a portion of a convolution data matrix. Each register stores at least one data element from each channel. The second group of registers is configured to store data elements of convolution weight matrices including a separate convolution weight matrix for each channel. Each register stores at least one data element from each convolution weight matrix. The hardware channel convolution processor unit is configured to multiply each data element in the first group of registers with a corresponding data element in the second group of registers and sum together the multiplication results for each specific channel to determine corresponding channel convolution result data elements in a corresponding channel convolution result matrix.

PROGRAM EVENT RECORDING STORAGE ALTERATION PROCESSING FOR A NEURAL NEWORK ACCELERATOR INSTRUCTION
20220405120 · 2022-12-22 ·

Instruction processing is performed for an instruction. The instruction is configured to perform a plurality of functions, in which a function of the plurality of functions is to be performed in a plurality of processing phases. A processing phase is defined to store up to a select amount of data. The select amount of data is based on the function to be performed. At least one function of the plurality of functions has a different value for the select amount of data than at least one other function. A determination is made as to whether a store into a designated area occurred based on processing a select processing phase of a select function. Based on determining that the store into the designated area occurred, an interrupt is presented, and based on determining that the store into the designated area did not occur, instruction processing is continued.

System and method for instruction mapping in an out-of-order processor
11531549 · 2022-12-20 · ·

A system and corresponding method map instructions in an out-of-order (OoO) processor. The system comprises a mapper, integer snapshot circuitry, and floating-point (FP) snapshot circuitry. The mapper maps instructions by mapping integer and FP architectural registers (ARs) of the instructions to integer and FP physical registers of the OoO processor, respectively. The mapper records, via at least one present FP indicator, presence of FP ARs used as destinations in the instructions. The mapper copies, periodically, the integer mapper state to the integer snapshot circuitry and copies, intermittently, based on the at least one FP present indicator, the FP mapper state to the FP snapshot circuitry. Copies of the integer and FP mapper state in the integer and FP snapshot circuitry, respectively, improve performance for instruction unwinding caused, for example, by an exception, branch/jump mispredict, etc. By copying the FP mapper state, intermittently, power efficiency of the OoO processor is improved.

SYSTEMS AND METHODS FOR UPDATING METADATA

Systems and methods for updating metadata. In some embodiments, in response to detecting an instruction executed by a hardware system, a source location of the instruction may be identified. First metadata associated with the instruction may be used to determine whether the instruction is allowed. In response to determining that the instruction is allowed, the source location of the instruction may be associated with second metadata.

Method of secure memory addressing
11593277 · 2023-02-28 · ·

The problem to be solved is to seek an alternative to known addressing methods which provides the same or similar effects or is more secure. Solution The problem is solved by a method (40) of addressing memory in a data-processing apparatus (10) comprising, when a central processing unit (11), while performing a task (31, 32, 33, 34) of the apparatus (10), executes an instruction involving a pointer (59) into a segment (s, r, d, h, f, o, i, c) of the memory: decoding the instruction by means of an instruction decoder (12), generating a virtual address (45) within the memory by means of a safe pointer operator (41) operating on the pointer (59), augmenting the virtual address (45) by an identifier (43) of the task (31, 32, 33, 34) and an identifier (44) of the segment (s, r, d, h, f, o, i, c), said identifiers (43, 44) being hardware-controlled (42), and, based on the augmented address (45), dereferencing the pointer (59) via a memory management unit (13).

System and method for instruction unwinding in an out-of-order processor
11593116 · 2023-02-28 · ·

A system and corresponding method unwind instructions in an out-of-order (OoO) processor. The system comprises a mapper. In response to a restart event causing at least one instruction to be unwound, the mapper restores a present integer mapper state and present floating-point (FP) mapper state, used for mapping instructions, to a former integer mapper state and former FP mapper state, respectively. The mapper stores integer snapshots and FP snapshots of the present integer and FP mapper state, respectively, to expedite restoration to the former integer and FP mapper state, respectively. Access to the FP snapshots is blocked, intermittently, as a function of at least one FP present indicator used by the mapper to record presence of FP registers used as destinations in the instructions. Blocking the access, intermittently, improves power efficiency of the OoO processor.

True/false vector index registers and methods of populating thereof
11507374 · 2022-11-22 · ·

Disclosed herein are vector index registers for storing or loading indexes of true and/or false results of comparison operations in vector processors. Each of the vector index registers store multiple addresses for accessing multiple positions in operand vectors.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.