G06F9/3557

Program counter (PC)-relative load and store addressing for fused instructions

Load store addressing can include a processor, which fuses two consecutive instruction determined to be prefix instructions and treats the two instructions as a single fused instruction. The prefix instruction of the fused instruction is auto-finished at dispatch time in an issue unit of the processor. A suffix instruction of the fused instruction and its fields and the prefix instruction's fields are issued from an issue queue of the issue unit, wherein an opcode of the suffix instruction is issued to a load store unit of the processor, and fields of the fused instruction are issued to the execution unit of the processor. The execution unit forms operands of the suffix instruction, at least one operand formed based on a current instruction address of the single fused instruction. The load store unit executes the suffix instruction using the operands formed by the execution unit.

UNIVERSAL FLOATING-POINT INSTRUCTION SET ARCHITECTURE FOR COMPUTING DIRECTLY WITH DECIMAL CHARACTER SEQUENCES AND BINARY FORMATS IN ANY COMBINATION
20220156070 · 2022-05-19 ·

A universal floating-point Instruction Set Architecture (ISA) compute engine implemented entirely in hardware. The ISA compute engine computes directly with human-readable decimal character sequence floating-point representation operands without first having to explicitly perform a conversion-to-binary-format process in software. A fully pipelined convertToBinaryFromDecimalCharacter hardware operator logic circuit converts one or more human-readable decimal character sequence floating-point representations to IEEE 754-2008 binary floating-point representations every clock cycle. Following computations by at least one hardware floating-point operator, a convertToDecimalCharacterFromBinary hardware conversion circuit converts the result back to a human-readable decimal character sequence floating-point representation.

Fully pipelined binary conversion hardware operator logic circuit
20220113968 · 2022-04-14 ·

A universal floating-point Instruction Set Architecture (ISA) implemented entirely in hardware. Using a single instruction, the universal floating-point ISA has the ability, in hardware, to compute directly with dual decimal character sequences up to IEEE 754-2008 “H=20” in length, without first having to explicitly perform a conversion-to-binary-format process in software before computing with these human-readable floating-point or integer representations. The ISA does not employ opcodes, but rather pushes and pulls “gobs” of data without the encumbering opcode fetch, decode, and execute bottleneck. Instead, the ISA employs stand-alone, memory-mapped operators, complete with their own pipeline that is completely decoupled from the processor's primary push-pull pipeline. The ISA employs special three-port, 1024-bit wide SRAMS; a special dual asymmetric system stack; memory-mapped stand-alone hardware operators with private result buffers having simultaneously readable side-A and side-B read ports; and dual hardware H=20 convertFromDecimalCharacter conversion operators.

Universal floating-point instruction set architecture for computing directly with decimal character sequences and binary formats in any combination
11275584 · 2022-03-15 ·

A universal floating-point Instruction Set Architecture (ISA) implemented entirely in hardware. Using a single instruction, the universal floating-point ISA has the ability, in hardware, to compute directly with dual decimal character sequences up to IEEE 754-2008 “H=20” in length, without first having to explicitly perform a conversion-to-binary-format process in software before computing with these human-readable floating-point or integer representations. The ISA does not employ opcodes, but rather pushes and pulls “gobs” of data without the encumbering opcode fetch, decode, and execute bottleneck. Instead, the ISA employs stand-alone, memory-mapped operators, complete with their own pipeline that is completely decoupled from the processor's primary push-pull pipeline. The ISA employs special three-port, 1024-bit wide SRAMS; a special dual asymmetric system stack; memory-mapped stand-alone hardware operators with private result buffers having simultaneously readable side-A and side-B read ports; and dual hardware H=20 convertFromDecimalCharacter conversion operators.

Block-based processor core composition register

Systems, apparatuses, and methods related to a block-based processor core composition register are disclosed. In one example of the disclosed technology, a processor can include a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. A respective block-based processor core can include one or more sharable resources and a programmable composition control register. The programmable composition control register can be used to configure which resources of the one or more sharable resources are shared with other processor cores of the plurality of processor cores.

Apparatus and methods for matrix multiplication

Aspects for matrix multiplication in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-multiply-matrix (MM) instruction that includes a first starting address of a first matrix, a first size of the first matrix, a second starting address of a second matrix, and a second size of the second matrix; multiple computation modules configured to respectively multiply, in response to the MM instruction, row vectors of the first matrix with column vectors of the second matrix to generate one or more result elements; and an interconnection unit configured to combine the result elements to generate one or more row vectors of a result matrix.

MEMORY APPARATUS WITH A BUILT-IN LOGIC OPERATION UNIT AND OPERATION METHOD THEREOF
20210240483 · 2021-08-05 · ·

A memory apparatus and an operation method thereof are provided. The memory apparatus includes a mode configuration register, a system memory array, a pointer and an arithmetic circuit including logic operation units. The mode configuration register stores weight matrix information and a base address. The system memory array stores feature values in a feature map from the base address according to the weight matrix information. The pointer stores the base address and a weight matrix size to provide pointer information. The arithmetic circuit sequentially or parallelly reads the feature values according to the pointer information. The arithmetic circuit parallelly arranges weight coefficients of a selected weight matrix and the corresponding feature values in each of the corresponding logic operation units according to the weight matrix information, and causes the logic operation units to perform computing operations parallelly to output intermediate layer feature values to an external processing unit.

DELIVERING IMMEDIATE VALUES BY USING PROGRAM COUNTER (PC)-RELATIVE LOAD INSTRUCTIONS TO FETCH LITERAL DATA IN PROCESSOR-BASED DEVICES
20210271480 · 2021-09-02 ·

Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices is disclosed. In this regard, a processing element (PE) of a processor-based device provides an execution pipeline circuit that comprises an instruction processing portion and a data access portion. Using a literal data access logic circuit, the PE detects a PC-relative load instruction within a fetch window that includes multiple fetched instructions. The PE determines that the PC-relative load instruction can be serviced using literal data that is available to the instruction processing portion of the execution pipeline circuit (e.g., located within the fetch window containing the PC-relative load instruction, or stored in a literal pool buffer), The PE then retrieves the literal data within the instruction processing portion of the execution pipeline circuit, and executes the PC-relative load instruction using the literal data.

System and method for addressing data in memory

A digital signal processor having a CPU with a program counter register and, optionally, an event context stack pointer register for saving and restoring the event handler context when higher priority event preempts a lower priority event handler. The CPU is configured to use a minimized set of addressing modes that includes using the event context stack pointer register and program counter register to compute an address for storing data in memory. The CPU may also eliminate post-decrement, pre-increment and post-decrement addressing and rely only on post-increment addressing.

Apparatus and method for interpreting permissions associated with a capability
11023237 · 2021-06-01 · ·

An apparatus and method are provided for interpreting permissions associated with a capability. The apparatus has processing circuitry for executing instructions in order to perform operations, and a capability storage element accessible to the processing circuitry and arranged to store a capability used to constrain at least one operation performed by the processing circuitry when executing the instructions. The capability identifies a plurality N of default permissions whose state, in accordance with a default interpretation, is determined from N permission flags provided in the capability. In accordance with the default interpretation, each permission flag is associated with one of the default permissions. The processing circuitry is then arranged to analyse the capability in accordance with an alternative interpretation, in order to derive, from logical combinations of the N permission flags, state for an enlarged set of permissions, where the enlarged set comprises at least N+1 permissions. This provides a mechanism for encoding additional permissions into capabilities without increasing the number of permission flags required, whilst still retaining desirable behaviour.