G06F9/3854

HIGH PERFORMANCE PROCESSOR SYSTEM AND METHOD BASED ON GENERAL PURPOSE UNITS
20190138313 · 2019-05-09 ·

This invention provides a high performance processor system and a method based on a common general purpose unit, it may be configured into a variety of different processor architectures; before the processor executes instructions, the instruction is filled into the instruction read buffer, which is directly accessed by the processor core, then instruction read buffer actively provides instructions to processor core to execute, achieving a high cache hit rate.

SYSTEM AND METHOD OF VLIW INSTRUCTION PROCESSING USING REDUCED-WIDTH VLIW PROCESSOR

Very long instruction word (VLIW) instruction processing using a reduced-width processor is disclosed. In a particular embodiment, a VLIW processor includes a control circuit configured to receive a VLIW packet that includes a first number of instructions and to distribute the instructions to a second number of instruction execution paths. The first number is greater than the second number. The VLIW processor also includes physical registers configured to store results of executing the instructions and a register renaming circuit that is coupled to the control circuit.

Neural network unit with output buffer feedback and masking capability

An output buffer holds N words arranged as N/J mutually exclusive output buffer word groups (OBWG) of J words each. N processing units (PU) are arranged as N/J mutually exclusive PU groups each having an associated OBWG. Each PU has an accumulator, an arithmetic unit, and first and second multiplexed registers each having at least J+1 inputs and an output. A first input receives a memory operand and the other J inputs receive the J words of the associated OBWG. Each accumulator provides its output to a respective output buffer word. Each arithmetic unit performs an operation on the first and second multiplexed register outputs and the accumulator output to generate a result for accumulation into the accumulator. A mask input to the output buffer controls which words, if any, of the N words retain their current value or are updated with their respective accumulator output.

EFFICIENT MANAGEMENT OF SCRATCH REGISTERS

Embodiments of the invention are directed to methods for handling scratch registers in a processor. The method includes receiving a cracked instruction in an instruction dispatch unit of the processor. The method further includes decoding the cracked instruction into a group of micro-operations. Based on a determination that the instruction group uses a scratch register, determining if the scratch register is used in other groups of micro-operations. Based on a determination that the scratch register is not used in other instruction groups, allocating a physical register for use as the scratch register without creating a mapper entry for the scratch register.

Tri-configuration neural network unit

A neural network unit configurable to first/second/third configurations has N narrow and N wide accumulators, multipliers and adders. Each multiplier performs a narrow/wide multiply on first and second narrow/wide inputs to generate a narrow/wide product. A first adder input receives a corresponding narrow/wide accumulator's output and third input receives a widened corresponding narrow multiplier's narrow product in the third configuration. In the first configuration, each narrow/wide adder performs a narrow/wide addition on the first and second inputs to generate a narrow/wide sum for storage into the corresponding narrow/wide accumulator. In the second configuration, each wide adder performs a wide addition on the first and a second input to generate a wide sum for storage into the corresponding wide accumulator. In the third configuration, each wide adder performs a wide addition on the first, second and third inputs to generate a wide sum for storage into the corresponding wide accumulator.

Processor with architectural neural network execution unit

A processor has an instruction fetch unit that fetches ISA instructions from memory and execution units that perform operations on instruction operands to generate results according to the processor's ISA. A hardware neural network unit (NNU) execution unit performs computations associated with artificial neural networks (ANN). The NNU has an array of ALUs, a first memory that holds data words associated with ANN neuron outputs, and a second memory that holds weight words associated with connections between ANN neurons. Each ALU multiplies a portion of the data words by a portion of the weight words to generate products and accumulates the products in an accumulator as an accumulated value. Activation function units normalize the accumulated values to generate outputs associated with ANN neurons. The ISA includes at least one instruction that instructs the processor to write data words and the weight words to the respective first and second memories.

Load/store unit for a processor, and applications thereof

A load/store unit for a processor, and applications thereof. In an embodiment, the load/store unit includes a load/store queue configured to store information and data associated with a particular class of instructions. Data stored in the load/store queue can be bypassed to dependent instructions. When an instruction belonging to The particular class of instructions graduates and the instruction is associated with a cache miss, control logic causes a pointer to be stored in a load/store graduation buffer that points to an entry in the load/store queue associated with the instruction. The load/store graduation buffer ensures that graduated instructions access a shared resource of the load/store unit in program order.

Memory sequencing with coherent and non-coherent sub-systems

Operations associated with a memory and operations associated with one or more functional units may be received. A dependency between the operations associated with the memory and the operations associated with one or more of the functional units may be determined. A first ordering may be created for the operations associated with the memory. Furthermore, a second ordering may be created for the operations associated with one or more of the functional units based on the determined dependency and the first operating of the operations associated with the memory.

PROCESSORS, METHODS, AND SYSTEMS WITH A CONFIGURABLE SPATIAL ACCELERATOR HAVING A SEQUENCER DATAFLOW OPERATOR

Systems, methods, and apparatuses relating to a sequencer dataflow operator of a configurable spatial accelerator are described. In one embodiment, an interconnect network between a plurality of processing elements receives an input of a dataflow graph comprising a plurality of nodes forming a loop construct, wherein the dataflow graph is overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements and at least one dataflow operator controlled by a sequencer dataflow operator of the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements and the sequencer dataflow operator generates control signals for the at least one dataflow operator in the plurality of processing elements.

Managing an effective address table in a multi-slice processor

Methods and apparatus for managing an effective address table (EAT) in a multi-slice processor including receiving, from an instruction sequence unit, a next-to-complete instruction tag (ITAG); obtaining, from the EAT, a first ITAG from a tail-plus-one EAT row, wherein the EAT comprises a tail EAT row that precedes the tail-plus-one EAT row; determining, based on a comparison of the next-to-complete ITAG and the first ITAG, that the tail EAT row has completed; and retiring the tail EAT row based on the determination.