G06F9/30079

Inserting a proxy read instruction in an instruction pipeline in a processor

Inserting a proxy read instruction in an instruction pipeline in a processor is disclosed. A scheduler circuit is configured to recognize when a produced value generated by execution of a producer instruction in the instruction pipeline will not be available through a data forwarding path to be consumed for processing of a subsequent consumer instruction. In this case, the scheduling circuit is configured to insert a proxy read instruction in the instruction pipeline to cause execution of an operation to generate the same produced value as was generated by previous execution of producer instruction in the instruction pipeline. Thus, the produced value will remain available in the instruction pipeline to again be available through a data forwarding path to an earlier stage of the instruction pipeline to be consumed by a consumer instruction, which may avoid a pipeline stall.

Calculation method and related product

The present disclosure provides a computing method that is applied to a computing device. The computing device includes: a memory, a register unit, and a matrix computing unit. The method includes the following steps: controlling, by the computing device, the matrix computing unit to obtain a first operation instruction, where the first operation instruction includes a matrix reading instruction for a matrix required for executing the instruction; controlling, by the computing device, an operating unit to send a reading command to the memory according to the matrix reading instruction; and controlling, by the computing device, the operating unit to read a matrix corresponding to the matrix reading instruction in a batch reading manner, and executing the first operation instruction on the matrix. The technical solutions in the present disclosure have the advantages of fast computing speed and high efficiency.

MULTIPLY-ACCUMULATE WITH VARIABLE FLOATING POINT PRECISION

An integrated circuit including a multiplier-accumulator execution pipeline including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: (i) a multiplier to multiply first input data, having a first floating point data format, by a filter weight data, having the first floating point data format, and generate and output a product data having a second floating point data format, and (ii) an accumulator, coupled to the multiplier of the associated MAC circuit, to add second input data and the product data output by the associated multiplier to generate sum data. The plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline may be connected in series and, in operation, perform a plurality of concatenated multiply and accumulate operations.

Hardware-implemented universal floating-point instruction set architecture for computing directly with human-readable decimal character sequence floating-point representation operands
11635957 · 2023-04-25 ·

A universal floating-point Instruction Set Architecture (ISA) compute engine implemented entirely in hardware. The ISA compute engine computes directly with human-readable decimal character sequence floating-point representation operands without first having to explicitly perform a conversion-to-binary-format process in software. A fully pipelined convertToBinaryFromDecimalCharacter hardware operator logic circuit converts one or more human-readable decimal character sequence floating-point representations to IEEE 754-2008 binary floating-point representations every clock cycle. Following computations by at least one hardware floating-point operator, a convertToDecimalCharacterFromBinary hardware conversion circuit converts the result back to a human-readable decimal character sequence floating-point representation.

Methods, systems, and computer readable media for on-demand, on-device compiling and use of programmable pipeline device profiles

A method for on-demand, on-device compiling and use of programmable pipeline device profiles includes storing, on a network test or visibility device, programmable pipeline device source code and a plurality of different programmable pipeline device profile definitions containing parameters for implementing different programmable pipeline device profile variations. The method further include implementing, on the network test or visibility device, a compiler that receives the programmable pipeline device source code and one of the profile definitions as input and that produces as output a programmable pipeline device profile including compiled object code for configuring a programmable pipeline device to implement a network test or network visibility function. The method further includes invoking the compiler to compile, using one of the profile definitions, the programmable pipeline device source code into a programmable pipeline device profile for implementing a network test or visibility function and loading the profile on the network test or visibility device to configure the programmable pipeline device for implementing the network test or network visibility function.

SUPPORTING LARGE-WORD OPERATIONS IN A REDUCED INSTRUCTION SET COMPUTER ("RISC") PROCESSOR

A Reduced Instruction Set Computer (“RISC”) supporting large-word operations in a computing environment is disclosed. In one implementation, in response to receiving one or more control signals from a central processing unit (“CPU”), a set of operations are executed on a state of a special purpose execution unit (“SPU”) having a plurality of SPU registers, the SPU being associated with the CPU and the state of the SPU having word widths of one or more of the plurality of registers being greater in size than word widths of a plurality of CPU registers of a computing system and a set of state-master bits to synchronize the state of the SPU and a state of the CPU. The results of the set of operations are stored in the plurality of CPU registers or an alternative set of the plurality of SPU registers.

TILE-BASED RESULT BUFFERING IN MEMORY-COMPUTE SYSTEMS
20230067771 · 2023-03-02 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. A first tile in a first node can include a processor with a processor output and a first register network configured to receive information from the processor output and information from one or more of the multiple other tiles in the first node. In response to an output instruction and a delay instruction, the register network can provide an output signal to one of the multiple other tiles in the first node. Based on the output instruction, the output signal can include one or the other of the information from the processor output and the information from one or more of the multiple other tiles in the first node. A timing characteristic of the output signal can depend on the delay instruction.

Counters For Ensuring Transactional Ordering in I/O Agent

Techniques are disclosed relating to an I/O agent circuit. The I/O agent circuit may include a transaction pipeline and a pool of counters. The I/O agent circuit may initialize a first counter included in the pool of counters with an initial counter value. The I/O agent circuit may assign the first counter to a specific transaction type. The I/O agent circuit may increment the first counter as a part of allocating a transaction of a transaction type included in a set of transaction types different than the specific transaction type. Based on receiving a transaction request to process a first transaction of the specific transaction type, the I/O agent circuit may bind the first transaction to the first counter. The I/O agent circuit may issue the first transaction to the transaction pipeline based on a counter value stored by the first counter matching the initial counter value.

IC including logic tile, having reconfigurable MAC pipeline, and reconfigurable memory
11663016 · 2023-05-30 · ·

An integrated circuit including configurable multiplier-accumulator circuitry, wherein, during processing operations, a plurality of the multiplier-accumulator circuits are serially connected into pipelines to perform concatenated multiply and accumulate operations. The integrated circuit includes a first memory and a second memory, and a switch interconnect network, including configurable multiplexers arranged in a plurality of switch matrices. The first and second memories are configurable as either a dedicated read memory or a dedicated write memory and connected to a given pipeline, via the switch interconnect network, during a processing operation performed thereby; wherein, during a first processing operations, the first memory is dedicated to write data to a first pipeline and the second memory is dedicated to read data therefrom and, during a second processing operation, the first memory is dedicated to read data from a second pipeline and the second memory is dedicated to write data thereto.

System and method for providing granular processor performance control

A basic input/output system provides an interface for a core aggregation layout that identifies a grouping of processor cores into core aggregations, wherein each of the core aggregations is associated with a maximum allowable C-state. A processor may monitor an information handling system during operation of an application to gather data associated with latency sensitivity of the application, update the core aggregation layout based on the data gathered during the operation of the application, and pin a thread for execution to one of the processor cores based on the latency sensitivity of the application and the maximum allowable C-state.