G06F9/3818

THREAD FORWARD PROGRESS AND/OR QUALITY OF SERVICE
20230034933 · 2023-02-02 ·

Methods, systems, and apparatuses provide support for allowing thread forward progress in a processing system and that improves quality of service. One system includes a processor; a bus coupled to the processor; a memory coupled to the processor via the bus; and a floating point unit coupled to the processor via the bus, wherein floating point unit comprises hardware control logic operative to: store for each thread, by a scheduler of the floating point unit, a counter; increase, by the scheduler, a value of the counter for each thread corresponding to a thread when at least one source ready operation exist for the thread; compare, by the scheduler, the value of the counter to a predetermined threshold; and make other threads ineligible to be picked by the scheduler when the counter is greater than or equal to the predetermined threshold.

Techniques for configuring a processor to execute instructions efficiently

Systems and techniques for improving the performance of circuits while adapting to dynamic voltage drops caused by the execution of noisy instructions (e.g. high power consuming instructions) are provided. The performance is improved by slowing down the frequency of operation selectively for types of noisy instructions. An example technique controls a clock by detecting an instruction of a predetermined noisy type that is predicted to have a predefined noise characteristic (e.g. a high level of noise generated on the voltage rails of a circuit due to greater amount of current drawn by the instruction), and, responsive to the detecting, deceasing a frequency of the clock. The detecting occurs before execution of the instruction. The changing of the frequency in accordance with instruction type enables the circuits to be operated at high frequencies even if some of the workloads include instructions for which the frequency of operation is slowed down.

METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
20230085048 · 2023-03-16 ·

A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.

Object-oriented memory for client-to-client communications
11635992 · 2023-04-25 · ·

Systems and corresponding methods employ an object-oriented (OO) memory (OOM) to effect inter-hardware-client (IHC) communication among a plurality of hardware clients included in same. A system comprises a centralized OOM and the plurality of hardware clients communicate, directly, to the centralized OOM device via OO message transactions. The centralized OOM device effects IHC communication among the plurality of hardware clients based on the OO message transactions. Another system comprises a plurality of OO memories (OOMs) capable of inter-object-oriented-memory-device communication. A hardware client communicates, directly, to a respective OOM device via OO message transactions. The inter-object-oriented-memory-device communication effects IHC communication among the plurality of hardware clients based on the OO message transactions.

Homogenizing data sparsity using a butterfly multiplexer

A data-sparsity homogenizer includes a plurality of multiplexers and a controller. The plurality of multiplexers receives 2.sup.N bit streams of non-homogenous sparse data in which the non-homogenous sparse data includes non-zero value data clumped together. The plurality of multiplexers is arranged in 2.sup.N rows and N columns. Each input of a multiplexer in a first column receives a respective bit stream of the 2.sup.N bit streams of non-homogenized sparse data, and the multiplexers in a last column output 2.sup.N bit streams of sparse data that is more homogenous than the non-homogenous sparse data of the 2.sup.N bit streams. The controller controls the plurality of multiplexers so that the multiplexers in the last column output the 2.sup.N channels of bit streams of sparse data that is more homogeneous than the non-homogenous sparse data of the 2.sup.N bit streams.

INSTRUCTION EXECUTION METHOD AND INSTRUCTION EXECUTION DEVICE
20230161594 · 2023-05-25 ·

An instruction configuration and execution method includes the following steps. A target instruction is received through an instruction cache. The target instruction is decoded by an instruction translator. It is determined whether the target instruction has the authority to read or write the model specific register in an unprivileged state. It is determined whether the model specific register index of the specific instruction corresponds to a specific model specific register, so as to order the microprocessor to perform an instruction serialization operation.

Storing multiple instructions in a single reordering buffer entry
11467844 · 2022-10-11 · ·

Embodiments of the present disclosure provide an instruction processing apparatus, comprising an instruction decoding circuitry configured to decode a set of instructions; a buffer comprising one or more buffer entries associated with the set of instructions, wherein the one or more buffer entries are configured to store information corresponding to at least one instruction of the set of instructions decoded by the instruction decoding circuitry; and an instruction executing circuitry configured to execute the at least one instruction, wherein a buffer entry storing the information corresponding to the at least one instruction is updated to indicate that the at least one instruction has been executed to enable retiring the set of instructions after the set of instructions have been executed.

Methods and apparatus for thread-based scheduling in multicore neural networks
11625592 · 2023-04-11 · ·

Systems, apparatus, and methods for thread-based scheduling within a multicore processor. Neural networking uses a network of connected nodes (aka neurons) to loosely model the neuro-biological functionality found in the human brain. Various embodiments of the present disclosure use thread dependency graphs analysis to decouple scheduling across many distributed cores. Rather than using thread dependency graphs to generate a sequential ordering for a centralized scheduler, the individual thread dependencies define a count value for each thread at compile-time. Threads and their thread dependency count are distributed to each core at run-time. Thereafter, each core can dynamically determine which threads to execute based on fulfilled thread dependencies without requiring a centralized scheduler.

SUPPORTING 8-BIT FLOATING POINT FORMAT OPERANDS IN A COMPUTING ARCHITECTURE

An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.

Systems and methods for improving cache efficiency and utilization

Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.