G06F9/30156

Method and processor for implementing an instruction including encoding a stopbit in the instruction to indicate whether the instruction is executable in parallel with a current instruction, and recording medium therefor

In a method to execute instructions, at least one instruction executed in a predetermined cycle is acquired based on information included in each of a plurality of instructions, and a code included in the at least one instruction acquired. An instruction is allocated to at least one slot based on the analysis result, and a slot necessary to execute the instruction is selectively used. Accordingly, power consumption of a device using the method may be reduced.

EFFICIENT ENCODING OF HIGH FANOUT COMMUNICATIONS

Efficient encoding of high fanout communication patterns in computer programming is achieved through utilization of producer and move instructions in an instruction set architecture (ISA) that supports direct instruction communication where a producer encodes identities of consumers of results directly within an instruction. The producer instructions may fully encode the targeted consumers with an explicit target distance or utilize compressed target encoding in which a field in the instruction provides a bit vector for one-hot encoding. A variety of move instructions target different numbers of consumers and may also utilize full or compressed target encoding. In consumer paths where a producer is unable to target all consumers, a compiler may utilize various combination of producer and move instructions, using full and/or compressed target encoding to build a fanout tree that efficiently propagates the producer results to the all the targeted consumers.

Accelerator systems and methods for matrix operations

The present disclosure is directed to systems and methods for performing one or more operations on a two dimensional tile register using an accelerator that includes a tiled matrix multiplication unit (TMU). The processor circuitry includes reservation station (RS) circuitry to communicatively couple the processor circuitry to the TMU. The RS circuitry coordinates the operations performed by the TMU. TMU dispatch queue (TDQ) circuitry in the TMU maintains the operations received from the RS circuitry in the order that the operations are received from the RS circuitry. Since the duration of each operation is not known prior to execution by the TMU, the RS circuitry maintains shadow dispatch queue (RS-TDQ) circuitry that mirrors the operations in the TDQ circuitry. Communication between the RS circuitry 134 and the TMU provides the RS circuitry with notification of successfully executed operations and allows the RS circuitry to cancel operations where the operations are associated with branch mispredictions and/or non-retired speculatively executed instructions.

TRANSCEIVER AND DRIVER ARCHITECTURE WITH LOW EMISSION AND HIGH INTERFERENCE TOLERANCE

Circuitry of a physical layer for interfacing with a communication bus of a wired local area network is disclosed. The circuitry includes a variable delay driver operably coupled to a communication bus. The communication bus includes a shared transmission medium. The variable delay driver is configured to control a slew rate of a driven transmit signal at the driver output. The circuitry also includes receiver circuitry operably coupled to the communication bus. The circuitry further includes a common mode dimmer operably coupled to the receiver circuitry and the communication bus. The common mode dimmer is configured to protect the receiver circuitry from common mode interference.

OPERATION CACHE COMPRESSION
20210064533 · 2021-03-04 ·

A data processing apparatus is provided. The data processing apparatus includes fetch circuitry to fetch instructions from storage circuitry. Decode circuitry decodes each of the instructions into one or more operations and provides the one or more operations to one or more execution units. The decode circuitry is adapted to decode at least one of the instructions into a plurality of operations. Cache circuitry caches the one or more operations and at least one entry of the cache circuitry is a compressed entry that represents the plurality of operations.

CONTEXT SWITCHING LOCATIONS FOR COMPILER-ASSISTED CONTEXT SWITCHING
20210208886 · 2021-07-08 ·

Generating context switching locations for compiler-assisted context switching. A set of possible locations is determined for preferred preemption points in a set of threads based on (i) an identification of a set of candidate markers for preferred preemption points and (ii) a type of characteristic that is associated with a possible location included in the set of possible locations. A modified set of possible locations is generated in a data structure based on the type of characteristic, wherein the modified set of possible locations indicate one or more preferred preemption points in the set of threads.

METHOD AND APPARATUS TO PROCESS SHA-2 SECURE HASHING ALGORITHM

A processor includes an instruction decoder to receive a first instruction to process a secure hash algorithm 2 (SHA-2) hash algorithm, the first instruction having a first operand associated with a first storage location to store a SHA-2 state and a second operand associated with a second storage location to store a plurality of messages and round constants. The processor further includes an execution unit coupled to the instruction decoder to perform one or more iterations of the SHA-2 hash algorithm on the SHA-2 state specified by the first operand and the plurality of messages and round constants specified by the second operand, in response to the first instruction.

Transmitting DBI over strobe in nonvolatile memory

A methodology and structure for a encoding a data stat signal in the data lock signal, e.g., the data strobe signal such as DBQ. The data strobe signal can maintain the clock continuity, e.g., the rise and fall edges are at the timing signal, and the data inversion can be based on the amplitude of the data strobe signal. This allows the data set on the data lines, e.g., D0-D7, to either be non-inverted or inverted, to save power consumed in the memory device.

Method for managing computation tasks on a functionally asymmetric multi-core processor

A method for managing a computation task on a functionally asymmetric multi-core processor includes a plurality of cores at least one of which comprises at least one hardware extension for executing specialized instructions, comprising the following steps: a) starting the execution of the computation task on a core of the processor; b) monitoring a parameter indicative of a quality of service of the computation task, and at least a number of specialized instructions loaded by the core; c) identifying instants splitting an application period of the computation task into a predetermined number of portions; d) computing costs or gains in quality of service and in energy consumption corresponding to different management options of the computation task; and e) making a management choice according to the costs or gains thus computed. Computer program product, processor and computer system for implementing such a method are also provided.

Encoding and decoding variable length instructions

Methods of encoding and decoding are described which use a variable number of instruction words to encode instructions from an instruction set, such that different instructions within the instruction set may be encoded using different numbers of instruction words. To encode an instruction, the bits within the instruction are re-ordered and formed into instruction words based upon their variance as determined using empirical or simulation data. The bits in the instruction words are compared to corresponding predicted values and some or all of the instruction words that match the predicted values are omitted from the encoded instruction.