G06F9/3009

Performance scaling for binary translation

Embodiments relate to improving user experiences when executing binary code that has been translated from other binary code. Binary code (instructions) for a source instruction set architecture (ISA) cannot natively execute on a processor that implements a target ISA. The instructions in the source ISA are binary-translated to instructions in the target ISA and are executed on the processor. The overhead of performing binary translation and/or the overhead of executing binary-translated code are compensated for by increasing the speed at which the translated code is executed, relative to non-translated code. Translated code may be executed on hardware that has one or more power-performance parameters of the processor set to increase the performance of the processor with respect to the translated code. The increase in power-performance for translated code may be proportional to the degree of translation overhead.

METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
20230085048 · 2023-03-16 ·

A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.

Cooperative workgroup scheduling and context prefetching based on predicted modification of signal values

A first workgroup is preempted in response to threads in the first workgroup executing a first wait instruction including a first value of a signal and a first hint indicating a type of modification for the signal. The first workgroup is scheduled for execution on a processor core based on a first context after preemption in response to the signal having the first value. A second workgroup is scheduled for execution on the processor core based on a second context in response to preempting the first workgroup and in response to the signal having a second value. A third context it is prefetched into registers of the processor core based on the first hint and the second value. The first context is stored in a first portion of the registers and the second context is prefetched into a second portion of the registers prior to preempting the first workgroup.

Methods and devices for hardware characterization of computing devices

A machine characterization device for determining one or more machine characterization parameters of a computing device depending on a machine signature determined from sets of timing measurements associated with at least one machine characterization instruction executed by one or more processors comprised in the computing device using at least two machine configurations. A machine configuration comprises a sequence of two or more machine configuration instructions defining an order of execution of one or more instructions by the one or more processors.

MECHANISM TO TRIGGER EARLY TERMINATION OF COOPERATING PROCESSES
20230074452 · 2023-03-09 ·

Devices and techniques for triggering early termination of cooperating processes in a processor are described herein. A system includes multiple memory-compute nodes, wherein a memory-compute node comprises: event manager circuitry configured to establish a broadcast channel to receive event messages; and thread manager circuitry configured to organize a plurality of threads to perform portions of a cooperative task, wherein the plurality of threads each monitor the broadcast channel to receive event messages on the broadcast channel, and wherein upon achieving a threshold operation, the thread manager circuitry is to use the event manager circuitry to broadcast, on the broadcast channel, an event message indicating that the cooperative task is complete, causing other threads, in response to receiving the event message, to terminate execution of their respective portions of the cooperative task.

Execution control of a multi-threaded, self-scheduling reconfigurable computing fabric
11635959 · 2023-04-25 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

GPU NETWORKING USING AN INTEGRATED COMMAND PROCESSOR

Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.

CONFIGURABLE PROCESSOR PARTITIONING
20230118662 · 2023-04-20 ·

Apparatuses, systems, and techniques to configure processor partitioning for a multi-process service. In at least one embodiment, a multi-process service configures a set of streaming multiprocessors of one or more parallel processing units to perform one or more threads based on one or more user-defined data values accessible to a parallel processing library, such as compute uniform device architecture (CUDA).

TILE-BASED RESULT BUFFERING IN MEMORY-COMPUTE SYSTEMS
20230067771 · 2023-03-02 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. A first tile in a first node can include a processor with a processor output and a first register network configured to receive information from the processor output and information from one or more of the multiple other tiles in the first node. In response to an output instruction and a delay instruction, the register network can provide an output signal to one of the multiple other tiles in the first node. Based on the output instruction, the output signal can include one or the other of the information from the processor output and the information from one or more of the multiple other tiles in the first node. A timing characteristic of the output signal can depend on the delay instruction.

REDUCING CALL STACK USAGE FOR FUTURE OBJECT COMPLETIONS
20230064547 · 2023-03-02 ·

Reducing call stack usage for future object completions is disclosed herein. In one example, a processor device of a computing device employs a completion queue when managing completions of future objects. When a future object is determined to have a status of complete, the processor device determines whether the current thread of the future object is associated with a completion queue. If so, a completion operation of the future object is enqueued in the completion queue. If the current thread is not associated with a completion queue, one is created and associated with the current thread, and the completion operation of the future object is performed. After completion, if the completion queue is not empty, any enqueued completion operations are dequeued and performed. Once the completion queue is empty, the completion queue is removed.