G06F9/30123

RECONFIGURABLE SIMD ENGINE
20230214351 · 2023-07-06 · ·

An exemplary SIMD computing system comprises a SIMD processing element (SPE) configured to perform a selected operation on a portion of a processor input data word, with the operation selected by control signals read from a control memory location addressed by a decoded instruction. The SPE may comprise one or more adder, multiplier, or multiplexer coupled to the control signals. The control signals may comprise one or more bit read from the control memory. The control memory may be an MxN (M rows by N columns) memory having M possible SIMD operations and N control signals. Each instruction decoded may select an SPE operation from among N rows. A plurality of SPEs may receive the same control signals. The control memory may be rewritable, advantageously permitting customizable SIMD operations that are reconfigurable by storing in the control memory locations control signals designed to cause the SPE to perform selected operations.

Mechanism for interrupting and resuming execution on an unprotected pipeline processor

Techniques related to executing a plurality of instructions by a processor comprising receiving a first instruction for execution on an instruction execution pipeline, beginning execution of the first instruction, receiving one or more second instructions for execution on the instruction execution pipeline, the one or more second instructions associated with a higher priority task than the first instruction, storing a register state associated with the execution of the first instruction in one or more registers of a capture queue associated with the instruction execution pipeline, copying the register state from the capture queue to a memory, determining that the one or more second instructions have been executed, copying the register state from the memory to the one or more registers of the capture queue, and restoring the register state to the instruction execution pipeline from the capture queue.

Fast incremental shared constants

This disclosure provides systems, devices, apparatus, and methods, including computer programs encoded on storage media, for fast incremental shared constants. In aspects, a CPU may determine/update shared constant data for a first draw call of a plurality of draw calls. The shared constant data, which may correspond to at least one shader, may be updated based on a draw call update for the first draw call. The CPU may communicate the updated shared constant data for the first draw call to a GPU. The GPU may receive, in at least one register, the updated shared constant data from the CPU and configure the at least one register based on the updated shared constant data corresponding to the draw call update of the first draw call of the plurality of draw calls.

VECTOR PROCESSOR UTILIZING MASSIVELY FUSED OPERATIONS

Techniques are disclosed for the use of fused vector processor instructions by a vector processor architecture. Each fused vector processor instruction may include a set of fields associated with individual vector processing instructions. The vector processor architecture may implement local buffers facilitating a single vector processor instruction to be used to execute each of the individual vector processing instructions without re-accessing vector registers between each executed individual vector processing instruction. The vector processor architecture enables less communication across the interconnection network, thereby increasing interconnection network bandwidth and the speed of computations, and decreasing power usage.

System, Apparatus And Methods For Minimum Serialization In Response To Non-Serializing Register Write Instruction
20220413860 · 2022-12-29 ·

In one embodiment, a processor includes: a plurality of registers; a front end circuit to fetch and decode a non-serializing register write instruction, the non-serializing register write instruction to cause a value to be stored in a first register of the plurality of registers; and an execution circuit coupled to the front end circuit. The execution circuit, in response to the non-serializing register write instruction, is to determine an amount of serialization for the non-serializing register write instruction and execute the non-serializing register write instruction according to the amount of serialization. Other embodiments are described and claimed.

PROCESSING DEVICE AND METHOD OF USING A REGISTER CACHE
20220413858 · 2022-12-29 · ·

A processing device is provided which comprises memory, a plurality of registers and a processor. the processor is configured to execute a plurality of portions of a program, allocate a number of the registers per portion of the program such that a number of remaining registers are available as a register cache and transfer data between the number of registers, which are allocated per portion of the program, and the register cache. The processor loads data to the allocated registers to execute a portion of the program, stores data, resulting from execution of the portion, in the register cache, reloads the data in the allocated registers and executes another portion of the program using the data reloaded to the allocated registers and A called function uses the number of allocated registers, which is less than an architectural limit of registers allocated per portion of the program.

Compiler-assisted inter-SIMD-group register sharing

Systems, apparatuses, and methods for efficiently sharing registers among threads are disclosed. A system includes at least a processor, control logic, and a register file with a plurality of registers. The processor assigns a base set of registers to each thread of a plurality of threads executing on the processor. When a given thread needs more than the base set of registers to execute a given phase of program code, the given thread executes an acquire instruction to acquire exclusive access to an extended set of registers from a shared resource pool. When the given thread no longer needs additional registers, the given thread executes a release instruction to release the extended set of registers back into the shared register pool for other threads to use. In one implementation, the compiler inserts acquire and release instructions into the program code based on a register liveness analysis performed during compilation.

Experience convergence for multiple platforms

A processor may launch a common layer including a browser and a shell. The processor may identify a request for data required by the software platform. The processor may determine a context in which the request was generated by detecting, by the shell, whether the software platform is an on-premise solution or a cloud solution; examining, by the shell, the request to determine whether it requires local resources or cloud-based resources; and designating, by the shell, the context as a cloud-based context or an on-premise context based on the detecting and the examining. In response to the determining, the processor may perform processing related to a request generated in the on-premise context using at least one local resource stored in a local data store; and communicate, by the browser, with at least one cloud-based resource to perform processing related to a request generated in the cloud-based context.

ADJUSTING STORE GATHER WINDOW DURATION IN A DATA PROCESSING SYSTEM SUPPORTING SIMULTANEOUS MULTITHREADING
20220405125 · 2022-12-22 ·

In at least some embodiments, a store-type operation is received and buffered within a store queue entry of a store queue associated with a cache memory of a processor core capable of executing multiple simultaneous hardware threads. A thread identifier indicating a particular hardware thread among the multiple hardware threads that issued the store-type operation is recorded. An indication of whether the store queue entry is a most recently allocated store queue entry for buffering store-type operations of the hardware thread is also maintained. While the indication indicates the store queue entry is a most recently allocated store queue entry for buffering store-type operations of the particular hardware thread, the store queue extends a duration of a store gathering window applicable to the store queue entry. For example, the duration may be extended by decreasing a rate at which the store gathering window applicable to the store queue entry ends.

EVICTING AND RESTORING INFORMATION USING A SINGLE PORT OF A LOGICAL REGISTER MAPPER AND HISTORY BUFFER IN A MICROPROCESSOR COMPRISING MULTIPLE MAIN REGISTER FILE ENTRIES MAPPED TO ONE ACCUMULATOR REGISTER FILE ENTRY

A computer system, processor, programming instructions and/or method of processing data that includes a main register file having a plurality of entries for storing data; an accumulator register file having a plurality of entries for storing data wherein multiple main register file entries are mapped to one accumulator register file entry in the at least one accumulator register file; a logical register mapper to track and map logical registers to main register file entries, and a history buffer. Processing wide data width instructions includes evicting and restoring information from a single primary entry in the logical register mapper through a single read or write port in the logical register mapper without evicting or restoring the remaining other multiple main register file entries mapped in the accumulator register.