G06F9/3826

DECOUPLED ACCESS-EXECUTE PROCESSING
20220391214 · 2022-12-08 ·

An apparatus comprises first instruction execution circuitry, second instruction execution circuitry, and a decoupled access buffer. Instructions of an ordered sequence of instructions are issued to one of the first and second instruction execution circuitry for execution in dependence on whether the instruction has a first type label or a second type label. An instruction with the first type label is an access-related instruction which determines at least one characteristic of a load operation to retrieve a data value from a memory address. Instruction execution by the first instruction execution circuitry of instructions having the first type label is prioritised over instruction execution by the second instruction execution circuitry of instructions having the second type label. Data values retrieved from memory as a result of execution of the first type instructions are stored in the decoupled access buffer.

INSERTING A PROXY READ INSTRUCTION IN AN INSTRUCTION PIPELINE IN A PROCESSOR
20220365780 · 2022-11-17 ·

Inserting a proxy read instruction in an instruction pipeline in a processor is disclosed. A scheduler circuit is configured to recognize when a produced value generated by execution of a producer instruction in the instruction pipeline will not be available through a data forwarding path to be consumed for processing of a subsequent consumer instruction. In this case, the scheduling circuit is configured to insert a proxy read instruction in the instruction pipeline to cause execution of an operation to generate the same produced value as was generated by previous execution of producer instruction in the instruction pipeline. Thus, the produced value will remain available in the instruction pipeline to again be available through a data forwarding path to an earlier stage of the instruction pipeline to be consumed by a consumer instruction, which may avoid a pipeline stall.

Convolution operator system to perform concurrent convolution operations

Disclosed is a convolution operator system for performing a convolution operation concurrently on an image. An input router receives image data. A controller allocates image data to a set of computing blocks based on the size of the image data and number of available computing blocks. Each computing block produces a convolution output corresponding to each row of the image. The controller allocates a plurality of group having one or more computing blocks to generate a set of convolution output. Further, a pipeline adder aggregates the set of convolution output to produce an aggregated convolution output. An output router transmits either the convolution output or the aggregated convolution output for performing subsequent convolution operation to generate a convolution result for the image data.

STORE-TO-LOAD FORWARDING CORRECTNESS CHECKS AT STORE INSTRUCTION COMMIT
20220357955 · 2022-11-10 ·

A microprocessor includes a load queue, a store queue, and a load/store unit that, during execution of a store instruction, records store information to a store queue entry. The store information comprises store address and store size information about store data to be stored by the store instruction. The load/store unit, during execution of a load instruction that is younger in program order than the store instruction, performs forwarding behavior with respect to forwarding or not forwarding the store data from the store instruction to the load instruction and records load information to a load queue entry, which comprises load address and load size information about load data to be loaded by the load instruction, and records the forwarding behavior in the load queue entry. The load/store unit, during commit of the store instruction, uses the recorded store information and the recorded load information and the recorded forwarding behavior to check correctness of the forwarding behavior.

Thwarting Store-to-Load Forwarding Side Channel Attacks by Pre-Forwarding Matching of Physical Address Proxies and/or Permission Checking
20220358209 · 2022-11-10 ·

A method and system for mitigating against side channel attacks (SCA) that exploit speculative store-to-load forwarding is described. The method comprises ensuring that the physical load and store addresses match and/or that permissions are present before speculatively store-to-load forwarding. Various improvements maintain a short load-store pipeline, including usage of a virtual level-one data cache (DL1), usage of an inclusive physical level-two data cache (DL2), storage and lookup of physical data address equivalents in the DL1, and using a memory dependence predictor (MDP) to speed up or replace store queue camming of load data addresses against store data addresses.

System and Method for Implementing Strong Load Ordering in a Processor Using a Circular Ordering Ring

A system and corresponding method enforce strong load ordering in a processor. The system comprises an ordering ring that stores entries corresponding to in-flight memory instructions associated with a program order, scanning logic, and recovery logic. The scanning logic scans the ordering ring in response to execution or completion of a given load instruction of the in-flight memory instructions and detects an ordering violation in an event at least one entry of the entries indicates that a younger load instruction has completed and is associated with an invalidated cache line. In response to the ordering violation, the recovery logic allows the given load instruction to complete, flushes the younger load instruction, and restarts execution of the processor after the given load instruction in the program order, causing data returned by the given and younger load instructions to be returned consistent with execution according to the program order to satisfy strong load ordering.

UNIFIED AUTOMATION OF APPLICATION DEVELOPMENT

Unified automation of application development and delivery is provided. An automation pipeline execution coordinator may define a pipeline specification that includes actions to be performed, a triggering event definition and specification for determining execution context. The coordinator may concurrently detect triggering events for multiple pipelines matching the pipeline specification, and responsive to the detecting, determine execution contexts for the pipelines. The coordinator may then execute the multiple pipelines, where execution may proceed independently for pipelines with differing execution contexts. For pipelines sharing an execution context, execution of various actions of the respective pipelines may be coordinated. Execution context may be determined according to the specification for determining execution context, which may include an overridable default specification that determines context by locations of source data related to the triggering event. Pipeline specifications may be defined using pipeline specification templates and input from users obtained via various user interfaces.

Reuse in-flight register data in a processor

Devices and techniques for short-thread rescheduling in a processor are described herein. When an instruction for a thread completes, a result is produced. The condition that the same thread is scheduled in a next execution slot and that the next instruction of the thread will use the result can be detected. In response to this condition, the result can be provided directly to an execution unit for the next instruction.

GATHERING PAYLOAD FROM ARBITRARY REGISTERS FOR SEND MESSAGES IN A GRAPHICS ENVIRONMENT

An apparatus to facilitate gathering payload from arbitrary registers for send messages in a graphics environment is disclosed. The apparatus includes processing resources comprising execution circuitry to receive a send gather message instruction identifying a number of registers to access for a send message and identifying IDs of a plurality of individual registers corresponding to the number of registers; decode a first phase of the send gather message instruction; based on decoding the first phase, cause a second phase of the send gather message instruction to bypass an instruction decode stage; and dispatch the first phase subsequently followed by dispatch of the second phase to a send pipeline. The apparatus can also perform an immediate move of the IDs of the plurality of individual registers to an architectural register of the execution circuitry and include a pointer to the architectural register in the send gather message instruction.

Computer Architecture with Register Name Addressing and Dynamic Load Size Adjustment
20230089349 · 2023-03-23 ·

A computer architecture allows load instructions to fetch from cache memory “fat” loads having more data than necessary to satisfy execution of the load instruction, for example, loading a full cache line instead of a required word. The fat load allows load instructions having spatiotemporal locality to share the data of the fat load avoiding cache accesses. Rapid access to local data structures is provided by using base register names to directly access those structures as a proxy for the actual load base register address,