G06F8/4451

Electronic apparatus, processor and control method including a compiler scheduling instructions to reduce unused input ports

An electronic apparatus is provided for obtaining compiling data used in an external processor including a function unit including a plurality of input ports. The electronic apparatus includes a storage configured to store a plurality of instructions, and a processor configured to schedule each of the plurality of instructions in a plurality of cycles, assign a plurality of input data corresponding to the plurality of instructions to the plurality of input ports in a corresponding cycle, and if an unassigned input port among the plurality of input ports is present in a first cycle, assign a part of input data corresponding to an instruction scheduled in a second cycle after the first cycle to the unassigned input port in the first cycle, and obtain the compiling data by assigning remaining data of the input data corresponding the instruction to one of the plurality of input ports in the second cycle.

Reducing stalling in a simultaneous multithreading processor by inserting thread switches for instructions likely to stall

A system and method suppresses occurrence of stalling caused by data dependency other than register dependency in an out-of-order processor. A stall reducing method includes a handler for detecting a stall occurring during execution of execution code using a performance monitoring unit, and for identifying, based on dependencies, a second instruction on which a first instruction is data dependent, the stall based on this dependency. A profiler registers the second instruction as profile information. An optimization module inserts a thread yield instruction in an appropriate position inside execution code or an original code file based on the profile information, and outputs optimized execution code.

PADDED VECTORIZATION WITH COMPILE TIME KNOWN MASKS
20200073662 · 2020-03-05 ·

A computing system includes a processing unit and a memory storing instructions that, when executed by the processor, cause the processor to receive program source code in a compiler, identify in the program source code a set of operations for vectorizing, where each operation in the set of operations specifies a set of one or more operands, in response to identifying the set of operations, vectorize the set of operations by, based on the number of operations in the set of operations and a total number of lanes in a first vector register, generating a mask indicating a first unmasked lane and a first masked lane in the first vector register, based on the mask, generating a set of one or more instructions for loading into the first unmasked lane a first operand of a first operation of the set of operations, and loading the first operand into the first masked lane.

Background processing during remote memory access
11966619 · 2024-04-23 · ·

An apparatus for executing a software program, comprising at least one hardware processor configured for: identifying in a plurality of computer instructions at least one remote memory access instruction and a following instruction following the at least one remote memory access instruction; executing after the at least one remote memory access instruction a sequence of other instructions, where the sequence of other instructions comprises a return instruction to execute the following instruction; and executing the following instruction; wherein executing the sequence of other instructions comprises executing an updated plurality of computer instructions produced by at least one of: inserting into the plurality of computer instructions the sequence of other instructions or at least one flow-control instruction to execute the sequence of other instructions; and replacing the at least one remote memory access instruction with at least one non-blocking memory access instruction.

Method and system for optimizing data transfer from one memory to another memory
11954497 · 2024-04-09 · ·

A method and system for moving data from a source memory to a destination memory by a processor are disclosed. The processor has a plurality of registers and the source memory stores a sequence of instructions that include one or more load instructions and one or more store instructions. The processor moves the load instructions from the source memory to the destination memory. Then, the processor initiates execution of the load instructions from the destination memory in order to load the data from the source memory to one or more registers in the processor. Execution then returns to the sequence of instructions stored in the source memory, and the processor stores the data from the registers to the destination memory.

System And Method For Improving Streaming Video Via Better Buffer Management
20190306551 · 2019-10-03 ·

Disclosed are solutions for improving Internet video streaming. A first number is determined based on one or more parameters, including network conditions. A second number is then determined corresponding to a number of video segments that is greater than or equal in size to a third number determined based on a bandwidth-delay product of the network to a remote machine. The second number of video segments is then requested in a pipelined fashion. Pipelined requests are stopped when a predetermined size of the video has been requested that is greater than or equal to the first number. Alternatively, a request is sent to the remote machine to send a portion of the video, where the size of the portion of the video is equal to the first number or equal to the size of video remaining if less than the first number. Combined with pipelining, the approach achieves near-optimal throughput and fast bitrate adaptation, regardless of control plane algorithm.

Lightweight restricted transactional memory for speculative compiler optimization

Embodiments described herein utilize restricted transactional memory (RTM) instructions to implement speculative compile time optimizations that will be automatically rolled back by hardware in the event of a missed speculation. In one embodiment, a lightweight version of RTM for speculative compiler optimization is described to provide lower operational overhead in comparison to conventional RTM implementations used when performing SLE.

Program code optimization for reducing branch mispredictions

Systems, apparatuses, and methods for implementing an IF2FOR transformation are disclosed. In one embodiment, a first group of instructions include an IF-statement and one or more control dependent instructions. The first group of instructions are transformed into a second group of instructions if the first group of instructions meet one or more criteria. In one embodiment, the criteria includes the (1) IF-statement being part of a loop and (2) the control dependent instructions not having any inter-loop iteration dependency. The second group of instructions are executable to (1) store results of the IF-statement condition for a first number of iterations and (2) execute the control dependent instructions for a second number of iterations when the IF-statement condition evaluates to true.

PROGRAM CODE OPTIMIZATION FOR REDUCING BRANCH MISPREDICTIONS
20180349140 · 2018-12-06 ·

Systems, apparatuses, and methods for implementing an IF2FOR transformation are disclosed. In one embodiment, a first group of instructions include an IF-statement and one or more control dependent instructions. The first group of instructions are transformed into a second group of instructions if the first group of instructions meet one or more criteria. In one embodiment, the criteria includes the (1) IF-statement being part of a loop and (2) the control dependent instructions not having any inter-loop iteration dependency. The second group of instructions are executable to (1) store results of the IF-statement condition for a first number of iterations and (2) execute the control dependent instructions for a second number of iterations when the IF-statement condition evaluates to true.

REDUCING STALLING IN A SIMULTANEOUS MULTITHREADING PROCESSOR BY INSERTING THREAD SWITCHES FOR INSTRUCTIONS LIKELY TO STALL
20180336030 · 2018-11-22 ·

A system and method suppresses occurrence of stalling caused by data dependency other than register dependency in an out-of-order processor. A stall reducing method includes a handler for detecting a stall occurring during execution of execution code using a performance monitoring unit, and for identifying, based on dependencies, a second instruction on which a first instruction is data dependent, the stall based on this dependency. A profiler registers the second instruction as profile information. An optimization module inserts a thread yield instruction in an appropriate position inside execution code or an original code file based on the profile information, and outputs optimized execution code.