G06F9/30163

SCALAR CORE INTEGRATION

Methods and apparatus relating to scalar core integration in a graphics processor. In an example, an apparatus comprises a processor to receive a set of workload instructions for a graphics workload from a host complex, determine a first subset of operations in the set of operations that is suitable for execution by a scalar processor complex of the graphics processing device and a second subset of operations in the set of operations that is suitable for execution by a vector processor complex of the graphics processing device, assign the first subset of operations to the scalar processor complex for execution to generate a first set of outputs, assign the second subset of operations to the vector processor complex for execution to generate a second set of outputs. Other embodiments are also disclosed and claimed.

MULTIPLY-ACCUMULATION IN A DATA PROCESSING APPARATUS
20190369989 · 2019-12-05 ·

A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and a set of N accumulation registers. In response to the instruction control signals are generated, causing processing circuitry to extract N data elements from content of the first source register, perform a multiplication of each of the N data elements by content of the second source register, and apply a result of each multiplication to content of a respective target register of the set of N accumulation registers. As a result plural (N) multiplications are performed in a manner that effectively provides a multiplier N times the register width, but without requiring the register file to be made N times larger.

Apparatus and method for complex multiply and accumulate

An embodiment of the invention is a processor including execution circuitry to calculate, in response to a decoded instruction, a result of a complex multiply-accumulate of a first complex number, a second complex number, and a third complex number. The calculation includes a first operation to calculate a first term of a real component of the result and a first term of the imaginary component of the result. The calculation also includes a second operation to calculate a second term of the real component of the result and a second term of the imaginary component of the result. The processor also includes a decoder to decode an instruction to generate the decoded instruction and a first source register, a second source register, and a source and destination register to provide the first complex number, the second complex number, and the third complex number, respectively.

PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS TO GENERATE SEQUENCES OF INTEGERS IN WHICH INTEGERS IN CONSECUTIVE POSITIONS DIFFER BY A CONSTANT INTEGER STRIDE AND WHERE A SMALLEST INTEGER IS OFFSET FROM ZERO BY AN INTEGER OFFSET

A method of an aspect includes receiving an instruction. The instruction indicates an integer stride, indicates an integer offset, and indicates a destination storage location. A result is stored in the destination storage location in response to the instruction. The result includes a sequence of at least four integers in numerical order with a smallest one of the at least four integers differing from zero by the integer offset and with all integers of the sequence in consecutive positions differing by the integer stride. Other methods, apparatus, systems, and instructions are disclosed.

COALESCING ADJACENT GATHER/SCATTER OPERATIONS

According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.

REAL TIME STACK PROTECTION
20190227953 · 2019-07-25 ·

Methods, circuitries, and systems for real-time protection of a stack are provided. A stack protection circuitry includes interface circuitry and computation circuitry. The interface is circuitry configured to receive a return instruction from a central processing unit (CPU). The computation circuitry is configured to, in response to the return instruction, generate protection data that i) identifies a new topmost return address location that is below a current protected topmost return address location and ii) specifies read only access for the new topmost return address location. The interface circuitry is configured to provide the protection data to a memory protection unit to cause the memory protection unit to enforce a read only access restriction on the new topmost return address location.

Fetch unit for predicting target for subroutine return instructions
10360037 · 2019-07-23 · ·

A fetch unit configured to, in response to detecting a subroutine call and link instruction, calculate and store a predicted target address for the corresponding subroutine return instruction in a prediction stack, and if certain conditions are met, also cause to be stored in the prediction stack a predicted target instruction bundle. The fetch unit is also configured to, in response to detecting a subroutine return instruction, use the predicted target address in the prediction stack to determine the address of the next instruction bundle to be fetched, and if certain conditions are met, cause any valid predicted target instruction bundle in the prediction stack to be the next bundle to be decoded.

Providing efficient recursion handling using compressed return address stacks (CRASs) in processor-based systems

Providing efficient recursion handling using compressed return address stacks (CRASs) in processor-based systems is disclosed. In one aspect, a processor-based system provides a branch prediction circuit including a CRAS. Each CRAS entry within the CRAS includes an address field and a counter field. When a call instruction is encountered, a return address of the call instruction is compared to the address field of a top CRAS entry indicated by a CRAS top-of-stack (TOS) index. If the return address matches the top CRAS entry, the counter field of the top CRAS entry is incremented instead of adding a new CRAS entry for the return address. When a return instruction is subsequently encountered in the instruction stream, the counter field of the top CRAS entry is decremented if its value is greater than zero (0), or, if not, the top CRAS entry is removed from the CRAS.

APPARATUS AND METHOD FOR COMPLEX MULTIPLY AND ACCUMULATE

An embodiment of the invention is a processor including execution circuitry to calculate, in response to a decoded instruction, a result of a complex multiply-accumulate of a first complex number, a second complex number, and a third complex number. The calculation includes a first operation to calculate a first term of a real component of the result and a first term of the imaginary component of the result. The calculation also includes a second operation to calculate a second term of the real component of the result and a second term of the imaginary component of the result. The processor also includes a decoder to decode an instruction to generate the decoded instruction and a first source register, a second source register, and a source and destination register to provide the first complex number, the second complex number, and the third complex number, respectively.

EMPLOYING A STACK ACCELERATOR FOR STACK-TYPE ACCESSES

A stack accelerator is employed for stack-type accesses. An instruction stream is scanned for stack-type accesses. These stack-type accesses may include push and pop stack operations. Based on identifying a stack-type access in the instruction stream, memory operations are replaced with one or more operations that access a stack in a stack accelerator.