G06F9/30072

Responding to branch misprediction for predicated-loop-terminating branch instruction

A predicated-loop-terminating branch instruction controls, based on whether a loop termination condition is satisfied, whether the processing circuitry should process a further iteration of a predicated loop body or process a following instruction. If at least one unnecessary iteration of the predicated loop body is processed following a mispredicted-non-termination branch misprediction when the loop termination condition is mispredicted as unsatisfied for a given iteration when it should have been satisfied, processing of the at least one unnecessary iteration of the predicated loop body is predicated to suppress an effect of the at least one unnecessary iteration. When the mispredicted-non-termination branch misprediction is detected for the given iteration of the predicated-loop-terminating branch instruction, in response to determining that a flush suppressing condition is satisfied, flushing of the at least one unnecessary iteration of the predicated loop body is suppressed as a response to the mispredicted-non-termination branch misprediction.

EXCEPTION SUMMARY FOR INVALID VALUES DETECTED DURING INSTRUCTION EXECUTION

An exception summary is provided for an invalid value detected during instruction execution. An indication that a value determined to be invalid was included in input data to a computation of one or more computations or in output data resulting from the one or more computations is obtained. The value is determined to be invalid due to one exception of a plurality of exceptions. Based on obtaining the indication that the value is determined to be invalid, a summary indicator is set. The summary indicator represents the plurality of exceptions collectively.

Technology For Optimizing Memory-To-Register Operations

An apparatus comprises decoder circuitry to decode an instruction that includes an opcode to indicate a protected load operation, a source field for source memory address information, and a destination field to identify a destination register. The apparatus also comprises memory to store an allocate load-protect (LP) data structure with an entry for the identified destination register. The entry comprises an IP field and a status field. The apparatus also comprises load elision circuitry to (a) use the allocate LP data structure to determine whether the identified destination register has active status for the IP; (b) in response to determining that the identified destination register has active status for the IP, cause the instruction to be elided; and (c) in response to determining that the identified destination register does not have active status for the IP, cause the instruction to be executed. Other embodiments are described and claimed.

PROGRAM EVENT RECORDING STORAGE ALTERATION PROCESSING FOR A NEURAL NEWORK ACCELERATOR INSTRUCTION
20220405120 · 2022-12-22 ·

Instruction processing is performed for an instruction. The instruction is configured to perform a plurality of functions, in which a function of the plurality of functions is to be performed in a plurality of processing phases. A processing phase is defined to store up to a select amount of data. The select amount of data is based on the function to be performed. At least one function of the plurality of functions has a different value for the select amount of data than at least one other function. A determination is made as to whether a store into a designated area occurred based on processing a select processing phase of a select function. Based on determining that the store into the designated area occurred, an interrupt is presented, and based on determining that the store into the designated area did not occur, instruction processing is continued.

Multiply-accumulation in a data processing apparatus
11513796 · 2022-11-29 · ·

A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and a set of N accumulation registers. In response to the instruction control signals are generated, causing processing circuitry to extract N data elements from content of the first source register, perform a multiplication of each of the N data elements by content of the second source register, and apply a result of each multiplication to content of a respective target register of the set of N accumulation registers. As a result plural (N) multiplications are performed in a manner that effectively provides a multiplier N times the register width, but without requiring the register file to be made N times larger.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

AN APPARATUS AND METHOD FOR CONTROLLING ACCESS TO A SET OF MEMORY MAPPED CONTROL REGISTERS
20230056039 · 2023-02-23 ·

A technique for controlling access to a set of memory mapped control registers. The apparatus has processing circuitry for executing program code to perform data processing operations, and a set of memory mapped control registers for storing control information used to control operation of the processing circuitry. Further, a lockdown register used to store a lockdown value. The processing circuitry is arranged to execute store instructions to perform write operations to a memory address space . Thethe processing circuitry is arranged to prevent a write operation being performed to change the control information in the memory mapped control registers . This significantly reduces the prospect of an attacker seeking to exploit a software vulnerability to change the control information in the memory mapped control registers.

Container isolation method and apparatus for netlink resource

A container isolation method for a netlink resource includes receiving, by a kernel executed by a processor, a trigger instruction from an application program. The method also includes creating, by the kernel according to the trigger instruction, a container corresponding to the application program, creating a netlink namespace for the container, and sending a notification to the application program indicating that the netlink namespace is created. The method further includes receiving, by the kernel, a netlink message from the container, wherein the netlink message comprises entries generated when the container runs. The method additionally includes storing, by the kernel, the entries based on an identifier of the netlink namespace for the container, to send an entry required by the container to user space of the container.

Using a vector processor to configure a direct memory access system for feature tracking operations in a system on a chip

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
20230085048 · 2023-03-16 ·

A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.