Patent classifications
G06F9/30134
System and method to implement masked vector instructions
A processor includes a register file comprising a length register, a vector register file comprising a plurality of vector registers, a mask register file comprising a plurality of mask registers, and a vector instruction execution circuit to execute a masked vector instruction comprising a first length register identifier representing the length register, a first vector register identifier representing a first vector register of the vector register file, and a first mask register identifier representing a first mask register of the mask register file, wherein the length register is to store a length value representing a number of operations to be applied to data elements stored in the first vector register, the first mask register is to store a plurality of mask bits, and a first mask bit of the plurality of mask bits determines whether a corresponding first one of the operations causes an effect.
Register pressure target function splitting
Provided are embodiments for a method of performing register pressure targeted function splitting. The method can include determining a candidate region of a function, the candidate region comprising variables, and determining a number of available registers in a computing system for allocating the variables of the function. The method can also include grouping the variables in the candidate region into first variables and second variables based at least in part on the number of available registers, and splitting the candidate region of the function into split functions based at least in part on the grouping of the variables. Also provided are embodiments for a computer program product and a system for performing register pressure targeted function splitting.
VALIDATION OF STORE COHERENCE RELATIVE TO PAGE TRANSLATION INVALIDATION
Systems and methods for invalidating page translation entries are described. A processing element may apply a delay to a drain cycle of a store reorder queue (SRQ) of a processing element. The processing element may drain the SRQ under the delayed drain cycle. The processing element may receive a translation lookaside buffer invalidation (TLBI) instruction from an interconnect connecting the plurality of processing elements. The TLBI instruction may be an instruction to invalidate a translation lookaside buffer (TLB) entry corresponding to at least one of a virtual memory page and a physical memory frame. The TLBI instruction may be broadcasted by another processing element. The application of the delay to the drain cycle of the SRQ may decrease a difference between the drain cycle of the SRQ and an invalidation cycle associated with the TLBI.
DEBUGGING COMMUNICATION AMONG UNITS ON PROCESSOR SIMULATOR
A method is provided for identifying a data transfer mismatch between a sender and a receiver from among units of a software simulator of a hardware processor. The simulator runs the plurality of the units which communicate with each other via First-In First-Outs (FIFOs). The method counts amounts of data the sender writes to the FIFOs and the receiver reads from the FIFOs for a given data transfer. The method avoids blocking during FIFO reading and writing operations by (i) reading dummy data by the receiver, even if the FIFOs are empty, when the receiver tries reading from the FIFOs, and (ii) discarding written data if the FIFOs are full, when the sender tries writing to the FIFOs. The method identifies mismatches in the amount of data the sender writes to the FIFOs versus the amount of data the receiver reads from the FIFOs for the given data transfer.
DISTANCE-BASED BRANCH PREDICTION AND DETECTION
Examples of techniques for distance-based branch prediction are disclosed. In one example implementation according to aspects of the present disclosure, a computer-implemented method includes: determining, by a processing system, a potential return instruction address (IA) by determining whether a relationship is satisfied between a first target IA and a first branch IA; storing a second branch IA as a return when a target IA of a second branch matches a potential return IA for the second branch; and applying the potential return IA for the second branch as a predicted target IA of a predicted branch IA stored as a return
DEEP VISION PROCESSOR
Disclosed herein is a processor for deep learning. In one embodiment, the processor comprises: a load and store unit configured to load and store image pixel data and stencil data; a register unit, implementing a banked register file, configured to: load and store a subset of the image pixel data from the load and store unit, and concurrently provide access to image pixel values stored in a register file entry of the banked register file, wherein the subset of the image pixel data comprises the image pixel values stored in the register file entry; and a plurality of arithmetic logic units configured to concurrently perform one or more operations on the image pixel values stored in the register file entry and corresponding stencil data of the stencil data.
SYSTEMS AND METHODS FOR DETECTING COROUTINES
Disclosed herein are systems and method for detecting coroutines. A method may include: identifying an application running on a computing device, wherein the application includes a plurality of coroutines; determining an address of a common entry point for coroutines, wherein the common entry point is found in a memory of the application; identifying, using an injected code, at least one stack trace entry for the common entry point; detecting coroutine context data based on the at least one stack trace entry; adding an identifier of a coroutine associated with the coroutine context data to a list of detected coroutines; and storing the list of detected coroutines in target process memory associated with the application.
Energy Efficient Processor Core Architecture for Image Processor
An apparatus is described. The apparatus includes a program controller to fetch and issue instructions. The apparatus includes an execution lane having at least one execution unit to execute the instructions. The execution lane is part of an execution lane array that is coupled to a two dimensional shift register array structure, wherein, execution lane s of the execution lane array are located at respective array locations and are coupled to dedicated registers at same respective array locations in the two-dimensional shift register array.
COMPILER MANAGED MEMORY FOR IMAGE PROCESSOR
A method is described. The method includes repeatedly loading a next sheet of image data from a first location of a memory into a two dimensional shift register array. The memory is locally coupled to the two-dimensional shift register array and an execution lane array having a smaller dimension than the two-dimensional shift register array along at least one array axis. The loaded next sheet of image data keeps within an image area of the two-dimensional shift register array. The method also includes repeatedly determining output values for the next sheet of image data through execution of program code instructions along respective lanes of the execution lane array, wherein, a stencil size used in determining the output values encompasses only pixels that reside within the two-dimensional shift register array. The method also includes repeatedly moving a next sheet of image data to be fully loaded into the two dimensional shift register array from a second location of the memory to the first location of the memory.
COMPILER FOR TRANSLATING BETWEEN A VIRTUAL IMAGE PROCESSOR INSTRUCTION SET ARCHITECTURE (ISA) AND TARGET HARDWARE HAVING A TWO-DIMENSIONAL SHIFT ARRAY STRUCTURE
A method is described that includes translating higher level program code including higher level instructions having an instruction format that identifies pixels to be accessed from a memory with first and second coordinates from an orthogonal coordinate system into lower level instructions that target a hardware architecture having an array of execution lanes and a shift register array structure that is able to shift data along two different axis. The translating includes replacing the higher level instructions having the instruction format with lower level shift instructions that shift data within the shift register array structure.