Patent classifications
G06F15/8084
Multithreading in vector processors
In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.
Hardware processors and methods for tightly-coupled heterogeneous computing
Methods and apparatuses relating to tightly-coupled heterogeneous computing are described. In one embodiment, a hardware processor includes a plurality of execution units in parallel, a switch to connect inputs of the plurality of execution units to outputs of a first buffer and a plurality of memory banks and connect inputs of the plurality of memory banks and a plurality of second buffers in parallel to outputs of the first buffer, the plurality of memory banks, and the plurality of execution units, and an offload engine with inputs connected to outputs of the plurality of second buffers.
Systems, apparatuses, and methods for setting an output mask in a destination writemask register from a source write mask register using an input writemask and immediate
Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.
Temporal clones to identify valid items from a set of items
Techniques are provided for using bitmaps to indicate which items, in a set of items, are invalid. The bitmaps include an active bitmap and one or more temporal clones. The active bitmap indicates which items in the set are currently valid. The temporal clones are outdated versions of the active bitmap that indicate which items in the set were invalid at previously points in time. Temporal clones may not be very different from each other. Therefore, temporal clones may be efficiently compressed. For example, a bitmap may be selected as a base bitmap, and one or more other bitmaps are encoded using delta encoding. Run length encoding may then be applied to further compress the bitmap information. These bitmaps may then be used to determine which items are valid relative to past-version requests.
EXCEPTION PRESERVING PARALLEL DATA PROCESSING OF STRING AND UNSTRUCTURED TEXT
A parallel processing method, system, and/or computer program product for performing data parallel wide accesses on an unstructured text is provided. The parallel processing includes creating a pointer that points to a beginning of the unstructured text and loading into a vector register a string segment of the unstructured text based on the pointer. Then, access permissions of a first byte of the string segment are automatically tested. In turn, a determination is made as to whether the string segment includes an end indication, and a remaining portion of the unstructured text is validated by accessing and loading a last character identified by the end indication into the vector register when the string segment is determined to include the end indication.
Granular creation and refresh of columnar data
Techniques are provided for granular load and refresh of columnar data. In an embodiment, a particular data object that contains particular data formatted different from column-major format is maintained, the particular data including first data and second data. First and second data objects contain the first and second data, respectively, organized in the column-major format. In response to changes being committed to the first data in the particular data object, invalidating one or more rows of the first data object. In response to a number of invalidated rows of the first data object exceeding a threshold, automatically performing a refresh operation on the first data object independent of any refresh operation on the second data object.
Vector processor architectures
The present disclosure relates to an integrated circuit device that includes a plurality of vector registers configurable to store a plurality of vectors and switch circuitry communicatively coupled to the plurality of vector registers. The switch circuitry is configurable to route a portion of the plurality of vectors. Additionally, the integrated circuit device includes a plurality of vector processing units communicatively coupled to the switch circuitry. The plurality of vector processing units is configurable to receive the portion of the plurality of vectors, perform one or more operations involving the portion of the plurality of vector inputs, and output a second plurality of vectors generated by performing the one or more operations.
Predication in a vector processor
Embodiments relate to vector processor predication in an active memory device. An aspect includes a system for vector processor predication in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions.
Vector processing in an active memory device
Embodiments relate to vector processing in an active memory device. An aspect includes a method for vector processing in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Based on the iteration count, execution of the sub-instructions in parallel is repeated for multiple iterations by the processing element. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions.
Predication in a vector processor
Embodiments relate to vector processor predication in an active memory device. An aspect includes a method for vector processor predication in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions.