G06F9/30145

Channel-parallel compression with random memory access
11716095 · 2023-08-01 · ·

A data compressor a zero-value remover, a zero bit mask generator, a non-zero values packer, and a row-pointer generator. The zero-value remover receives 2.sup.N bit streams of values and outputs 2.sup.N non-zero-value bit streams having zero values removed from each respective bit stream. The zero bit mask generator receives the 2.sup.N bit streams of values and generates a zero bit mask for a predetermined number of values of each bit stream in which each zero bit mask indicates a location of a zero value in the predetermined number of values corresponding to the zero bit mask. The non-zero values packer receives the 2.sup.N non-zero-value bit streams and forms a group of packed non-zero values. The row-pointer generator that generates a row-pointer for each group of packed non-zero values.

Systems for performing instructions to quickly convert and use tiles as 1D vectors

Disclosed embodiments relate to systems for performing instructions to quickly convert and use matrices (tiles) as one-dimensional vectors. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode, locations of a two-dimensional (2D) matrix and a one-dimensional (1D) vector, and a group of elements comprising one of a row, part of a row, multiple rows, a column, part of a column, multiple columns, and a rectangular sub-tile of the specified 2D matrix, and wherein the opcode is to indicate a move of the specified group between the 2D matrix and the 1D vector, decode circuitry to decode the fetched instruction; and execution circuitry, responsive to the decoded instruction, when the opcode specifies a move from 1D, to move contents of the specified 1D vector to the specified group of elements.

METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
20230229448 · 2023-07-20 ·

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

SYSTEMS AND METHODS TO LOAD A TILE REGISTER PAIR

Embodiments detailed herein relate to systems and methods to load a tile register pair. In one example, a processor includes: decode circuitry to decode a load matrix pair instruction having fields for an opcode and source and destination identifiers to identify source and destination matrices, respectively, each matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded load matrix pair instruction to load every element of left and right tiles of the identified destination matrix from corresponding element positions of left and right tiles of the identified source matrix, respectively, wherein the executing operates on one row of the identified destination matrix at a time, starting with the first row.

Instructions for vector multiplication of unsigned words with rounding

Disclosed embodiments relate to executing a vector multiplication instruction. In one example, a processor includes fetch circuitry to fetch the vector multiplication instruction having fields for an opcode, first and second source identifiers, and a destination identifier, decode circuitry to decode the fetched instruction, execution circuitry to, on each of a plurality of corresponding pairs of fixed-sized elements of the identified first and second sources, execute the decoded instruction to generate a double-sized product of each pair of fixed-sized elements, the double-sized product being represented by at least twice a number of bits of the fixed size, and generate an unsigned fixed-sized result by rounding the most significant fixed-sized portion of the double-sized product to fit into the identified destination.

Integrated circuit, semiconductor device and control method for semiconductor device
11704041 · 2023-07-18 · ·

An integrated circuit for allowing a band of an external memory to be effectively used in processing a layer algorithm is disclosed. One aspect of the present disclosure relates to an integrated circuit including a first arithmetic part including a first arithmetic unit and a first memory, wherein the first arithmetic unit performs an operation and the first memory stores data for use in the first arithmetic unit and a first data transfer control unit that controls transfer of data between the first memory and a second memory of a second arithmetic part including a second arithmetic unit, wherein the second arithmetic part communicates with an external memory via the first arithmetic part.

ISA accessible physical unclonable function

Techniques for encrypting data using a key generated by a physical unclonable function (PUF) are described. An apparatus according to the present disclosure may include decoder circuitry to decode an instruction and generate a decoded instruction. The decoded instruction includes operands and an opcode. The opcode indicates that execution circuitry is to encrypt data using a key generated by a PUF. The apparatus may further include execution circuitry to execute the decoded instruction according to the opcode to encrypt the data to generate encrypted data using the key generated by the PUF.

ON-THE-FLY CONVERSION
20230018056 · 2023-01-19 ·

A data processing apparatus that converts a plurality of signed digits representing an input value in redundant representation comprises receiver circuitry to receive, at each of a plurality of iterations, a signed digit from the plurality of signed digits, and previous intermediate data from a previous iteration. Concatenation circuitry performs a concatenation of bits corresponding to the signed digit and bits of the previous intermediate data to produce updated intermediate data. Output circuitry provides the updated intermediate data as previous intermediate data of a next iteration. The previous intermediate data comprises S3[i] in non-redundant representation, which is at least part of the input value multiplied by 3 in non-redundant representation.

PROCESSOR CIRCUIT AND DATA PROCESSING METHOD
20230020505 · 2023-01-19 ·

A processor circuit includes an instruction decode unit, an instruction detector, an address generator and a data buffer. The instruction decode unit is configured to decode a first load instruction included in a plurality of load instructions to generate a first decoding result. The instruction detector, coupled to the instruction decode unit, is configured to detect if the load instructions use a same register. The address generator, coupled to the instruction decode unit, is configured to generate a first address requested by the first load instruction according to the first decoding result. The data buffer is coupled to the instruction detector and the address generator. When the instruction detector detects that the load instructions use the same register, the data buffer is configured to store the first address generated from the address generator, and store data requested by the first load instruction according to the first address.

LOCK FREE HIGH THROUGHPUT RESOURCE STREAMING
20230019646 · 2023-01-19 ·

Methods, systems and apparatuses may provide for technology that conducts, via a plurality of concurrent threads, transfers of graphics resources into and out of graphics memory, wherein the transfers bypass lock operations between the plurality of concurrent threads, generates frames based on the graphics resources in the graphics memory, and streams the frames to a display. In one example, the transfers also bypass explicit wait operations for the graphics resources to be fully resident in the graphics memory.