G06F9/3887

Thread group scheduling for graphics processing

Embodiments are generally directed to thread group scheduling for graphics processing. An embodiment of an apparatus includes a plurality of processors including a plurality of graphics processors to process data; a memory; and one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches.

METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
20230229448 · 2023-07-20 ·

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

COMPUTATIONAL MEMORY
20230229450 · 2023-07-20 ·

An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.

Graphics systems and methods for accelerating synchronization using fine grain dependency check and scheduling optimizations based on available shared memory space

Accelerated synchronization operations using fine grain dependency check are disclosed. A graphics multiprocessor includes a plurality of execution units and synchronization circuitry that is configured to determine availability of at least one execution unit. The synchronization circuitry to perform a fine grain dependency check of availability of dependent data or operands in shared local memory or cache when at least one execution unit is available.

Integrated circuit, semiconductor device and control method for semiconductor device
11704041 · 2023-07-18 · ·

An integrated circuit for allowing a band of an external memory to be effectively used in processing a layer algorithm is disclosed. One aspect of the present disclosure relates to an integrated circuit including a first arithmetic part including a first arithmetic unit and a first memory, wherein the first arithmetic unit performs an operation and the first memory stores data for use in the first arithmetic unit and a first data transfer control unit that controls transfer of data between the first memory and a second memory of a second arithmetic part including a second arithmetic unit, wherein the second arithmetic part communicates with an external memory via the first arithmetic part.

NON-CRYPTOGRAPHIC HASHING USING CARRY-LESS MULTIPLICATION
20230015000 · 2023-01-19 ·

Non-cryptographic hashing using carry-less multiplication and associated methods, software, and apparatus. Under one aspect, the disclosed hash solution expands on CRC technology that updates a polynomial expansion and final reduction, to use initialization (init), update and finalize stages with extended seed values. The hash solutions operate on input data partitioned into multiple blocks comprising sequences of byte data, such as ASCII characters. During multiple rounds of an update stage, operations are performed on sub-blocks of a given block in parallel including carry-less multiplication and shuffle operations. During a finalize stage, multiple SHA or carry-less multiplication operations are performed on data output following a final round of the update stage.

INFORMATION PROCESSING SYSTEM AND INFORMATION PROCESSING METHOD
20230009759 · 2023-01-12 · ·

One or more information processing apparatuses to process information are provided. The information processing apparatus includes: a division function that divides processing information into a plurality of pieces, under a division condition that designates parallel processing among the information processing apparatuses, the processing information indicating a data processing procedure from a plurality of start points to one or more end points; a determination function that uniquely determines an assignee of each piece of the processing information divided by the division function, as any of the information processing apparatuses; and an execution function that executes a process in the information processing apparatus determined by the determination function.

System and method to control the number of active vector lanes in a processor

In one disclosed embodiment, a processor includes a first execution unit and a second execution unit, a register file, and a data path including a plurality of lanes. The data path and the register file are arranged so that writing to the register file by the first execution unit and by the second execution unit is allowed over the data path, reading from the register file by the first execution unit is allowed over the data path, and reading from the register file by the second execution unit is not allowed over the data path. The processor also includes a power control circuit configured to, when a transfer of data between the register file and either of the first and second execution units uses less than all of the lanes, power down the lanes of the data path not used for the transfer of the data.

ANALYSIS AND DEBUGGING OF FULLY-HOMOMORPHIC ENCRYPTION
20230216657 · 2023-07-06 ·

In response to identifying that a Single Instruction, Multiple Data (SIMD) operation has been instructed to be performed or has been performed by a Fully-Homomorphic Encryption (FHE) software on one or more original ciphertexts, performing the following steps: Performing the same operation on one or more original plaintexts, respectively, that are each a decrypted version of one of the one or more original ciphertexts. Decrypting a ciphertext resulting from the operation performed on the one or more original ciphertexts. Comparing the decrypted ciphertext with a plaintext resulting from the same operation performed on the one or more original plaintexts. Based on said comparison, performing at least one of: (a) determining an amount of noise caused by the operation, (b) determining whether unencrypted data underlying the one or more original ciphertexts has become corrupt by the operation, and (c) determining correctness of an algorithm which includes the operation.

METHOD AND DEVICE FOR PROVIDING A VECTOR STREAM INSTRUCTION SET ARCHITECTURE EXTENSION FOR A CPU
20230214217 · 2023-07-06 ·

A method and device for providing a vector stream instruction set architecture extension for a CPU. In one aspect, there is provided a vector stream engine unit comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.