G06F9/38

System having a hybrid threading processor, a hybrid threading fabric having configurable computing elements, and a hybrid interconnection network
11579887 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. In a representative embodiment, a system includes an interconnection network, a processor, a host interface, and a configurable circuit cluster. The configurable circuit cluster may include a plurality of configurable circuits arranged in an array; an asynchronous packet network and a synchronous network coupled to each configurable circuit of the array; and a memory interface circuit and a dispatch interface circuit coupled to the asynchronous packet network and to the interconnection network. Each configurable circuit includes instruction or configuration memories for selection of a current data path configuration, a master synchronous network input, and a data path configuration for a next configurable circuit.

Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
11579888 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Platform independent GPU profiles for more efficient utilization of GPU resources

Disclosed are various examples for platform independent graphics processing unit (GPU) profiles for more efficient utilization of GPU resources. A virtual machine configuration can be identified to include a platform independent graphics computing requirement. Hosts can be identified as available in a computing environment based on the platform independent graphics computing requirement. The virtual machine can be placed on a host based on a consideration of host priority.

Method and apparatus to efficiently process and execute Artificial Intelligence operations
11580371 · 2023-02-14 · ·

A method, apparatus, and system are discussed to efficiently process and execute Artificial Intelligence operations. An integrated circuit has a tailored architecture to process and execute Artificial Intelligence operations, including computations for a neural network having weights with a sparse value. The integrated circuit contains at least a scheduler, one or more arithmetic logic units, and one or more random access memories configured to cooperate with each other to process and execute these computations for the neural network having weights with the sparse value.

Technologies for providing shared memory for accelerator sleds

Technologies for providing shared memory for accelerator sleds includes an accelerator sled to receive, with a memory controller, a memory access request from an accelerator device to access a region of memory. The request is to identify the region of memory with a logical address. Additionally, the accelerator sled is to determine from a map of logical addresses and associated physical address, the physical address associated with the region of memory. In addition, the accelerator sled is to route the memory access request to a memory device associated with the determined physical address.

Programmable instruction buffering for accumulating a burst of instructions

A processing system 2 includes a processing pipeline 12, 14, 16, 18, 28 which includes fetch circuitry 12 for fetching instructions to be executed from a memory 6, 8. Buffer control circuitry 34 is responsive to a programmable trigger, such as explicit hint instructions delimiting an instruction burst, or predetermined configuration data specifying parameters of a burst together with a synchronising instruction, to trigger the buffer control circuitry to stall a stallable portion of the processing pipeline (e.g. issue circuitry 16), to accumulate within one or more buffers 30, 32 fetched instructions starting from a predetermined starting instruction, and, when those instructions have been accumulated, to restart the stallable portion of the pipeline.

Transaction-enabled systems and methods for royalty apportionment and stacking

Transaction-enabled systems and methods for royalty apportionment and stacking are disclosed. An example system may include a plurality of royalty generating elements (a royalty stack) each related to a corresponding one or more of a plurality of intellectual property (IP) assets (an aggregate stack of IP). The system may further include a royalty apportionment wrapper to interpret IP licensing terms and apportion royalties to a plurality of owning entities corresponding to the aggregate stack of IP in response to the IP licensing terms and a smart contract wrapper. The smart contract wrapper is configured to access a distributed ledger, interpret an IP description value and IP addition request, to add an IP asset to the aggregate stack of IP, and to adjust the royalty stack.

METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
20230037321 · 2023-02-09 ·

A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.

SPARSE MATRIX OPERATIONS FOR DEEP LEARNING

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for parallelizing matrix operations. One of the methods includes implementing a neural network on a parallel processing device, the neural network comprising at least one sparse neural network layer, the sparse neural network layer being configured to receive an input matrix and perform matrix multiplication between the input matrix and a sparse weight matrix to generate an output matrix, the method comprising: for each row of the M rows of the output matrix, determining a plurality of tiles that each include one or more elements from the row; assigning, for each tile of each row, the tile to a respective one of a plurality of thread blocks of the parallel processing device; and computing, for each tile, respective values for each element in the tile using the respective thread block to which the tile was assigned.

VERIFYING PROCESSING LOGIC OF A GRAPHICS PROCESSING UNIT
20230043280 · 2023-02-09 ·

A method of verifying processing logic of a graphics processing unit receives a test task including a predefined set of instructions for execution on the graphics processing unit, the predefined set of instructions being configured to perform a predetermined set of operations on the graphics processing unit when executed for predefined input data. In a test phase, the test task is processed by executing the predefined set of instructions for the predefined input data first and second times at the graphics processing unit so as to, respectively, generate first and second outputs. A fault signal is raised if the first and second outputs do not match.