G06F15/781

On-chip heterogeneous AI processor with distributed tasks queues allowing for parallel task execution

Embodiments described herein provide an on-chip heterogeneous Artificial Intelligence (AI) processor comprising at least two different architectural types of computation units, wherein each of the computation units is associated with a respective task queue configured to store computation subtasks to be executed by the computation unit. The AI processor also comprises a controller configured to partition a received computation graph associated with a neural network into a plurality of computation subtasks according to a preset scheduling strategy and distribute the computation subtasks to the task queues of the computation units. The AI processor further comprises a storage unit configured to store data required by the computation units to execute their respective computation subtasks and an access interface configured to access an off-chip memory. Different application tasks are processed by managing and scheduling the different architectural types of computation units in an on-chip heterogeneous manner.

METHOD AND APPARATUS FOR PERMUTING STREAMED DATA ELEMENTS

A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.

METHOD FOR MEASURING PERFORMANCE OF NEURAL PROCESSING DEVICE AND DEVICE FOR MEASURING PERFORMANCE
20230315608 · 2023-10-05 ·

A method for measuring performance of neural processing devices and devices for measuring performance are provided. The method for measuring performance of neural processing devices comprises receiving hardware information of a neural processing device, modeling hardware components according to the hardware information as agents, dividing a calculation task by events for the agents and modeling the calculation task, thereby generating an event model which includes nodes corresponding to the agents and edges corresponding to the events and measuring a total duration of the calculation task through simulation of the event model.

STANDARD CELL STRUCTURE

A standard cell includes plural of transistors including a first type transistor and a second type transistor, plural of contacts coupled to the transistors; at least one input line electrically coupled to the transistors; an output line electrically coupled to the transistors; a VDD contacting line electrically coupled to the transistors; a VSS contacting line electrically coupled to the transistors; wherein the first type transistor includes a first set of fin structures electrically coupled together, the second type transistor includes a second set of fin structures electrically coupled together, and a gap between the first type transistor and the second type transistor is not greater than 3×Fp minus A, wherein Fp is a pitch distance between two adjacent fin structures in the first type transistor and A is a minimum feature size of the standard cell.

Tracking streaming engine vector predicates to control processor execution

In a method of operating a computer system, an instruction loop is executed by a processor in which each iteration of the instruction loop accesses a current data vector and an associated current vector predicate. The instruction loop is repeated when the current vector predicate indicates the current data vector contains at least one valid data element and the instruction loop is exited when the current vector predicate indicates the current data vector contains no valid data elements.

Memory system and SoC including linear address remapping logic
11681449 · 2023-06-20 · ·

A system-on-chip is connected to a first memory device and a second memory device. The system-on-chip comprises a memory controller configured to control an interleaving access operation on the first and second memory devices. A modem processor is configured to provide an address for accessing the first or second memory devices. A linear address remapping logic is configured to remap an address received from the modem processor and to provide the remapped address to the memory controller. The memory controller performs a linear access operation on the first or second memory device in response to receiving the remapped address.

METHOD AND APPARATUS FOR VECTOR SORTING
20230099669 · 2023-03-30 ·

A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values are sorted in an order indicated by the vector sort instruction, and storing the sorted vector in a storage location.

Method and apparatus to sort a vector for a bitonic sorting algorithm

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

Method and Apparatus for Dual Issue Multiply Instructions
20230350813 · 2023-11-02 ·

Various configurations of processors are provided. In a configuration, the processor comprises first and second multiplication unit. Each of these multiplication units includes carry-save adder circuitry with a respective outputs, partial product alignment multiplexing logic coupled to the outputs of the associated carry-save adder circuitry. The processor further comprises communication paths coupled between the outputs of the carry-save adder circuitry of the first multiplication unit and the partial product alignment multiplexing logic of the second multiplication unit. In other configurations, each of the first and second multiplication units may include one or more instances of masking logic, one or more instances of a multiplier array coupled to the associated instance(s) of masking logic, and one or more instances of a multiplexer set coupled to the associated instance(s) of multiplier array(s). Each of multiplexer set instance(s) of a particular multiplication unit is coupled to the carry-save adder circuitry of that multiplication unit.

Method and apparatus for vector sorting using vector permutation logic

A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.