G06F15/8023

System, apparatus and method for increasing bandwidth of edge-located agents of an integrated circuit

In one embodiment, a system on chip includes: a plurality of intellectual property (IP) agents formed on a semiconductor die; a mesh interconnect formed on the semiconductor die to couple the plurality of IP agents, and a plurality of mesh stops each to couple one or more of the plurality of IP agents to the mesh interconnect. The mesh interconnect may be formed of a plurality of rows each having one of a plurality of horizontal interconnects and a plurality of columns each having one of a plurality of vertical interconnects;, where at least one of the plurality of rows includes an asymmetrical number of mesh stops. Other embodiments are described and claimed.

Automatic out-of-bound access prevention in GPU kernels executed in a managed environment

Techniques are provided for an automated method of adding out-of-bound access prevention in GPU kernels executed in a managed environment. In an embodiment, a system of computers compiles a GPU kernel code function that includes one or more array references that are memory address dependent. The system of computers compiles the kernel code function by generating a rewritten GPU kernel code module that includes, within the function signature of the rewritten GPU kernel code module, a respective array size parameter for each array reference of the one or more array references included in the GPU kernel code function. The system of computers further compiles the kernel code function by adding bounding protection instructions to the one or more potential out-of-bound access instructions in the rewritten GPU kernel code module. The potential out-of-bound access instructions comprise instructions that reference each respective array size parameter of the one or more array references. Afterwards, the rewritten GPU kernel code module is loaded in a virtual machine. Loading the rewritten GPU kernel code module in the virtual machine comprises modifying a host application to automatically transmit, from the host application, one or more input array size values. The one or more input array size values is referenced by the one or more potential out-of-bound-access instructions.

PROCESSOR ARRAY AND MULTIPLE-CORE PROCESSOR
20220100698 · 2022-03-31 ·

The present disclosure provides a processor array and a multiple-core processor. The processor array includes a plurality of processing elements arranged in a two-dimensional array, a plurality of first load units correspondingly arranged and connected to the processing elements of the first edge row, respectively, a plurality of second load units correspondingly arranged and connected to the processing elements of the first edge column, respectively, a plurality of first store units correspondingly arranged and connected to the processing elements of the second edge column, respectively, a plurality of second store units correspondingly arranged and connected to the processing elements of the second edge row, respectively.

Neural processing device and load/store method of neural processing device

A neural processing device is provided. The neural processing device comprises: a processing unit configured to receive an input activation and a weight and perform a two-dimensional matrix calculation with the input activation and the weight to generate an output activation, a first memory, and a load-store unit (LSU) configured to perform memory access operations between the first memory and a second memory. The memory access operations include a main memory access operation for a current processing operation that is performed by the processing unit, and a standby memory access operation for a standby processing operation that is performed by the processing unit after the current processing operation. A level of the first memory is equal to a level of the processing unit, and a level of the second memory is different from the level of the first memory.

HIGHLY PARALLEL PROCESSING ARCHITECTURE WITH SHALLOW PIPELINE
20220075627 · 2022-03-10 · ·

Techniques for task processing using a highly parallel processing architecture with a shallow pipeline are disclosed. A two-dimensional array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the array of compute elements is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide, variable length, microcode control words generated by the compiler. Relevant portions of the control word are stored within a cache associated with the array of compute elements. The control words are decompressed. The decompressing occurs cycle-by-cycle out of the cache over multiple cycles. A compiled task is executed on the array of compute elements, based on the decompressing. Simultaneous execution of two or more potential compiled task outcomes is provided.

SYSTEMS AND METHODS FOR IMPLEMENTING TILE-LEVEL PREDICATION WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT

Systems and methods of implementing tile-level predication of a computing tile of an integrated circuit includes identifying a plurality of distinct predicate state values for each of a plurality of distinct processing cores of the computing tile; calculating one or more summed predicate state values for an entirety of the plurality of distinct processing cores based on performing a summation operation of the plurality of distinct predicate state values; propagating the one or more summed predicate state values to an instructions generating circuit of the integrated circuit; and identifying, by the instructions generating circuit, a tile-level predication for the computing tile based on input of the one or more summed predicate state values.

Computational array microprocessor system using non-consecutive data formatting
11157441 · 2021-10-26 · ·

A microprocessor system comprises a computational array and a hardware data formatter. The computational array includes a plurality of computation units that each operates on a corresponding value addressed from memory. The values operated by the computation units are synchronously provided together to the computational array as a group of values to be processed in parallel. The hardware data formatter is configured to gather the group of values, wherein the group of values includes a first subset of values located consecutively in memory and a second subset of values located consecutively in memory. The first subset of values is not required to be located consecutively in the memory from the second subset of values.

Programmable cache coherent node controller

A computer system includes a first group of CPU modules operatively coupled to at least one first Programmable ASIC Node Controller configured to execute transactions directly or through a first interconnect switch to at least one second Programmable ASIC Node Controller connected to a second group of CPU modules running a single instance of an operating system.

MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT

A circuit that includes a plurality of array cores, each array core of the plurality of array cores comprising: a plurality of distinct data processing circuits; and a data queue register file; a plurality of border cores, each border core of the plurality of border cores comprising: at least a register file, wherein: [i] at least a subset of the plurality of border cores encompasses a periphery of a first subset of the plurality of array cores; and [ii] a combination of the plurality of array cores and the plurality of border cores define an integrated circuit array.

HIGH BANDWIDTH MEMORY SYSTEM WITH DISTRIBUTED REQUEST BROADCASTING MASTERS

A system comprises a processor and a plurality of memory units. The processor is coupled to each of the plurality of memory units by a plurality of network connections. The processor includes a plurality of processing elements arranged in a two-dimensional array and a corresponding two-dimensional communication network communicatively connecting each of the plurality of processing elements to other processing elements on same axes of the two-dimensional array. Each processing element that is located along a diagonal of the two-dimensional array is configured as a request broadcasting master for a respective group of processing elements located along a same axis of the two-dimensional array.