G06F8/441

Balanced partitioning of neural network based on execution latencies

Techniques to partition a neural network model for serial execution on multiple processing integrated circuit devices are described. An initial partitioning of the model into multiple partitions each corresponding to a processing integrated circuit device is performed. For each partition, an execution latency is calculated by aggregating compute clock cycles to perform computations in the partition, and weight loading clock cycles determined based on a number of weights used in the partition. The amount of data being outputted from the partition is also determined. The partitions can be adjusted by moving computations from a source partition to a target partition to change execution latencies of the partitions and the amount of data being transferred between partitions.

Interoperable composite data units for use in distributed computing execution environments
11797274 · 2023-10-24 · ·

Disclosed implementations provide executable models, such as artificial intelligence models that can be owned, traded, and used in various execution environments. By coupling a model with a strictly defined interface definition, the model can be executed in various execution environments that support the interface. Coupling the model with a non-fungible cryptographic token allows the model and other components to be owned and traded as a unit. The tradeable composite units have utility across multiple supported execution environments, such as video game environments, chat bot environments and financial trading environments. Additionally, the interface allows for the creation of pipelines and systems from multiple complementary composite units.

Handling Interrupts from a Virtual Function in a System with a Reconfigurable Processor

A system is presented that includes a communication link, a runtime processor coupled to the communication link, and a reconfigurable processor. The reconfigurable processor is adapted for generating an interrupt to the runtime processor in response to a predetermined event and includes multiple arrays of coarse-grained reconfigurable (CGR) units and an interface to the communication link that couples the reconfigurable processor to the runtime processor via the communication link. The runtime processor is adapted for configuring the interface to the communication link to provide access to the multiple arrays of coarse-grained reconfigurable units from a physical function driver and from at least one virtual function driver, and the reconfigurable processor is adapted for sending the interrupt to the physical function driver and to a virtual function driver of the at least one virtual function driver within the runtime processor.

Configurable Access to a Reconfigurable Processor by a Virtual Function

A data processing system is presented that includes a communication link, a runtime processor coupled to the communication link, and one or more reconfigurable processors. A reconfigurable processor of the one or more reconfigurable processors is adapted for generating an interrupt to the runtime processor in response to a predetermined event and includes arrays of coarse-grained reconfigurable (CGR) units and an interface to the communication link that couples the reconfigurable processor to the runtime processor via the communication link. The runtime processor is adapted for configuring the interface to the communication link to provide access to the arrays of CGR units through the communication link from a physical function driver and from a virtual function driver.

Compiler for a Fracturable Data Path in a Reconfigurable Data Processor

A complier produces a configuration file to configure a fracturable data path of a configurable unit in a coarse-grained reconfigurable processor to concurrently generate different address sequences generated using different address associated with different operations. The fracturable data path includes multiple computation stages respectively including a pipeline register. The compiler analyzes a first address calculation and a second address calculation and assigns a first set of stages to the first operation to generate the first address sequence and a second set of stages to the second operation to generate the second address sequence using the second set of stages, based on the analysis. A configuration file for the configurable unit is generated by the compiler that assigns the first set of stages to the first operation and the second set of stages to the second operation and includes two or more immediate values for each computation stage.

Space- And Time-Efficient Enumerations
20230385028 · 2023-11-30 ·

Systems, computer instructions and computer-implemented methods are disclosed for implementing space- and time-efficient enumerations. An instance of an enumeration class may be created with a constant, plurality of enumerations. A plurality of objects corresponding to the respective enumerations may be stored in memory along with a lookup table indexed by respective ordinal values of the plurality of enumerations, the lookup table including respective references to the stored objects of the instantiated enumeration class. A reference to an enumeration may be stored in a memory location by storing an ordinal value of the enumeration. A determination may then be made to convert a stored ordinal value to a reference to an object, and responsive to the determination, the ordinal value may be loaded and used as an index into the lookup table to obtain the reference to the object corresponding to the enumeration.

Systems and methods for extending a live range of a virtual scalar register

Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.

Method and system for optimizing access to constant memory

The disclosed systems, structures, and methods are directed to optimizing memory access to constants in heterogeneous parallel computers, including systems that support OpenCL. This is achieved in an optimizing compiler that transforms program scope constants and constants at the outermost scope of kernels into implicit constant pointer arguments. The optimizing compiler also attempts to determine access patterns for constants at compile-time and places the constants in a variety of memory types available in a compute device architecture based on these access patterns.

Memory pool allocation for a multi-core system

An apparatus includes processing cores, memory blocks, a connection between each of processing core and memory block, chip selection circuit, and chip selection circuit busses between the chip selection circuit and each of the memory blocks. Each memory block includes a data port and a memory check port. The chip selection circuit is configured to enable writing data from a highest priority core through respective data ports of the memory blocks. The chip selection circuit is further configured to enable writing data from other cores through respective memory check ports of the memory blocks.

Method of Using Multidimensional Blockification To Optimize Computer Program and Device Thereof

Disclosed embodiments relate to a method and device for optimizing compilation of source code. The proposed method receives a first intermediate representation code of a source code and analyses each basic block instruction of the plurality of basic block instructions contained in the first intermediate representation code for blockification. In order to blockify the identical instructions, the one or more groups of basic block instructions are assessed for eligibility of blockification. Upon determining as eligible, the group of basic block instructions are blockified using one of one dimensional SIMD vectorization and two-dimensional SIMD vectorization. The method further generates a second intermediate representation of the source code which is translated to executable target code with more efficient processing capacity.