G06F2209/485

Neural network processor using compression and decompression of activation data to reduce memory bandwidth utilization

A deep neural network (“DNN”) module compresses and decompresses neuron-generated activation data to reduce the utilization of memory bus bandwidth. The compression unit receives an uncompressed chunk of data generated by a neuron in the DNN module. The compression unit generates a mask portion and a data portion of a compressed output chunk. The mask portion encodes the presence and location of the zero and non-zero bytes in the uncompressed chunk of data. The data portion stores truncated non-zero bytes from the uncompressed chunk of data. A decompression unit receives a compressed chunk of data from memory in the DNN processor or memory of an application host. The decompression unit decompresses the compressed chunk of data using the mask portion and the data portion.

Dynamically partitioning workload in a deep neural network module to reduce power consumption

A deep neural network (DNN) module is disclosed that can dynamically partition neuron workload to reduce power consumption. The DNN module includes neurons and a group partitioner and scheduler unit. The group partitioner and scheduler unit divides a workload for the neurons into partitions in order to maximize the number of neurons that can simultaneously process the workload. The group partitioner and scheduler unit then assigns a group of neurons to each of the partitions. The groups of neurons in the DNN module process the workload in their assigned partition to generate a partial output value. The neurons in each group can then sum their partial output values to generate a final output value for the workload. The neurons can be powered down once the groups of neurons have completed processing their assigned workload to reduce power consumption.

METHOD AND PROCESSING UNIT FOR PERFORMING TASKS THROUGH MASTER SLAVE ROTATION
20220276902 · 2022-09-01 · ·

The present subject matter relates to a method comprising acquiring a master role by a processing unit of a multi-processor system, executing by the processing unit a master function part of a set of tasks: comprising searching an available processing unit of the multi-processor system; wherein in case an available processing unit is found, controlling the found processing unit to perform a slave function part of the set of tasks, and in case no available processing unit is found, executing by the processing unit the slave function part of the set of tasks, wherein the master function comprises a master to slave switching function for releasing the master role and the slave function composes a slave to master switching function for acquiring the master role.

SHARDING FOR SYNCHRONOUS PROCESSORS
20220300450 · 2022-09-22 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sharding dataflow graphs for a device having multiple synchronous tiles. One of the methods includes receiving a representation of a dataflow graph comprising a plurality of nodes that each represent respective matrix operations to be performed by a device having a plurality synchronous tiles. Candidate allocations of respective portions of the dataflow graph to each tile of the plurality of synchronous tiles are evaluated according to one or more resource constraints of the device. One of the candidate allocations is selected based on evaluating each candidate allocation.

Method, apparatus and computer program product for resource scheduling

Embodiments of the present disclosure provide a method, apparatus and computer program product for resource scheduling. The method comprises obtaining a processing requirement for a deep learning task, the processing requirement being specified by a user and at least including a requirement related to a completion time of the deep learning task. The method further comprises determining, based on the processing requirement, a resource required by the deep learning task such that processing of the deep learning task based on the resource satisfies the processing requirement. Through the embodiments of the present disclosure, the resources can be scheduled reasonably and flexibly to satisfy the user's processing requirement for a particular deep learning task without requiring the user to manually specify the requirement on the resources.

Enhancing processing performance of artificial intelligence/machine hardware by data sharing and distribution as well as reuse of data in neuron buffer/line buffer

An exemplary artificial intelligence/machine learning hardware computing environment having an exemplary DNN module cooperating with one or more memory components can perform data sharing and distribution as well reuse of a buffer data to reduce the number of memory component read/writes thereby enhancing overall hardware performance and reducing power consumption. Illustratively, data from a cooperating memory component is read according to a selected operation of the exemplary hardware and written to corresponding other memory component for use by one or more processing elements (e.g., neurons). The data is read in such a manner to optimize the engagement of the one or more processing elements for each processing cycle as well as to reuse data previously stored in the one or more cooperating memory components. Operatively, the written data is copied to a shadow memory buffer prior to being consumed by the processing elements.

Worker thread manager
11379259 · 2022-07-05 · ·

A system includes determination of whether a current number of active worker threads of a client application is less than a maximum active worker thread limit, retrieval, if the number of active worker threads is less than the maximum active worker thread limit, of a first job associated with a first context from a job pool, determination of whether an inactive worker thread is associated with the first context, and, if an inactive worker thread is associated with the first context, execution of the first job on the inactive worker thread.

DATA STRUCTURE EXECUTION FRAMEWORK USING VIRTUAL COMPUTING DOMAINS
20220222108 · 2022-07-14 · ·

Techniques and solutions are described for implementing virtual domains. Computing resources in a computing environment are determined and assigned to one or more virtual domains. One or more data structures can be located in a given virtual domain. The computing resources assigned to a virtual domain can be dynamically reconfigured without affecting processes that submit tasks to be performed on data structures in the virtual domains. Tasks can be submitted to a dispatcher, which can determine the appropriate virtual domain for the task and forward the task to the determined virtual domain. Tasks are received by virtual domains and assigned to worker threads, which can access a data structure specified for a given task.

Reducing power consumption in a neural network processor by skipping processing operations

A deep neural network (“DNN”) module can determine whether processing of certain values in an input buffer or a weight buffer by neurons can be skipped. For example, the DNN module might determine whether neurons can skip the processing of values in entire columns of a neuron buffer. Processing of these values might be skipped if an entire column of an input buffer or a weight buffer are zeros, for example. The DNN module can also determine whether processing of single values in rows of the input buffer or the weight buffer can be skipped (e.g. if the values are zero). Neurons that complete their processing early as a result of skipping operations can assist other neurons with their processing. A combination operation can be performed following the completion of processing that transfers the results of the processing operations performed by a neuron to their correct owner.

Systems and methods for providing low memory killer protection to non-system applications

Systems and methods for low memory killer protection are disclosed. According to one embodiment, in an information processing apparatus comprising at least one computer processor and executing an operating system including a LMK subsystem, a method for providing low memory killer (LMK) protection may include: (1) a non-system application embedded with a SDK initiating a foreground service at the beginning of a use case session; (2) the non-system application causing the foreground service to create an ongoing notification with the operating system, wherein the ongoing notification causes the non-system application to have no lower than a perceptible LMK status during the use case session; (3) the non-system application completing the use case session; and (4) the non-system application causing the foreground service to remove the ongoing notification.