Patent classifications
G06F8/457
Compile time logic for inserting a buffer between a producer operation unit and a consumer operation unit in a dataflow graph
A dataflow graph for an application has operation units that are configured to be producers and consumers of tensors. A write access pattern of a particular producer specifies an order in which the particular producer generates elements of a tensor, and a read access pattern of a corresponding consumer specifies an order in which the corresponding consumer processes the elements of the tensor. The technology disclosed detects conflicts between the producers and the corresponding consumers that have mismatches between the write access patterns and the read access patterns. A conflict occurs when the order in which the particular producer generates the elements of the tensor is different from the order in which the corresponding consumer processes the elements of the tensor. The technology disclosed resolves the conflicts by inserting buffers between the producers and the corresponding consumers.
COMPUTER SYSTEM AND METHOD FOR MULTI-PROCESSOR COMMUNICATION
A compiler system, computer-implemented method and computer program product for optimizing a program for multi-processor system execution. The compiler includes an interface component configured to load from a storage component program code to be executed by one or more processors (P1 to Pn) of a multi-processor system. The compiler further includes a static analysis component configured to determine data dependencies) within the program code, and further determines all basic blocks of the control flow graph providing potential insertion positions along paths where communication statements can be inserted to enable data flow between different processors at runtime. An evaluation function component of the compiler is configured to evaluate each potential insertion position with regards to its impact on program execution on the multi-processor system at runtime by using a predefined execution evaluation function.
Data access layer for translating between a data structure used by a first software program and a proxy table used by a second software program
In one example, a system can receive information about a data structure including a set of data entries. The system can generate a proxy data table including a set of columns. The system can use a data access layer to generate a mapping from the data entries to the columns. The system can receive an input to cause an operation to be performed on the data structure by performing the operation on the data structure. Generating a result can involve issuing read commands to the data access layer to perform the operation on the data structure such that the data access layer obtains the associated data entries and provides them as responses to the read commands by performing a translation between the data entries and the columns based on the mapping. The system can then output the result of the operation.
METHOD, DEVICE, AND SYSTEM FOR CREATING A MASSIVELY PARALLILIZED EXECUTABLE OBJECT
The present invention provides a method, system and device for optimizing machine code to be executed on a device that comprises one or more busses and a plurality of processing elements. The machine code is configured to execute a task on the device comprising a plurality of subtasks. The method includes the steps of identifying for at least one subtask one or more processing elements from the plurality of processing elements that are capable of processing the subtask, identifying one or more paths for communicating with the one or more identified processing elements, predicting a cycle length for one or more of the identified processing elements and/or the identified paths, selecting a preferred processing element from the identified processing elements and/or selecting a preferred path from the identified paths, and generating the machine code sequence that comprises instructions that cause the device to communicate with the preferred processing element over the preferred path and/or to execute the subtask on the preferred processing element.
Multiple reference point shortest path algorithm
Data are maintained in a distributed computing system that describe a directed graph representing relationships among items. The directed graph has a plurality of vertices representing the items and has edges with values representing distances between the items connected by the vertices. A multiple reference point algorithm is executed for a plurality of the vertices in the directed graph in parallel for a series of synchronized iterations to determine shortest distances between the vertices and the source vertex. After executing the algorithm on the vertices, value pairs associated with the vertices are aggregated. The aggregated value pairs indicate shortest distances from the respective vertices to the source vertex. The aggregated value pairs are outputted.
Head of line blocking mitigation in a reconfigurable data processor
A coarse-grained reconfigurable (CGR) processor comprises a first network and a second network; a plurality of agents coupled to the first network; an array of CGR units coupled together by the second network; and a tile agent coupled between the first network and the second network. The tile agent comprises a plurality of links, a plurality of credit counters associated with respective agents of the plurality of agents, a plurality of credit-hog counters associated with respective links of the plurality of links, and an arbiter to manage access to the first network from the plurality of links based their associated credit-hog counters. Furthermore, a credit-hog counter of the plurality of credit-hog counters changes in response to processing a request for a transaction from its associated link.
Threadsafe use of non-threadsafe libraries with multi-threaded processes
An apparatus includes a processor and a storage storing instructions causing the processor to determine whether an analysis routine is multi-threaded and calls a library function of a non-threadsafe library, and if so, causes the processor to: instantiate an analysis process for executing the analysis routine on multiple threads; instantiate an instance of the library for execution within a isolated library process; instantiate another instance of the library for execution within another isolated library process; retrieve library metadata providing a function prototype of the library function; employ the function prototype to generate an instance of a bridge routine to enable a call from the analysis routine on a first thread to the library function; employ the function prototype to generate another instance of the bridge routine to enable a call from the analysis routine on a second thread to the library function; and begin execution of the analysis routine.
Energy/performance with optimal communication in dynamic parallelization of single threaded programs
A method and apparatus for optimizing parallelized single threaded programs is herein described. Code regions, such as dependency chains, are replicated utilizing any known method, such as dynamic code replication. A flow network associated with a replicated code region is built and a minimum cut algorithm is applied to determine duplicated nodes, which may include a single instruction or a group of instructions, to be removed. The dependency of removed nodes is fulfilled with inserted communication to ensure proper data consistency of the original single-threaded program. As a result, both performance and power consumption is optimized for parallel code sections through removal of expensive workload nodes and replacement with communication between other replicated code regions to be executed in parallel.
GLOBAL DATA FLOW OPTIMIZATION FOR MACHINE LEARNING PROGRAMS
A method for global data flow optimization for machine learning (ML) programs. The method includes receiving, by a storage device, an initial plan for an ML program. A processor builds a nested global data flow graph representation using the initial plan. Operator directed acyclic graphs (DAGs) are connected using crossblock operators according to inter-block data dependencies. The initial plan for the ML program is re-written resulting in an optimized plan for the ML program with respect to its global data flow properties. The re-writing includes re-writes of: configuration dataflow properties, operator selection and structural changes.
INFORMATION PROCESSING APPARATUS AND COMPILATION METHOD
An apparatus includes one or more memories; and one or more processors configured to be coupled to the one or more memories, wherein the one or more processors are configured to generate, through compiling a source code, an object program, execute the object program as multiple processes generated by execution of the object program, allocate a first storage domain in the one or more memories for each of the multiple processes, allocate a variable in the first storage domain for each of the multiple processes, notify multiple processes other than own process of address information of a content of the variable for each of the multiple processes.