Patent classifications
G06F8/456
OFFLOAD SERVER AND OFFLOAD PROGRAM
An offload server includes: a parallel processing designation section configured to identify repeat statements in an application and specify a directive specifying application of parallel processing by an accelerator and perform compilation for each of the repeat statements; a parallel processing pattern creation section configured to create parallel processing patterns each of which specifies whether to perform parallel processing for repeat statements not causing a compilation error; a performance measurement section configured to compile the application with a parallel processing pattern, deploy the compiled application to a verification machine, and perform processing for a measurement of a performance of the application; and an executable file creation section configured to compile a parallel processing pattern with the highest processing performance to create an executable file.
Method, apparatus, and electronic device for improving parallel performance of CPU
Implementations of the present specification provide a method, an apparatus, and an electronic device for improving parallel performance of a CPU. The method includes: attempting to acquire data requests that are of a same type and that are allocated to the CPU core; determining a number of requests that are specified by the acquired one or more data requests; and in response to determining that the number of requests is greater than or equal to a maximum degree of parallelism: executing executable codes corresponding to the maximum degree of parallelism, wherein the maximum degree of parallelism is a maximum number of parallel threads executable by the CPU, and wherein the executable codes comprise code programs that are compiled and linked based on the maximum degree of parallelism at a time that is prior to a time of the executing.
Vectorizing conditional min-max sequence reduction loops
Algorithms, examples, and related technology for automatic vectorization of a particular class of loops is described. The loops, denoted “CMMSR loops”, operate to find an extremum and also utilize an index denoting the position of the extremum in an array or other multi-element input. CMMSR loops are identified in a language translator by matching a specified template or having a specified set of parsing results, or both. Generated vectorization code includes, for example, code to compute candidates for the extremum, code to select the same instance of the extremum as a scalar execution when the input contains multiple instances, and wind-down code to compute an index expression based on the selected instance of the extremum. Vectorizations may execute on SIMD hardware or other vector processors.
MULTI-DEVICE BASED INFERENCE METHOD AND APPARATUS
Disclosed is a multi-device based inference method and apparatus, where the multi-device based inference method includes receiving information related to operation devices performing an operation included in a neural network and a graph corresponding to the neural network, obtaining a size of an output of the operation in a forward direction of the graph based on the information and the graph, dividing an input of the operation in a backward direction of the graph based on the information, the graph, and the size of the output, and performing an inference based on the divided input.
Offload server and offload program
An offload server includes: a parallel processing designation section configured to identify repeat statements in an application and specify a directive specifying application of parallel processing by an accelerator and perform compilation for each of the repeat statements; a parallel processing pattern creation section configured to create parallel processing patterns each of which specifies whether to perform parallel processing for repeat statements not causing a compilation error; a performance measurement section configured to compile the application with a parallel processing pattern, deploy the compiled application to a verification machine, and perform processing for a measurement of a performance of the application; and an executable file creation section configured to compile a parallel processing pattern with the highest processing performance to create an executable file.
API FOR RECURRENT NEURAL NETWORKS
Apparatuses, systems, and techniques to implement a recurrent neural network. In at least one embodiment, an application programming interface receives one or more API calls comprising a graph definition and a recurrence attribute, and executes a recurrent neural network based on the graph definition.
DATA STRUCTURE ALLOCATION INTO STORAGE CLASS MEMORY
A method, a computer program product, and a system for allocating a variable into storage class memory during compilation of a program. The method includes selecting a variable recorded in a symbol table during compilation and computing a variable size of the variable by analyzing attributes related to the variable. The method further includes computing additional attributes relating to the variable. The method also includes computing a control flow graph and analyzing the control flow graph and the additional attributes to determine an allocation location for the variable. The method further includes allocating the variable into a storage class memory based on the analysis performed.
Method and apparatus for remote field programmable gate array processing
In one embodiment, an apparatus comprises a fabric controller of a first computing node. The fabric controller is to receive, from a second computing node via a network fabric that couples the first computing node to the second computing node, a request to execute a kernel on a field-programmable gate array (FPGA) of the first computing node; instruct the FPGA to execute the kernel; and send a result of the execution of the kernel to the second computing node via the network fabric.
DYNAMIC DISTRIBUTION OF LOADS ACROSS HETEROGENOUS COMPUTING STRUCTURES IN COMPUTATIONAL RENDERING
Embodiments for dynamically distributing loads in computational rendering in a computing environment. A computational rendering model on a computational rendering to exploit nested recursive parallelism within a heterogenous computing architecture to enable communication overlap, memory transfer, and data and task management, wherein the computational rendering model is developed for the heterogenous computing architecture.
PRE-INSTRUCTION SCHEDULING REMATERIALIZATION FOR REGISTER PRESSURE REDUCTION
Examples are disclosed herein that relate to performing rematerialization operation(s) on program source code prior to instruction scheduling. In one example, a method includes prior to performing instruction scheduling on program source code, for each basic block of the program source code, determining a register pressure at a boundary of the basic block, determining whether the register pressure at the boundary is greater than a target register pressure, based on the register pressure at the boundary being greater than the target register pressure, identifying one or more candidate instructions in the basic block suitable for rematerialization to reduce the register pressure at the boundary, and performing a rematerialization operation on at least one of the one or more candidate instructions to reduce the register pressure at the boundary to be less than the target register pressure.