Patent classifications
G06F8/45
Memory-based distributed processor architecture
Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
HIGHLY PARALLEL PROCESSING ARCHITECTURE USING DUAL BRANCH EXECUTION
Techniques for task processing in a highly parallel processing architecture using dual branch execution are disclosed. A two-dimensional array of compute elements is accessed. Each compute element within the array is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the array of compute elements is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide, variable length, control words generated by the compiler. The control includes a branch. Two sides of the branch in the array are executed while waiting for a branch decision to be acted upon by control logic. The branch decision is based on computation results in the array. Data produced by a taken branch path is promoted. Results from a side of the branch not indicated by the branch decision are ignored or invalidated.
Memory-based distributed processor architecture
Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
Methods and apparatus for intentional programming for heterogeneous systems
Methods, apparatus, systems and articles of manufacture are disclosed for intentional programming for heterogeneous systems. An example apparatus includes a code lifter to identify annotated code corresponding to an algorithm to be executed on the heterogeneous system based on an identifier being associated with the annotated code, and convert the annotated code in the first representation to intermediate code in a second representation by identifying the intermediate code as having a first algorithmic intent that corresponds to a second algorithmic intent of the annotated code, a domain specific language (DSL) generator to translate the intermediate code in the second representation to DSL code in a third representation when the first algorithmic intent matches the second algorithmic intent, the third representation corresponding to a DSL representation, and a code replacer to invoke a compiler to generate an executable including variant binaries based on the DSL code.
Application division device, method and program
A function defined in source code of an application is further partitioned into a plurality of logics without depending on function definition performed by a developer. An application partitioning apparatus (1) for partitioning an application distributively processed by a plurality of information processing apparatuses into a plurality of logics includes an acquisition unit (121) which acquires source code of the application, a first partitioning unit (122) which identifies a plurality of functions defined in the source code and partitioning the source code into the plurality of functions, a determination unit (123) which determines whether each of the partitioned functions can be further partitioned according to rules set in advance, and a second partitioning unit (124) which, when it is determined that each of the partitioned function can be partitioned, partitions the function into a plurality of functions including one or a plurality of rows.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD AND COMPUTER READABLE MEDIUM
A task graph debranching section (109) determines as a parallelizable number, the number of parallelization of processes which is possible at a time of executing a program. A schedule generation section (112) generates as a parallelization execution schedule, an execution schedule of the program at the time of executing the program. A display processing section (114) computes a parallelization execution time which is a time required for executing the program at a time of executing the program according to the parallelization execution schedule. Further, the display processing section (114) generates parallelization information indicating the parallelizable number, the parallelization execution schedule, and the parallelization execution time, and outputs the generated parallelization information.
Architecture and programming in a parallel processing environment with a tiled processor having a direct memory access controller
An integrated circuit includes a plurality of tiles. Each tile includes a processor, a switch including switching circuitry to forward data over data paths from other tiles to the processor and to switches of other tiles, and a switch memory that stores instruction streams that are able to operate independently for respective output ports of the switch. Also disclosed is a direct memory access (DMA) scheme in which sizes of DMA transfers are limited according to whether a cache miss has occurred.
GPU-based adaptive BLAS operation acceleration apparatus and method thereof
Disclosed herein are an apparatus and method for adaptively accelerating a BLAS operation based on a GPU. The apparatus for adaptively accelerating a BLAS operation based on a GPU includes a BLAS operation acceleration unit for setting optimal OpenCL parameters using machine-learning data attribute information and OpenCL device information and for creating a kernel in a binary format by compiling kernel source code; an OpenCL execution unit for creating an OpenCL buffer for a BLAS operation using information about an OpenCL execution environment and the optimal OpenCL parameters and for accelerating machine learning in an embedded system in such a way that a GPU that is capable of accessing the created OpenCL buffer performs the BLAS operation using the kernel, and an accelerator application unit for returning the result of the BLAS operation to a machine-learning algorithm.
Memory-based distributed processor architecture
Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
SYSTEM AND METHOD FOR GENERATION OF EVENT DRIVEN, TUPLE-SPACE BASED PROGRAMS
In a system for automatic generation of event-driven, tuple-space based programs from a sequential specification, a hierarchical mapping solution can target different runtimes relying on event-driven tasks (EDTs). The solution uses loop types to encode short, transitive relations among EDTs that can be evaluated efficiently at runtime. Specifically, permutable loops translate immediately into conservative point-to-point synchronizations of distance one. A runtime-agnostic which can be used to target the transformed code to different runtimes.