G06F8/45

LARGE LOOKUP TABLES FOR AN IMAGE PROCESSOR
20210042875 · 2021-02-11 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for supporting large lookup tables on an image processor. One of the methods includes receiving an input kernel program for an image processor having a two-dimensional array of execution lanes, a shift-register array, and a plurality of memory banks. If the kernel program has an instruction that reads a lookup table value for a lookup table partitioned across the plurality of memory banks, the instruction in the kernel program are replaced with a sequence of instructions that, when executed by an execution lane, causes the execution lane to read a first value from a local memory bank and a second value from the local memory bank on behalf of another execution lane belonging to a different group of execution lanes.

MEMORY-BASED DISTRIBUTED PROCESSOR ARCHITECTURE
20210090617 · 2021-03-25 · ·

Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.

Performance Estimation-Based Resource Allocation for Reconfigurable Architectures
20210081769 · 2021-03-18 · ·

The technology disclosed relates to allocating available physical compute units (PCUs) and/or physical memory units (PMUs) of a reconfigurable data processor to operation units of an operation unit graph for execution thereof. In particular, it relates to selecting, for evaluation, an intermediate stage compute processing time between lower and upper search bounds of a generic stage compute processing time, determining a pipeline number of the PCUs and/or the PMUs required to process the operation unit graph, and iteratively, initializing new lower and upper search bounds of the generic stage compute processing time and selecting, for evaluation in a next iteration, a new intermediate stage compute processing time taking into account whether the pipeline number of the PCUs and/or the PMUs produced for a prior intermediate stage compute processing time in a previous iteration is lower or higher than the available PCUs and/or PMUs.

INFORMATION PROCESSING METHOD AND COMPUTER-READABLE RECORDING MEDIUM HAVING STORED THEREIN OPTIMIZATION PROGRAM
20210081210 · 2021-03-18 · ·

An information processing method executed by a computer, the method includes executing a target program to acquire number of executions for each of a plurality of program codes; selecting a combination of program codes related to a plurality of assignment statements from among program codes related to assignment statements having a higher number of executions based on the acquired number of executions; when the target program is changed, executing the changed target program to calculate an execution accuracy and an operation time so that parallel processing using an SIMD operation function is executed for each of the program codes related to the plurality of assignment statements included in the selected combination; and searching for the combination so that the calculated execution accuracy and operation time satisfy a predetermined condition.

Spatially programmed logic array architecture
10963302 · 2021-03-30 · ·

A spatially programmed logic circuit (SPLC) array system performs spatial compilation of programs for use in the SPLCs to produce standardized compiled blocks representing predetermined portions of an SPLC. The blocks may be freely relocated in an SPLC after compilation by editing of the compiled file. Inter-block communication circuitry allows joining of blocks within an SPLC or across SPLCs to allow scalability and accommodation of different programs with efficient utilization of an SPLC for multiple programs, again without recompilation.

Synchronization of concurrent computation engines

Systems and methods are provided for synchronizing execution of program code for an integrated circuit device having multiple concurrently operating execution engines, where the operation of one execution engine may be dependent on the operation of another execution engine. Data or resource dependencies may be accommodated with a Set instruction to cause a first execution engine to set a register value and a Wait instruction to cause a second execution engine to wait for a condition associate with the register value. Concurrently operation of the execution engines may thus be synchronized.

Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores
11853256 · 2023-12-26 · ·

An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.

Processor that includes a special store instruction used in regions of a computer program where memory aliasing may occur

Processor hardware detects when memory aliasing occurs, and assures proper operation of the code even in the presence of memory aliasing. The processor defines a special store instruction that is different from a regular store instruction. The special store instruction is used in regions of the computer program where memory aliasing may occur. Because the hardware can detect and correct for memory aliasing, this allows a compiler to make optimizations such as register promotion even in regions of the code where memory aliasing may occur.

Failover of a hardware accelerator to software

An accelerator manager monitors hardware accelerators that are called by one or more computer programs. A virtual function table includes multiple entries, where each entry correlates a call from a computer program to a corresponding call to either a software library or a hardware accelerator. A call by the computer program to a function in the virtual function table results in the call being routed to either the software library or to a hardware accelerator depending on the contents of the corresponding entry in the virtual function table. The accelerator manager, in response to a detected failure in an accelerator, replaces one or more calls in the virtual function table to the failed accelerator with calls to the software library. The accelerator manager can then retry the call that caused the accelerator to fail, which will then be executed by the software library.

Optimizing program parameters in multithreaded programming

Optimizing program parameters in multithreaded programming may include: generating, for a program, a plurality of low-level metric functions, each of the low-level metric functions calculating a respective low-level metric of a plurality of low-level metrics; generating one or more high-level metric functions for one or more high-level metrics, each of the one or more high-level metric functions comprising a piecewise-rational function based on one or more of the low-level metric functions; and generate, based on the one or more high level-metric functions, one or more data parameter values and one or more hardware parameter values, one or more program parameter values for executing the program, wherein the one or more program parameter values are configured to optimize the one or more high-level metrics.