G06F8/452

METHODS AND DEVICES FOR COMPUTING A MEMORY SIZE FOR SOFTWARE OPTIMIZATION
20230066702 · 2023-03-02 · ·

There is provided methods and devices for computing a tile size for software optimization. A method includes receiving, by a computing device, information indicative of one or more of a set of loop bounds and a set of data shapes; processing, by the computing device, the information to determine a computation configuration based on the obtained information, the computation configuration implementable by a compiler, said processing including evaluating at least the computation configuration based on a build cost model, the build cost model representative of a data transfer cost and a data efficiency of the computation configuration; and transmitting, by the computing device, instructions directing the compiler to implement the computation configuration.

ANALYSIS FOR MODELING DATA CACHE UTILIZATION

Aspects include modeling data cache utilization for each loop in a loop nest; estimating total data cache lines fetched in one iteration of the loop; and determining the possibility of data cache reuse across loop iterations using data cache lines fetched and associativity constraints. Aspects also include estimating, for memory reference pairs, reuse by one reference of data cache line fetched by another; estimating total number of cache misses for all iterations of the loop; and estimating total number of cache misses of a reference for iterations of a next outer loop as equal to total cache misses for an entire inner loop. Aspects further include estimating memory cost of a loop unroll and jam transformation, without performing the transformation; and extending a data cache model to estimate best unroll-and-jam factors for the loop nest, capable of minimizing total cache misses incurred by the memory references in the loop body.

Analysis for modeling data cache utilization

Aspects include modeling data cache utilization for each loop in a loop nest; estimating total data cache lines fetched in one iteration of the loop; and determining the possibility of data cache reuse across loop iterations using data cache lines fetched and associativity constraints. Aspects also include estimating, for memory reference pairs, reuse by one reference of data cache line fetched by another; estimating total number of cache misses for all iterations of the loop; and estimating total number of cache misses of a reference for iterations of a next outer loop as equal to total cache misses for an entire inner loop. Aspects further include estimating memory cost of a loop unroll and jam transformation, without performing the transformation; and extending a data cache model to estimate best unroll-and-jam factors for the loop nest, capable of minimizing total cache misses incurred by the memory references in the loop body.

SELECTING AN EPILOGUE VECTORIZATION FACTOR FOR USE IN COMPUTER PROCESSING
20230161573 · 2023-05-25 ·

A vectorization factor to be used in vectorization of an epilogue loop in program code is automatically selected. The automatically selecting includes selecting the vectorization factor from a plurality of candidate vectorization factors based on one or more considerations relating to vectorizing the epilogue loop. The vectorization factor that is automatically selected is used in vectorizing the epilogue loop.

TRANSFORMATION OF A LOOP WITHIN COMPUTER CODE TO MINIMIZE ITERATIONS
20230161575 · 2023-05-25 ·

A loop within computer code is transformed to minimize loop iterations. A determination is made using statistical information relating to the loop whether the loop that has an early exit indication is to be transformed to minimize iterations of the loop. Based on determining that the loop is to be transformed, the loop is transformed.

APPARATUS AND METHOD WITH NEURAL NETWORK COMPUTATION SCHEDULING

An apparatus includes a processor configured to generate each of intermediate representation codes corresponding to each of a plurality of loop structures obtained that corresponds to a neural network computation based on an input specification file of hardware; schedule instructions included in each of the intermediate representation codes corresponding to the plurality of loop structures; select, based on latency values predicted according to scheduling results of the intermediate representation codes, any one code among the intermediate representation codes; and allocate, based on a scheduling result of the selected intermediate representation code, instructions included in the selected intermediate representation code to resources of the hardware included in the apparatus.

Hardware and software solutions to divergent branches in a parallel pipeline
09830164 · 2017-11-28 · ·

A system and method for efficiently processing instructions in hardware parallel execution lanes within a processor. In response to a given divergent point within an identified loop, a compiler arranges instructions within the identified loop into very large instruction words (VLIW's). At least one VLIW includes instructions intermingled from different basic blocks between the given divergence point and a corresponding convergence point. The compiler generates code wherein when executed assigns at runtime instructions within a given VLIW to multiple parallel execution lanes within a target processor. The target processor includes a single instruction multiple data (SIMD) micro-architecture. The assignment for a given lane is based on branch direction found at runtime for the given lane at the given divergent point. The target processor includes a vector register for storing indications indicating which given instruction within a fetched VLIW for an associated lane to execute.

Generating object code from intermediate code that includes hierarchical sub-routine information
09830134 · 2017-11-28 · ·

Examples are described for a device to receive intermediate code that was generated from compiling source code of an application. The intermediate code includes information generated from the compiling that identifies a hierarchical structure of lower level sub-routines in higher level sub-routines, and the lower level sub-routines are defined in the source code of the application to execute more frequently than the higher level sub-routines that identify the lower level sub-routines. The device is configured to compile the intermediate code to generate object code based on the information that identifies lower level sub-routines in higher level sub-routines, and store the object code.

Hardware acceleration method, compiler, and device

A hardware acceleration method includes: obtaining compilation policy information and a source code, where the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor, analyzing a code segment in the source code according to the compilation policy information, determining a first code segment belonging to the first code type or a second code segment belonging to the second code type, compiling the first code segment into a first executable code, sending the first executable code to the first processor, compiling the second code segment into a second executable code, and sending the second executable code to the second processor.

METHOD FOR CONTROLLING THE FLOW EXECUTION OF A GENERATED SCRIPT OF A BLOCKCHAIN TRANSACTION
20220350579 · 2022-11-03 ·

A method and system for generating a transaction for a blockchain protocol are disclosed. The method comprises using a software resource to receive, generate, or derive at least one data item, insert, at least once, a portion of code into a script associated with the transaction, where the script is written in a language that is functionally restricted. Upon execution of the script, the portion of code provides functionality of a control flow mechanism controlled or influenced by the at least one data item. The method further comprises using the software resource to generate the blockchain transaction comprising the script and submit the blockchain transaction to a blockchain network.