G06F8/452

SYSTEMS AND METHODS FOR SCALABLE HIERARCHICAL POLYHEDRAL COMPILATION

A system for compiling programs for execution thereof using a hierarchical processing system having two or more levels of memory hierarchy can perform memory-level-specific optimizations, without exceeding a specified maximum compilation time. To this end, the compiler system employs a polyhedral model and limits the dimensions of a polyhedral program representation that is processed by the compiler at each level using a focalization operator that temporarily reduces one or more dimensions of the polyhedral representation. Semantic correctness is provided via a defocalization operator that can restore all polyhedral dimensions that had been temporarily removed.

System for co-ordination of logical sequence of instructions across electronic devices using visual programming and wireless communication

An orchestration engine provides a technical output across multiple programmable objects such as electronic devices, virtual objects and cloud based services in response to user specified logic. The orchestration engine may be deployed on a mobile computer, a tablet computer, a laptop computer, a desktop computer, a wired or wireless electronic device in the system or on a server computer connected via internet. The orchestration engine is capable of supporting extensibility in order to expand support for similar common interaction methods to newer electronic devices via a plug-in framework by specifying the communication protocol of the new element and its capabilities in a descriptive way via a markup language. The orchestration engine is provided along with a library of drag and drop Visual Programming Language steps required for providing executable computer program steps for specifying a user specified logic by computer language illiterate person.

NEURAL NETWORK OPERATION REORDERING FOR PARALLEL EXECUTION

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

SYSTEMS AND METHODS FOR OPTIMIZING NESTED LOOP INSTRUCTIONS IN PIPELINE PROCESSING STAGES WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT

In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.

Compilation to reduce number of instructions for deep learning processor

A method performed during execution of a compilation process for a program having nested loops is provided. The method replaces multiple conditional branch instructions for a processor which uses a conditional branch instruction limited to only comparing a value of a general register with a value of a special register that holds a loop counter value. The method generates, in replacement of the multiple conditional branch instructions, the conditional branch instruction limited to only comparing the value of the general register with the value of the special register that holds the loop counter value for the inner-most loop. The method adds (i) a register initialization outside the nested loops and (ii) a register value adjustment to the inner-most loop. The method defines the value for the general register for the register initialization and conditions for the generated conditional branch instruction, responsive to requirements of the multiple conditional branch instructions.

OPTIMIZING RUNTIME ALIAS CHECKS

Optimizing runtime alias checks includes identifying, by a compiler, a base pointer and a plurality of different memory accesses based on the base pointer in a code loop; generating, by the compiler, a first portion of runtime code to determine a minimum access and a maximum access of the plurality of different memory accesses; and generating, by the compiler, a second portion of runtime code including one or more runtime alias checks for the minimum access and one or more runtime alias checks for the maximum access.

HIGH PERFORMANCE PROCESSOR
20210286755 · 2021-09-16 · ·

Implementations relate to a data processor that includes a data processing unit having a plurality of processing elements and a cache hierarchy including a plurality of levels of data caches. The data caches include a first level data cache connected to a second level data cache, and a main memory connected to the highest level cache of the cache hierarchy. At least one of the first level data cache or second level data cache is divided into a plurality of cache segments, and during operation of the data processor, at least some of the plurality of cache segments are excluded from cache operation. Each of the excluded cache segments is dedicated to an associated processing element as tightly coupled local access memory.

Buffer overflow detection based on a synthesis of assertions from templates and k-induction

A method for buffer overflow detection involves obtaining a program code configured to access memory locations in a loop using a buffer index variable, obtaining an assertion template configured to capture a dependency between the buffer index variable and a loop index variable of the loop in the program code, generating an assertion using the assertion template, verifying that the assertion holds using a k-induction; and determining whether a buffer overflow exists using the assertion.

Neural network operation reordering for parallel execution

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Buffer Fusion and Layout Optimization

Buffer assignment in a contiguous area in a coarse-grained reconfigurable (CGR) array is optimized by temporarily assigning a first buffer portion and a second buffer portion to first and second physical memory units, routing connections in the contiguous area, and calculating a first cost. A list of candidates for a third physical memory unit is created, and a best cost and a best candidate are initialized. For each candidate, the first and second buffer are reassigned to the candidate, connections for data and dataflow control information in the contiguous area are routed, and a second cost is calculated. If the second cost is better than the best cost, the best cost and the best candidate are updated.