G06F9/3004

Chained buffers in neural network processor
11513799 · 2022-11-29 · ·

Embodiments of the present disclosure relate to chained buffers in a neural processor circuit. The neural processor circuit includes multiple neural engines, a planar engine, a buffer memory, and a flow control circuit. At least one neural engine operates as a first producer of first data or a first consumer of second data. The planar engine operates as a second consumer receiving the first data from the first producer or a second producer sending the second data to the first consumer. Data flow between the at least one neural engine and the planar engine is controlled using at least a subset of buffers in the buffer memory operating as at least one chained buffer that chains flow of the first data and the second data between the at least one neural engine and the planar engine.

PARALLEL PROCESSING ARCHITECTURE FOR ATOMIC OPERATIONS
20220374286 · 2022-11-24 · ·

Techniques for task processing in a parallel processing architecture for atomic operations are disclosed. A two-dimensional array of compute elements is accessed, where each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the array of compute elements is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide control words generated by the compiler. At least one of the control words involves an operation requiring at least one additional operation. A bit of the control word is set, where the bit indicates a multicycle operation. The control word is executed, on at least one compute element within the array of compute elements, based on the bit. The multicycle operation comprises a read-modify-write operation.

STOCHASTIC HYPERDIMENSIONAL ARITHMETIC COMPUTING
20220374234 · 2022-11-24 ·

Stochastic hyperdimensional arithmetic computing is provided. Hyperdimensional computing (HDC) is a neurally-inspired computation model working based on the observation that the human brain operates on high-dimensional representations of data, called hypervectors. Although HDC is powerful in reasoning and association of the abstract information, it is weak on feature extraction from complex data. Consequently, most existing HDC solutions rely on expensive pre-processing algorithms for feature extraction. This disclosure proposes StocHD, a novel end-to-end hyperdimensional system that supports accurate, efficient, and robust learning over raw data. StocHD expands HDC functionality to the computing area by mathematically defining stochastic arithmetic over HDC hypervectors. StocHD enables an entire learning application (including feature extractor) to process using HDC data representation, enabling uniform, efficient, robust, and highly parallel computation. This disclosure further provides a novel fully digital and scalable processing in-memory (PIM) architecture that exploits the HDC memory-centric nature to support extensively parallel computation.

DATA ACCESS PERFORMANCE IN A MEMORY

In an approach for improving data access performance in memory, a processor monitors each data access to a data element in the memory from an application, wherein the application has a plurality of functions. A processor records, during runtime, each data access into a monitoring element table, wherein the record for each data access includes an identity, a start address, an end address, and a memory page number. A processor clusters recorded data accesses for each function based on a distance between data elements accessed in sequence. A processor allocates, based on the data element clustering result, the data elements in a same cluster into a same memory unit in the memory.

Widening memory access to an aligned address for unaligned memory operations

Unaligned atomic memory operations on a processor using a load-store instruction set architecture (ISA) that requires aligned accesses are performed by widening the memory access to an aligned address by the next larger power of two (e.g., 4-byte access is widened to 8 bytes, and 8-byte access is widened to 16 bytes). Data processing operations supported by the load-store ISA including shift, rotate, and bitfield manipulation are utilized to modify only the bytes in the original unaligned address so that the atomic memory operations are aligned to the widened access address. The aligned atomic memory operations using the widened accesses avoid the faulting exceptions associated with unaligned access for most 4-byte and 8-byte accesses. Exception handling is performed in cases in which memory access spans a 16-byte boundary.

Full asynchronous execution queue for accelerator hardware
11593157 · 2023-02-28 · ·

A method for providing an asynchronous execution queue for accelerator hardware includes replacing a malloc operation in an execution queue to be sent to an accelerator with an asynchronous malloc operation that returns a unique reference pointer. Execution of the asynchronous malloc operation in the execution queue by the accelerator allocates a requested memory size and adds an entry to a look-up table accessible by the accelerator that maps the reference pointer to a corresponding memory address.

INFORMATION PROCESSING DEVICE, CONTROL METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM

An information processing device that executes an arithmetic process includes a first processing circuit and a second processing circuit. The first processing circuit executes the arithmetic process N times consecutively. The second processing circuit executes the arithmetic process N times consecutively. N is an integer of 2 or more. The first processing circuit and the second processing circuit continue to operate according to a match between at least one result among the results of the N arithmetic processes executed by the first processing circuit and at least one result among the results of the N arithmetic processes executed by the second processing circuit. As a result, it is possible to suppress an increase in cost required for hardware and to suppress a temporary stop due to a temporary failure.

MANAGING RETURN PARAMETER ALLOCATION
20230058935 · 2023-02-23 ·

A hybrid threading processor (HTP) supports thread creation by executing an instruction that indicates an amount of storage space to reserve for return values. Before a thread is created, the indicated amount of space is reserved. The newly created child thread sends a return packet back to the parent thread when the child thread completes. The thread writes its return information into the reserved space and waits for the parent thread to execute a thread join instruction. The thread join instruction takes the returned information from the reserved space and transfers it to the parent thread's register state. The reserved space is released once the child thread is joined. Using a configurable amount of space for each child thread may allow for more child threads to be executed simultaneously.

Method, device and storage medium for processing overhead of memory access

A method for processing overhead of memory access includes: applying for a memory configured to perform value padding on at least one convolution operation in a deep learning model; determining input data of the deep learning model; performing deep learning processing on the input data by using the deep learning model; and releasing the memory after performing the deep learning processing.

Thread-based processor halting

Devices and techniques for thread-based processor halting are described herein. A processor monitors control-status register (CSR) values that correspond to a halt condition for a thread. The processor then compares the halt condition to a current state of the thread and halts in response to the current state of the thread meeting the halt condition.