G06F15/8023

GENERAL-PURPOSE PARALLEL COMPUTING ARCHITECTURE
20220269637 · 2022-08-25 ·

An apparatus includes multiple parallel computing cores and multiple parallel coprocessor/reducer cores associated with each computing core. Each computing core is configured to perform one or more processing operations, generate input data, and provide the input data to designated coprocessor/reducer cores associated with at least some of the computing cores. Each coprocessor/reducer core associated with a respective computing core is configured to generate output data. Some of the coprocessor/reducer cores associated with the respective computing core are configured to perform part of a distributed operation using the output data to generate intermediate results. A designated one of the coprocessor/reducer cores associated with the respective computing core is configured to provide one or more final results to the computing core. The coprocessor/reducer cores are arranged in rows and columns, each column is associated with a different computing core, and each computing core is communicatively coupled to its designated coprocessor/reducer cores in the columns.

GENERAL-PURPOSE PARALLEL COMPUTING ARCHITECTURE
20170220511 · 2017-08-03 ·

An apparatus includes multiple computing cores, where each computing core is configured to perform one or more processing operations and generate input data. The apparatus also includes multiple coprocessors associated with each computing core, where each coprocessor is configured to receive the input data from at least one of the computing cores, process the input data, and generate output data. The apparatus further includes multiple reducer circuits, where each reducer circuit is configured to receive the output data from each of the coprocessors of an associated computing core, apply one or more functions to the output data, and provide one or more results to the associated computing core. In addition, the apparatus includes multiple communication links communicatively coupling the computing cores and the coprocessors associated with the computing cores.

Execution engine for executing single assignment programs with affine dependencies

The execution engine is a new organization for a digital data processing apparatus, suitable for highly parallel execution of structured fine-grain parallel computations. The execution engine includes a memory for storing data and a domain flow program, a controller for requesting the domain flow program from the memory, and further for translating the program into programming information, a processor fabric for processing the domain flow programming information and a crossbar for sending tokens and the programming information to the processor fabric.

Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor
11249939 · 2022-02-15 ·

An Execution Array Memory Array (XarMa©) processor is described for signal processing and internet of things (IoT) applications, (pronounced sharma, that means happiness in Sanskrit). The XarMa© processor uses a 1 to K+1 adjacency network in an array of execution units. The 1 to K+1 adjacency refers to connections separately made in rows and in columns of execution unit and local file nodes, where the number of R.sub.ows≥K>1 and of C.sub.olumns≥K>1 and K is an odd integer. Instead of a large central multi-ported register file, a distributed set of storage files local to each execution unit is used. The instruction set architecture uses instructions that specify forwarding of execution results to execution units associated with destination instructions. This execution array is scalable to support cost effective and low power high-performance application specific processing focused on target product requirements.

APPARATUSES AND METHODS FOR MAP REDUCE

The present disclosure relates to a method and an apparatus for map reduce. In some embodiments, an exemplary processing unit includes: a 2-dimensional (2D) processing element (PE) array comprising a plurality of PEs, each PE comprising a first input and a second input, the first inputs of the PEs in a linear array in a first dimension of the PE array being connected in series and the second inputs of the PEs in a linear array in a second dimension of the PE array being connected in parallel, each PE being configured to perform an operation on data from the first input or second input; and a plurality of reduce tree units, each reduce tree unit being coupled with the PEs in a linear array in the first dimension or the second dimension of the PE array and configured to perform a first reduction operation.

Reconfigurable Parallel Processing
20210382722 · 2021-12-09 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS FOR ALIGNING TILES OF A MATRIX OPERATIONS ACCELERATOR
20220206854 · 2022-06-30 ·

Systems, methods, and apparatuses relating to one or more instructions for element aligning of a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a first plurality of registers that represents a first two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a second plurality of registers that represents a second two-dimensional matrix coupled to the two-dimensional grid of processing elements; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the first two-dimensional matrix, a second field that identifies the second two-dimensional matrix, and an opcode that indicates an execution circuit of the hardware processor core is to cause the matrix operations accelerator circuit to generate a third two-dimensional matrix from a proper subset of elements of a row or a column of the first two-dimensional matrix and a proper subset of elements of a row or a column of the second two-dimensional matrix and store the third two-dimensional matrix at a destination in the matrix operations accelerator circuit, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.

Fuseload architecture for system-on-chip reconfiguration and repurposing

Methods, systems, and devices that support fuseload architectures for system-on-chip (SoC) reconfiguration and repurposing are described. Trim data may be loaded from fuses to registers on a die based on a fuse header. For example, a set of registers coupled with a set of fuses on the die may be identified, where the set of fuses may store trim data to be copied to the registers as part of a fuseload procedure. In such cases, one or more fuse headers may be identified within the trim data, and each fuse header may correspond to a fuse group that includes a subset of fuses. Based on one or more subfields within a fuse header, a mapping between fuse addresses and register addresses may be determined, and the trim data from each fuse group may be copied into a set of registers based on the mapping.

COMPUTATIONAL MEMORY
20220171829 · 2022-06-02 ·

A processing device includes a two-dimensional array of processing elements, each processing element including an arithmetic logic unit to perform an operation. The device further includes interconnections among the two-dimensional array of processing elements to provide direct communication among neighboring processing elements of the two-dimensional array of processing elements. A processing element of the two-dimensional array of processing elements is connected to a first neighbor processing element that is immediately adjacent the processing element in a first dimension of the two-dimensional array. The processing element is further connected to a second neighbor processing element that is immediately adjacent the processing element in a second dimension of the two-dimensional array.

Information processing system and method for controlling information processing system
11327764 · 2022-05-10 · ·

A method for controlling an information processing system, the information processing system including multiple information processing devices coupled to each other, each of the multiple information processing devices including multiple main operation devices and multiple aggregate operation devices that are coupled to each other, the method includes: acquiring, by each of the aggregate operation devices, array data items from a main operation device coupled to the concerned aggregate operation device; determining the order of dimensions in which a process is executed and in which the information processing devices are coupled to each other; executing for each of the dimensions in accordance with the order of the dimensions, a process of halving the array data items and distributing the array data items to information processing devices arranged in the dimension; executing a process of transmitting, to information processing devices arranged in the dimension, operation results calculated based on data items.