G06F15/8023

CPU AND MULTI-CPU SYSTEM MANAGEMENT METHOD
20170364475 · 2017-12-21 · ·

The present disclosure provides a multi-CPU system, where the multi-CPU system includes: at least two Quick-Path Interconnect QPI domains, a first node controller NC group, and a second node controller NC group; according to a CPU route configuration, there is at least one CPU that can access a CPU in another QPI domain by using the first NC group; and there is at least one CPU that can access a CPU in another QPI domain by using the second NC group. According to this topology, hot swap of an NC can be implemented while the system is relatively slightly affected.

Execution engine for executing single assignment programs with affine dependencies

The execution engine is a new organization for a digital data processing apparatus, suitable for highly parallel execution of structured fine-grain parallel computations. The execution engine includes a memory for storing data and a domain flow program, a controller for requesting the domain flow program from the memory, and further for translating the program into programming information, a processor fabric for processing the domain flow programming information and a crossbar for sending tokens and the programming information to the processor fabric.

Computational array microprocessor system using non-consecutive data formatting
11681649 · 2023-06-20 · ·

A microprocessor system comprises a computational array and a hardware data formatter. The computational array includes a plurality of computation units that each operates on a corresponding value addressed from memory. The values operated by the computation units are synchronously provided together to the computational array as a group of values to be processed in parallel. The hardware data formatter is configured to gather the group of values, wherein the group of values includes a first subset of values located consecutively in memory and a second subset of values located consecutively in memory. The first subset of values is not required to be located consecutively in the memory from the second subset of values.

COMPUTING MACHINE ARCHITECTURE FOR MATRIX AND ARRAY PROCESSING
20170337156 · 2017-11-23 · ·

This invention discloses a novel paradigm, method and apparatus for Matrix Computing which include a novel machine architecture with an embedded storage space for holding matrices and arrays for computing which can be accessed by its columns or by its rows or both concurrently. A large capacity multi length instruction set with instructions and methods to load, store and compute with these matrices and arrays are also disclosed; a method and apparatus to secure, share, lock and unlock this embedded space for matrices under the control of an Operating System or a Virtual Machine Monitor by a plurality of threads and processes are also disclosed. A novel method and apparatus to handle immediate operands used by Immediate Instructions are also disclosed. The structure of the instructions with some key fields and a method for determining instruction length easily are also disclosed.

Incorporating a spatial array into one or more programmable processor cores

Functional units disposed in one or more processor cores are communicatively coupled using both a shared bypass network and a switched network. The shared bypass network enables the functional units to be operated conventionally for general processing while the switched network enables specialized processing in which the functional units are configured as a spatial array. In the spatial array configuration, operands produced by one functional unit can only be sent to a subset of functional units to which dependent instructions have been mapped a priori. The functional units may be dynamically reconfigured at runtime to toggle between operating in the general configuration and operating as the spatial array. Information to control the toggling between operating configurations may be provided in instructions received by the functional units.

Architecture of crossbar of inference engine

A programmable hardware system for machine learning (ML) includes a core and an inference engine. The core receives commands from a host. The commands are in a first instruction set architecture (ISA) format. The core divides the commands into a first set for performance-critical operations, in the first ISA format, and a second set of performance non-critical operations, in the first ISA format. The core executes the second set to perform the performance non-critical operations of the ML operations and streams the first set to inference engine. The inference engine generates a stream of the first set of commands in a second ISA format based on the first set of commands in the first ISA format. The first set of commands in the second ISA format programs components within the inference engine to execute the ML operations to infer data.

COMPUTATIONAL ARRAY MICROPROCESSOR SYSTEM USING NON-CONSECUTIVE DATA FORMATTING
20220050806 · 2022-02-17 ·

A microprocessor system comprises a computational array and a hardware data formatter. The computational array includes a plurality of computation units that each operates on a corresponding value addressed from memory. The values operated by the computation units are synchronously provided together to the computational array as a group of values to be processed in parallel. The hardware data formatter is configured to gather the group of values, wherein the group of values includes a first subset of values located consecutively in memory and a second subset of values located consecutively in memory. The first subset of values is not required to be located consecutively in the memory from the second subset of values.

Toroidal Systolic Array Processor for General Matrix Multiplication (GEMM) With Local Dot Product Output Accumulation
20220043769 · 2022-02-10 ·

A toroidal systolic array processor for GEMM with local dot-product output comprises an array of processing elements (PEs) arranged in rows and columns. User input circuitry provides input arrays A and B (and optionally G) as initial first values and second values before the array operation begins. Then, for each step of the array operation, first values and second values are received from other PEs in the array in a toroidal fashion. Each PE performs a fused multiply-add (FMA) operation based upon first values and second values received, whether from the input circuitry or from other PEs. At the end of the array process, each PE provides and output, for example a.sub.0,1b.sub.1,0+a.sub.0,0b.sub.0,0 for the upper left hand PE in a 2×2 array. Depending upon user input, the array processor can compute A*B+G, A*B+C*D, etc.

Energy Efficient Processor Core Architecture for Image Processor

An apparatus is described. The apparatus includes a program controller to fetch and issue instructions. The apparatus includes an execution lane having at least one execution unit to execute the instructions. The execution lane is part of an execution lane array that is coupled to a two dimensional shift register array structure, wherein, execution lane s of the execution lane array are located at respective array locations and are coupled to dedicated registers at same respective array locations in the two-dimensional shift register array.

COMPUTE-COMMUNICATE CONTINUUM TECHNOLOGY
20170230447 · 2017-08-10 ·

The present disclosure relates to Compute-Communicate Continuum (“CCC”) technology, which challenges today's use model of Computing and Communications as independent but interfacing entities. CCC technology conflates computing and communications to create a new breed of device. Compute-Communicate Continuum metal algorithms allow a software programmer to compile/link/load and run his software application directly on device hardware providing Super Computing and Extreme Low Latency links for demanding financial applications and other applications. CCC based multiple CCC-DEVICE hardware platforms can be interconnected using its ELL “Metal Shared Memory Interconnects” form what looks like a “single” machine that crosses different geographies, asset classes, and trading venues. Thus, the technology enables the creation of a new category of Compute-Communicate devices (CCC-DEVICE Series appliances) that can connect multiple geographically distributed locations with extreme low latency and provide supercomputing for distributed data using High Performance Embedded Computing (HPEC) and Extreme Low Latency (ELL) Communications.