Patent classifications
G06F15/17362
DETERMINING AN OPERATION STATE WITHIN A COMPUTING SYSTEM WITH MULTI-CORE PROCESSING DEVICES
Systems and methods for operating a processing device are provided. A method may comprise transmitting data on the processing device, monitoring state information for a plurality of buffers on the processing device for the transmitted data, aggregating the monitored state information, starting a timer in response to determining that all buffers of the plurality of buffers are empty and asserting a drain state for the plurality of buffers in response to all buffers of the plurality of buffers remained empty for the duration of the timer.
Data transmission between memory and on chip memory of inference engine for machine learning via a single data gathering instruction
A system to support data gathering for a machine learning (ML) operation comprises a memory unit configured to maintain data for the ML operation in a plurality of memory blocks each accessible via a memory address. The system further comprises an inference engine comprising a plurality of processing tiles each comprising one or more of an on-chip memory (OCM) configured to load and maintain data for local access by components in the processing tile. The system also comprises a core configured to program components of the processing tiles of the inference engine according to an instruction set architecture (ISA) and a data streaming engine configured to stream data between the memory unit and the OCMs of the processing tiles of the inference engine wherein data streaming engine is configured to perform a data gathering operation via a single data gathering instruction of the ISA at the same time.
RECONFIGURABLE COMPUTING PODS USING OPTICAL NETWORKS WITH ONE-TO-MANY OPTICAL SWITCHES
Methods, systems, and apparatus, including an apparatus for generating clusters of building blocks of compute nodes using an optical network. In one aspect, a method includes receiving data specifying requested compute nodes for a computing workload. The data specifies a target arrangement of the nodes. A subset of building blocks of a superpod is selected. A logical arrangement of the subset of compute nodes that matches the target arrangement is determined. A workload cluster of compute nodes that includes the subset of the building blocks is generated. For each dimension of the workload cluster, respective routing data for two or more OCS switches for the dimension is configured. One-to-many switches are configured such that a second compute node of each segment of compute nodes is connected to a same OCS switch as a corresponding first compute node of a corresponding segment to which the second compute node is connected.
Disjoint array computer
A hierarchical array computer architecture comprised of a master computer connected to a plurality of node computers wherein each node has a memory segment. A high speed connection scheme between the master computer and the nodes allows the master computer or individual nodes conditional access to the node memory segments. The resulting architecture creates an array computer with a large distributed memory in which each memory segment of the distributed memory has an associated computing element; the entire array being housed in a blade server type enclosure. The array computer created with this architecture provides a linear increase of processing speed corresponding to the number of nodes.
Techniques for collective operations in distributed systems
Various embodiments are generally directed to techniques for collective operations among compute nodes in a distributed processing set, such as by utilizing ring sets and local sets of the distributed processing set. In some embodiments, a ring set may include a subset of the distributed processing set in which each compute node is connected to a network with a separate router. In various embodiments, a local set may include a subset of the distributed processing set in which each compute node is connected to a network with a common router. In one or more embodiments, each compute node in a distributed processing set may belong to one ring set and one local set.
DATA ACTOR AND DATA PROCESSING METHOD THEREOF
Provided is a data actor, which is in data communication with direct upstream actor and/or downstream actor. The data actor includes a message bin, a finite state machine, a processing component and an output data cache. The message bin is configured to receive a message from the upstream actor and/or the downstream actor; the finite state machine is configured to change a current state of the actor based on the received message in the message bin and an operation of the processing component; when a state of the finite state machine reaches a trigger condition, the processing component directly reads output data in a readable state in an output data cache of the upstream actor and executes a predetermined operation, and then stores result data subsequent to execution of the predetermined operation in an output data cache of the data actor.
TECHNIQUES TO TRANSFER DATA AMONG HARDWARE DEVICES
Apparatuses, systems, and techniques to route data transfers between hardware devices. In at least one embodiment, a path over which to transfer data from a first hardware component of a computer system to a second hardware component of a computer system is determined based, at least in part, on one or more characteristics of different paths usable to transfer the data.
Reconfigurable computing pods using optical networks with one-to-many optical switches
Methods, systems, and apparatus, including an apparatus for generating clusters of building blocks of compute nodes using an optical network. In one aspect, a method includes receiving data specifying requested compute nodes for a computing workload. The data specifies a target arrangement of the nodes. A subset of building blocks of a superpod is selected. A logical arrangement of the subset of compute nodes that matches the target arrangement is determined. A workload cluster of compute nodes that includes the subset of the building blocks is generated. For each dimension of the workload cluster, respective routing data for two or more OCS switches for the dimension is configured. One-to-many switches are configured such that a second compute node of each segment of compute nodes is connected to a same OCS switch as a corresponding first compute node of a corresponding segment to which the second compute node is connected.
HETEROGENEOUS COMPUTE ARCHITECTURE HARDWARE/SOFTWARE CO-DESIGN FOR AUTONOMOUS DRIVING
Methods and apparatus relating to heterogeneous compute architecture hardware/software co-design for autonomous driving are described. In one embodiment, a heterogeneous compute architecture for autonomous driving systems (also interchangeably referred to herein as Heterogeneous Compute Architecture or “HCA” for short) integrates scalable heterogeneous processors, flexible networking, benchmarking tools, etc. to enable (e.g., system-level) designers to perform hardware and software co-design. With HCA system engineers can rapidly architect, benchmark, and/or evolve vehicle system architectures for autonomous driving. Other embodiments are also disclosed and claimed.
Heterogeneous compute architecture hardware/software co-design for autonomous driving
Methods and apparatus relating to heterogeneous compute architecture hardware/software co-design for autonomous driving are described. In one embodiment, a heterogeneous compute architecture for autonomous driving systems (also interchangeably referred to herein as Heterogeneous Compute Architecture or “HCA” for short) integrates scalable heterogeneous processors, flexible networking, benchmarking tools, etc. to enable (e.g., system-level) designers to perform hardware and software co-design. With HCA system engineers can rapidly architect, benchmark, and/or evolve vehicle system architectures for autonomous driving. Other embodiments are also disclosed and claimed.