Patent classifications
G06F9/3895
PARALLEL PROCESSOR, ADDRESS GENERATOR OF PARALLEL PROCESSOR, AND ELECTRONIC DEVICE INCLUDING PARALLEL PROCESSOR
Disclosed is a parallel processor. The parallel processor includes a processing element array including a plurality of processing elements arranged in rows and columns, a row memory group including row memories corresponding to rows of the processing elements, a column memory group including column memories corresponding to columns of the processing elements, and a controller to generate a first address and a second address, to send the first address to the row memory group, and to send the second address to the column memory group. The controller supports convolution operations having mutually different forms, by changing a scheme of generating the first address.
Address interleaving for machine learning
A system includes a memory, an interface engine, and a master. The memory is configured to store data. The inference engine is configured to receive the data and to perform one or more computation tasks of a machine learning (ML) operation associated with the data. The master is configured to interleave an address associated with memory access transaction for accessing the memory. The master is further configured to provide a content associated with the accessing to the inference engine.
Programmable coarse grained and sparse matrix compute hardware with advanced scheduling
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Nadathur Rajagopalan Satish ,
- Jeremy Bottleson ,
- Farshad Akhbari ,
- Altug Koker ,
- Narayan Srinivasa ,
- Dukhwan Kim ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Feng Chen ,
- Elmoustapha Ould-Ahmed-Vall ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Anbang Yao
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex compute operation.
Neural network accelerator with parameters resident on chip
One embodiment of an accelerator includes a computing unit; a first memory bank for storing input activations and a second memory bank for storing parameters used in performing computations, the second memory bank configured to store a sufficient amount of the neural network parameters on the computing unit to allow for latency below a specified level with throughput above a specified level. The computing unit includes at least one cell comprising at least one multiply accumulate (“MAC”) operator that receives parameters from the second memory bank and performs computations. The computing unit further includes a first traversal unit that provides a control signal to the first memory bank to cause an input activation to be provided to a data bus accessible by the MAC operator. The computing unit performs computations associated with at least one element of a data array, the one or more computations performed by the MAC operator.
POINT TO POINT CONNECTED PROCESSING ELEMENTS WITH DATA JOINER COMPONENTS
A system comprises a first processing element, a second processing element, a point-to-point connection between the first processing element and the second processing element, and a communication bus connecting together at least the first processing element and the second processing element. The first processing element includes a first matrix computing unit and the second processing element includes a second matrix computing unit. The point-to-point connection is configured to provide at least a result of the first processing element to a data joiner component of the second processing element configured to join at least the provided result of the first processing element with a result of the second matrix computing unit.
MECHANISM FOR REDUCING COHERENCE DIRECTORY CONTROLLER OVERHEAD FOR NEAR-MEMORY COMPUTE ELEMENTS
A parallel processing (PP) level coherence directory, also referred to as a Processing In-Memory Probe Filter (PimPF), is added to a coherence directory controller. When the coherence directory controller receives a broadcast PIM command from a host, or a PIM command that is directed to multiple memory banks in parallel, the PimPF accelerates processing of the PIM command by maintaining a directory for cache coherence that is separate from existing system level directories in the coherence directory controller. The PimPF maintains a directory according to address signatures that define the memory addresses affected by a broadcast PIM command. Two implementations are described: a lightweight implementation that accelerates PIM loads into registers, and a heavyweight implementation that accelerates both PIM loads into registers and PIM stores into memory.
MEMORY-BASED DISTRIBUTED PROCESSOR ARCHITECTURE
Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
METHODS AND SYSTEMS FOR MULTI-DIMENSIONAL AGGREGATION USING COMPOSITION
Multi-dimensional aggregation using user interface workflow composition is described. Data for a computer implemented process is in a set of related data objects in a data store with each object in the set of related data objects representing an entity modelled in the process. A number of levels for a multi-dimensional aggregation associated with a request is determined where each level of the multi-dimensional aggregation represents a different dimension of data values to be aggregated. Data is aggregated at the levels of aggregation based on the relationships between parent objects and children objects. The data for a final level of aggregation is output to a user interface. The final result includes multiple dimensions of data.
Neural network unit that manages power consumption based on memory accesses per period
An apparatus includes a first memory, processing units that access the first memory, and a counter that, for each period of a sequence of periods, holds an indication of accesses to the first memory during the period; and control logic that, for each period of the sequence of periods, monitors the indication to determine whether it exceeds the threshold and, if so, stalls the processing units from accessing the first memory for a remaining portion of the period.
Hardware accelerator for convolutional neural networks and method of operation thereof
An accelerator for processing of a convolutional neural network (CNN) includes a compute core having a plurality of compute units. Each compute unit includes a first memory cache configured to store at least one vector in a map trace, a second memory cache configured to store at least one vector in a kernel trace, and a plurality of vector multiply-accumulate units (vMACs) connected to the first and second memory caches. Each vMAC includes a plurality of multiply-accumulate units (MACs). Each MAC includes a multiplier unit configured to multiply a first word that of the at least one vector in the map trace by a second word of the at least one vector in the kernel trace to produce an intermediate product, and an adder unit that adds the intermediate product to a third word to generate a sum of the intermediate product and the third word.