Patent classifications
G06F15/825
DEPENDENCY-BASED QUEUING OF WORK REQUESTS IN DATAFLOW APPLICATIONS
A computer implemented method comprises a server processing work requests of a work requester. The work requester can communicate to the server a processing dependency of one work request on a second work request. The server can associate the dependency with the work requests and/or a queue of work requests. The dependency include a condition to be met in association with processing the work requests, and the condition can include an action for the server to take in association with processing a work request. A computing system can comprise a work requester, a server, and a set of dependency-aware queues for processing a set of work requests. A queue and/or work requests on the queues can be associated with a processing dependency and the server can process work requests enqueued to the queues in an order based on the dependencies. A work requester/server interface can comprise a dependency framework.
HIGH PERFORMANCE LAYER NORMALIZATION FOR LARGE MODELS
As general matrix multiply (GEMM) bottlenecks are ameliorated by tensor parallelism that is distributed to several processors, layer normalization (LN) surfaces as a latent bottleneck as it is not amenable to distribution. LN performance is linear to embedding size, which is extremely large in some AI models. Moreover, aggressive tiling prevents the use of internal pipelining. The disclosed implementation addresses this issue, composing LN from simpler operations and this composition is amenable to pipelining, facilitating efficient implementation of large AI models (e.g., GPTs). In both forward and backward propagation, the pipeline is stretched longer with improved balance across stages. This strategy improves throughput for larger batch-sizes as the workload benefits from pipelining operations for better performance. Furthermore, avoiding stochastic rounding further improves performance. In addition, LayerNorm checkpoints facilitate efficient computation of gradients during backward propagation.
DATA STRUCTURE DESCRIPTORS FOR DEEP LEARNING ACCELERATION
Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a respective compute element and a respective routing element. Instructions executed by the compute element include operand specifiers, some specifying a data structure register storing a data structure descriptor describing an operand as a fabric vector or a memory vector. The data structure descriptor further describes the memory vector as one of a one-dimensional vector, a four-dimensional vector, or a circular buffer vector. Optionally, the data structure descriptor specifies an extended data structure register storing an extended data structure descriptor. The extended data structure descriptor specifies parameters relating to a four-dimensional vector or a circular buffer vector.
SHARDING FOR SYNCHRONOUS PROCESSORS
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sharding dataflow graphs for a device having multiple synchronous tiles. One of the methods includes receiving a representation of a dataflow graph comprising a plurality of nodes that each represent respective matrix operations to be performed by a device having a plurality synchronous tiles. Candidate allocations of respective portions of the dataflow graph to each tile of the plurality of synchronous tiles are evaluated according to one or more resource constraints of the device. One of the candidate allocations is selected based on evaluating each candidate allocation.
Circuit unit, circuit module and apparatus for data statistics
Disclosed are a circuit unit, a circuit module and an apparatus for data statistics. The circuit unit comprises a first register and a second register, and stores data received via a first input terminal in the first register in a case where a first control terminal receives a valid control signal, stores data received via a second input terminal in the second register in a case where a second control terminal receives a valid control signal, and increases the value of data stored in the second register by 1 in a case where a third control terminal receives a valid control signal. The circuit module comprises one or more such circuit units, and the apparatus comprises one or more such circuit modules. The circuit module or the apparatus may use smaller resource and smaller power consumption to complete data statistics.
Reach-based explicit dataflow processors, and related computer-readable media and methods
Exemplary reach-based explicit dataflow processors and related computer-readable media and methods. The reach-based explicit dataflow processors are configured to support execution of producer instructions encoded with explicit naming of consumer instructions intended to consume the values produced by the producer instructions. The reach-based explicit dataflow processors are configured to make available produced values as inputs to explicitly named consumer instructions as a result of processing producer instructions. The reach-based explicit dataflow processors support execution of a producer instruction that explicitly names a consumer instruction based on using the producer instruction as a relative reference point from the producer instruction. This reach-based explicit naming architecture does not require instructions to be grouped in instruction blocks to support a fixed block reference point for explicit naming of consumer instructions, and thus is not limited to explicit naming of consumer instructions only within the same instruction block of the producer instruction.
Merging Buffer Access Operations of a Compute Graph
A method for merging buffers and associated operations includes receiving a compute graph for a reconfigurable dataflow computing system and conducting a buffer allocation and merging process responsive to determining that a first operation specified by a first operation node is a memory indexing operation and that the first operation node is a producer for exactly one consuming node that specifies a second operation. The buffer allocation and merging process may include replacing the first operation node and the consuming node with a merged buffer node within the graph responsive to determining that the first operation and the second operation can be merged into a merged indexing operation and that the resource cost of the merged node is less than the sum of the resource costs of separate buffer nodes. A corresponding system and computer readable medium are also disclosed herein.
Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
Systems, methods, and apparatuses relating to unstructured data flow in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator includes a data path having a first branch and a second branch, and the data path comprises at least one processing element; a switch circuit comprising a switch control input to receive a first switch control value to couple an input of the switch circuit to the first branch and a second switch control value to couple the input of the switch circuit to the second branch; a pick circuit comprising a pick control input to receive a first pick control value to couple an output of the pick circuit to the first branch and a second pick control value to couple the output of the pick circuit to a third branch of the data path; a predicate propagation processing element to output a first edge predicate value and a second edge predicate value based on (e.g., both of) a switch control value from the switch control input of the switch circuit and a first block predicate value; and a predicate merge processing element to output a pick control value to the pick control input of the pick circuit and a second block predicate value based on both of a third edge predicate value and one of the first edge predicate value or the second edge predicate value.
Overlay Layer Hardware Unit for Network of Processor Cores
Methods and systems for executing an application data flow graph on a set of computational nodes are disclosed. The computational nodes can each include a programmable controller from a set of programmable controllers, a memory from a set of memories, a network interface unit from a set of network interface units, and an endpoint from a set of endpoints. A disclosed method comprises configuring the programmable controllers with instructions. The method also comprises independently and asynchronously executing the instructions using the set of programmable controllers in response to a set of events exchanged between the programmable controllers themselves, between the programmable controllers and the network interface units, between the programmable controllers and the set of endpoints. The method also comprises transitioning data in the set of memories on the computational nodes in accordance with the application data flow graph and in response to the execution of the instructions.
Providing, in a configuration packet, data indicative of data flows in a processor with a data flow manager
Methods, apparatuses, and systems for implementing data flows in a processor are described herein. A data flow manager may be configured to generate a configuration packet for a compute operation based on status information regarding multiple processing elements of the processor. Accordingly, multiple processing elements of a processor may concurrently process data flows based on the configuration packet. For example, the multiple processing elements may implement a mapping of processing elements to memory, while also implementing identified paths, through the processor, for the data flows. After executing the compute operation at certain processing elements of the processor, the processing results may be provided. In speech signal processing operations, the processing results may be compared to phonemes to identify such components of human speech in the processing results. Once dynamically identified, the processing elements may continue comparing additional components of human speech to facilitate processing of an audio recording, for example.