Patent classifications
G06F9/3877
Methods, systems and apparatus to improve deep learning resource efficiency
Methods, apparatus, systems and articles of manufacture are disclosed to improve deep learning resource efficiency. An example apparatus includes a graph monitor to select a candidate operation node in response to receiving an operation graph, the operation graph including one or more other operation nodes, a node rule evaluator to evaluate the candidate operation node based on an operating principle, the operating principle to determine an output storage destination of the candidate operation node based on a topology of the operation graph, and a tag engine to tag the candidate operation node with a memory tag value based on the determined output storage destination.
Accelerated operation of a graph streaming processor
Methods, systems and apparatuses for graph processing are disclosed. One graph streaming processor includes a thread manager, wherein the thread manager is operative to dispatch operation of the plurality of threads of a plurality of thread processors before dependencies of the dependent threads have been resolved, maintain a scorecard of operation of the plurality of threads of the plurality of thread processors, and provide an indication to at least one of the plurality of thread processors when a dependency between the at least one of the plurality of threads that a request has or has not been satisfied. Further, a producer thread provides a response to the dependency when the dependency has been satisfied, and each of the plurality of thread processors is operative to provide processing updates to the thread manager, and provide queries to the thread manager upon reaching a dependency.
Gateway pull model
A computer system comprising: (i) a computer subsystem configured to act as a work accelerator, and (ii) a gateway connected to the computer subsystem, the gateway enabling the transfer of data to the computer subsystem from external storage at pre-compiled data exchange synchronization points attained by the computer subsystem, which act as a barrier between a compute phase and an exchange phase of the computer subsystem, wherein the computer subsystem is configured to pull data from a gateway transfer memory of the gateway in response to the pre-compiled data exchange synchronization point attained by the subsystem, wherein the gateway comprises at least one processor configured to perform at least one operation to pre-load at least some of the data from a first memory of the gateway to the gateway transfer memory in advance of the pre-compiled data exchange synchronization point attained by the subsystem.
Neural network accelerator including bidirectional processing element array
Provided is a neural network accelerator which performs a calculation of a neural network provided with layers, the neural network accelerator including a kernel memory configured to store kernel data related to a filter, a feature map memory configured to store feature map data which are outputs of the layers, and a Processing Element (PE) array including PEs arranged along first and second directions, wherein each of the PEs performs a calculation using the feature map data transmitted in the first direction from the feature map memory and the kernel data transmitted in the second direction from the kernel memory, and transmits a calculation result to the feature map memory in a third direction opposite to the first direction.
INSTRUCTIONS FOR OPERATING ACCELERATOR CIRCUIT
A system includes a memory to store an input data, an accelerator circuit comprising an input command execution circuit, a neuron matrix command execution circuit, and an output command execution circuit, and a processor, communicatively coupled to the memory and the accelerator circuit, to generate a stream of instructions from a source code targeted the accelerator circuit, each one of the stream of instructions comprising at least one of an input command, a neuron matrix command, or an output command, and issue the stream of instructions to the accelerator circuit for execution by the input command execution circuit, the neuron matrix command execution circuit, and the output command execution circuit.
Development and analysis of quantum computing programs
Techniques regarding the development and/or analysis of one or more quantum computing programs are provided. For example, one or more embodiments described herein can comprise a system, which can comprise a memory that can store computer executable components. The system can also comprise a processor, operably coupled to the memory, and that can execute the computer executable components stored in the memory. The computer executable components can comprise a circuit component, operatively coupled to the processor, that can create a quantum computing program over a period of time. The computer executable components can also comprise a visualization component, operatively coupled to the processor, that can generates a quantum state visualization that depicts a characterization of the quantum computing program over the period of time.
Apparatus and method for acceleration data structure refit
Apparatus and method for acceleration data structure refit. For example, one embodiment of an apparatus comprises: a ray generator to generate a plurality of rays in a first graphics scene; a hierarchical acceleration data structure generator to construct an acceleration data structure comprising a plurality of hierarchically arranged nodes including inner nodes and leaf nodes stored in a memory in a depth-first search (DFS) order; traversal hardware logic to traverse one or more of the rays through the acceleration data structure; intersection hardware logic to determine intersections between the one or more rays and one or more primitives within the hierarchical acceleration data structure; a node refit unit comprising circuitry and/or logic to read consecutively through at least the inner nodes in the memory in reverse DFS order to perform a bottom-up refit operation on the hierarchical acceleration data structure.
Panel self-refresh (PSR) transmission of bulk data
The present disclosure is directed to systems and methods of transferring bulk data, such as OLED compensation mask data, generated by a source device to a sink device using a high-bandwidth embedded DisplayPort (eDP) connection contemporaneous with an ENABLED Panel Self-Refresh (PSR) mode. Upon ENABLING the PSR mode, the source control circuitry causes the source transmitter circuitry, the sink receiver circuitry, and the eDP high-bandwidth communication link to remain active rather than inactive. The source control circuitry generates one or more data transport units (DTUs) having a header portion that contains data indicative of the presence of a bulk data payload and the non-display status of the bulk data payload carried by the DTUs.
Custom instruction implemented finite state machine engines for extensible processors
An extensible processor can include an execution pipeline, one or more extensible control engines and architectural visible control states. The extensible processor can be configured to determine a control state of the one or more extensible control engines from the architectural visible control states. The extensible processor can be further configured to initiate execution of a given one of the extensible control engines when a control state in the architectural visible control states corresponding to the given one of the extensible control engines is enabled, wherein the given one of the extensible control engines comprises control input and control outputs based on one or more control transitions of an instruction. The extensible processor can also be further configured to output a result of execution of the given one of the extensible control engines to the architectural visible control states.
Serialization floors and deadline driven control for performance optimization of asymmetric multiprocessor systems
Closed loop performance controllers of asymmetric multiprocessor systems may be configured and operated to improve performance and power efficiency of such systems by adjusting control effort parameters that determine the dynamic voltage and frequency state of the processors and coprocessors of the system in response to the workload. One example of such an arrangement includes applying hysteresis to the control effort parameter and/or seeding the control effort parameter so that the processor or coprocessor receives a returning workload in a higher performance state. Another example of such an arrangement includes deadline driven control, in which the control effort parameter for one or more processing agents may be increased in response to deadlines not being met for a workload and/or decreased in response to deadlines being met too far in advance. The performance increase/decrease may be determined by comparison of various performance metrics for each of the processing agents.