Patent classifications
G06F9/3887
Lazy push strategies for vectorized D-Heaps
Techniques are provided for lazy push optimization, allowing for constant time push operations. A d-heap is used as the underlying data structure for indexing values being inserted. The d-heap is vectorized by storing values in a contiguous memory array. Heapify operations are delayed until a retrieve operation occurs, improving insert performance of vectorized d-heaps that use horizontal aggregation SIMD instructions at the cost of slightly lower retrieve performance.
PROVIDING NATIVE SUPPORT FOR GENERIC POINTERS IN A GRAPHICS PROCESSING UNIT
Embodiments are directed to systems and methods for supporting generic pointers in hardware of a GPU. According to one embodiment, a GPU includes multiple sub-cores each having a processing resource and a load/store pipeline. The processing resource is operable to receive a memory access message including a pointer and a memory type identifier indicative of the pointer representing a generic pointer. The processing resource is further operable to output a load or store operation to the load/store pipeline based on the memory access message, including computing an address for the load or store operation by adding a base address of a named memory type of a plurality of named memory types referenced by the generic pointer to an offset into a memory of the named memory type. The load/store pipeline is operable to, responsive to receipt of the load or store operation, access the memory at the address.
LOAD MULTIPLE PRIMITIVES PER THREAD IN A GRAPHICS PIPELINE
Systems, apparatuses, and methods for loading multiple primitives per thread in a graphics pipeline are disclosed. A system includes a graphics pipeline frontend with a geometry engine, shader processor input (SPI), and a plurality of compute units. The geometry engine generates primitives which are accumulated by the SPI into primitive groups. While accumulating primitives, the SPI tracks the number of vertices and primitives per group. The SPI determines wavefront boundaries based on mapping a single vertex to each thread of the wavefront while allowing more than one primitive per thread. The SPI launches wavefronts with one vertex per thread and potentially multiple primitives per thread. The compute units execute a vertex phase and a multi-cycle primitive phase for wavefronts with multiple primitives per thread.
GARBAGE COLLECTING WAVEFRONT
A processing system executes a specialized wavefront, referred to as a “garbage collecting wavefront” or GCWF, to identify and deallocate resources such as, for example, scalar registers, vector registers, and local data share space, that are no longer being used by wavefronts of a workgroup executing at the processing system (i.e., dead resources). In some embodiments, the GCWF is programmed to have compiler information regarding the resource requirements of the other wavefronts of the workgroup and specifies the program counter after which there will be a permanent drop in resource requirements for the other wavefronts. In other embodiments, the standard compute wavefronts signal the GCWF when they have completed using resources. The GCWF sends a command to deallocate the dead resources so the dead resources can be made available for additional wavefronts.
CONVOLUTIONAL NEURAL NETWORK OPERATIONS
Methods and systems are disclosed for executing operations on single-instruction-multiple-data (SIMD) units. Techniques disclosed perform a dot product operation on input data during one computer cycle, including convolving the input data, generating intermediate data, and applying one or more transitional operations to the intermediate data to generate output data. Aspects described, wherein the input data is an input to a layer of a convolutional neural network and the generated output data is the output of the layer.
UNIFIED AUTOMATION OF APPLICATION DEVELOPMENT
Unified automation of application development and delivery is provided. An automation pipeline execution coordinator may define a pipeline specification that includes actions to be performed, a triggering event definition and specification for determining execution context. The coordinator may concurrently detect triggering events for multiple pipelines matching the pipeline specification, and responsive to the detecting, determine execution contexts for the pipelines. The coordinator may then execute the multiple pipelines, where execution may proceed independently for pipelines with differing execution contexts. For pipelines sharing an execution context, execution of various actions of the respective pipelines may be coordinated. Execution context may be determined according to the specification for determining execution context, which may include an overridable default specification that determines context by locations of source data related to the triggering event. Pipeline specifications may be defined using pipeline specification templates and input from users obtained via various user interfaces.
EFFICIENT COMPRESSED VERBATIM COPY
Compressed verbatim copy can enable more efficient copying of compressed data. In one example, a compressed verbatim copy method involves receiving a command to copy compressed data from a source address of the memory device to a destination address. In response to the receipt of the command, the method involves copying the compressed data in a compressed format from the source address to the destination address without first decompressing the data. A second source address and a second destination address of metadata for the compressed data is determined, and the metadata is copied from the second source address to the second destination address.
Computational memory
An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.
COMPUTING ARCHITECTURE
Computing architecture comprises an off-chip memory, an on-chip cache unit, a prefetching unit, a global scheduler, a transmitting unit, a pre-recombination network, a post-recombination network, a main computing array, a write-back cache unit, a data dependence controller and an auxiliary computing array. The architecture reads data tiles into an on-chip cache in a prefetching mode, and performs computing according to the data tiles; in the computing process of the tiles, a tile exchange network is adopted to recombine a data structure, and a data dependence module is arranged to process a data dependence relationship possibly existing between different tiles. According to the computing architecture, the data utilization rate can be increased, the data processing flexibility is improved, and therefore Cache Miss is reduced, and the memory bandwidth pressure is reduced.
TECHNIQUES FOR COMBINING OPERATIONS
Apparatuses, systems, and techniques to combine operations. In at least one embodiment, a processor causes two or more dependent reduction operations to be combined into a software kernel.