Patent classifications
G06F15/7839
MANAGEMENT APPARATUS AND INFORMATION PROCESSING SYSTEM
A management apparatus includes a memory and a processor coupled to the memory. The processor is configured to broadcast an activation request to a plurality of information processing devices having a reception period during which the activation request is received. The reception period occurs in a predetermined cycle. The processor is configured to receive a confirmation response from first information processing devices among the plurality of information processing devices. The first information processing devices receive the activation request. The processor is configured to issue an activation instruction to a predetermined number of second information processing devices among the first information processing devices. The activation instruction instructs to activate the second information processing devices. The processor is configured to issue, to the first information processing devices other than the second information processing devices, an activation prohibition instruction to prohibit activation.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.
Independent control of multiple concurrent application graphs in a reconfigurable data processor
A reconfigurable data processor includes a plurality of configurable units, and a configuration controller. The configuration controller is configured to start execution of a first application graph in a first set of configurable units. Then, concurrently with the execution of the first application graph in the first set of configurable units, the configuration controllers receive a command to load a configuration file into a second set of configurable units and obtain the configuration file. The configuration file contains information to configure the second set of configurable units to execute a second application graph. The configuration file is then loaded into the second set of configurable units and execution of the second application graph is started in the second set of configurable units.
COMPUTER FOR EXECUTING ALGORITHMS CARRIED OUT FROM MEMORIES USING MIXED TECHNOLOGIES
A computer for executing a computation algorithm involving a digital variable as per at least two operating phases is provided. The computer includes a memory stage having: a first set of memories for storing a first sub-word of each digital variable; with each memory of the first set being non-volatile and having a first read endurance and a first write cyclability; a second set of memories for storing a second sub-word of each digital variable; with each memory of the second set having a second read endurance and a second write cyclability; with the first read endurance being greater than the second read endurance and the first write cyclability being less than the second write cyclability.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
Graphic Processor Unit with Improved Energy Efficiency
A GPU architecture employs a crossbar switch to preferentially store operand vectors in a compressed form allowing reduction in the number of memory circuits that must be activated during an operand fetch and to allow existing execution units to be used for scalar execution. Scalar execution can be performed during branch divergence.
Lossless tiling in convolution networks—padding before tiling, location-based tiling, and zeroing-out
Disclosed is a data processing system to receive a processing graph of an application. A compile time logic is configured to modify the processing graph and generate a modified processing graph. The modified processing graph is configured to apply a post-padding tiling after applying a cumulative input padding that confines padding to an input. The cumulative input padding pads the input into a padded input. The post-padding tiling tiles the padded input into a set of pre-padded input tiles with a same tile size, tiles intermediate representation of the input into a set of intermediate tiles with a same tile size, and tiles output representation of the input into a set of non-overlapping output tiles with a same tile size. Runtime logic is configured with the compile time logic to execute the modified processing graph to execute the application.
Tensor Partitioning and Partition Access Order
A method of processing partitions of a tensor in a target order includes receiving, by a reorder unit and from two or more producer units, a plurality of partitions of a tensor in a first order that is different from the target order, storing the plurality of partitions in the reorder unit, and providing, from the reorder unit, the plurality of partitions in the target order to one or more consumer units. In an example, the one or more consumer units process the plurality of partitions in the target order.
Lossless Tiling in Convolution Networks - Materialization of Tensors
Disclosed is a data processing system that includes a plurality of reconfigurable processors and processor memory. Runtime logic, operatively coupled to the plurality of reconfigurable processors and the processor memory, is configured to configure at least one reconfigurable processor in the plurality of reconfigurable processors with a first subgraph in a sequence of subgraphs of a graph; load an input onto the processor memory; on a tile-by-tile basis, process a first set of input tiles from the input through the first subgraph and generate a first set of intermediate tiles, load the first set of intermediate tiles onto the processor memory, and process the first set of intermediate tiles through the first subgraph and generate a first set of output tiles; and compose output tiles in the first set of output tiles into a first composed input, and load the first composed input onto the processor memory.
Efficient address translation caching in a processor that supports a large number of different address spaces
A processor includes translation-lookaside buffer (TLB) and a mapping module. The TLB includes a plurality of entries, wherein each entry of the plurality of entries is configured to hold an address translation and a valid bit vector, wherein each bit of the valid bit vector indicates, for a respective address translation context, the address translation is valid if set and invalid if clear. The TLB also includes an invalidation bit vector having bits corresponding to the bits of the valid bit vector of the plurality of entries, wherein a set bit of the invalidation bit vector indicates to simultaneously clear the corresponding bit of the valid bit vector of each entry of the plurality of entries. The mapping module generates the invalidation bit vector.