Patent classifications
G06F9/30079
Optimized machine learning pipeline execution based on execution graphs
Embodiments provide a machine learning framework that enables developers to author and deploy machine learning pipelines into their applications regardless of the programming language in which the applications are structured. The framework may provide a programming language-specific API that enables the application to call a plurality of operators provided by the framework. The framework provides any number of APIs, each for a different programming language. The pipeline generated via the application is represented as an execution graph comprising node(s), where each node represents a particular operator. When a pipeline is submitted for execution, calls to the operators are detected, and nodes corresponding to the operators are generated for the execution graph. Once the execution graph is complete, the execution graph is provided to an execution engine, which reconstructs a corresponding execution graph that is structured in accordance with the operators' programming language and executes the reconstructed execution graph.
SELECTIVELY PERFORMING INLINE COMPRESSION BASED ON DATA ENTROPY
A technique for managing data storage obtains a batch of chunks of data. The technique generates, using multiple pipelined instructions operating on the batch, a measure of data entropy for each of the chunks in the batch. The technique selectively compresses chunks in the batch based at least in part on the measures of data entropy generated for the respective chunks.
PROGRAMMABLE INSTRUCTION BUFFERING
A processing system 2 includes a processing pipeline 12, 14, 16, 18, 28 which includes fetch circuitry 12 for fetching instructions to be executed from a memory 6, 8. Buffer control circuitry 34 is responsive to a programmable trigger, such as explicit hint instructions delimiting an instruction burst, or predetermined configuration data specifying parameters of a burst together with a synchronising instruction, to trigger the buffer control circuitry to stall a stallable portion of the processing pipeline (e.g. issue circuitry 16), to accumulate within one or more buffers 30, 32 fetched instructions starting from a predetermined starting instruction, and, when those instructions have been accumulated, to restart the stallable portion of the pipeline.
Check instruction for verifying correct code execution context
A data processing apparatus and method of data processing are provided which make use of a processor state check instruction to determine if the data processing apparatus is currently operating in a processor state, defined by at least one runtime processor state configuration value, which matches a processor state check value defined by the processor state check instruction. Dependent on the required runtime processor state configuration value(s) matching the processor state check value, the processor state check instruction is treated as an ineffective instruction. When the at least one runtime processor state configuration value does not match the processor state check value an exception is generated. Improved security of the data processing apparatus is thus provided.
PERFORMING CYCLIC REDUNDANCY CHECKS USING PARALLEL COMPUTING ARCHITECTURES
Apparatuses, systems, and techniques to compute cyclic redundancy checks use a graphics processing unit (GPU) to compute cyclic redundancy checks. For example, in at least one embodiment, an input data sequence is distributed among GPU threads for parallel calculation of an overall CRC value for the input data sequence according to various novel techniques described herein.
UNIVERSAL FLOATING-POINT INSTRUCTION SET ARCHITECTURE FOR COMPUTING DIRECTLY WITH DECIMAL CHARACTER SEQUENCES AND BINARY FORMATS IN ANY COMBINATION
A universal floating-point Instruction Set Architecture (ISA) implemented entirely in hardware. Using a single instruction, the universal floating-point ISA has the ability, in hardware, to compute directly with dual decimal character sequences up to IEEE 754-2008 H=20 in length, without first having to explicitly perform a conversion-to-binary-format process in software before computing with these human-readable floating-point or integer representations. The ISA does not employ opcodes, but rather pushes and pulls gobs of data without the encumbering opcode fetch, decode, and execute bottleneck. Instead, the ISA employs stand-alone, memory-mapped operators, complete with their own pipeline that is completely decoupled from the processor's primary push-pull pipeline. The ISA employs special three-port, 1024-bit wide SRAMS; a special dual asymmetric system stack; memory-mapped stand-alone hardware operators with private result buffers having simultaneously readable side-A and side-B read ports; and dual hardware H=20 convertFromDecimalCharacter conversion operators.
CACHE LINE DEMOTE INFRASTRUCTURE FOR MULTI-PROCESSOR PIPELINES
Examples described herein relate to a manner of demoting multiple cache lines to shared memory. In some examples, a shared cache is accessible by at least two processor cores and a region of the cache is larger than a cache line and is designated for demotion from the cache to the shared cache. In some examples, the cache line corresponds to a memory address in a region of memory. In some examples, an indication that the region of memory is associated with a cache line demote operation is provided in an indicator in a page table entry (PTE). In some examples, the indication that the region of memory is associated with a cache line demote operation is based on a command in an application executed by a processor. In some examples, the cache is an level 1 (L1) or level 2 (L2) cache.
Systems and methods for ISA support for indirect loads and stores for efficiently accessing compressed lists in graph applications
Disclosed embodiments relate to systems and methods for performing instructions to access a compressed graphic list. In one example, a processor includes fetch and decode circuitry to fetch and decode the single instruction to access the compressed graphic list, and execution circuitry to execute the decoded single instruction to cause access to the compressed graphic list by: receiving, from a load store queue, at a first op-engine associated with a first data location, an indirection request, computing, via the first op-engine, a second data location associated with a second op-engine, computing, via the second op-engine, a third data location associated with a third op-engine responsive to the indirection request, and providing, via the third op-engine, a data response to the load store queue responsive to receiving data from the third data location.
IC including Logic Tile, having Reconfigurable MAC Pipeline, and Reconfigurable Memory
An integrated circuit including configurable multiplier-accumulator circuitry, wherein, during processing operations, a plurality of the multiplier-accumulator circuits are serially connected into pipelines to perform concatenated multiply and accumulate operations. The integrated circuit includes a first memory and a second memory, and a switch interconnect network, including configurable multiplexers arranged in a plurality of switch matrices. The first and second memories are configurable as either a dedicated read memory or a dedicated write memory and connected to a given pipeline, via the switch interconnect network, during a processing operation performed thereby; wherein, during a first processing operations, the first memory is dedicated to write data to a first pipeline and the second memory is dedicated to read data therefrom and, during a second processing operation, the first memory is dedicated to read data from a second pipeline and the second memory is dedicated to write data thereto.
GLOBAL COHERENCE OPERATIONS
A method includes receiving, by a L2 controller, a request to perform a global operation on a L2 cache and preventing new blocking transactions from entering a pipeline coupled to the L2 cache while permitting new non-blocking transactions to enter the pipeline. Blocking transactions include read transactions and non-victim write transactions. Non-blocking transactions include response transactions, snoop transactions, and victim transactions. The method further includes, in response to an indication that the pipeline does not contain any pending blocking transactions, preventing new snoop transactions from entering the pipeline while permitting new response transactions and victim transactions to enter the pipeline; in response to an indication that the pipeline does not contain any pending snoop transactions, preventing, all new transactions from entering the pipeline; and, in response to an indication that the pipeline does not contain any pending transactions, performing the global operation on the L2 cache.