G06F9/355

Initialization of parameters for machine-learned transformer neural network architectures

An online system trains a transformer architecture by an initialization method which allows the transformer architecture to be trained without normalization layers of learning rate warmup, resulting in significant improvements in computational efficiency for transformer architectures. Specifically, an attention block included in an encoder or a decoder of the transformer architecture generates the set of attention representations by applying a key matrix to the input key, a query matrix to the input query, a value matrix to the input value to generate an output, and applying an output matrix to the output to generate the set of attention representations. The initialization method may be performed by scaling the parameters of the value matrix and the output matrix with a factor that is inverse to a number of the set of encoders or a number of the set of decoders.

Vector convert hexadecimal floating point to scaled decimal instruction

An instruction to perform converting and scaling operations is provided. Execution of the instruction includes converting an input value in one format to provide a converted result in another format. The converted result is scaled to provide a scaled result. A result obtained from the scaled result is placed in a selected location. Further, an instruction to perform scaling and converting operations is provided. Execution of the instruction includes scaling an input value in one format to provide a scaled result and converting the scaled result from the one format to provide a converted result in another format. A result obtained from the converted result is placed in a selected location.

APPARATUS AND METHOD FOR SCALABLE QUBIT ADDRESSING
20230162075 · 2023-05-25 ·

An apparatus and method for scalable qubit addressing. For example, one embodiment of a processor comprises: a decoder comprising quantum instruction decode circuitry to decode quantum instructions to generate quantum microoperations (uops) and non-quantum decode circuitry to decode non-quantum instructions to generate non-quantum uops; execution circuitry comprising: an address generation unit (AGU) to generate a system memory address responsive to execution of one or more of the non-quantum uops; and quantum index generation circuitry to generate quantum index values responsive to execution of one or more of the quantum uops, each quantum index value uniquely identifying a quantum bit (qubit) in a quantum processor; wherein to generate a first quantum index value for a first quantum uop, the quantum index generation circuitry is to read the first quantum index value from a first architectural register identified by the first quantum uop.

Processor with smart cache in place of register file for providing operands

A processor including a pointer storage that stores pointer descriptors each including addressing information, an arithmetic logic unit (ALU) configured to execute an instruction which includes operand indexes each identifying a corresponding pointer descriptor, multiple address generation units (AGUs), each configured to translate addressing information from a corresponding pointer descriptors into memory addresses for accessing corresponding operands stored in a memory, and a smart cache. The smart cache includes a cache storage, and uses the memory addresses from the AGUs to retrieve and store operands from the memory into the cache storage, and to provide the stored operands to the ALU when executing the instruction. The smart cache replaces a register file used by a conventional processor for retrieving and storing operand information. The pointer operands include post-update capability that reduces instruction fetches. Wasted memory cycles associated with cache speculation are avoided.

AUTOMATED PREDICTIVE INFRASTRUCTURE SCALING
20230116810 · 2023-04-13 ·

Methods, apparatus, and processor-readable storage media for automated predictive infrastructure scaling are provided herein. An example computer-implemented method includes generating infrastructure scaling predictions by processing, using a motion-based model, historical data pertaining to number of requests for resources for a given time interval and historical data pertaining to a rate of change in the number of requests; determining a trend based on moving average values pertaining to the historical data; determining a utilization target related to the resources based on the trend; calculating a standard deviation for resource demand based on historical utilization data pertaining to the resources; separating the standard deviation into zones related to levels of utilization of the resources; identifying one of the zones for infrastructure scaling in a future time interval based on the utilization target; updating the predictions by fitting the predictions to the identified zone; and performing automated actions based on the updated predictions.

Physical Quantity Detection Device
20220318010 · 2022-10-06 ·

A physical quantity detection device that can improve arithmetic resolution while preventing an increase in memory capacity is obtained. A physical quantity detection device 100 according to the present invention includes: a physical quantity detection sensor that detects a physical quantity of a measurement target gas; a storage unit that records a correction amount corresponding to a detection value of the physical quantity detection sensor; and an arithmetic unit 110 that performs output adjustment of the detection value using the detection value and the correction amount. Resolution of the storage unit 120 is lower than arithmetic resolution of the arithmetic unit 110.

INTERPOLATION ACCELERATION IN A PROCESSOR MEMORY INTERFACE
20220318162 · 2022-10-06 ·

Linear interpolation is performed within a memory system. The memory system receives a floating-point point index into an integer-indexed memory array. The memory system accesses the two values of the two adjacent integer indices, performs the linear interpolation, and provides the resulting interpolated value. In many system architectures, the critical limitation on system performance is the data transfer rate between memory and processing elements. Accordingly, reducing the amount of data transferred improves overall system performance and reduces power consumption.

DIRECT DATA TRANSFER SYSTEM

The present description concerns a system comprising at least one first and one second memory circuits; and a direct data transfer circuit which is adapted to receiving specific instructions originating from an external processor, and to decoding specific instructions comprising: a specific instruction SET_REGION of definition of a sub-region in the first memory circuit towards and from which the data will be transferred; and a specific instruction of transfer between said sub-region and the second memory circuit, the specific transfer instruction comprising a first address field containing the relative coordinates, in said sub-region, of a first reference cell.

FORWARD TENSOR AND ACTIVATION SCALING FOR LOWER PRECISION NEURAL NETWORKS
20230205544 · 2023-06-29 · ·

A processing device is provided which comprises memory configured to store data and a processor configured to execute a forward activation of the neural network using a low precision floating point (FP) format, scale up values of numbers represented by the low precision FP format and process the scaled up values of the numbers as non-zero values for the numbers. The processor is configured to scale up the values of one or more numbers, via scaling parameters, to a scaled up value equal to or greater than a floor of a dynamic range of the low precision FP format. The scaling parameters are, for example, static parameters or alternatively, parameters determined during execution of the neural network.

Providing code sections for matrix of arithmetic logic units in a processor
11687346 · 2023-06-27 · ·

The present invention relates to a processor having a trace cache and a plurality of ALUs arranged in a matrix, comprising an analyser unit located between the trace cache and the ALUs, wherein the analyser unit analyses the code in the trace cache, detects loops, transforms the code, and issues to the ALUs sections of the code combined to blocks for joint execution for a plurality of clock cycles.