G06F9/3555

SPECIAL-PURPOSE DIGITAL-COMPUTE HARDWARE FOR EFFICIENT ELEMENT-WISE AGGREGATION, SCALING AND OFFSET

An efficient pipelined implementation of digital scaling, offset and aggregation operation supports element-by-element programmable scale and offset factors. The method includes time-multiplexed parallel pipelining of a plurality of digital data words, each of the plurality of digital data words encoding an N-bit signed integer, from one of a plurality of receive-registers through a datapath that can either (1) store the plurality of digital data words directly in a dedicated first memory, (2) store the plurality of digital data words directly in a dedicated second memory, or (3) direct the plurality of digital data words into a parallel set of fused-multiply-add units. The method further includes multiplying each digital data word by a corresponding data-word retrieved from the dedicated first memory to form product data words and adding the product data words to a corresponding data-word retrieved from the dedicated second memory to form an output sum-and-product data words.

APPARATUSES AND METHODS FOR PROCESSING SINGLE INSTRUCTION FOR IMAGE TRANSFORMATION FROM NON-INTEGRAL LOCATIONS

A processor pipeline circuit in a processor for non-integral transformation of an image utilizing a single instruction is disclosed. The processor pipeline circuit comprises a data fetch circuit configured to receive a memory address of the input image and fetch a plurality of pixels of the input image. The processor pipeline circuit further comprises a weights access circuit configured to receive an element of the array of offsets and the interpolation type parameter. The weights access circuit is configured to determine weights to be applied to the plurality of pixels of the input image. The processor pipeline circuit further comprises a multiply and add circuit configured to calculate the output pixel of the transformed image by multiplying the plurality of pixels of the input image by the weights and summing each resulting product.

AUTOSCALING AND THROTTLING IN AN ELASTIC CLOUD SERVICE

Techniques described herein can optimize usage of computing resources in a data system. Dynamic throttling can be performed locally on a computing resource in the foreground and autoscaling can be performed in a centralized fashion in the background. Dynamic throttling can lower the load without overshooting while minimizing oscillation and reducing the throttle quickly. Autoscaling may involve scaling in or out the number of computing resources in a cluster as well as scaling up or down the type of computing resources to handle different types of situations.

Instructions and logic for load-indices-and-prefetch-scatters operations

A processor includes an execution unit to execute instructions to load indices from an array of indices, optionally perform scatters, and prefetch (to a specified cache) contents of target locations for future scatters from arbitrary locations in memory. The execution unit includes logic to load, for each target location of a scatter or prefetch operation, an index value to be used in computing the address in memory for the operation. The index value may be retrieved from an array of indices identified for the instruction. The execution unit includes logic to compute the addresses based on the sum of a base address specified for the instruction, the index value retrieved for the location, and a prefetch offset (for prefetch operations), with optional scaling. The execution unit includes logic to retrieve data elements from contiguous locations in a source vector register specified for the instruction to be scattered to the memory.

VECTOR GENERATING INSTRUCTION

An apparatus and method are provided for performing vector processing operations. In particular the apparatus has processing circuitry to perform the vector processing operations and an instruction decoder to decode vector instructions to control the processing circuitry to perform the vector processing operations specified by the vector instructions. The instruction decoder is responsive to a vector generating instruction identifying a scalar start value and wrapping control information, to control the processing circuitry to generate a vector comprising a plurality of elements. In particular, the processing circuitry is arranged to generate the vector such that the first element in the plurality is dependent on the scalar start value, and the values of the plurality of elements follow a regularly progressing sequence that is constrained to wrap as required to ensure that each value is within bounds determined from the wrapping control information. The vector generating instruction can be useful in a variety of situations, a particular use case being to implement a circular addressing mode within memory, where the vector generating instruction can be coupled with an associated vector memory access instruction. Such an approach can remove the need to provide additional logic within the memory access path to support such circular addressing.

PROGRAMMABLE COMPUTE ENGINE HAVING TRANSPOSE OPERATIONS
20240111528 · 2024-04-04 ·

A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.

WORKLOAD LINKED PERFORMANCE SCALING FOR SERVERS

Embodiments herein relate to providing uniform servicing of workloads at a set of servers in a computer network. A platform determines and meets the performance requirements of a workload by scaling a performance capability of a group of processing units such as central processing units (CPUs) which are assigned to service the workload. This can involve increasing the power (P) state of one or more of the processing units to a highest P state in the group, so that every processing units in the group provides the same performance for a given workload. The platform can manage scaling of the processing units performance by reading a performance profile list which indicates minimum and maximum scaling points for programs that are executed to service the workload.

CONFIGURING AND DYNAMICALLY RECONFIGURING CHAINS OF ACCELERATORS

A method of an aspect includes receiving a request for a chained accelerator operation, and configuring a chain of accelerators to perform the chained accelerator operation. This may include configuring a first accelerator to access an input data from a source memory location in system memory, process the input data, and generate first intermediate data. This may also include configuring a second accelerator to receive the first intermediate data, without the first intermediate data having been sent to the system memory, process the first intermediate data, and generate additional data. Other apparatus, methods, systems, and machine-readable medium are disclosed.

CHAINED ACCELERATOR OPERATIONS

A chip or other apparatus of an aspect includes a first accelerator and a second accelerator. The first accelerator has support for a chained accelerator operation. The first accelerator is to be controlled as part of the chained accelerator operation to access an input data from a source memory location in system memory, process the input data, and generate first intermediate data. The second accelerator also has support for the chained accelerator operation. The second accelerator is to be controlled as part of the chained accelerator operation to receive the first intermediate data, without the first intermediate data having been sent to the system memory, process the first intermediate data, and generate additional data. Other apparatus, methods, systems, and machine-readable medium are disclosed.

PROGRAM ACCELERATORS WITH MULTIDIMENSIONAL NESTED COMMAND STRUCTURES
20240127107 · 2024-04-18 ·

Embodiments of the present disclosure include techniques for machine language processing. In one embodiment, the present disclosure include commands with data structures comprising fields describing multi-dimensional data and fields describing synchronization. Large volumes of data may be processed and automatically synchronized by execution of a single command.