Patent classifications
G06F9/3555
HEAVY-WEIGHT/LIGHT-WEIGHT GPU SHADER CORE PARE ARCHITECTURE
A shader core includes a first processing element (PE), a second processing element, a register file and a warp sequencing unit. The first PE includes a first predetermined number of execution units, and the second PE includes a second predetermined number of execution units in which the second predetermined number of execution units is less than the first predetermined number of execution units. The register file shared by the first PE and the second PE. The warp sequencer unit (WSQ) is coupled to the first PE and to the second PE and schedules an instruction trace to execute on the first PE or the second PE based on information contained in a trace header of the instruction trace. The information contained in the trace header indicates whether the instruction trace is executable on the second PE.
Accessing data in multi-dimensional tensors
Methods, systems, and apparatus, including an apparatus for processing an instruction for accessing a N-dimensional tensor, the apparatus including multiple tensor index elements and multiple dimension multiplier elements, where each of the dimension multiplier elements has a corresponding tensor index element. The apparatus includes one or more processors configured to obtain an instruction to access a particular element of a N-dimensional tensor, where the N-dimensional tensor has multiple elements arranged across each of the N dimensions, and where N is an integer that is equal to or greater than one; determine, using one or more tensor index elements of the multiple tensor index elements and one or more dimension multiplier elements of the multiple dimension multiplier elements, an address of the particular element; and output data indicating the determined address for accessing the particular element of the N-dimensional tensor.
System, Apparatus And Method For Symbolic Store Address Generation For Data-Parallel Processor
In one embodiment, an apparatus includes: a plurality of execution lanes to perform parallel execution of instructions; and a unified symbolic store address buffer coupled to the plurality of execution lanes, the unified symbolic store address buffer comprising a plurality of entries each to store a symbolic store address for a store instruction to be executed by at least some of the plurality of execution lanes. Other embodiments are described and claimed.
TRANSACTION EXCHANGE PLATFORM TO DETERMINE SECONDARY WORKFLOWS FOR TRANSACTION OBJECT PROCESSING
Aspects described herein may relate to a transaction exchange platform using a streaming data platform (SDP) and microservices to process transactions in accordance with corresponding workflows. The transaction exchange platform may receive transactions from origination sources, which may be added to the SDP as transaction objects. Microservices on the transaction exchange platform may interact with the transaction objects based on configured workflows associated with the transactions. Further, the microservices may leverage machine-learning models to determine whether transaction objects may be more effectively processed using alternative or secondary workflows. Processing on the transaction exchange platform may facilitate clearing and settlement of transactions. Some aspects may provide for dynamic and flexible reconfiguration of workflows.
SYSTEM AND METHOD FOR VECTOR COMMUNICATION
There is disclosed in an example, an endpoint apparatus for an interconnect, comprising: a mechanical and electrical interface to the interconnect; and one or more logic elements comprising an interface vector engine to: receive a first scalar transaction for the interface; determine that the first scalar transaction meets a criterion for vectorization; receive a second scalar transaction for the interface; determine that the second transaction meets the criterion for vectorization; vectorize the first scalar transaction and second scalar transaction into a vector transaction; and send the vector transaction via the electrical interface
Distributed parallel processing routing
Examples described herein provide a non-transitory computer-readable medium storing instructions, which when executed on one or more processors, cause the one or more processors to perform operations. The operations include generating a plurality of child processes according to a number of a plurality of partitions in an integrated circuit (IC) design for an IC die, each of the plurality of child processes corresponding to and assigned to a respective one of the plurality of partitions. The operations include transmitting each of the plurality of partitions to a respective one of the plurality of child processes for routing, each of the plurality of partitions comprising a placement of components for the IC design. The operations include receiving a plurality of routings from the plurality of child processes. The operations include merging the plurality of routings into a global routing for the IC design by assembling together to form a global routing.
APPARATUS AND METHOD FOR TILE GATHER AND TILE SCATTER
An apparatus and method for tile-based gather and scatter operations. For example, one embodiment of a processor comprises: a destination tile register to store a 2-D arrangement of data elements; a first source tile register to store indices associated with the data elements; instruction fetch circuitry to fetch a tile gather instruction comprising operands identifying the first source tile register and the destination tile register; a decoder to decode the tile gather instruction; and execution circuitry to determine a plurality of system memory addresses based on the indices from the first source tile register and to load the data elements from the system memory addresses to the destination tile register.
Systems and methods for efficient scaling of quantized integers
The disclosed computer-implemented method may include receiving an input value and a floating-point scaling factor and determining (1) an integer scaling factor based on the floating-point scaling factor, (2) a pre-scaling adjustment value representative of a number of places by which to shift a binary representation of the input value prior to a scaling operation, and (3) a post-scaling adjustment value representative of a number of places by which to shift the binary representation of the input value following the scaling operation. The method may further include calculating a scaled result value by (1) shifting rightwards the binary representation of the input value by the pre-scaling adjustment value, (2) scaling the shifted binary representation of the input value by the integer scaling factor, and (3) shifting rightwards the shifted and scaled binary value by the post-scaling adjustment value. Various other methods, systems, and computer-readable media are also disclosed.
PROCESSING OF TEMPORARY-REGISTER-USING INSTRUCTION
An apparatus has a processing pipeline, and first and second register files. A temporary-register-using instruction is supported which controls the pipeline to perform an operation using a temporary variable derived from an operand stored in the first register file. In response to the instruction, when a predetermined condition is not satisfied, the pipeline processes at least one register move micro-operation to transfer data from the at least one source register of the first register file to at least one newly allocated temporary register of the second register file. When the condition is satisfied, the operation can be performed using a temporary variable already stored in the temporary register of the second register file used by an earlier temporary-register-using instruction specifying the same source register for determining the temporary variable, in the absence of an intervening instruction for rewriting the source register.
Large-scale data processing computer architecture
Computing architecture comprises an off-chip memory, an on-chip cache unit, a prefetching unit, a global scheduler, a transmitting unit, a pre-recombination network, a post-recombination network, a main computing array, a write-back cache unit, a data dependence controller and an auxiliary computing array. The architecture reads data tiles into an on-chip cache in a prefetching mode, and performs computing according to the data tiles; in the computing process of the tiles, a tile exchange network is adopted to recombine a data structure, and a data dependence module is arranged to process a data dependence relationship possibly existing between different tiles. According to the computing architecture, the data utilization rate can be increased, the data processing flexibility is improved, and therefore Cache Miss is reduced, and the memory bandwidth pressure is reduced.