G06F9/302

Storage organization for transposing a matrix using a streaming engine

Software instructions are executed on a processor within a computer system to configure a steaming engine to operate in either a linear mode or a transpose mode. A stream of addresses is generated using an address generator, in which the stream of addresses includes consecutive nested loop iterations for at least a first loop and a second loop. While in the linear mode, the first loop is treated as an inner loop. While in the transpose mode, the second loop is treated as the inner loop. A matrix can be fetched from memory in the linear mode to provide row-wise vectors. A matrix can be fetched from the memory in the transpose mode to provide column wise vectors. Local storage on the streaming engine is organized as sectors based on the number of rows in the matrix to allow overlapping transposition processing and to minimize memory accesses.

Shift-folding for efficient load coalescing in a binary translation based processor

A processor includes an instruction fetch circuit to retrieve instructions from memory, and a decode unit circuit to decode retrieved instructions. The decode unit circuit identifies a shift instruction, accumulates a shift folded immediate value to track a number of bit positions shifted for a source register, and prevents the shift instruction from allocation to an execution unit of the processor.

Architecture to support synchronization between core and inference engine for machine learning

A system to support a machine learning (ML) operation comprises a core configured to receive and interpret commands into a set of instructions for the ML operation and a memory unit configured to maintain data for the ML operation. The system further comprises an inference engine having a plurality of processing tiles, each comprising an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform tasks of the ML operation on the data in the OCM. The system also comprises an instruction streaming engine configured to distribute the instructions to the processing tiles to control their operations and to synchronize data communication between the core and the inference engine so that data transmitted between them correctly reaches the corresponding processing tiles while ensuring coherence of data shared and distributed among the core and the OCMs.

Modular multi-computer monitor stand system for multi-computer monitor setup
10885815 · 2021-01-05 ·

A modular multi-computer monitor stand system for multi-computer monitor setups is disclosed. The stand system provides various discrete interlocking components including a rail, slide members, elongated legs, base members and mounting plates that modularly interconnect to form an upright stand capable of holding multiple computer monitors thereon. The rail includes various grooves for slidably receiving the slide members. The slide members include projections slidably engageable with the grooves of the elongated rail and interlocking regions detachably engageable with the base members and the mounting plates. The base members include interlocking receptacles detachably engageable with the elongated legs and interlocking regions detachably engageable with the interlocking regions of the slide members. The mounting plates include interlocking regions detachably engageable with the interlocking regions of the slide members and universally positioned apertures for fastening the mounting plate to any type of computer monitor.

Multi-display apparatus and method of installing the same
10853018 · 2020-12-01 · ·

A multi-display apparatus including a main frame, a plurality of display modules arranged on the main frame, and a sub-frame coupled to each of the plurality of display modules. The sub-frame includes a rail portion and a hinge portion hingedly-coupled to the main frame. Each of the plurality of display modules includes a display panel, a back cover coupled to a rear surface of the display panel, and a movement guide fixed to the back cover and coupled to the rail portion a sliding manner.

Accessing data in multi-dimensional tensors
10838724 · 2020-11-17 · ·

Methods, systems, and apparatus, including an apparatus for processing an instruction for accessing a N-dimensional tensor, the apparatus including multiple tensor index elements and multiple dimension multiplier elements, where each of the dimension multiplier elements has a corresponding tensor index element. The apparatus includes one or more processors configured to obtain an instruction to access a particular element of a N-dimensional tensor, where the N-dimensional tensor has multiple elements arranged across each of the N dimensions, and where N is an integer that is equal to or greater than one; determine, using one or more tensor index elements of the multiple tensor index elements and one or more dimension multiplier elements of the multiple dimension multiplier elements, an address of the particular element; and output data indicating the determined address for accessing the particular element of the N-dimensional tensor.

Processor supporting arithmetic instructions with branch on overflow and methods
10768930 · 2020-09-08 · ·

A method provides for decoding, in a microprocessor, an instruction into data identifying a first register, a second register, an immediate value, and an opcode identifier. The opcode identifier is interpreted as indicating that an arithmetic operation is to be performed on the first register and the second register, and that the microprocessor is to perform a change of control operation in response to the addition of the first register and the second register causing overflow or underflow. The change of control operation is to a location in a program determined based on the immediate value. A processor can be provided with a decoder and other supporting circuitry to implement such method. Overflow/underflow can be checked on word boundaries of a double-word operation.

Processor and control method of processor for address generating and address displacement

A processor includes: an address generating unit that, when an instruction decoded by a decoding unit is an instruction to execute arithmetic processing on a plurality of operand sets each including a plurality of operands that are objects of the arithmetic processing, in parallel a plurality of times, generates an address set corresponding to each of the operand sets of the arithmetic processing for each time, based on a certain address displacement with respect to the plurality of operands included in each of the operand sets; a plurality of instruction queues that hold the generated address sets corresponding to the respective operand sets, in correspondence to respective processing units; and a plurality of processing units that perform the arithmetic processing in parallel on the operand sets obtained based on the respective address sets outputted by the plurality of instruction queues.

Core for a data processing engine in an integrated circuit

An example core for a data processing engine (DPE) includes a register file, a processor, coupled to the register file. The processor includes a multiply-accumulate (MAC) circuit, and permute circuitry coupled between the register file and the MAC circuit, the permute circuitry configured to concatenate at least one pair of outputs of the register file to provide at least one input to the MAC circuit. The core further includes an instruction decoder, coupled to the processor, configured to decode a very large instruction word (VLIW) to set a plurality of parameters of the processor, the plurality of parameters including first parameters of the permute circuitry and second parameters of the MAC circuit.

Hardware accelerators and methods for high-performance authenticated encryption

Methods and apparatuses relating to high-performance authenticated encryption are described. A hardware accelerator may include a vector register to store an input vector of a round of an encryption operation; a circuit including a first data path including a first modular adder coupled to a first input from the vector register and a second input from the vector register, and a second modular adder coupled to the first modular adder and a second data path from the vector register, and the second data path including a first logical XOR circuit coupled to the second input and a third data path from the vector register, a first rotate circuit coupled to the first logical XOR circuit, a second logical XOR circuit coupled to the first rotate circuit and the third data path, and a second rotate circuit coupled to the second logical XOR circuit; and a control circuit to cause the first modular adder and the second modular adder of the first data path and the first logical XOR circuit, the second logical XOR circuit, the first rotate circuit, and the second rotate circuit of the second data path to perform a portion of the round according to one or more control values, and store a first result from the first data path for the portion and a second result from the second data path for the portion into the vector register.