G06F9/3889

Extended memory operations

Systems, apparatuses, and methods related to extended memory operations are described. Extended memory operations can include operations specified by a single address and operand and may be performed by a computing device that includes a processing unit and a memory resource. The computing device can perform extended memory operations on data streamed through the computing tile without receipt of intervening commands. In an example, a computing device is configured to receive a command to perform an operation that comprises performing an operation on a data with the processing unit of the computing device and determine that an operand corresponding to the operation is stored in the memory resource. The computing device can further perform the operation using the operand stored in the memory resource.

Scheduling method and related apparatus

Disclosed are a scheduling method and a related apparatus. A computing apparatus in a server can be chosen to implement a computation request, thereby improving the running efficiency of the server.

Queues for inter-pipeline data hazard avoidance

Methods and parallel processing units for avoiding inter-pipeline data hazards identified at compile time. For each identified inter-pipeline data hazard the primary instruction and secondary instruction(s) thereof are identified as such and are linked by a counter which is used to track that inter-pipeline data hazard. When a primary instruction is output by the instruction decoder for execution the value of the counter associated therewith is adjusted to indicate that there is hazard related to the primary instruction, and when primary instruction has been resolved by one of multiple parallel processing pipelines the value of the counter associated therewith is adjusted to indicate that the hazard related to the primary instruction has been resolved. When a secondary instruction is output by the decoder for execution, the secondary instruction is stalled in a queue associated with the appropriate instruction pipeline if at least one counter associated with the primary instructions from which it depends indicates that there is a hazard related to the primary instruction.

ROUTING INSTRUCTIONS IN A MICROPROCESSOR

A computer system, processor, programming instructions and/or method for balancing the workload of processing pipelines that includes an execution slice, the execution slice comprising at least two processing pipelines having one or more execution units for processing instructions, wherein at least a first processing pipeline and a second processing pipeline are capable of executing a first instruction type; and an instruction decode unit for decoding instructions to determine which of the first processing pipeline or the second processing pipeline to execute the first instruction type. The processor configured to calculate at least one of a workload group consisting of: the first processing pipeline workload, the second processing pipeline workload, and combinations thereof; and select the first processing pipeline or the second processing pipeline to execute the first instruction type based upon at least one of the workload group.

Thread associated memory allocation and memory architecture aware allocation
11520633 · 2022-12-06 · ·

A method and system for thread aware, class aware, and topology aware memory allocations. Embodiments include a compiler configured to generate compiled code (e.g., for a runtime) that when executed allocates memory on a per class per thread basis that is system topology (e.g., for non-uniform memory architecture (NUMA)) aware. Embodiments can further include an executable configured to allocate a respective memory pool during runtime for each instance of a class for each thread. The memory pools are local to a respective processor, core, etc., where each thread executes.

PARALLEL PROCESSING IN A SPIKING NEURAL NETWORK
20220383080 · 2022-12-01 ·

The disclosed embodiments are related to storing critical data in a memory device such as Flash or DRAM memory device. In one embodiment, a device comprising a plurality of parallel processors is disclosed, the plurality of parallel processors configured to: perform a search and match operation, the search and match operation loading a plurality of synaptic identifier bit strings and a plurality of spike identifier bit strings, the search and match operation further generating a plurality of bitmasks; perform a synaptic integration phase, the synaptic integration phase generating a plurality of synaptic current vectors based on the plurality of bitmasks, the synaptic current vectors associated with respective synthetic neurons; solve a neural membrane equation for each of the synthetic neurons; and update membrane potentials associated with the synthetic neurons, the membrane potentials stored in a memory device.

Error detection using vector processing circuitry

A data processing apparatus (2) has scalar processing circuitry (32-42) and vector processing circuitry (38, 40, 42). When executing main scalar processing on the scalar processing circuitry (32-42), or main vector processing using a subset of said plurality of lanes on the vector processing circuitry (38, 40, 42), checker processing is executed using at least one lane of the plurality of lanes on the vector processing circuitry (38, 40, 42), the checker processing comprising operations corresponding to at least part of the main scalar/vector processing. Errors can then be detected based on a comparison of an outcome of the main processing and an outcome of the checker processing. This provides a technique for achieving functional safety in a high end processor with better performance and reduced hardware cost compared to a dual/triple core lockstep approach.

SYSTEM AND METHOD FOR HALTING PROCESSING CORES IN A MULTICORE SYSTEM

A multicore computing device includes a memory and a processor coupled to the memory. The processor includes plural cores and a multiple input multiple output (MIMO) block coupled to the cores. The MIMO block receives a halt request from a first core of the cores, transmits a core-halt request to one or more other cores other than the first core, to halt execution of the one or more other cores, and permits the first core to lock with a shared resource.

MANAGING RETURN PARAMETER ALLOCATION
20230058935 · 2023-02-23 ·

A hybrid threading processor (HTP) supports thread creation by executing an instruction that indicates an amount of storage space to reserve for return values. Before a thread is created, the indicated amount of space is reserved. The newly created child thread sends a return packet back to the parent thread when the child thread completes. The thread writes its return information into the reserved space and waits for the parent thread to execute a thread join instruction. The thread join instruction takes the returned information from the reserved space and transfers it to the parent thread's register state. The reserved space is released once the child thread is joined. Using a configurable amount of space for each child thread may allow for more child threads to be executed simultaneously.

MASKING FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURE
20230058355 · 2023-02-23 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, a first node can include a tile cluster of N memory-compute tiles, and the N memory-compute tiles can be coupled using a first portion of a synchronous compute fabric. Operations performed by the respective processing and storage elements of the N memory-compute tiles can be selectively enabled or disabled based on information in a mask field of data propagated through the first portion of the synchronous compute fabric.