G06F9/3009

DISTRIBUTED GEOMETRY

Systems, apparatuses, and methods for performing geometry work in parallel on multiple chiplets are disclosed. A system includes a chiplet processor with multiple chiplets for performing graphics work in parallel. Instead of having a central distributor to distribute work to the individual chiplets, each chiplet determines on its own the work to be performed. For example, during a draw call, each chiplet calculates which portions to fetch and process of one or more index buffer(s) corresponding to one or more graphics object(s) of the draw call. Once the portions are calculated, each chiplet fetches the corresponding indices and processes the indices. The chiplets perform these tasks in parallel and independently of each other. When the index buffer(s) are processed, one or more subsequent step(s) in the graphics rendering process are performed in parallel by the chiplets.

IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
20230101232 · 2023-03-30 ·

An image processing apparatus capable of communicating with an information processing apparatus includes a reading unit that reads an image of a document and generates image data, a transmission unit that transmits the image data, and a reception unit that receives an instruction from the information processing apparatus. Based on the received instruction, the reading unit reads the image of the document and generates the image data and the transmission unit transmits the generated image data to the information processing apparatus. The image data transmitted by the transmission unit is posted to a thread specified by a user via a terminal apparatus connected to the information processing apparatus, the thread being one of at least one thread included in a channel of a chat service provided by the information processing apparatus.

SYNCHRONIZING SCHEDULING TASKS WITH ATOMIC ALU
20230033355 · 2023-02-02 ·

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

Coprocessor Register Renaming

A coprocessor with register renaming is disclosed. An apparatus includes a plurality of processors and a coprocessor respectively configured to execute processor instructions and coprocessor instructions. The coprocessor receives coprocessor instructions from ones of the processors. The coprocessor includes an array of processing elements and a result register set comprising storage elements respectively distributed within the array of processing elements. For a given member of the array of processing elements, a corresponding storage element is configured to store coprocessor instruction results generated by the given member. The result register set implements a plurality of contexts to store respective coprocessor states corresponding to coprocessor instructions received from different processors. Based on a determination that one of the contexts is inactive, the coprocessor is configured to store coprocessor instruction results corresponding to an active context within storage elements of the result register set corresponding to the inactive context.

CIRCUITRY AND METHODS FOR ACCELERATING STREAMING DATA-TRANSFORMATION OPERATIONS
20230100586 · 2023-03-30 ·

Systems, methods, and apparatuses for accelerating streaming data-transformation operations are described. In one example, a system on a chip (SoC) includes a hardware processor core comprising a decoder circuit to decode an instruction comprising an opcode into a decoded instruction, the opcode to indicate an execution circuit is to generate a single descriptor and cause the single descriptor to be sent to an accelerator circuit coupled to the hardware processor core, and the execution circuit to execute the decoded instruction according to the opcode; and the accelerator circuit comprising a work dispatcher circuit and one or more work execution circuits to, in response to the single descriptor sent from the hardware processor core: when a field of the single descriptor is a first value, cause a single job to be sent by the work dispatcher circuit to a single work execution circuit of the one or more work execution circuits to perform an operation indicated in the single descriptor to generate an output, and when the field of the single descriptor is a second different value, cause a plurality of jobs to be sent by the work dispatcher circuit to the one or more work execution circuits to perform the operation indicated in the single descriptor to generate the output as a single stream.

DYNAMIC WORKLOAD DISTRIBUTION FOR DATA PROCESSING
20230030808 · 2023-02-02 ·

A computer-implemented method, according to one embodiment, includes: receiving a data process that includes a plurality of sub-processes. A unique subset of the sub-processes is assigned to each of: a managing thread, and at least one other thread. Moreover, performance characteristics of each of the threads is evaluated while the respective subsets of sub-processes are being performed, and a determination is made as to whether the performance characteristics of each of the threads are substantially equal to the performance characteristics of each of the other threads. In response to determining that performance characteristics of each of the threads are not substantially equal, the subsets of the sub-processes are dynamically adjusted such that the performance characteristics of each of the threads become more equal. Moreover, the adjusted subsets of the sub-processes are reassigned to each of the managing thread and at least one other thread.

Multi-Thread Synchronization Method and Electronic Device
20220350602 · 2022-11-03 ·

A multi-thread synchronization method includes that a first thread requests to obtain a target lock. Then, the first thread checks the lock thread identifier field. The first thread checks the blocked thread quantity field when checking that the lock thread identifier field is a valid thread and is not the first thread. The first thread performs spin wait when checking that the blocked thread quantity field is less than a first threshold. When a quantity of times for spin wait reaches a second threshold and when it is checked that the lock thread identifier field is the valid thread and is not the first thread, the first thread performs an operation of adding 1 to the blocked thread quantity field, and suspends to enter a blocked state.

Thread Creation on Local or Remote Compute Elements by a Multi-Threaded, Self-Scheduling Processor
20230091432 · 2023-03-23 ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

DATA PROCESSING SYSTEM, DATA TRANSFER DEVICE, AND CONTEXT SWITCHING METHOD
20230090585 · 2023-03-23 · ·

A processing section executes processes concerning a plurality of applications in a time division manner. A CSDMA engine detects a switching timing of an application to be executed in the processing section. When detecting the switching timing, the CSDMA engine saves a context of an application that is being executed in the processing section 46, to a main memory from the processing section, and installs a context of an application to be subsequently executed in the processing section, from the main memory to the processing section, not through a process by software managing the plurality of applications.

PARALLEL PROCESSING OF THREAD GROUPS

Apparatuses, systems, and techniques to facilitate parallel processing. In at least one embodiment, an application programming interface allows a user to define a plurality of cooperative thread groups, and launch multiple cooperative thread groups in parallel provided sufficient processing resources are available.