SYSTEM AND METHOD FOR HIGH THROUGHPUT IN MULTIPLE COMPUTATIONS
20210191729 · 2021-06-24
Assignee
Inventors
Cpc classification
G06F13/4022
PHYSICS
G06F2212/621
PHYSICS
G06F8/441
PHYSICS
G06T1/20
PHYSICS
G06F12/0806
PHYSICS
International classification
Abstract
Device, circuit and method are configured to enhance throughout of processing of vast amount of data such as video stream. In some embodiment frequently used data blocks are stored in a fast RAM of the processor. In another embodiment received stream of data is divided to plurality of data portions and is streamed concurrently to streaming multiprocessors of a graphic processing unit (GPU) and is processed concurrently before the entire stream is loaded.
Claims
1. A method for enhancing graphical data throughput exchanged between graphical data source and a graphical processing unit (GPU) via a streaming multiprocessor unit that comprises a processing core unit (PCU), a register file unit, multiple cache units, shared memory unit, unified cache unit and interface cache unit, the method comprising: transferring stream of graphical data via interface cache unit and via the multiple cache units and via the unified cache unit to the register file unit; transferring a second stream of graphical data from the register file unit to the processing core unit (PCU); and storing and receiving frequently used portions of data in shared memory unit, via register file unit.
2. The method of claim 1 wherein the register file unit is configured to direct data processed by the PCU to the shared memory unit as long as it is capable of receiving more data, based on the level of frequent use of that data.
3. The method of claim 2 wherein the level of frequent use is determined by the PCU.
4. A streaming multiprocessor unit for enhancing graphical data throughput comprising: a processing core unit (PCU) configured to process graphical data; a register file unit, configured to provide graphical data from the PCU and to receive and temporary store processed graphical data from the PCU; multiple cache units, configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit; shared memory unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit; unified cache unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit; and interface cache unit, configured to receive graphical data for graphical processing at high pace, to provide the graphical data to at least one of shared memory unit and unified cache unit, to receive processed graphical data from the unified cache unit, and to provide the processed graphical data to external processing units.
5. The streaming multiprocessor unit of claim 4 wherein at least some of the graphical data elements are stored, before and/or after processing by the PCU in the shared memory unit, based on a priority figure that is associated with the probability of their close call by the PCU.
6. The streaming multiprocessor unit of claim 5 wherein the priority figure is higher as the probability is higher.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
[0019]
[0020]
[0021]
[0022]
[0023] It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
DETAILED DESCRIPTION OF THE INVENTION
[0024] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
[0025] The bottle-neck of CPU-GPU mutual operation in known computing systems lies, mostly, in the data transfer channels used for directing graphical related data by the CPU to the GPU and receiving the processed graphical data back from the GPU. Typically, the CPU and the GPU processors operate and communicate in standard computing environments.
[0026] Reference is made to
[0027] GPU unit 150 typically comprises GPU DRAM unit 154, interfacing data between unit 112 and the GPU processors, GPU cache units 156 (such as L2 cache units) that is adapted to cache data for the GPU processing units, and GPU processing units 158 (such as streaming multiprocessor/SM).
[0028] The flow of graphical data that enters processing unit 100 and is intended to be processed by GPU 150 is described by data flow (DF) arrows. First Data flow—DF1 depicts the flow of data into computing unit 100, where CPU 111 directs the flow—DF2—via peripheral controlling unit (PCU) 112, to DRAM 111A and back from it—DF3—via PCU 112—DF4—to GPU 150. At GPU 150 the flow of the data passes through DRAM unit 154 and through cache units 156 to the plurality of streaming multiprocessors (SMs) units 158 where graphical processing takes place.
[0029] It is a target of methods and structures according to the present invention to eliminate as much data flow bottle-necks as possible.
[0030] Reference is made now to
[0031] One way of reducing data transfers time is minimization of redundant data transfers. For example, intermediate results calculated by core 210 may be stored in register file 220 instead of storing them in the DRAM. Further, shared memory 240 may be used for storing data that is frequently used within SM 200, instead of circulating it outbound, as is commonly done. In some embodiments the level of frequency of use is determined by the PCU. Still further, constant memory units and/or cache memory units may be defined in SM 210.
[0032] According to further embodiments of the present invention data flow bottle-neck between the CPU computing environment and the GPU computing environment may be reduced or eliminate, by replacing the CPU with a specifically structured computing unit for all handling of graphical-related data.
[0033] Reference is made now to
[0034] In an exemplary embodiment UPDHU 300 comprises a Multi Streamer unit (MSU) 310 that may comprise a DSDU 304 comprising plurality of first-in-first-out (FIFO) registers/storage units array 304A (the FIFO units are not shown separately), of which one FIFO unit may be assigned to each of the SMs 318 of GPU 320. In some embodiments the received UPD stream that is received by DSDU 304 may be partitioned to multiple data units, which may be transferred to GPU 320 via FIFO units 304A, broadcasted to the GPU over an interface unit, such as AXI interface, such that data unit in each FIFO 304A is transferred to the associated SM 318, thereby enabling, for example, single action multiple data (SIMD) computing. When each (even a single) SM 318 of GPU 320 is loaded with the respective portion of the unprocessed data received from the associated FIFO 304A unit over an AXI interface, GPU 320 may start processing, not having to wait until the entire UPD file is loaded.
[0035] MSU 310 may comprise unprocessed data interface unit 302, configured to receive long streams of graphical data. The large amount of unprocessed data received via interface unit 302 may be partitioned to smaller size, plurality number of data units, to be transferred each via an assigned FIFO unit in FIFO unit 304A and then, over an AXI channel 315, via GPU AXI interface 316 to the assigned SM 318 of GPU 320.
[0036] Data units that were processed by the respective SM of SMs 318 may then be transferred back, over AXI connection, to the MSU. As seen, large overhead that is typical to CPU-GPU architectures is saved in the embodiments described above.
[0037]
[0038] The above described devices, structures and methods may accelerate the processing of large amount of unprocessed data, compared to known architectures and methods. For example, in known embodiments there is the need to transfer the whole image before the process/algorithm could start on the GPU. If the image size is 1 GB, the theoretical throughput of the PCI-E bus transferring data to the GPU is 32 GB/s, so latency would be 1 GB/(32 GB/s)= 1/32 s=31.125 ms≈31.3 ms. in contrary, with the FPGA according to embodiment of the invention it is just needed to fully load all SM units. For example, in the Tesla P100 GPU there are 56 SM units, and in each SM there are 64 cores that support 32 bit (in single precision mode) or 32 cores that support 64 bit (extended precision mode), thus the data size for a fully loaded GPU (same result for single or extended precision modes) is 56*32*64=114688 bits=14.336 Mbytes. The FPGA to GPU AXI stream theoretical throughput is 896 MB/s (for 56 lanes), so latency is 14.336 MB/(896 MB/s)=14.336/896 s=16 ms, which is substantially half the latency.
[0039] While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.