System and methods for the computation of the forward and inverse discrete periodic radon transform on GPUs and CPUs
10943322 ยท 2021-03-09
Inventors
- Marios Stephanou Pattichis (Albuquerque, NM)
- Cesar Carranza (Miami Beach, FL, US)
- Daniel Llamocca Obregon (Clawson, MI, US)
Cpc classification
G06T1/20
PHYSICS
G06F9/5066
PHYSICS
G06T11/006
PHYSICS
International classification
G06T1/20
PHYSICS
G06F9/30
PHYSICS
G06F9/50
PHYSICS
Abstract
Fast and a scalable algorithms and methods adaptable to available resources for computing (1) the DPRT on multicore CPUs by distributing the computation of the DPRT primary directions among the different cores, and (2) the DPRT on GPUs using parallel, distributed, and synchronized ray computations among the GPU cores with ray referring to one of the sums required for computing the DPRT or its inverse along a prime direction.
Claims
1. A method for computing a Discrete Periodic Radon Transform (DPRT) on a Graphics Processing Unit (GPU) comprising the steps of: distributing a computation of prime directions among multiprocessors (MPs), wherein the distributed prime directions are each partitioned into threads; synchronizing parallel threads so that each thread always reads from a same row of pixels at the same time; enforcing row-major ordering for read and write operations for each thread; and applying one or more optimizations to each thread, wherein the one or more optimizations comprise a pixel width, an optimal concurrency of warps, a fast address calculation and a modulo operation masking.
2. The method according to claim 1, wherein the computation for the Discrete Periodic Radon Transform (DPRT) using the GPU comprises the steps of: receiving an input image f of a size NN, wherein N is a prime number; partitioning a set of N+1 prime directions into sets of threads with each thread comprising consecutive prime directions; assigning each partitioned set of threads to different multiprocessors of the GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute a DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; summing the DPRT computed by each multiprocessor; and outputting the summed DPRT.
3. The method according to claim 2, wherein a single ray of the DPRT is computed using:
4. The method according to claim 2 extended for the computation of the Inverse Discrete Periodic Radon Transform (iDPRT).
5. The method according to claim 4, wherein the computation for the Inverse Discrete Periodic Radon Transform (iDPRT) using the GPU comprises the steps of: receiving a DPRT size parameter N and a DPRT of size (N+1)N; partitioning a set of N prime directions into sets of threads with each thread comprising consecutive prime directions; computing a total sum of the inverse DPRT; assigning each partitioned set of threads to different multiprocessors of GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute an inverse DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; reconstructing a final image using each inverse DPRT computed by each microprocessor and the total sum of the inverse DPRT; and outputting the final image.
6. The method according to claim 5, wherein a ray of the Inverse Discrete Periodic Radon Transform (iDPRT) is:
7. The method according to claim 4 wherein the computation of Discrete Periodic Radon Transform (DPRT) and the computation of Inverse Discrete Periodic Radon Transform (iDPRT) are implemented in parallel.
8. A method for computing a Discrete Periodic Radon Transform (DPRT) on a Graphics Processing Unit (GPU) comprising the steps of: distributing a computation of prime directions among multiprocessors (MPs), wherein the prime directions are partitioned into threads; receiving an input image f of a size NN, wherein N is a prime number; partitioning a set of N+1 prime directions into sets of threads with each thread comprising consecutive prime directions; assigning each partitioned set of threads to different multiprocessors of the GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute a DPRT along each ray for each multiprocessor, wherein rays that access a same row of pixels are synchronized; summing the DPRT computed by each multiprocessor; and outputting the summed DPRT.
9. The method according to claim 8 further comprising the steps of: synchronizing parallel threads so that each thread always reads from the same row of pixels at the same time; enforcing row-major ordering for read and write operations; and applying one or more optimizations comprising pixel width, optimal concurrency of warps, fast address calculation and modulo operation masking.
10. The method according to claim 8, wherein a single ray of the DPRT is computed using:
11. The method according to claim 8 extended for the computation of the Inverse Discrete Periodic Radon Transform (iDPRT).
12. The method according to claim 11 wherein the computation of Discrete Periodic Radon Transform (DPRT) and the computation of Inverse Discrete Periodic Radon Transform (iDPRT) are implemented in parallel.
13. The method according to claim 11, wherein the computation for the Inverse Discrete Periodic Radon Transform (iDPRT) using the GPU comprises the steps of: receiving a DPRT size parameter N and a DPRT of size (N+1)N; partitioning a set of N prime directions into sets of threads with each thread comprising consecutive prime directions; computing a total sum of the inverse DPRT; assigning each partitioned set of threads to different multiprocessors of GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute an inverse DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; reconstructing a final image using each inverse DPRT computed by each microprocessor and the total sum of the inverse DPRT; and outputting the final image.
14. The method according to claim 11, wherein a ray of the Inverse Discrete Periodic Radon Transform (iDPRT) is:
Description
DESCRIPTION OF THE DRAWING
(1) The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF THE INVENTION
(13) The invention is directed to fast and a scalable architectures and methods adaptable to available resources, that (1) computes DPRT on multicore CPUs by distributing the computation of the DPRT primary direction among the different cores, and (2) computes DPRT on GPUs using parallel, distributed, and synchronized ray computations among the GPU cores with ray referring to one of the sums required for computing the DPRT or its inverse along a prime direction.
(14) The invention provides the following basic notation. Let f denote an image of NN rows with N is a prime number. It is essential that N is prime to allow the computation of the forward and inverse DPRT using just N+1 prime directions as well as providing for a well-defined inverse DPRT. However, an arbitrarily sized image can be zero-padded to a prime number.
(15) The DPRT of f is defined as:
(16)
where R is the radon space of the DPRT, m=0, 1, . . . , N indexes a prime direction, d is the first pixel offset, and <>.sub.N is a positive remainder when integer division by N is performed.
(17) To compute the inverse DPRT, the total sum is computed as follows:
(18)
(19) Using the above f is reconstructed from its DPRT using:
(20)
(21)
(22) Execution is performed on the M.sub.C hardware cores following a MIMD model. Due to the hardware support for parallel execution of multiple threads on each hardware core, from the programming perspective, parallel threads are programmed for the logical cores. Each core can perform fixed-point or floating-point operations in a single clock cycle and is supported by the memory hierarchy (L1, L2, and L3 cache). In terms of speed, relatively fast access to L1 and L2 is assumed, and substantially slower access to L3.
(23) As shown in
(24) Programming the GPU requires special attention to the latencies associated with memory I/O. Within each core, register operations require a single clock cycle. Within each multiprocessor, access to the fast SRAM or cache L1 requires a few clock cycles (shown as T.sub.S in
(25) For programming the GPU, the Single Instruction Multiple Threads (SIMT) model is adopted. Within a multiprocessor, each core has access to a shared memory (fast SRAM). On the other hand, there is only one program memory from which a special control processor fetches and dispatches instructions. For each step, each core obtains from the control processor the same instruction and loads a separate data element through its private data access on which the instruction is performed. Thus, the instruction is synchronously applied in parallel by all cores to different data elements.
(26) The computation of the DPRT and its inverse using primary directions on multi-core CPUs is now discussed. The number of cores in multi-core CPUs tends to be relatively small compared to the number of primary directions. As a result, the computation of several directions are assigned to each logical core. More specifically, for multi-core CPUs each thread is assigned with the computation of a set of prime directions of the DPRT or the inverse DPRT. For the DPRT, a single direction is associated with the computation of R(m, d) using Equation (1) or Equation (2) for DPRT of f provided above with a fixed value of m (all d). Similarly, for the inverse DPRT, a single direction is associated with the computation of f(i,f) using Equation (4) above for reconstructing f from its DPRT with a fixed value of i (all j).
(27) Each thread is distributed to each logical core. For the DPRT, P=(N+1)/M.sub.L or P1 directions are assigned per logical core with ext=P.Math.M.sub.LN representing the extra directions that are not needed. Then, P1 directions are assigned to M.sub.L1 logical cores and the remaining logical core gets assigned to execute the remaining directions. Similarly, for the inverse DPRT, the N directions are divided up using P=N/M.sub.L.
(28) The multi-core CPU algorithms for computing the DPRT and the inverse DPRT are shown in
(29) In terms of computational complexity, it needs to be determined how long it takes for the slowest thread to complete. The nested loops given in lines 6-12 of FIG. For FastDirDPRT, nested loops are given in lines 6-12. 15-23 of
(30) Turning to the GPU, parallel algorithms are provided for computing the forward and inverse DPRT. Unlike multi-core CPUs, GPUs offer a large number of cores. In fact, the number of GPU cores (M.sub.P.Math.N.sub.P) is often in the order of N, the number of columns (or rows) in the image. As discussed above, for fast memory I/O, the goal is to process the image data using Fast SRAM/Cache L1 associated with each multiprocessor (see
(31) To increase the level of parallelism, each thread is assigned with the computation of a single ray of the DPRT or the inverse DPRT. Ray computation is defined as the computation of R(m, d) using Equation (1) and Equation (2) for fixed values of m and d. Similarly, for the inverse DPRT, ray computation is defined as the computation of f(i,j) using Equation (4) for fixed values of i and j. Furthermore, as for the case for multi-core CPUs, each multiprocessor is associated with the computation of a set of prime directions.
(32) Next, we discuss some basic concepts with respect to parallel ray computations on many cores. A simple example of N=7 is used to introduce the most important concepts. For computing the DPRT, 8 prime directions are given by m=0, 1, 2, . . . , 7. The simplest case comes from m=0:
R(0,d)=.sub.j=( ).sup.6f(d,j),d=0,1, . . . , 7Equation (5)
(33) Now, to compute R(0, d) by launching a parallel thread for each value of d (each ray). Similarly, for more complex directions, we still have that the threads are accessing a single row. To see this, consider the computation of R(1, d) using 7 parallel threads to compute the 7 rays given by:
(34)
(35) Here, note that each thread accesses a different column, but all of them access the same row when computing the i-th element of their corresponding ray. Thus, if f is stored in row-major format (as concatenated rows), each thread would access memory using a constant stride and if the threads are synchronized (all threads process the same i-th element of their corresponding ray) all threads access the same row memory as can be seen in
(36)
(37) Assuming f is stored in row-major format in img, as shown in
th0: img[0][0],th1: img[0][1], . . . , th6: img[0][6],Equation (6)
which represents the first row of pixels. Then, for i=1, the following is required:
th0: img[1][1], . . . , th5: img[1][6],th6: img[1][0],Equation (7)
where the last thread represents a wrap-around. It is important to note that the wrap-around forces the threads to stay on the same row when computing the i-th element. There is a potential for a cache-miss when accessing the data, especially for large images.
(38) First, since all the threads are accessing the same row and the data is in the same neighborhood, one cache miss will force the fetching of that data for all of the parallel threads. Second, if a cache-miss does occur, thread execution stalls. In such a case, the scheduler initiates execution of the next set of threads while the data of the previous set is being accessed. In row-major format, there is regularity in such memory accesses.
(39) More generally, for any direction m satisfying 0m<N, for each i, img[i][<0+mi>.sub.N] is required which accesses all of the elements in the row since N is prime. In fact, for any two consecutive directions, img [i][<0+mi>.sub.N] is simply a circularly-shifted version img [i][<0+(m+1)i>.sub.N] (compare Equation (6) and Equation (7)).
(40) Thus, row-by-row access is followed within any single direction, for 0m<N. As long as the image is stored row-by-row in the Fast SRAM/Cache L1 there will not be a cache-miss. Since the entire row is needed, even if a cache-miss occurs, the result will be the fetching of a row-block that supports the needs of future threads in the same direction. For much smaller images, there will be no issue because the computation of the first direction will result in bringing the entire image in the Fast SRAM/Cache L1, and the rest of the directions will simply find it there.
(41) The main components of the FastRayDPRT and FastRayInvDPRT include top level partition of the prime directions, launching of parallel threads for each MP and synchronize the threads.
(42)
(43) As shown by line 2 in
(44) For synchronized threads, for 0m<N, the synchronized rays access the same row of pixels (e.g., see
(45) On the other hand, synchronized execution for FastRayInvDPRT does not require any special treatment (see
(46) Then, for each prime direction, within each multiprocessor, N.sub.P rays are processed in parallel until the completion of computations for all of the N rays. Then, the next prime direction is processed and so on until all of the prime directions are computed associated with Thr.sub.p for the p-th multiprocessor. After a core computes one ray, the result is directly stored in the device memory (see
(47) The computation of the inverse DPRT is slightly different. First, the inverse DPRT only requires N prime directions (the horizontal one is not needed). Second, the final output per ray needs two additional additions and one division (line 11 in
(48) With the input image represented using B bits per pixel, exact additions require B+ log.sub.2 N for the DPRT outputs (for computing R[m][d]) due to the additions associated with each ray. Similarly, prior to the division (see line 11 in
(49) Now discussed are the performance bounds for FastRayDPRT and FastRayInvDPRT, specifically the requirements on the number of processors for achieving different levels of performance. Noting that the DPRT computation requires the computation of sums along (N+1).Math.N rays, the number of rays is reduced to N.sup.2 for the inverse DPRT. Discussing the requirements for the DPRT, the extension of the results to the inverse DPRT is straight-forward.
(50) Let p be the number of processors. For GPU implementations, p=M.sub.p.Math.N.sub.p where M.sub.p denotes the number of multiprocessors and N.sub.p denotes the number of cores per multiprocessor. Each processor is assigned to compute (N+1)N/p rays and possibly 1 less for some cores. Based on the basic model of computing a single ray per thread presented above, the following can be achieved: (1) linear execution time 0(N), (2) quadratic execution time 0(N.sup.2), and (3) cubic execution time 0(N.sup.3).
(51) To support a wide range of comparisons, the algorithms were implemented for 170 prime sizes that ranged from 55 to 10211021. For the GPU implementation, results are provided for video images of 14711471, which represents the largest image size for which real-time video can be processed at 30 frames per second. In each case, the input image is built using random, 8-bit integers. As a result, exact arithmetic was possible which allowed the inverse DPRT to perfectly reconstruct f from R (its DPRT) using 32-bit fixed point arithmetic.
(52) The implementation of FastDirDPRT and FastDirInvDPRT on the CPU (for example Xeon E5-2630) used 16 parallel threads that correspond to 8 physical cores with special support for running 2 threads in parallel. Each physical core has its own dedicated L1 and L2 caches. Collectively, all physical cores shared the L3 cache and an off-chip system memory. For parallel processing, both algorithms are implemented using POSIX threads (also called Pthreads) that were programmed in C. The image (or DPRT) is loaded in system memory and thus made available to each thread. The implementation used concurrent reads and no concurrent writes.
(53) FastRayDPRT and FastRayInvDPRT on the GPU (for example, Maxwell architecture (GM204) provide efficient implementations despite the limited amount of fast memory. Current GPUs provide limited amounts of fast SRAM. However, for the ray-based algorithms (FastRayDPRT and FastRayInvDPRT), only requires storing the current row in the FastRam or the L1 cache. Thus, the FastRam of 48K allows running much larger images (N=12K), or N=6K using the L1 cache. On the other hand, as discussed above, for exact DPRT computations, the maximum image size is restricted to 40934093. Furthermore, real-time performance of 30 frames per second can be maintained for images of size 14711471.
(54) The implementations also benefited from efficient memory I/O. First, sequential memory accesses are supported by memory transfers of 128 bytes that brought into the cache 32 pixels (4 bytes each) that could be used by the 32 threads in the current or later warps. Second, within each multiprocessor, 32 cores could perform block writes using row-major ordering with no conflicts (no concurrent writes). Third, the L1 cache provided is used by the GPU by compiling the code. These compile-time options force all reads to be cached provided that the input is in read-only mode, as is the case for the input image (or DPRT for inverse DPRT computation). Additionally, using cache L1 leverages the additional coding needed to use the Fast SRAM and provides the same performance.
(55) The implementations were also optimized for warps. First, the prime directions were partitioned into the 16 multiprocessor sets by the scheduler by using proper thread enumeration, sequential launching, and distribution of the threads (in blocks) to execution capacity. Then, each multiprocessor further partitioned the blocks into warps. Each warp supported the parallel execution of 32 threads. Within each warp, there were no bank conflicts between the threads because FastRayDPRT and FastRayInvDPRT required different addresses as described above. When a warp stalled (e.g., requiring device memory access), the warp scheduler switched to another warp that was ready to execute. Thus, delays due to memory stalls are minimized. Furthermore, by launching all threads at the beginning, additional delays are avoided due to the large number of warps that are ready for execution. As a result, each multiprocessor is always occupied with the execution of the maximum (up to 32) number of blocks. Furthermore, the modulo operations were executed in parallel during a memory accesses so as to minimize their impact on the main kernel loop.
(56) For the following discussion, it should be noted that p=2048 (total number of cores), f.sub.GPU=1367 MHz, and f.sub.CPU=3.2 GHz, with 8 hardware cores and 16 logical cores in the CPU. Additionally, the proposed algorithms are compared against FPGA implementations that process H=2 rows in parallel at a clock frequency of f.sub.CLK=100 MHz.
(57) Turning first to the analysis of the performance achieved by the FastDirDPRT, specific numbers for speedups and running times for several image sizes in
(58)
(59) From
(60)
(61) The performance of the GPU implementations also approximates highly-efficient FPGA implementations using H.Math.N=2N parallel cores. It is noted that certain FPGA implementations avoid all I/O issues by moving entire rows or columns in a single clock cycle. Furthermore, all FPGA computations are also completed in a single clock cycle (at a much lower clock frequency). As provided by the results, fast GPU implementations of the FastRayDPRT and FastRayInvDRPT can be achieved without requiring custom-built hardware architectures.
(62) With respect to real-time video processing applications, maximum image sizes for which the DPRT can be computed at 30 frames per second as shown in
(63) The invention provides new, parallel algorithms for computing the DPRT and its inverse on multi-core CPUs and GPUs. As provided above, the invention can differentiate between multi-core CPUs with a limited number of cores and access to reasonable memory resources as opposed to GPUs that exhibit a much larger number of cores with access to limited memory. For multi-core CPUs, FastDirDPRT and FastDirInvDPRT distribute the computation of the prime directions of the DPRT and its inverse using parallel threads over the logical cores. For GPUs, FastRayDPRT and FastRayInvDPRT minimize memory requirements to a single row of pixels and efficiently handle memory to compute the DPRT and its inverse using a large number of parallel threads that are distributed over the many cores.
(64) The described embodiments are to be considered in all respects only as illustrative and not restrictive, and the scope of the invention is not limited to the foregoing description. Those of skill in the art may recognize changes, substitutions, adaptations and other modifications that may nonetheless come within the scope of the invention and range of the invention.