System and methods for the computation of the forward and inverse discrete periodic radon transform on GPUs and CPUs

10943322 ยท 2021-03-09

    Inventors

    Cpc classification

    International classification

    Abstract

    Fast and a scalable algorithms and methods adaptable to available resources for computing (1) the DPRT on multicore CPUs by distributing the computation of the DPRT primary directions among the different cores, and (2) the DPRT on GPUs using parallel, distributed, and synchronized ray computations among the GPU cores with ray referring to one of the sums required for computing the DPRT or its inverse along a prime direction.

    Claims

    1. A method for computing a Discrete Periodic Radon Transform (DPRT) on a Graphics Processing Unit (GPU) comprising the steps of: distributing a computation of prime directions among multiprocessors (MPs), wherein the distributed prime directions are each partitioned into threads; synchronizing parallel threads so that each thread always reads from a same row of pixels at the same time; enforcing row-major ordering for read and write operations for each thread; and applying one or more optimizations to each thread, wherein the one or more optimizations comprise a pixel width, an optimal concurrency of warps, a fast address calculation and a modulo operation masking.

    2. The method according to claim 1, wherein the computation for the Discrete Periodic Radon Transform (DPRT) using the GPU comprises the steps of: receiving an input image f of a size NN, wherein N is a prime number; partitioning a set of N+1 prime directions into sets of threads with each thread comprising consecutive prime directions; assigning each partitioned set of threads to different multiprocessors of the GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute a DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; summing the DPRT computed by each multiprocessor; and outputting the summed DPRT.

    3. The method according to claim 2, wherein a single ray of the DPRT is computed using: R ( m , d ) = { .Math. j = 0 N - 1 f ( d , j ) , m = N , .Math. i = 0 N - 1 f ( i , .Math. d + mi .Math. N ) , 0 m < N . where R is the radon space of the DPRT, m=0, 1, N indexes a prime direction, d is the first pixel offset, and <>.sub.N is a positive remainder when integer division by N is performed, S=.sup.N1 .sub.j=0.sup.N1 .sub.j=0f(i,j), and the pair (m, d) indexes the computation of a single ray as a sum of pixels for a particular direction and offset.

    4. The method according to claim 2 extended for the computation of the Inverse Discrete Periodic Radon Transform (iDPRT).

    5. The method according to claim 4, wherein the computation for the Inverse Discrete Periodic Radon Transform (iDPRT) using the GPU comprises the steps of: receiving a DPRT size parameter N and a DPRT of size (N+1)N; partitioning a set of N prime directions into sets of threads with each thread comprising consecutive prime directions; computing a total sum of the inverse DPRT; assigning each partitioned set of threads to different multiprocessors of GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute an inverse DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; reconstructing a final image using each inverse DPRT computed by each microprocessor and the total sum of the inverse DPRT; and outputting the final image.

    6. The method according to claim 5, wherein a ray of the Inverse Discrete Periodic Radon Transform (iDPRT) is: f ( i , j ) = 1 N [ .Math. m = 0 N - 1 R ( m , .Math. j - mi .Math. N ) - S + R ( N , i ) ] where f(i,j) is the inverse input image, R is the radon space of the DPRT, m=0,1, . . . ,N indexes a prime direction, <>.sub.N is a positive remainder when integer division by N is performed, S=.sup.N1 .sub.j=0.sup.N1 .sub.j=0f(i,j), and (i,j) denotes the index of a single ray of computations.

    7. The method according to claim 4 wherein the computation of Discrete Periodic Radon Transform (DPRT) and the computation of Inverse Discrete Periodic Radon Transform (iDPRT) are implemented in parallel.

    8. A method for computing a Discrete Periodic Radon Transform (DPRT) on a Graphics Processing Unit (GPU) comprising the steps of: distributing a computation of prime directions among multiprocessors (MPs), wherein the prime directions are partitioned into threads; receiving an input image f of a size NN, wherein N is a prime number; partitioning a set of N+1 prime directions into sets of threads with each thread comprising consecutive prime directions; assigning each partitioned set of threads to different multiprocessors of the GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute a DPRT along each ray for each multiprocessor, wherein rays that access a same row of pixels are synchronized; summing the DPRT computed by each multiprocessor; and outputting the summed DPRT.

    9. The method according to claim 8 further comprising the steps of: synchronizing parallel threads so that each thread always reads from the same row of pixels at the same time; enforcing row-major ordering for read and write operations; and applying one or more optimizations comprising pixel width, optimal concurrency of warps, fast address calculation and modulo operation masking.

    10. The method according to claim 8, wherein a single ray of the DPRT is computed using: R ( m , d ) = { .Math. j = 0 N - 1 f ( d , j ) , m = N , .Math. i = 0 N - 1 f ( i , .Math. d + mi .Math. N ) , 0 m < N . where R is the radon space of the DPRT, m=0, 1, . . . , N indexes a prime direction, d is the first pixel offset, and <>.sub.N is a positive remainder when integer division by N is performed, S=.sup.N1 .sub.j=0.sup.N1 .sub.j=0f(i,j), and the pair (m, d) indexes the computation of a single ray as a sum of pixels for a particular direction and offset.

    11. The method according to claim 8 extended for the computation of the Inverse Discrete Periodic Radon Transform (iDPRT).

    12. The method according to claim 11 wherein the computation of Discrete Periodic Radon Transform (DPRT) and the computation of Inverse Discrete Periodic Radon Transform (iDPRT) are implemented in parallel.

    13. The method according to claim 11, wherein the computation for the Inverse Discrete Periodic Radon Transform (iDPRT) using the GPU comprises the steps of: receiving a DPRT size parameter N and a DPRT of size (N+1)N; partitioning a set of N prime directions into sets of threads with each thread comprising consecutive prime directions; computing a total sum of the inverse DPRT; assigning each partitioned set of threads to different multiprocessors of GPU, wherein each partitioned thread is a computation of a set of rays; executing in parallel by each multiprocessor the assigned partitioned set of threads to compute an inverse DPRT along each ray for each multiprocessor, wherein rays that access the same row of pixels are synchronized; reconstructing a final image using each inverse DPRT computed by each microprocessor and the total sum of the inverse DPRT; and outputting the final image.

    14. The method according to claim 11, wherein a ray of the Inverse Discrete Periodic Radon Transform (iDPRT) is: f ( i , j ) = 1 N [ .Math. m = 0 N - 1 R ( m , .Math. j - mi .Math. N ) - S + R ( N , i ) ] where f(i, j) is the inverse input image, R is the radon space of the DPRT, m=0, 1, . . . , N indexes a prime direction, <>.sub.N is a positive remainder when integer division by N is performed, S=.sup.N1 .sub.j=0.sup.N1 .sub.j=0f(i, j), and (.sub.0) denotes the index of a single ray of computations.

    Description

    DESCRIPTION OF THE DRAWING

    (1) The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention:

    (2) FIG. 1 illustrates a block diagram of an architecture for parallel processing using a CPU and a GPU according to the invention.

    (3) FIG. 2 illustrates the multi-core CPU algorithm for computing the DPRT according to the invention.

    (4) FIG. 3 illustrates the multi-core CPU algorithm for computing the inverse DPRT according to the invention.

    (5) FIG. 4 illustrates row-major memory accesses for synchronized, parallel thread execution for ray computations according to the invention.

    (6) FIG. 5 illustrates the algorithm for FastRayDPRT according to the invention.

    (7) FIG. 6 illustrates the algorithm for FastRayInvDPRT according to the invention.

    (8) FIG. 7 illustrates the algorithm for the kernel for each core on the GPU that supports thread-level parallelism according to the invention.

    (9) FIG. 8 illustrates the algorithm for the kernel for each core on the GPU that computes one ray of the inverse DPRT according to the invention.

    (10) FIG. 9 illustrates a chart of performance and resource comparisons according to the invention.

    (11) FIG. 10 illustrates a chart of specific numbers for speedups and running times for several image sizes according to the invention.

    (12) FIG. 11 illustrates a chart of maximum image sizes that achieve real-time video implementations according to the invention.

    DETAILED DESCRIPTION OF THE INVENTION

    (13) The invention is directed to fast and a scalable architectures and methods adaptable to available resources, that (1) computes DPRT on multicore CPUs by distributing the computation of the DPRT primary direction among the different cores, and (2) computes DPRT on GPUs using parallel, distributed, and synchronized ray computations among the GPU cores with ray referring to one of the sums required for computing the DPRT or its inverse along a prime direction.

    (14) The invention provides the following basic notation. Let f denote an image of NN rows with N is a prime number. It is essential that N is prime to allow the computation of the forward and inverse DPRT using just N+1 prime directions as well as providing for a well-defined inverse DPRT. However, an arbitrarily sized image can be zero-padded to a prime number.

    (15) The DPRT of f is defined as:

    (16) R ( m , d ) = { .Math. j = 0 N - 1 f ( d , j ) , m = N , .Math. i = 0 N - 1 f ( i , .Math. d + mi .Math. N ) , 0 m < N . Equation ( 1 ) Equation ( 2 )
    where R is the radon space of the DPRT, m=0, 1, . . . , N indexes a prime direction, d is the first pixel offset, and <>.sub.N is a positive remainder when integer division by N is performed.

    (17) To compute the inverse DPRT, the total sum is computed as follows:

    (18) S = .Math. j = 0 N - 1 .Math. i = 0 N - 1 f ( i , j ) = .Math. d = 0 d = N - 1 R ( m , d ) for any m = .Math. d = 0 d = N - 1 R ( 0 , d ) Equation ( 3 )

    (19) Using the above f is reconstructed from its DPRT using:

    (20) f ( i , j ) = 1 N [ .Math. m = 0 N - 1 F ( m , .Math. j - mi .Math. N ) - S + R ( N , i ) ] Equation ( 4 )

    (21) FIG. 1 illustrates a block diagram of an architecture for parallel processing using a CPU and a GPU according to the invention. Computations in the host system are performed by a multicore CPU. To draw a distinction between logical and hardware cores associated with the CPU, let M.sub.C refer to the number of actual hardware cores. An actual hardware core can be provided with special hardware support to execute multiple threads concurrently. The number of logical cores is given by M.sub.L and refers to the product of the actual number of hardware cores with the number of supported, possible threads per core. As an example, for an 8-core Intel CPU that can concurrently execute two threads per core provides M.sub.C=8 hardware cores and a total of M.sub.L=2M.sub.C=16 logical cores.

    (22) Execution is performed on the M.sub.C hardware cores following a MIMD model. Due to the hardware support for parallel execution of multiple threads on each hardware core, from the programming perspective, parallel threads are programmed for the logical cores. Each core can perform fixed-point or floating-point operations in a single clock cycle and is supported by the memory hierarchy (L1, L2, and L3 cache). In terms of speed, relatively fast access to L1 and L2 is assumed, and substantially slower access to L3.

    (23) As shown in FIG. 1, a graphics processing unit (GPU) is connected to the CPU via the peripheral interconnect bus (PCI). The GPU consists of M.sub.P Multiprocessors (MP), that share a common Cache that is also connected to device memory. Inside each multiprocessor, N.sub.P processors (cores) are assumed, each with its own set of registers. Thus, there is a total of M.sub.P.Math.N.sub.P cores. Within each multiprocessor, all of the cores share a local Fast SRAM. Each multiprocessor has an instruction unit that can dispatch the same instruction for parallel execution by all of the N.sub.P cores.

    (24) Programming the GPU requires special attention to the latencies associated with memory I/O. Within each core, register operations require a single clock cycle. Within each multiprocessor, access to the fast SRAM or cache L1 requires a few clock cycles (shown as T.sub.S in FIG. 1). Outside the multiprocessor, cache L2 memory access requires tens of clock cycles (shown as T.sub.C in FIG. 1). A cache miss requires access to the device memory that can cost hundreds of clock cycles (shown as T.sub.M in FIG. 1). Thus, for fast implementations, the goal is to load and operate on the data in the fast SRAM and/or the cache L1.

    (25) For programming the GPU, the Single Instruction Multiple Threads (SIMT) model is adopted. Within a multiprocessor, each core has access to a shared memory (fast SRAM). On the other hand, there is only one program memory from which a special control processor fetches and dispatches instructions. For each step, each core obtains from the control processor the same instruction and loads a separate data element through its private data access on which the instruction is performed. Thus, the instruction is synchronously applied in parallel by all cores to different data elements.

    (26) The computation of the DPRT and its inverse using primary directions on multi-core CPUs is now discussed. The number of cores in multi-core CPUs tends to be relatively small compared to the number of primary directions. As a result, the computation of several directions are assigned to each logical core. More specifically, for multi-core CPUs each thread is assigned with the computation of a set of prime directions of the DPRT or the inverse DPRT. For the DPRT, a single direction is associated with the computation of R(m, d) using Equation (1) or Equation (2) for DPRT of f provided above with a fixed value of m (all d). Similarly, for the inverse DPRT, a single direction is associated with the computation of f(i,f) using Equation (4) above for reconstructing f from its DPRT with a fixed value of i (all j).

    (27) Each thread is distributed to each logical core. For the DPRT, P=(N+1)/M.sub.L or P1 directions are assigned per logical core with ext=P.Math.M.sub.LN representing the extra directions that are not needed. Then, P1 directions are assigned to M.sub.L1 logical cores and the remaining logical core gets assigned to execute the remaining directions. Similarly, for the inverse DPRT, the N directions are divided up using P=N/M.sub.L.

    (28) The multi-core CPU algorithms for computing the DPRT and the inverse DPRT are shown in FIG. 2 and FIG. 3, respectively. More specifically, the algorithm for FastDirDPRT is given in FIG. 2. The threads are assigned as described above. For each thread, there is a call to DPRT_CPU_Core ( ) to compute Equations (1)-(2) for each direction that belongs in the corresponding thread set Dirns.sub.i. Similarly, FastInvDirDPRT calls iDPRT_DPRT_Core ( ) to compute Equation (4) for each direction in its corresponding thread set Dirns.sub.i. (see FIG. 3).

    (29) In terms of computational complexity, it needs to be determined how long it takes for the slowest thread to complete. The nested loops given in lines 6-12 of FIG. For FastDirDPRT, nested loops are given in lines 6-12. 15-23 of FIG. 2. Since the largest number of directions in Dirns.sub.i is P, the computational complexity is 0(P.Math.N.sup.2). Similarly, for FastInvDirDPRT, nested for-loops in lines 6-14 of FIG. 3 give a computational complexity of 0(P.Math.N.sup.2).

    (30) Turning to the GPU, parallel algorithms are provided for computing the forward and inverse DPRT. Unlike multi-core CPUs, GPUs offer a large number of cores. In fact, the number of GPU cores (M.sub.P.Math.N.sub.P) is often in the order of N, the number of columns (or rows) in the image. As discussed above, for fast memory I/O, the goal is to process the image data using Fast SRAM/Cache L1 associated with each multiprocessor (see FIG. 1). Unfortunately, Fast SRAM/Cache L1 memory may be limited leaving the big challenge in developing effective algorithms for GPUs that increase the level of parallelism while maintaining low requirements on the required amount of memory.

    (31) To increase the level of parallelism, each thread is assigned with the computation of a single ray of the DPRT or the inverse DPRT. Ray computation is defined as the computation of R(m, d) using Equation (1) and Equation (2) for fixed values of m and d. Similarly, for the inverse DPRT, ray computation is defined as the computation of f(i,j) using Equation (4) for fixed values of i and j. Furthermore, as for the case for multi-core CPUs, each multiprocessor is associated with the computation of a set of prime directions.

    (32) Next, we discuss some basic concepts with respect to parallel ray computations on many cores. A simple example of N=7 is used to introduce the most important concepts. For computing the DPRT, 8 prime directions are given by m=0, 1, 2, . . . , 7. The simplest case comes from m=0:
    R(0,d)=.sub.j=( ).sup.6f(d,j),d=0,1, . . . , 7Equation (5)

    (33) Now, to compute R(0, d) by launching a parallel thread for each value of d (each ray). Similarly, for more complex directions, we still have that the threads are accessing a single row. To see this, consider the computation of R(1, d) using 7 parallel threads to compute the 7 rays given by:

    (34) m = 1 , th 0 : R ( 1 , 0 ) = .Math. i = 0 6 f ( i , < 0 + i > 7 ) m = 1 , th 1 : R ( 1 , 1 ) = .Math. i = 0 6 f ( i , < 1 + i > 7 ) .Math. m = 1 , th 6 : R ( 1 , 6 ) = .Math. i = 0 6 f ( i , < 6 + i > 7 )

    (35) Here, note that each thread accesses a different column, but all of them access the same row when computing the i-th element of their corresponding ray. Thus, if f is stored in row-major format (as concatenated rows), each thread would access memory using a constant stride and if the threads are synchronized (all threads process the same i-th element of their corresponding ray) all threads access the same row memory as can be seen in FIG. 4B.

    (36) FIG. 4 illustrates row-major memory accesses for synchronized, parallel thread execution for ray computations according to the invention. The input image is assumed to be of size 77. For each step, a bold square highlights the pixel being accessed. For each thread, distinct grayscale shade is used to identify the pixels that need to be accessed. Row by row access for m=0 by 7 parallel threads is shown in FIG. 4A. Initially, all threads are accessing the top row (left image). The middle image shows memory access for i=1. The right image shows memory access for the last row. FIG. 4B shows row by row access for m=1 by 7 parallel threads. After accessing the top row (left image), execution proceeds to the second row (middle image). The last row is accessed for the final computations.

    (37) Assuming f is stored in row-major format in img, as shown in FIG. 4B, the parallel thread begins with accessing:
    th0: img[0][0],th1: img[0][1], . . . , th6: img[0][6],Equation (6)
    which represents the first row of pixels. Then, for i=1, the following is required:
    th0: img[1][1], . . . , th5: img[1][6],th6: img[1][0],Equation (7)
    where the last thread represents a wrap-around. It is important to note that the wrap-around forces the threads to stay on the same row when computing the i-th element. There is a potential for a cache-miss when accessing the data, especially for large images.

    (38) First, since all the threads are accessing the same row and the data is in the same neighborhood, one cache miss will force the fetching of that data for all of the parallel threads. Second, if a cache-miss does occur, thread execution stalls. In such a case, the scheduler initiates execution of the next set of threads while the data of the previous set is being accessed. In row-major format, there is regularity in such memory accesses.

    (39) More generally, for any direction m satisfying 0m<N, for each i, img[i][<0+mi>.sub.N] is required which accesses all of the elements in the row since N is prime. In fact, for any two consecutive directions, img [i][<0+mi>.sub.N] is simply a circularly-shifted version img [i][<0+(m+1)i>.sub.N] (compare Equation (6) and Equation (7)).

    (40) Thus, row-by-row access is followed within any single direction, for 0m<N. As long as the image is stored row-by-row in the Fast SRAM/Cache L1 there will not be a cache-miss. Since the entire row is needed, even if a cache-miss occurs, the result will be the fetching of a row-block that supports the needs of future threads in the same direction. For much smaller images, there will be no issue because the computation of the first direction will result in bringing the entire image in the Fast SRAM/Cache L1, and the rest of the directions will simply find it there.

    (41) The main components of the FastRayDPRT and FastRayInvDPRT include top level partition of the prime directions, launching of parallel threads for each MP and synchronize the threads.

    (42) FIG. 5 illustrates the algorithm for FastRayDPRT and FIG. 6 illustrates the algorithm for FastRayInvDPRT. The top level partition of the prime direction is shown at line 1 of FIG. 5 and line 1 in FIG. 6: N+1 directions are subdivided among the M.sub.P MPs (with M.sub.L=M.sub.P). This approach balances the processing load among the MPs.

    (43) As shown by line 2 in FIG. 5 and line 3 in FIG. 6, each MP launches in parallel the full amount of threads to process the whole set of assigned directions. Thus, if the number of directions assigned to the i-th MP is P.sub.i, then the total number of threads launched is N.Math.P.sub.i. Then, the number of threads assigned to each MP is maximized, allowing room for the scheduler to switch between threads when a stall occurs.

    (44) For synchronized threads, for 0m<N, the synchronized rays access the same row of pixels (e.g., see FIG. 4). For the FastRayDPRT described in FIG. 5, execution of the if-statement can break synchronicity. After the if-statement, synchronization is restored by including a special statement in line 13 as shown in FIG. 7. More specifically, FIG. 7 illustrates the algorithm for the kernel for each core on the GPU that supports thread-level parallelism according to the invention.

    (45) On the other hand, synchronized execution for FastRayInvDPRT does not require any special treatment (see FIG. 8). This rule along with the row-major format for storage minimizes dramatically the number of stalls in the processing, i.e., access to FastSRAM or Cache L1 dominates the parallel processing. More specifically, FIG. 8 illustrates the algorithm for the kernel for each core on the GPU that computes one ray of the inverse DPRT according to the invention.

    (46) Then, for each prime direction, within each multiprocessor, N.sub.P rays are processed in parallel until the completion of computations for all of the N rays. Then, the next prime direction is processed and so on until all of the prime directions are computed associated with Thr.sub.p for the p-th multiprocessor. After a core computes one ray, the result is directly stored in the device memory (see FIG. 1). There is no need to hold the results in the fast SRAM because there is no further use of that result and there is no chance of concurrent writes. During the process, the image f[i][j] is assumed to be on the device memory and the result is stored in the same memory.

    (47) The computation of the inverse DPRT is slightly different. First, the inverse DPRT only requires N prime directions (the horizontal one is not needed). Second, the final output per ray needs two additional additions and one division (line 11 in FIG. 8).

    (48) With the input image represented using B bits per pixel, exact additions require B+ log.sub.2 N for the DPRT outputs (for computing R[m][d]) due to the additions associated with each ray. Similarly, prior to the division (see line 11 in FIG. 6), R[m][d] requires b+ log.sub.2 N bits. Clearly, after the division, the reconstructed image only needs B bits for perfect reconstruction (as for the input). Thus, for 8-bit input images, 32-bit arithmetic (fixed point) will produce perfect reconstruction for images up to N<2.sup.12=4096.

    (49) Now discussed are the performance bounds for FastRayDPRT and FastRayInvDPRT, specifically the requirements on the number of processors for achieving different levels of performance. Noting that the DPRT computation requires the computation of sums along (N+1).Math.N rays, the number of rays is reduced to N.sup.2 for the inverse DPRT. Discussing the requirements for the DPRT, the extension of the results to the inverse DPRT is straight-forward.

    (50) Let p be the number of processors. For GPU implementations, p=M.sub.p.Math.N.sub.p where M.sub.p denotes the number of multiprocessors and N.sub.p denotes the number of cores per multiprocessor. Each processor is assigned to compute (N+1)N/p rays and possibly 1 less for some cores. Based on the basic model of computing a single ray per thread presented above, the following can be achieved: (1) linear execution time 0(N), (2) quadratic execution time 0(N.sup.2), and (3) cubic execution time 0(N.sup.3).

    (51) To support a wide range of comparisons, the algorithms were implemented for 170 prime sizes that ranged from 55 to 10211021. For the GPU implementation, results are provided for video images of 14711471, which represents the largest image size for which real-time video can be processed at 30 frames per second. In each case, the input image is built using random, 8-bit integers. As a result, exact arithmetic was possible which allowed the inverse DPRT to perfectly reconstruct f from R (its DPRT) using 32-bit fixed point arithmetic.

    (52) The implementation of FastDirDPRT and FastDirInvDPRT on the CPU (for example Xeon E5-2630) used 16 parallel threads that correspond to 8 physical cores with special support for running 2 threads in parallel. Each physical core has its own dedicated L1 and L2 caches. Collectively, all physical cores shared the L3 cache and an off-chip system memory. For parallel processing, both algorithms are implemented using POSIX threads (also called Pthreads) that were programmed in C. The image (or DPRT) is loaded in system memory and thus made available to each thread. The implementation used concurrent reads and no concurrent writes.

    (53) FastRayDPRT and FastRayInvDPRT on the GPU (for example, Maxwell architecture (GM204) provide efficient implementations despite the limited amount of fast memory. Current GPUs provide limited amounts of fast SRAM. However, for the ray-based algorithms (FastRayDPRT and FastRayInvDPRT), only requires storing the current row in the FastRam or the L1 cache. Thus, the FastRam of 48K allows running much larger images (N=12K), or N=6K using the L1 cache. On the other hand, as discussed above, for exact DPRT computations, the maximum image size is restricted to 40934093. Furthermore, real-time performance of 30 frames per second can be maintained for images of size 14711471.

    (54) The implementations also benefited from efficient memory I/O. First, sequential memory accesses are supported by memory transfers of 128 bytes that brought into the cache 32 pixels (4 bytes each) that could be used by the 32 threads in the current or later warps. Second, within each multiprocessor, 32 cores could perform block writes using row-major ordering with no conflicts (no concurrent writes). Third, the L1 cache provided is used by the GPU by compiling the code. These compile-time options force all reads to be cached provided that the input is in read-only mode, as is the case for the input image (or DPRT for inverse DPRT computation). Additionally, using cache L1 leverages the additional coding needed to use the Fast SRAM and provides the same performance.

    (55) The implementations were also optimized for warps. First, the prime directions were partitioned into the 16 multiprocessor sets by the scheduler by using proper thread enumeration, sequential launching, and distribution of the threads (in blocks) to execution capacity. Then, each multiprocessor further partitioned the blocks into warps. Each warp supported the parallel execution of 32 threads. Within each warp, there were no bank conflicts between the threads because FastRayDPRT and FastRayInvDPRT required different addresses as described above. When a warp stalled (e.g., requiring device memory access), the warp scheduler switched to another warp that was ready to execute. Thus, delays due to memory stalls are minimized. Furthermore, by launching all threads at the beginning, additional delays are avoided due to the large number of warps that are ready for execution. As a result, each multiprocessor is always occupied with the execution of the maximum (up to 32) number of blocks. Furthermore, the modulo operations were executed in parallel during a memory accesses so as to minimize their impact on the main kernel loop.

    (56) For the following discussion, it should be noted that p=2048 (total number of cores), f.sub.GPU=1367 MHz, and f.sub.CPU=3.2 GHz, with 8 hardware cores and 16 logical cores in the CPU. Additionally, the proposed algorithms are compared against FPGA implementations that process H=2 rows in parallel at a clock frequency of f.sub.CLK=100 MHz.

    (57) Turning first to the analysis of the performance achieved by the FastDirDPRT, specific numbers for speedups and running times for several image sizes in FIG. 10. As can be seen from FIG. 10, the maximum speedup is 12.83 for FastDirDPRT and 12.53 for fastDirInvDPRT for images of sizes 701701.

    (58) FIG. 9 illustrates a chart of performance and resource comparisons between FastDirDPRT, FastDirInvDPRT, FastRayDPRT, FastRayInvDPRT, FastScaleDPRT and FastScaleInvDPRT. In particular, the algorithms were implemented for 170 prime number cases that ranged from 55 to 10211021. In each case, the input image was built using random, 8-bit integers with exact computations using 32-bit fixed-point arithmetic. More specifically, FIG. 9A illustrates comparative running times for the DPRT, FIG. 9B illustrates comparative running times for the inverse DPRT, FIG. 9C illustrates speedup comparisons for the DPRT and FIG. 9D illustrates speedup comparisons for the inverse DPRT.

    (59) From FIG. 9C it can be shown that the multi-core FastDirDPRT begins to improve over the single-core FastDirDPRT for N>47. For N=47 to N=137, there is a roughly constant speedup increase to 10. There is a constant speedup around 10 for all measurements, until the maximum value of N=1021. Since 10>8, the speedup of the multi-core FastDirDPRT implementation exceeds the number of hardware cores. This speedup is achieved due to the hyper-threading used in the multi-core implementation. Overall, a 10 speedup (up to 12.83) demonstrates that the FastDirDPRT approximates very well what is possible with the number of cores (8 hardware and 16 logical cores).

    (60) FIG. 10 and FIG. 9 provide the results of the performance analysis and trends for the GPU implementation of FastRayDPRT and FastRayInvDPRT. As described above, for very small N(N<47), p=2048>44.Math.43=1892 (43 is the next prime less than 47), which implies that there are more GPU cores than rays to compute. As a result, as a function of N, the execution time is linear and the effective speedup is bounded above by (f.sub.GPU/f.sub.CPU).Math.N.Math.(N+1)=0.42.Math.N.Math.(N+1) for the DPRT and 0.42N.sup.2 for the inverse DPRT. For N47, all of the GPU cores are assigned at-least one or more rays to compute. Then, the ideal upper bound on the speed-up becomes (f.sub.GPU/f.sub.CPU).Math.p which gives 860. For N>167, speedup is around 600870 that indicates that FastRayDPRT scales very well. Here, it is noted that for N>79, the local cache of 24K cannot hold the entire image in fast SRAM. Thus, for N>79, there is a need to access slower memory. Access to slower memory is handled very efficiently as described above. From FIG. 9, the GPU implementations achieved the maximum speedups (853 for FastRayDPRT and 873 for FastRayInvDPRT) for the largest image size of 10211021. Overall, both the FastRayDPRT and FastRayInvDPRT implementations approximate ideal speedups for a wide range of image sizes.

    (61) The performance of the GPU implementations also approximates highly-efficient FPGA implementations using H.Math.N=2N parallel cores. It is noted that certain FPGA implementations avoid all I/O issues by moving entire rows or columns in a single clock cycle. Furthermore, all FPGA computations are also completed in a single clock cycle (at a much lower clock frequency). As provided by the results, fast GPU implementations of the FastRayDPRT and FastRayInvDRPT can be achieved without requiring custom-built hardware architectures.

    (62) With respect to real-time video processing applications, maximum image sizes for which the DPRT can be computed at 30 frames per second as shown in FIG. 11. Smaller image sizes to allow for additional processing time are also considered. As an example, real-time video processing of 10211021 videos at 30 fps using the FastRayDPRT and the FastRayInvDPRT requires a total of 22 ms per frame and allows 11 ms per frame for additional computations.

    (63) The invention provides new, parallel algorithms for computing the DPRT and its inverse on multi-core CPUs and GPUs. As provided above, the invention can differentiate between multi-core CPUs with a limited number of cores and access to reasonable memory resources as opposed to GPUs that exhibit a much larger number of cores with access to limited memory. For multi-core CPUs, FastDirDPRT and FastDirInvDPRT distribute the computation of the prime directions of the DPRT and its inverse using parallel threads over the logical cores. For GPUs, FastRayDPRT and FastRayInvDPRT minimize memory requirements to a single row of pixels and efficiently handle memory to compute the DPRT and its inverse using a large number of parallel threads that are distributed over the many cores.

    (64) The described embodiments are to be considered in all respects only as illustrative and not restrictive, and the scope of the invention is not limited to the foregoing description. Those of skill in the art may recognize changes, substitutions, adaptations and other modifications that may nonetheless come within the scope of the invention and range of the invention.