System and methods for computing 2-D convolutions and cross-correlations

11593907 · 2023-02-28

    Inventors

    Cpc classification

    International classification

    Abstract

    Fast and scalable architectures and methods adaptable to available resources, that (1) compute 2-D convolutions using 1-D convolutions, (2) provide fast transposition and accumulation of results for computing fast cross-correlations or 2-D convolutions, and (3) provide parallel computations using pipelined 1-D convolvers. Additionally, fast and scalable architectures and methods that compute 2-D linear convolutions using Discrete Periodic Radon Transforms (DPRTs) including the use of scalable DPRT, Fast DPRT, and fast 1-D convolutions.

    Claims

    1. A method for fast and scalable architectures adaptable to available resources, that can be used to compute 2-D convolutions using 1-D convolutions comprising the steps of: providing an input image and a 2-D convolution kernel, wherein the input image comprises rows and columns of pixels and is partitioned into blocks and processed using an overlap-and-add approach or an overlap-and-save approach; decomposing the 2-D convolution kernel into a sum of separable 1-D kernels; applying the 1-D kernels along the rows and then the columns or vice-versa using pipelined 1-D convolvers that compute one output pixel per clock cycle; adding up results from the applying step; and producing a final image output based on the 2-D convolution kernel.

    2. The method according to claim 1, wherein the number of 1-D kernels is 2, one to be applied along the rows and another one to be applied along the columns, wherein the two 1-D kernels can be the same.

    3. The method according to claim 1, wherein the number of 1-D kernels is 4, two to be applied along the rows, and the other two to be applied along the columns, wherein some of the 1-D kernels can be the same.

    4. The method according to claim 1, wherein the 2-D decomposition of the 2-D convolution kernel and the number of separable 1-D kernels are computed using a combination of singular value decomposition (SVD) and lower-upper (LU) decompositions.

    5. The method according to claim 1 further comprising the steps of: allowing access, storage or accumulation of the results from a row or a column in a single clock cycle; and accessing all or a portion of the results using one or more rows or one or more columns.

    6. The method according to claim 1, wherein the number of 1-D kernels is 2, 4, 6, 8, up to maximum number needed to get a perfect reconstruction of the 2-D kernel using SVD, wherein some of the 1-D kernels can be the same.

    Description

    DESCRIPTION OF THE DRAWING

    (1) The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention:

    (2) FIG. 1 illustrates an architecture for computing the fast 1-D circular convolution according to the invention.

    (3) FIG. 2 illustrates the algorithm for computing the fast 1-D circular convolution according to the invention.

    (4) FIG. 3 illustrates the timing diagram for computing the fast 1-D circular convolution according to the invention.

    (5) FIG. 4A illustrates a partial view of a basic architecture for computing the fast and scalable 2-D linear convolutions and cross-correlations using the Discrete Periodic Radon Transform (DPRT), specifically FastConv and FastScaleConv according to the invention.

    (6) FIG. 4B illustrates a partial view of the basic architecture for computing the fast and scalable 2-D linear convolutions and cross-correlations using the Discrete Periodic Radon Transform (DPRT), specifically FastConv and FastScaleConv according to the invention.

    (7) FIG. 5 illustrates the algorithm for computing the fast and scalable 2-D linear convolutions and cross-correlations according to the invention.

    (8) FIG. 6 illustrates the timing diagram for computing circular convolutions using J parallel convolution blocks according to the invention.

    (9) FIG. 7 illustrates the timing diagram for computing (N+1) circular convolutions using the J parallel convolution blocks given in FIG. 6 according to the invention.

    (10) FIG. 8 illustrates the algorithm for computing the number of required adder tree resources as a function of the number of image samples (N) and bits per pixel (D) according to the invention. The algorithm gives the number of flip-flops in A.sub.ffb and the number of full adders in A.sub.FA.

    (11) FIG. 9 illustrates a comparison between the methods based on the invention: FastConv, FastXCorr (for cross-correlations), FastScaleConv, FastScaleXCorr (for cross-correlations), FastRankConv, and FastRankXCorr (for cross-correlations), and other methods. The input image and convolution blocks are of size P×P and the result image is of size N×N where N=2P−1 where P represents the input image and convolution kernel size. The parallel 1-D circular convolutions are denoted by n=┌log.sub.2N┐, p=┌log.sub.2P┐, J and H denotes the number of image rows that are processed in parallel by the DPRT. For S.sub.CAS.sub.YS, P needs to be a composite number and it is assumed to be given by P=P.sub.A.Math.P.sub.B. For FFT.sub.R2, D=2, 4 represents tehnubmer of 1-D FFT units running in parallel. It is defined that: (i) A.sub.ffb(a, b) is the number of required flip-flops inside the a-operand of b bits adder tree including input buffers, (ii) A.sub.ff( ) is the same number without accounting for input buffers, and (iii) A.sub.FA( ) is the equivalent number of 1-bit additions. As shown in FIG. 9, A.sub.ffb(.Math.), A.sub.ff(.Math.), A.sub.FA(.Math.) grow linearly as a function of N, and can be computed exactly using the algorithm of FIG. 8.

    (12) FIG. 10A illustrates a partial view of an architecture for computing fast and scalable 2-D convolutions using a sum of separable 1-D convolutions according to the invention (FastRankConv).

    (13) FIG. 10B illustrates a partial view of the architecture for computing fast and scalable 2-D convolutions using a sum of separable 1-D convolutions according to the invention (FastRankConv).

    (14) FIG. 11A illustrates a partial view of a custom SRAM architecture for fast transpositions and memory access according to the invention.

    (15) FIG. 11B illustrates a partial view of a custom SRAM architecture for fast transpositions and memory access according to the invention.

    (16) FIG. 12 illustrates a table of SRAM memory configurations for maximum accuracy according to the invention.

    (17) FIG. 13 illustrates a graph of a comparison of normalized execution times including FastRankConv, FastConv, and FastScaleConv according to the invention.

    (18) FIG. 14A illustrates a graph of multi-objective comparisons for 1-bit flip flops according to the invention for N=127 for 2-D convolutions between 64×64 blocks.

    (19) FIG. 14B illustrates a graph of multi-objective comparisons for 1-bit additions according to the invention for N=127 for 2-D convolutions between 64×64 blocks.

    (20) FIG. 14C illustrates a graph of multi-objective comparisons for multipliers according to the invention for N=127 for 2-D convolutions between 64×64 blocks.

    (21) FIG. 15 illustrates a chart of performance and resource comparisons according to the invention for N=127 for 2-D convolutions between 64×64 blocks.

    (22) FIG. 16 illustrates a chart of the full precision implementation of FastScaleConv and FastConv on FPGAs according to the invention. FastConv is shown for N=37. Each DSP represents a multiplier and each BRAM represents up to 36 Kbits of SRAM.

    (23) FIG. 17 illustrates a chart of the full precision and scalable implementation of FastRankConv (rank=2) on a Virtex-7 FPGA according to the invention.

    DETAILED DESCRIPTION OF THE INVENTION

    (24) The invention is directed to fast and scalable architectures and methods adaptable to available resources, that (1) compute 2-D convolutions using 1-D convolutions, (2) provide fast transposition and accumulation of results for computing fast cross-correlations or 2-D convolutions, (3) provide parallel computations using pipelined 1-D convolvers. Additionally, fast and scalable architectures and methods compute 2-D linear convolutions using a Discrete Periodic Radon Transforms (DPRTs) including the use of scalable DPRT, Fast DPRT, and fast 1-D convolutions.

    (25) The invention provides the following basic notation. Let g(i,j) denote an image of P.sub.1 rows with P.sub.2 pixels per row be of size P.sub.1×P.sub.2 with B bits per pixel. The image g(i,j) is indexed using 0≤i≤P.sub.1−1 and 0≤j≤P.sub.2−1. For the convolution kernels, the symbol h is used and assume a size of Q.sub.1×Q.sub.2 with C bits per pixel. f (i, j) is used for the output, with N.sub.1×N.sub.2 where N.sub.1=P.sub.1+Q.sub.1−1 and N.sub.2=P.sub.2+Q.sub.2−1. As described herein, N is simply used for the special case of N.sub.1=N.sub.2 and similarly, P is used for P.sub.1=P.sub.2.

    (26) To compute 1-D circular convolutions using circular shifts, F.sub.m(d), G.sub.m(d), H.sub.m(d) denote the DPRTs for f, g, h along the m-th prime direction. A special flip operation {hacek over (H)}.sub.m is defined by:
    {hacek over (H)}.sub.m(d)=H.sub.m(N−1−d), d≥0
    And the circular right-shift (CRS) by n using {hacek over (H)}.sub.m.sup.n that is defined by:
    {hacek over (H)}.sub.m.sup.n(d)=H.sub.m(custom characterd+ncustom character.sub.N)
    Then, the following shifted representation of the circular convolution is derived using:

    (27) F m ( d ) = .Math. k = 0 N - 1 G m ( k ) H m ( .Math. d + k .Math. N ) = .Math. k = 0 N - 1 G m ( k ) H m ( .Math. N - 1 - k + d + 1 .Math. N ) = .Math. k = 0 N - 1 G m ( k ) H m d + 1 ( N - 1 - k ) = .Math. k = 0 N - 1 G m ( k ) H m d + 1 ( k )
    As shown above, F.sub.m(d) can be expressed as the dot product between G.sub.m and a flipped and circular right-shifted by d+1 positions version of H.sub.m (denoted as {hacek over (H)}.sub.m.sup.d+1). A fast hardware implementation is derived. FIG. 1 illustrates an architecture for computing the fast 1-D circular convolution according to the invention. The 1-D circular convolution is expressed as F.sub.m=G.sub.m.Math.H.sub.m(d). As shown, B′ and C′ represent the number of input bits of H and G respectively.

    (28) FIG. 2 illustrates the algorithm for computing the fast 1-D circular convolution F.sub.m=G.sub.m.Math.H.sub.m(d). FIG. 3 illustrates the running time for the implementation of the fast architecture for computing one 1-D circular convolution. In this diagram, time increases to the right. The number of clock cycles for computing each term of F.sub.m(d) is shown on each strip. The strip on the right represents the total running time and n=┌log.sub.2N┐ represents the addition latency.

    (29) According to the fast computation of 1-D circular convolutions given in FIG. 2, parallel loads are used to transfer both of the DPRTs to the G and H registers in a single clock cycle. Note that flipping H.sub.m into {hacek over (H)}.sub.m is performed by simply wiring the inputs in reverse as shown in the upper register portion of FIG. 1. Starting with the last convolution output, there is a 3-step sequence of parallel multiplies, addition of the results, and a circular right-shift to prepare for the next output, as shown by lines 3-5 of FIG. 2. The multiplications are performed in parallel in a single cycle using the parallel fixed-point multipliers of FIG. 1 and added using a pipelined tree structure in just ┌log.sub.2(N)┐ clock cycles. The resulting outputs are left-shifted in, one output sample at a time, into the output F register shown in the lower-right portion of FIG. 1. A single cycle is also needed to perform the circular right-shift of H using the top-left register of FIG. 1. To derive the timing requirements, refer back to FIG. 3. Using a fully pipelined approach, the system and methods of the invention begin working on the next output sample after the parallel multiplies. It is easy to see that after the initial latency for the first sample, an output sample is computed at every cycle. After adding the latency for the initial loads, the adder latency, and the final left-shift, the total clock cycles is just N+┌log.sub.2(N)┐+2.

    (30) The invention provides fast and scalable 2-D linear convolutions and cross-correlations using the DPRT. Below is a discussion of the architectures, algorithms, bit requirements, and computational efficiency for 2-D convolutions and cross-correlations. Most importantly, the scalability of the proposed approach allows for the most efficient implementations based on available resources.

    (31) FIG. 4A and FIG. 4B illustrate an architecture for computing the fast and scalable 2-D linear convolutions and cross-correlations according to the invention. More specifically, the architecture shown provides a unifying architecture for implementing FastScaleConv, FastScaleXCorrr, FastConv, FastXCorr.

    (32) FIG. 5 illustrates the algorithm for computing the fast and scalable 2-D linear convolutions and cross-correlations between g(i,j) and h(i,j) using the architecture illustrated in FIG. 4A and FIG. 4B with L=┌(N+1)/J┐. In the most efficient implementation, the convolution kernel is available ahead of time. In this case, the DPRT of the kernel can be pre-computed and stored in memory as shown in the hardware architecture of FIG. 4A and FIG. 4B.

    (33) For adaptive filterbank applications, the DPRT of the zero-padded convolution kernel can be computed in real-time using the SFDPRT_System where the resulting DPRT is stored in (additional) memory. Alternatively, the SFDPRT_System component can be replicated for the kernel to avoid an increase of the running time. For computing cross-correlations, the vertical and horizontal flips associated with convolution need to be “undone” by flipping the kernel along rows and columns as described in FIG. 5. Here, the horizontal and vertical flips are performed by the SFDPRT_System component during the loading of the kernel.

    (34) An inverted MODE signal is used to control the SFDPRT_System to perform the needed flips. Vertical flips are implemented by switching the order of loading the rows. Thus, in a vertical flip, the last kernel row is loaded first and the first kernel row is loaded last. Horizontal flips are simply implemented by loading each row in reverse. Thus, in a horizontal flip, the last row element is loaded first and the first row element is loaded last. Overall, there is minimal overhead for implementing the horizontal and vertical flips.

    (35) Scalability is achieved by controlling (i) the number of 1-D circular convolutions that can be computed in parallel (denoted by J), and (ii) the number of image rows that can be processed in parallel in the DPRT blocks (denoted by H). Following the computation of the 1-D circular convolutions, an inverse DPRT is applied for computing the final result.

    (36) Bit requirements are also determined. Using the basic notation provided above, exact convolutions are computed to zero pad to a prime number requiring N=NextPrime(max(P.sub.1+Q.sub.1−1, P.sub.2+Q.sub.2−1). Therefore, the following is required (i) B+n bits for the DPRT of g, C+n bits for the DPRT of h where g uses B bits, h uses C bits, and n=┌log.sub.2N┐, (ii) B+C+3n bits for the convolutions, and (iii) B+C+4n bits just before the normalization step of the inverse DPRT, and B+C+x bits for the final result, where x represents the additional bits used for precision after the division.

    (37) In deriving the computational complexity of the approach, scalable DPRT computation requires ┌N/H┐(N+3H+3)+N+┌log.sub.2H┐+1 clock cycles that reduce to 2N+┌log.sub.2N┐+1 clock cycles for the fast DPRT implementation. For computing the number of cycles required for the circular convolutions, refer to the timing diagrams for computing circular convolutions as shown in FIG. 6 and FIG. 7.

    (38) As shown in FIG. 6, J+N+n+1 clock cycles are required to compute J convolutions in parallel where n=┌log.sub.2N┐ represents the initial addition latency. To compute outputs for all the N+1 required DPRT directions, all J parallel blocks of 1-D convolutions are used for L=┌(N+1)/J┐ times. Depending on N, increasing J may not always provide for better solutions. Therefore, optimal values may be determined as described below with respect to Pareto optimal architectures. Overall, a total of L.Math.(J+N)+n+1 clock cycles is required to compute all the 1-D convolutions.

    (39) A summary of the required resources for implementing the J 1-D parallel convolution blocks is now provided. Let the number of required flip-flops (including input buffers) that are needed for implementing the internal adders be denoted by A.sub.ffb. Also, let the equivalent number of 1-bit full additions denoted by A.sub.FA. The exact values for A.sub.ffb and A.sub.FA can be determined using the algorithm of FIG. 8.

    (40) FIG. 8 illustrates an algorithm of required tree resources as a function of the zero padded image (N), and the number of bits per pixel (D). To compute A.sub.ff for architectures that do not use input buffers, step 12 is simply removed from the algorithm. FIG. 9 provides for definitions of Aff.sub.b and A.sub.FA. FIG. 9 provides comparison of the performance of 2-D convolution and cross-correlations architectures as a function of computational resources. The result is of size of N×N, where N=2P−1, P represents the input image size and convolution kernel size, n=┌log.sub.2N┐, p=┌log.sub.2P┐, J denotes the number of parallel 1-D circular convolutions, and H denotes the number of image rows that are processed in parallel by the DPRT.

    (41) It is defined that: (i) A.sub.ffb(a,b) is the number of required flip-flops inside the a-operand of b bits adder tree including input buffers, (ii) A.sub.ff( ) is the same number without accounting for input buffers, and (iii) A.sub.FA( ) is the equivalent number of 1-bit additions. As shown in FIG. 9, A.sub.ffb(.Math.), A.sub.ff(.Math.), A.sub.FA(.Math.) grow linearly as a function of N, and can be computed exactly using the algorithm of FIG. 8.

    (42) Returning to the required resources for implementing the J 1-D parallel convolution blocks, the required flip-flops is J.Math.N.Math.(2B+2C+5n)+J.Math.A.sub.ffb(N,B+C+2n), 1-bit full additions is J.Math.A.sub.FA(N,B+C+2n) and required multipliers is J.Math.N. As noted above both A.sub.ffb(.Math.), A.sub.FA(.Math.) grow linearly.

    (43) Overall, based on the derived complexity, the invention provides the fastest running time using J=N+1 parallel 1-D convolutions at just 2N+n+2 clock cycles with resource usage of 0(N.sup.2) for flip-flops and full adders. For the minimum number of resources, only a J=1 1-D convolution block is used that require (N+1).sup.2+n+1 clock cycles with the lowest resource usage 0(N) for flip-flops and full adders.

    (44) Following the 1-D convolutions, the inverse DPRT is taken using the iSFDPRT_System component. Similar to the forward DPRT, scalability is controlled by H, the number of image rows processed in parallel. For this step, the input data uses B+C+3n bits per pixel. Depending on available resources, the inverse DPRT can be computed in just 2N+5n+B+C+2 for the fast inverse DPRT with 0(N.sup.2) resource usage (1-bit additions and flip-flops), or as slow as

    (45) .Math. N 2 .Math. ( N + 2 ) + 4 n + B + C + 4
    for H=2 for just 0(N) resource usage.

    (46) Turning to fast and scalable 2-D linear convolution using SVD-LU decompositions (FastRankConv), a collection of 1-D convolutions can be used, as described above, along the rows and columns to implement effective approximations to 2-D convolutions. Advantageously, the invention provides a fast and scalable system that eliminates the need for transpositions and allows the computation of convolutions in 0(N) to 0(N.sup.2) clock cycles.

    (47) As described above, scalability is achieved by controlling J, the number of linear convolutions that are computed in parallel. The linear convolution blocks are similar to the circular convolution blocks except that the complexity is a function of the size of the convolution kernel only (see FIG. 10A and FIG. 10B). Then, in order to operate as fast as possible, a custom memory system is designed that moves entire rows or columns to and from each linear convolver. Initially, all the rows are moved into the J convolution blocks, the convolution results stored in J SRAM memories so that the rows of the row-convolutions results correspond to the columns of the original image, and then row convolutions are performed and stored in J output SRAM memories. Thus, the need for the transpositions is completely avoided.

    (48) Then, for a single clock cycle, custom memories are used to (i) move entire rows and columns of pixels from memory to the convolution blocks and vice-versa, and (ii) allow direct access to J different SRAMs. The architecture is shown in FIG. 10A and FIG. 10B and the custom SRAM architecture shown in FIG. 11A and FIG. 11B. As shown in FIG. 10A and FIG. 10B, the fast and scalable 2-D convolutions are based on separable decomposition. Referring to the basic notation as provided above, the linear convolution blocks are very similar to the circular convolution blocks except that multiplications and additions are reduced to the size of the convolution kernel. Bus width is shown for maximum accuracy. Also, note that the implementation of FastRankCross is not considered here since cross-correlation is the same as convolution with a flipped kernel, and flipping can be computed during pre-processing (prior to SVD and LU).

    (49) The custom SRAM architecture shown in FIG. 11A and FIG. 11B provides for fast transposition and memory access. The architecture allows for full-row—or full-column, i.e., transpose—read/write in a single clock cycle (MODE=1) and individual access to up to J SRAMs in a single clock cycle (MODE=0). The SRAM stores M rows (or columns) of N B′-bit per pixels.

    (50) More specifically, the SRAM architecture is customized according to FIG. 12 for different purposes while the SRAM memory configurations allow for maximum accuracy. Orientation refers to each SRAM holding either a full row or column of the image. B denotes the number of bits of the input image, C denotes the number of bits used for the kernel coefficients. In particular, q.sub.1=┌log.sub.2Q┐ and q.sub.2=┌log.sub.2Q2┐.

    (51) Turning back to FIG. 10A and FIG. 10B, the following applies for a single clock cycle: (a) MEM_IN provides a full row of the image, (b) MEM_KER provides the entire row or column filter coefficients, (c) MEM_TEP stores the results of convolution along each row, provides access to a full column of the results, and (d) MEM_OUT, accumulates the final result, adds up to P2+Q2−1 values of the convolved image (in a single clock cycle), and also provides a full row.

    (52) Returning to the required resources are J.Math.(N2.Math.(B+C+q2)+Q2.Math.C+A.sub.ffb(Q2,B+2C+q2) flip-flops, J.Math.A.sub.FA(Q2,B+2C+q2) 1-bit full additions, and J×Q2 multipliers, were A.sub.ffb(.Math.), A.sub.FA(.Math.) grow linearly as mentioned above.

    (53) Next, a summary of performance-resource requirements is discussed. Without loss of generality, it is assumed that P2≥P1, Q2≥Q1, and consequently N2≥N1. Furthermore, for the purposes of the analysis, it is assumed full rank: r=Q1, and let L.sub.R=┌P1/J┐ and L.sub.C=┌(P2+Q2−1/J┐. The total running time is given as the sum of clock cycles required for: (i) row processing: r.Math.L.sub.R.Math.(J+P2+Q2−1), (ii) column processing: r.Math.L.sub.C.Math.(J+P1+Q1−1), and (iii) the latency of the adder tree ┌log.sub.2Q1┐+1. To simplify the derivation, let N=max{P2+Q2−1,P1+Q1−1}. Then, for J=1, the minimum resource usage grows as 0(N) with a running time of 0(N.sup.2). For J=N, the fastest running time 0(N) is provided with resource usage that grows as 0(N.sup.2). Again, FIG. 12 illustrates detailed resource usage of the memories. The architecture parameters may be further optimized as described below with respect to Pareto optimal architectures. Detailed comparisons are also discussed below.

    (54) Scalability may be achieved for large images using overlap-add. The convolution and cross-correlation kernels tend to be much smaller than the size of the input image. Thus, for much larger images, the best approach is to design the hardware architecture for the smaller kernels. The original image is subdivided into the smaller windows that are equal to the size of the kernel. Convolutions and cross-1-D correlations re computed for each block. Results from neighboring blocks must be added together. Furthermore, the final output is a concatenation of the results from each block. The basic approach is very well understood. Furthermore, the approach can also be parallelized to use multiple hardware blocks.

    (55) For purposes of the following discussion, it is assumed that both the image (block) and the convolution/cross-correlation kernel size are of the same size. Furthermore, the most common size is focused upon when both the image (block) and the kernels are square.

    (56) According to the invention, the architectures may be Pareto optimal. As mentioned above, it is possible to use J that is sub-optimal, or sub-optimal in the Pareto sense. Essentially, an architecture is Pareto-optimal if it provides the best possible performance for required resources. Thus, a Pareto optimal family of architectures always produces better running time for more resources. To derive the set of Pareto-optimal solutions, recall that the scalable families of architectures may contain less than J rows for the last block of 1-D convolutions. Thus, for FastScaleConv and FastScaleXcross, to fully utilize available hardware resources, the selected J values are required that would satisfy custom characterN+1custom character.sub.J=0. Similarly, for FastRankConv, the selected J values are required that simultaneously satisfy custom characterP1custom character.sub.J=0 and custom characterP2+Q2−1custom character.sub.J=0.

    (57) Following is a discussion of extensive comparisons with prior methods to demonstrate the promise of the proposed methods in implementing both convolutions and cross-correlations. The invention is compared to (i) serial systolic arrays (“S.sub.ERS.sub.YS”), (ii) scalable and parallel systolic arrays (“S.sub.CAS.sub.YS”), (iii) sliding windows (“SliWin”), and (iv) parallel and pipelined Fast Fourier Transform radix-2 (“FFT.sub.R2”).

    (58) As described earlier, the proposed architectures can compute both convolutions and cross-correlations. For FastRankConv, flipping the kernel can be done during pre-processing, prior to DVD and LU computations. Following is a presentation of results for FastConv, FastScaleConv, and FastRankConv. It is noted that FastXCorr, FastScaleXCorr, and FastRankXCorr are minor variations of FastConv, FastScaleConv, and FastRankConv.

    (59) Implementation setup and alternative methods are now discussed. Convolutions are considered with P×P kernels and image blocks where the output is of size N×N where N=2P−1. For multi-objective comparisons, it is assumed B=8 bits for the input image pixels and C=12 bits for the kernel coefficients. 12-bits for the outputs of the additions, multiplications, and the DPRT are used. For FPGA implementations, C=8 bits for the kernel coefficients and full-precision for the outputs is considered. For the “FFT.sub.R2”, the computations are performed using 32-bit floating point units. An extension of “FFT.sub.R2” is considered using point-to-point multiplications using D 1-D FFT cores. Then, in the fastest possible 2-D implementation, it is assumed that it would take N.sup.2/D additional clock cycles to implement the point to point complex multiplications.

    (60) As discussed above, for FastScaleConv, hardware scalability is achieved by varying H, the number of rows processed in parallel for the scalable DPRT, and J, which represents the number of 1-D convolutions computed in parallel. Here, for H=2, 3, . . . , N−1, set J=H for a balanced approach towards both. Then, for H=N, J=N+1 is sued to provide the optimal solution using FastConv. For FastRankConv, r is used to denote the rank of the approximation.

    (61) There are some special restrictions on N. For the DPRT based methods, N needs to be prime. For “FFT.sub.R2”, N is assumed to be a power of 2. For “S.sub.CAS.sub.YS”, P needs to be a composite number (N=2P−1), and it is assumed that P=P.sub.A.Math.P.sub.B. The results are focused on the cases when P.sub.A=2 (slowest) and P.sub.B=4 (fastest), with an input buffer and fully pipelined additions. It is noted that whenP.sub.B=2 the resource usage becomes prohibitive (0(N.sup.3)). For “S.sub.ERS.sub.YS”, “SliWin” and FastRankConv, there is no restriction for P. When the size needs to be changed, zero-padding is applied.

    (62) In addition to providing convolution and cross-correlation architectures that are both fast and scalable, the invention also provides architectures that are optimal in the multi-objective sense. The comprehensive summary in terms of performance and resources is shown in FIG. 9. FIG. 9 lists performance in clock cycles, number of flip-flops, number of 1-bit additions (equivalent full-adders), number and type of multipliers, and SRAM requirements. Additional details for FPGA implementations are discussed below.

    (63) FIG. 13 illustrates a graph of a comparison of normalized execution times according to the invention. As shown in FIG. 13, the required number of clock cycles is divided by N where N=2P−1. To illustrate the range of possibilities, architectures with quadratic and linear time complexity are considered. For quadratic time complexity (0(N.sup.2)), scalable implementations are derived by FastScaleConv for J=H=2, and FastRankConv with J=1, r=2. Alternatively, the following are considered: a scalable extension of “FFT.sub.R2” for D=2,4, a scalable implementation of “S.sub.CAS.sub.YS”, and the non-scalable implementations due to “SliWin” and “S.sub.ERS.sub.YS”. FastConv provides the fastest performance at just 6N+5n+17 clock cycles (n=log.sub.2(N)). For J=N+1, FastScaleConv achieves the same performance as FastConv. For rank=2 approximations to the convolution kernel (J=N), FastRankConv approximates the performance of FastConv. In terms of related research, for P.sub.B=4, “S.sub.CAS.sub.YS” achieves linear time-performance as well. On the other hand, from FIG. 9, for linear time performance, using P.sub.A=P/P.sub.B=P/4, it can be seen that the requirements for “S.sub.CAS.sub.YS” grow as P.sup.3 as opposed to P.sup.2 growth for FastConv and FastScaleConv.

    (64) It is noted that for very large kernels, as given in FIG. 9, the FFT based methods provide the best results since the N log.sub.2(N)) growth in floating point multipliers and additions likely cost less than the N.sup.2 growth of fixed-point multipliers and adders required by FastConv, FastScaleConv, and FastRankConv. As seen in the multi-objective comparisons, FastConv, FastScaleConv, and FastRankConv perform better than “FFT.sub.R2” in realistic convolution kernels (e.g., for N=127 and thus for lower N also). As discussed in further detail below, fast implementations of FastConv, FastScaleConv, and FastRankConv can fit in current FPGAs.

    (65) FIG. 14A, FIG. 14B, and FIG. 14C illustrate graphs of multi-objective comparisons according to the invention for N=127 (N=128 for FFT.sub.R2″). More specifically, the graph of FIG. 14A shows comparisons based for 1-bit flip-flops, the graph of FIG. 14B illustrates equivalent 1-bit additions, and equivalent 12-bit fixed point multipliers are shown in the graphof FIG. 14C (see FIG. 9 for memory usage). To interpret the plots, note that each curve, (termed a Pareto front), represents a family of optimal implementations. The best results come from the Pareto fronts located in the lower-left. Within each Pareto front, the upper left point represents the implementation that requires the largest number of cycles (slowest) with the lowest number of required resources. Then, the lower-right point represents the implementation that requires the smallest number of cycles (fastest) with the maximum number of required resources. To enable more direct comparisons, specific numbers for some of the implementations is listed in FIG. 15.

    (66) FIG. 15 illustrates a chart of performance and resource comparisons according to the invention. Convolutions are between 64×64 blocks. For linear time implementations FastConv is the fastest and serves as the reference design. The remaining implementations are normalized by the corresponding resources required by FastConv. Similarly, for quadratic-time implementations, FastScaleConv is used as the reference dsign. Memory requiremens refer to SRAM bits. Also, it is noted that the reported “FFT.sub.R2” resources for additions and multiplications refer to an approximation of the equivalent fixed-point resources as described above.

    (67) Since they are the fastest, FastConv implementations are always in the lowest right portion in each plot. From FIG. 15, it can be seen that FastConv only requires 25% of the multipliers and memory, and 56% of the addition resources required by “S.sub.CAS.sub.YS”, while requiring only 77% of the clock-cycles. In terms of scalable approaches, the Pareto front for FastRankConv (rank=2), provide the best performance with minimum resources. The limited resources required by FastRankConv are also clearly documented in FIG. 15. The full-ranked FastRankConv requires the maximum amounts of resources to deliver the same performance. Consistently, FastScaleConv provides the best scalable implementations without requiring low-rank. As seen from FIG. 15, for the linear case, FastScaleConv is slightly more expensive in resources than FastConv and substantially less expensive than “S.sub.CAS.sub.YS”. Overall, “S.sub.CAS.sub.YS” (PB=4) implementations achieve the speed of FastScaleConv but require significantly more multipliers and adders. “S.sub.ERS.sub.YS” and “SliWin” require significantly more resources and are also much slower than FastScaleConv. Returning to FIG. 15, for the quadratic case, FastScaleConv and FastRankConv (rank=2) are the fastest while requiring fewer adders and (equivalent) multipliers. Since only FastConv, FastScaleConv allow the kernel to change in running time, FastConv and FastScaleConv can also be used in cross-correlations with adaptive kernels, and adaptive filterbank applications.

    (68) Full-precision FPGA implementations are considered to understand what can be fitted in modern devices. In particular, full-precision implementations for 8-bit inputs and kernels are considered. According to certain embodiments, the invention was implemented using current FPGA technologies, for example, Virtex-7 and Zynq-SOC. For FastScaleConv and FastConv, for different N and J (the number of parallel 1-D convolvers), different implementations are shown in FIG. 16. For FastRankConv, by varying P and J, different implementations are presented in FIG. 17.

    (69) As shown, a collection of FastScaleConv architectures are successfully implemented for N=7 to N=127. For N=41, a high-level of parallelism is achieved by computing the DPRT and inverse DPRT by parallel-processing H=32 rows at a time through J=32 1-D full-precision, pipelined convolvers also operating in parallel. According to this example, the output images required 34 bits. For N=37, a full precision implementation of FastConv is provided that only requires 291 clock cycles by parallel processing 38 rows of the DPRT and inverse DPRT, and parallel computing 38 1-D convolutions.

    (70) As seen from FIG. 16, implementations are limited by the number of available look-up tables. Thus, the larger values of N can be implemented by reducing the precision requirements. As shown in FIG. 17, FastRankConv makes a very efficient use of the DSPs while not requiring significant LUT resources. As an example, for P=67, FastRankConv only requires 16205 LUTs (out of 712000 LUTs). In comparison, FastScaleConv requirements for N=127 (which approximates 2P−1), requires about 20 times more LUTs to deliver the full-accuracy results. Also for P=67, FastRankConv with rank r=2 requires 48903 clock cycles, compared to 33507 clock cycles for FastScaleConv with J=1 and H=2 without any rank restrictions. Thus, as mentioned above, for low-rank kernels, FastRankConv is a good alternative to FastScaleConv. For higher ranks and general-purpose implementations, FastScaleConv is preferable.

    (71) The described embodiments are to be considered in all respects only as illustrative and not restrictive, and the scope of the invention is not limited to the foregoing description. Those of skill in the art may recognize changes, substitutions, adaptations and other modifications that may nonetheless come within the scope of the invention and range of the invention.