System and methods for computing 2-D convolutions and cross-correlations
11593907 · 2023-02-28
Inventors
- Marios Stephanou Pattichis (Albuquerque, NM)
- Cesar Carranza (Miami Beach, FL, US)
- Daniel Llamocca Obregon (Clawson, MI, US)
Cpc classification
G06T1/20
PHYSICS
G06V10/50
PHYSICS
International classification
G06T1/20
PHYSICS
G06V10/50
PHYSICS
G06V10/42
PHYSICS
G06V10/94
PHYSICS
Abstract
Fast and scalable architectures and methods adaptable to available resources, that (1) compute 2-D convolutions using 1-D convolutions, (2) provide fast transposition and accumulation of results for computing fast cross-correlations or 2-D convolutions, and (3) provide parallel computations using pipelined 1-D convolvers. Additionally, fast and scalable architectures and methods that compute 2-D linear convolutions using Discrete Periodic Radon Transforms (DPRTs) including the use of scalable DPRT, Fast DPRT, and fast 1-D convolutions.
Claims
1. A method for fast and scalable architectures adaptable to available resources, that can be used to compute 2-D convolutions using 1-D convolutions comprising the steps of: providing an input image and a 2-D convolution kernel, wherein the input image comprises rows and columns of pixels and is partitioned into blocks and processed using an overlap-and-add approach or an overlap-and-save approach; decomposing the 2-D convolution kernel into a sum of separable 1-D kernels; applying the 1-D kernels along the rows and then the columns or vice-versa using pipelined 1-D convolvers that compute one output pixel per clock cycle; adding up results from the applying step; and producing a final image output based on the 2-D convolution kernel.
2. The method according to claim 1, wherein the number of 1-D kernels is 2, one to be applied along the rows and another one to be applied along the columns, wherein the two 1-D kernels can be the same.
3. The method according to claim 1, wherein the number of 1-D kernels is 4, two to be applied along the rows, and the other two to be applied along the columns, wherein some of the 1-D kernels can be the same.
4. The method according to claim 1, wherein the 2-D decomposition of the 2-D convolution kernel and the number of separable 1-D kernels are computed using a combination of singular value decomposition (SVD) and lower-upper (LU) decompositions.
5. The method according to claim 1 further comprising the steps of: allowing access, storage or accumulation of the results from a row or a column in a single clock cycle; and accessing all or a portion of the results using one or more rows or one or more columns.
6. The method according to claim 1, wherein the number of 1-D kernels is 2, 4, 6, 8, up to maximum number needed to get a perfect reconstruction of the 2-D kernel using SVD, wherein some of the 1-D kernels can be the same.
Description
DESCRIPTION OF THE DRAWING
(1) The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
DETAILED DESCRIPTION OF THE INVENTION
(24) The invention is directed to fast and scalable architectures and methods adaptable to available resources, that (1) compute 2-D convolutions using 1-D convolutions, (2) provide fast transposition and accumulation of results for computing fast cross-correlations or 2-D convolutions, (3) provide parallel computations using pipelined 1-D convolvers. Additionally, fast and scalable architectures and methods compute 2-D linear convolutions using a Discrete Periodic Radon Transforms (DPRTs) including the use of scalable DPRT, Fast DPRT, and fast 1-D convolutions.
(25) The invention provides the following basic notation. Let g(i,j) denote an image of P.sub.1 rows with P.sub.2 pixels per row be of size P.sub.1×P.sub.2 with B bits per pixel. The image g(i,j) is indexed using 0≤i≤P.sub.1−1 and 0≤j≤P.sub.2−1. For the convolution kernels, the symbol h is used and assume a size of Q.sub.1×Q.sub.2 with C bits per pixel. f (i, j) is used for the output, with N.sub.1×N.sub.2 where N.sub.1=P.sub.1+Q.sub.1−1 and N.sub.2=P.sub.2+Q.sub.2−1. As described herein, N is simply used for the special case of N.sub.1=N.sub.2 and similarly, P is used for P.sub.1=P.sub.2.
(26) To compute 1-D circular convolutions using circular shifts, F.sub.m(d), G.sub.m(d), H.sub.m(d) denote the DPRTs for f, g, h along the m-th prime direction. A special flip operation {hacek over (H)}.sub.m is defined by:
{hacek over (H)}.sub.m(d)=H.sub.m(N−1−d), d≥0
And the circular right-shift (CRS) by n using {hacek over (H)}.sub.m.sup.n that is defined by:
{hacek over (H)}.sub.m.sup.n(d)=H.sub.m(d+n
.sub.N)
Then, the following shifted representation of the circular convolution is derived using:
(27)
As shown above, F.sub.m(d) can be expressed as the dot product between G.sub.m and a flipped and circular right-shifted by d+1 positions version of H.sub.m (denoted as {hacek over (H)}.sub.m.sup.d+1). A fast hardware implementation is derived.
(28)
(29) According to the fast computation of 1-D circular convolutions given in
(30) The invention provides fast and scalable 2-D linear convolutions and cross-correlations using the DPRT. Below is a discussion of the architectures, algorithms, bit requirements, and computational efficiency for 2-D convolutions and cross-correlations. Most importantly, the scalability of the proposed approach allows for the most efficient implementations based on available resources.
(31)
(32)
(33) For adaptive filterbank applications, the DPRT of the zero-padded convolution kernel can be computed in real-time using the SFDPRT_System where the resulting DPRT is stored in (additional) memory. Alternatively, the SFDPRT_System component can be replicated for the kernel to avoid an increase of the running time. For computing cross-correlations, the vertical and horizontal flips associated with convolution need to be “undone” by flipping the kernel along rows and columns as described in
(34) An inverted MODE signal is used to control the SFDPRT_System to perform the needed flips. Vertical flips are implemented by switching the order of loading the rows. Thus, in a vertical flip, the last kernel row is loaded first and the first kernel row is loaded last. Horizontal flips are simply implemented by loading each row in reverse. Thus, in a horizontal flip, the last row element is loaded first and the first row element is loaded last. Overall, there is minimal overhead for implementing the horizontal and vertical flips.
(35) Scalability is achieved by controlling (i) the number of 1-D circular convolutions that can be computed in parallel (denoted by J), and (ii) the number of image rows that can be processed in parallel in the DPRT blocks (denoted by H). Following the computation of the 1-D circular convolutions, an inverse DPRT is applied for computing the final result.
(36) Bit requirements are also determined. Using the basic notation provided above, exact convolutions are computed to zero pad to a prime number requiring N=NextPrime(max(P.sub.1+Q.sub.1−1, P.sub.2+Q.sub.2−1). Therefore, the following is required (i) B+n bits for the DPRT of g, C+n bits for the DPRT of h where g uses B bits, h uses C bits, and n=┌log.sub.2N┐, (ii) B+C+3n bits for the convolutions, and (iii) B+C+4n bits just before the normalization step of the inverse DPRT, and B+C+x bits for the final result, where x represents the additional bits used for precision after the division.
(37) In deriving the computational complexity of the approach, scalable DPRT computation requires ┌N/H┐(N+3H+3)+N+┌log.sub.2H┐+1 clock cycles that reduce to 2N+┌log.sub.2N┐+1 clock cycles for the fast DPRT implementation. For computing the number of cycles required for the circular convolutions, refer to the timing diagrams for computing circular convolutions as shown in
(38) As shown in
(39) A summary of the required resources for implementing the J 1-D parallel convolution blocks is now provided. Let the number of required flip-flops (including input buffers) that are needed for implementing the internal adders be denoted by A.sub.ffb. Also, let the equivalent number of 1-bit full additions denoted by A.sub.FA. The exact values for A.sub.ffb and A.sub.FA can be determined using the algorithm of
(40)
(41) It is defined that: (i) A.sub.ffb(a,b) is the number of required flip-flops inside the a-operand of b bits adder tree including input buffers, (ii) A.sub.ff( ) is the same number without accounting for input buffers, and (iii) A.sub.FA( ) is the equivalent number of 1-bit additions. As shown in
(42) Returning to the required resources for implementing the J 1-D parallel convolution blocks, the required flip-flops is J.Math.N.Math.(2B+2C+5n)+J.Math.A.sub.ffb(N,B+C+2n), 1-bit full additions is J.Math.A.sub.FA(N,B+C+2n) and required multipliers is J.Math.N. As noted above both A.sub.ffb(.Math.), A.sub.FA(.Math.) grow linearly.
(43) Overall, based on the derived complexity, the invention provides the fastest running time using J=N+1 parallel 1-D convolutions at just 2N+n+2 clock cycles with resource usage of 0(N.sup.2) for flip-flops and full adders. For the minimum number of resources, only a J=1 1-D convolution block is used that require (N+1).sup.2+n+1 clock cycles with the lowest resource usage 0(N) for flip-flops and full adders.
(44) Following the 1-D convolutions, the inverse DPRT is taken using the iSFDPRT_System component. Similar to the forward DPRT, scalability is controlled by H, the number of image rows processed in parallel. For this step, the input data uses B+C+3n bits per pixel. Depending on available resources, the inverse DPRT can be computed in just 2N+5n+B+C+2 for the fast inverse DPRT with 0(N.sup.2) resource usage (1-bit additions and flip-flops), or as slow as
(45)
for H=2 for just 0(N) resource usage.
(46) Turning to fast and scalable 2-D linear convolution using SVD-LU decompositions (FastRankConv), a collection of 1-D convolutions can be used, as described above, along the rows and columns to implement effective approximations to 2-D convolutions. Advantageously, the invention provides a fast and scalable system that eliminates the need for transpositions and allows the computation of convolutions in 0(N) to 0(N.sup.2) clock cycles.
(47) As described above, scalability is achieved by controlling J, the number of linear convolutions that are computed in parallel. The linear convolution blocks are similar to the circular convolution blocks except that the complexity is a function of the size of the convolution kernel only (see
(48) Then, for a single clock cycle, custom memories are used to (i) move entire rows and columns of pixels from memory to the convolution blocks and vice-versa, and (ii) allow direct access to J different SRAMs. The architecture is shown in
(49) The custom SRAM architecture shown in
(50) More specifically, the SRAM architecture is customized according to
(51) Turning back to
(52) Returning to the required resources are J.Math.(N2.Math.(B+C+q2)+Q2.Math.C+A.sub.ffb(Q2,B+2C+q2) flip-flops, J.Math.A.sub.FA(Q2,B+2C+q2) 1-bit full additions, and J×Q2 multipliers, were A.sub.ffb(.Math.), A.sub.FA(.Math.) grow linearly as mentioned above.
(53) Next, a summary of performance-resource requirements is discussed. Without loss of generality, it is assumed that P2≥P1, Q2≥Q1, and consequently N2≥N1. Furthermore, for the purposes of the analysis, it is assumed full rank: r=Q1, and let L.sub.R=┌P1/J┐ and L.sub.C=┌(P2+Q2−1/J┐. The total running time is given as the sum of clock cycles required for: (i) row processing: r.Math.L.sub.R.Math.(J+P2+Q2−1), (ii) column processing: r.Math.L.sub.C.Math.(J+P1+Q1−1), and (iii) the latency of the adder tree ┌log.sub.2Q1┐+1. To simplify the derivation, let N=max{P2+Q2−1,P1+Q1−1}. Then, for J=1, the minimum resource usage grows as 0(N) with a running time of 0(N.sup.2). For J=N, the fastest running time 0(N) is provided with resource usage that grows as 0(N.sup.2). Again,
(54) Scalability may be achieved for large images using overlap-add. The convolution and cross-correlation kernels tend to be much smaller than the size of the input image. Thus, for much larger images, the best approach is to design the hardware architecture for the smaller kernels. The original image is subdivided into the smaller windows that are equal to the size of the kernel. Convolutions and cross-1-D correlations re computed for each block. Results from neighboring blocks must be added together. Furthermore, the final output is a concatenation of the results from each block. The basic approach is very well understood. Furthermore, the approach can also be parallelized to use multiple hardware blocks.
(55) For purposes of the following discussion, it is assumed that both the image (block) and the convolution/cross-correlation kernel size are of the same size. Furthermore, the most common size is focused upon when both the image (block) and the kernels are square.
(56) According to the invention, the architectures may be Pareto optimal. As mentioned above, it is possible to use J that is sub-optimal, or sub-optimal in the Pareto sense. Essentially, an architecture is Pareto-optimal if it provides the best possible performance for required resources. Thus, a Pareto optimal family of architectures always produces better running time for more resources. To derive the set of Pareto-optimal solutions, recall that the scalable families of architectures may contain less than J rows for the last block of 1-D convolutions. Thus, for FastScaleConv and FastScaleXcross, to fully utilize available hardware resources, the selected J values are required that would satisfy N+1
.sub.J=0. Similarly, for FastRankConv, the selected J values are required that simultaneously satisfy
P1
.sub.J=0 and
P2+Q2−1
.sub.J=0.
(57) Following is a discussion of extensive comparisons with prior methods to demonstrate the promise of the proposed methods in implementing both convolutions and cross-correlations. The invention is compared to (i) serial systolic arrays (“S.sub.ERS.sub.YS”), (ii) scalable and parallel systolic arrays (“S.sub.CAS.sub.YS”), (iii) sliding windows (“SliWin”), and (iv) parallel and pipelined Fast Fourier Transform radix-2 (“FFT.sub.R2”).
(58) As described earlier, the proposed architectures can compute both convolutions and cross-correlations. For FastRankConv, flipping the kernel can be done during pre-processing, prior to DVD and LU computations. Following is a presentation of results for FastConv, FastScaleConv, and FastRankConv. It is noted that FastXCorr, FastScaleXCorr, and FastRankXCorr are minor variations of FastConv, FastScaleConv, and FastRankConv.
(59) Implementation setup and alternative methods are now discussed. Convolutions are considered with P×P kernels and image blocks where the output is of size N×N where N=2P−1. For multi-objective comparisons, it is assumed B=8 bits for the input image pixels and C=12 bits for the kernel coefficients. 12-bits for the outputs of the additions, multiplications, and the DPRT are used. For FPGA implementations, C=8 bits for the kernel coefficients and full-precision for the outputs is considered. For the “FFT.sub.R2”, the computations are performed using 32-bit floating point units. An extension of “FFT.sub.R2” is considered using point-to-point multiplications using D 1-D FFT cores. Then, in the fastest possible 2-D implementation, it is assumed that it would take N.sup.2/D additional clock cycles to implement the point to point complex multiplications.
(60) As discussed above, for FastScaleConv, hardware scalability is achieved by varying H, the number of rows processed in parallel for the scalable DPRT, and J, which represents the number of 1-D convolutions computed in parallel. Here, for H=2, 3, . . . , N−1, set J=H for a balanced approach towards both. Then, for H=N, J=N+1 is sued to provide the optimal solution using FastConv. For FastRankConv, r is used to denote the rank of the approximation.
(61) There are some special restrictions on N. For the DPRT based methods, N needs to be prime. For “FFT.sub.R2”, N is assumed to be a power of 2. For “S.sub.CAS.sub.YS”, P needs to be a composite number (N=2P−1), and it is assumed that P=P.sub.A.Math.P.sub.B. The results are focused on the cases when P.sub.A=2 (slowest) and P.sub.B=4 (fastest), with an input buffer and fully pipelined additions. It is noted that whenP.sub.B=2 the resource usage becomes prohibitive (0(N.sup.3)). For “S.sub.ERS.sub.YS”, “SliWin” and FastRankConv, there is no restriction for P. When the size needs to be changed, zero-padding is applied.
(62) In addition to providing convolution and cross-correlation architectures that are both fast and scalable, the invention also provides architectures that are optimal in the multi-objective sense. The comprehensive summary in terms of performance and resources is shown in
(63)
(64) It is noted that for very large kernels, as given in
(65)
(66)
(67) Since they are the fastest, FastConv implementations are always in the lowest right portion in each plot. From
(68) Full-precision FPGA implementations are considered to understand what can be fitted in modern devices. In particular, full-precision implementations for 8-bit inputs and kernels are considered. According to certain embodiments, the invention was implemented using current FPGA technologies, for example, Virtex-7 and Zynq-SOC. For FastScaleConv and FastConv, for different N and J (the number of parallel 1-D convolvers), different implementations are shown in
(69) As shown, a collection of FastScaleConv architectures are successfully implemented for N=7 to N=127. For N=41, a high-level of parallelism is achieved by computing the DPRT and inverse DPRT by parallel-processing H=32 rows at a time through J=32 1-D full-precision, pipelined convolvers also operating in parallel. According to this example, the output images required 34 bits. For N=37, a full precision implementation of FastConv is provided that only requires 291 clock cycles by parallel processing 38 rows of the DPRT and inverse DPRT, and parallel computing 38 1-D convolutions.
(70) As seen from
(71) The described embodiments are to be considered in all respects only as illustrative and not restrictive, and the scope of the invention is not limited to the foregoing description. Those of skill in the art may recognize changes, substitutions, adaptations and other modifications that may nonetheless come within the scope of the invention and range of the invention.