Cascaded architecture for disparity and motion prediction with block matching and convolutional neural network (CNN)
11694341 · 2023-07-04
Assignee
Inventors
Cpc classification
H04N7/181
ELECTRICITY
H04N13/239
ELECTRICITY
H04N2013/0081
ELECTRICITY
International classification
H04N13/00
ELECTRICITY
Abstract
A CNN operates on the disparity or motion outputs of a block matching hardware module, such as a DMPAC module, to produce refined disparity or motion streams which improve operations in images having ambiguous regions. As the block matching hardware module provides most of the processing, the CNN can be small and thus able to operate in real time, in contrast to CNNs which are performing all of the processing. In one example, the CNN operation is performed only if the block hardware module output confidence level is below a predetermined amount. The CNN can have a number of different configurations and still be sufficiently small to operate in real time on conventional platforms.
Claims
1. An image processing system comprising: a block matching hardware module having an input for receiving first and second image streams and an output for providing a displacement stream, wherein the block matching hardware module is configured to provide a confidence stream; and a convolutional neural network (CNN) having an input coupled to the block matching hardware module for receiving the displacement stream and for receiving at least one of the first and second image streams, wherein the CNN is configured to selectively provide a refined displacement stream for disparity prediction or for motion prediction; and comparator logic configured to: responsive to a value of the confidence stream being greater than a threshold value, bypass the CNN and output the displacement stream; responsive to the value of the confidence stream being less than the threshold value, provide and output the refined displacement stream.
2. The image processing system of claim 1, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the CNN receives both of the first and second image streams.
3. The image processing system of claim 1, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the CNN receives only one of the first and second image streams.
4. The image processing system of claim 1, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the CNN receives both of the first and second image streams.
5. The image processing system of claim 1, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the CNN receives only one of the first and second image streams.
6. The image processing system of claim 1, wherein the CNN is formed by a digital signal processor (DSP) executing software instructions.
7. The image processing system of claim 1, wherein the block matching hardware module comprises a semi-global block matching hardware module configured to implement a Lucas-Kanade method on the first and second image streams to provide the displacement stream.
8. A system comprising: a system on a chip (SoC) including: a plurality of processors; a memory controller coupled to the plurality of processors; onboard memory coupled to the memory controller; an external memory interface for connecting to external memory; a high-speed interconnect coupled to the plurality of processors, the memory controller and the external memory interface; an external communication interface coupled to the high-speed interconnect; and a block matching hardware module having an input for receiving first and second image streams and an output for providing a displacement stream for disparity prediction or for motion prediction; and an external memory coupled to the external memory interface and storing instructions for execution on a first processor of the plurality of processors to form, when executed on the first processor, a convolutional neural network (CNN) having an input coupled to the block matching hardware module for receiving the displacement stream and for receiving at least one of the first and second image streams, wherein the instructions are configured to be executable by the first processor for further causing the CNN to selectively provide a refined displacement stream.
9. The system of claim 8, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the CNN receives both of the first and second image streams.
10. The system of claim 8, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the CNN receives only one of the first and second image streams.
11. The system of claim 8, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the CNN receives both of the first and second image streams.
12. The system of claim 8, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the CNN receives only one of the first and second image streams.
13. The system of claim 8, wherein the block matching hardware module further provides a confidence stream, the SoC further comprising: comparator logic coupled to the block matching hardware module and the CNN and receiving the confidence stream, the displacement stream and the refined displacement stream and providing the displacement stream if a value of the confidence stream is greater than a threshold value and providing the refined displacement stream if the value of the confidence stream is less than the threshold value.
14. The system of claim 8, wherein the plurality of processors includes a digital signal processor (DSP), and wherein the DSP is the first processor used to execute the instructions and form the CNN.
15. The system of claim 8, wherein the block matching hardware module comprises a semi-global block matching hardware module configured to implement a Lucas-Kanade method on the first and second image streams to provide the displacement stream.
16. A method of image processing comprising: processing first and second image streams with a block matching hardware module to provide a displacement stream; generating a confidence stream based on the first and second image streams; processing the displacement stream, and at least one of the first and second image streams with a convolutional neural network (CNN) to selectively provide a refined displacement stream for disparity prediction or for motion prediction; responsive to a value of the confidence stream being greater than a threshold value, bypassing the CNN and outputting the displacement stream; and responsive to the value of the confidence stream being less than the threshold value, providing and outputting the refined displacement stream.
17. The image processing method of claim 16, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the method further comprises receiving, by the CNN, both of the first and second image streams.
18. The image processing method of claim 16, wherein the first and second image streams are left and right image streams and the displacement stream and refined displacement stream are each disparity streams, and wherein the method further comprises receiving, by the CNN, only one of the first and second image streams.
19. The image processing method of claim 16, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the method further comprises receiving, by the CNN, both of the first and second image streams.
20. The image processing method of claim 16, wherein the first and second image streams are current and previous image streams and the displacement stream and refined displacement stream are each motion streams, and wherein the method further comprises receiving, by the CNN, only one of the first and second image streams.
Description
BRIEF DESCRIPTION OF THE FIGURES
(1) The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
(2) For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION
(18) Referring now to
(19) Referring now to
(20)
(21) A graphics acceleration module 524 is connected to the high-speed interconnect 508. A display subsystem 526 is connected to the high-speed interconnect 508 and includes conversion logic 528 and output logic 530 to allow operation with and connection to various video monitors. A system services block 532, which includes items such as DMA controllers, memory management units, general-purpose I/O's, mailboxes and the like, is provided for normal SoC 500 operation. A serial connectivity module 534 is connected to the high-speed interconnect 508 and includes modules as normal in an SoC. A vehicle connectivity module 536 provides interconnects for external communication interfaces, such as PCIe block 538, USB block 540 and an Ethernet switch 542. A capture/MIPI module 544 includes a four-lane CSI-2 compliant transmit block 546 and a four-lane CSI-2 receive module and hub.
(22) An MCU island 560 is provided as a secondary subsystem and handles operation of the integrated SoC 500 when the other components are powered down to save energy. An MCU ARM processor 562, such as one or more ARM R5F cores, operates as a master and is coupled to the high-speed interconnect 508 through an isolation interface 561. An MCU general purpose I/O (GPIO) block 564 operates as a slave. MCU RAM 566 is provided to act as local memory for the MCU ARM processor 562. A CAN bus block 568, an additional external communication interface, is connected to allow operation with a conventional CAN bus environment in the vehicle 100. An Ethernet MAC (media access control) block 570 is provided for further connectivity in the vehicle 100. External memory, generally non-volatile memory (NVM) is connected to the MCU ARM processor 562 via an external memory interface 569 to store instructions loaded into the various other memories for execution by the various appropriate processors. The MCU ARM processor 562 operates as a safety processor, monitoring operations of the SoC 500 to ensure proper operation of the SoC 500.
(23) It is understood that this is one example of an SoC provided for explanation and many other SoC examples are possible, with varying numbers of processors, DSPs, accelerators and the like.
(24) The examples of
(25) In optical flow operation as in
(26) The use of the CNN 402 cascaded with the DMPAC module 522 provides improved disparity and motion stream outputs compared to just the DMPAC module 522 alone. The CNN 402 is much smaller than the end-to-end CNNs discussed above as it uses many fewer layers and thus requires many fewer calculations, so the combination can provide real time operation.
(27) While previously the outputs of the DMPAC module 522 were used by the various functions, such as collision avoidance and autonomous operation, now the refined disparity and motion stream outputs of the CNN 402 are used in the various functions.
(28) In the examples of
(29) The DMPAC module 522 is one example of a block matching system and other more traditional block matching systems can be used instead of the DMPAC module 522, the CNN 402 improving the results of those other block matching systems as the CNN 402 improves the output of the DMPAC module 522.
(30) In one example the operation of the CNN 402 is a refine network (RefineNet) that has been taught to predict a residual correction value to combine with original disparity or motion values to provide a refined disparity or motion value. For disparity or stereo operation, mathematically this is stated as:
d.sub.2=d.sub.1+F.sub.r(I.sub.L,I.sub.R,Ĩ.sub.L,E.sub.L,d.sub.1) where d.sub.1=initial disparity d.sub.2=refined disparity F.sub.r=correction function I.sub.L=left image I.sub.R=right image Ĩ.sub.L=warped or reconstructed left image—right image and disparity E.sub.L=error image—displacement between I.sub.L and Ĩ.sub.L
(31) For optical flow or motion operation, I.sub.L becomes I.sub.t-1, I.sub.R becomes I.sub.t, Ĩ.sub.L becomes Ĩ.sub.t-1, E.sub.L becomes E.sub.t-1, d.sub.1 becomes d.sub.1x, d.sub.1y and d.sub.2 becomes d.sub.2x, d.sub.2y.
(32) In one example, illustrated in
(33) This comparator logic shown diagrammatically in
(34) The CNN 402 is developed by software instructions executing on the DSP 504; the comparator logic is shown in flowchart format in
(35)
(36) The outputs of the first layer 604 are provided as inputs to a second layer 606, which has 16 output channels. The 16 output channels of the second layer 606 are the inputs to a third layer 608, which also has 16 output channels. The 16 output channels of the third layer 608 are the inputs to a fourth layer 610. The fourth layer 610 has 32 output channels, which are the inputs to a fifth layer 612. The fifth layer 612 has 32 output channels, which are the inputs to a sixth layer 614. The sixth layer 614 has 64 output channels, which are the inputs to a seventh layer 616. The seventh layer 616 has one output channel for disparity and two output channels for motion. A summer 618 combines the output streams from the seventh layer 616 with the disparity or motion streams from the DMPAC module 522 to produce the refined disparity or motion streams. In one example, the sequential refine network configuration 602 has only 33,000 parameters and a receptive field size of 15×15.
(37)
(38) The 32 output channels from first layer 704 are provided to second layer 706. The second layer 706 is a 3×3 depthwise convolutional layer that has 32 output channels provided to a third layer 708. The third layer 708 is a 1×1 convolutional layer with ReLu6 and 16 output channels.
(39) The output channels from the third layer 708 are provided to a first block 710 of a series of blocks. The block composition is illustrated in
(40) A second block 712 receives the 24 output channels from the first block 710 and has an R value of 1 and a stride of 1, with 24 output channels. A third block 714 receives the 24 output channels from the second block 712 and has an R value of 1 and a stride of 2, with 32 output channels, providing a further factor of two downsampling. A fourth block 716 receives the 32 output channels from the third block 714 and has an R value of 1 and a stride of 1, with 32 output channels. A fifth block 718 receives the 32 output channels from the fourth block 716 and has an R value of 1 and a stride of 1, with 32 output channels. A sixth block 720 receives the 32 output channels from the fifth block 718 and has an R value of 1 and a stride of 1, with 64 output channels. A seventh block 722 receives the 64 output channels from the sixth block 720 and has an R value of 2 and a stride of 1, with 64 output channels. An eighth block 724 receives the 64 output channels from the seventh block 722 and has an R value of 2 and a stride of 1, with 64 output channels. A ninth block 726 receives the 64 output channels from the eighth block 724 and has an R value of 2 and a stride of 1, with 64 output channels.
(41) A tenth block 728 receives the 64 output channels from the ninth block 726 and has an R value of 2 and a stride of 1, with 96 output channels. An eleventh block 730 receives the 96 output channels from the tenth block 728 and has an R value of 2 and a stride of 1, with 96 output channels. A twelfth block 732 receives the 96 output channels from the eleventh block 730 and has an R value of 2 and a stride of 1, with 96 output channels. A thirteenth block 734 receives the 96 output channels from the twelfth block 732 and has an R value of 2 and a stride of 1, with 160 output channels. A fourteenth block 736 receives the 160 output channels from the thirteenth block 734 and has an R value of 2 and a stride of 1, with 160 output channels. A fifteenth block 738 receives the 160 output channels from the fourteenth block 736 and has an R value of 2 and a stride of 1, with 160 output channels. A sixteenth block 740 receives the 160 output channels from the fifteenth block 738 and has an R value of 2 and a stride of 1, with 320 output channels.
(42) The 320 output channels of the sixteenth block 740 are provided to a fourth layer 742, which is a 3×3 depthwise convolutional layer that has 256 output channels. The 256 output channels of the fourth layer 742 are provided to an average pooling layer 744 with 256 output channels. The 256 output channels of the fourth layer 742 and the 256 output channels of the average pooling layer 744 are provided to a concatenation element 746, which has 512 output channels. The concatenated 512 output channels are provided to a fifth layer 748, which is a 1×1 convolutional layer and has two output channels. The two output channels are provided to an upsampling element 750, which upsamples by a factor of eight to return to the original channel density and provides one output channel for disparity and two output channels for motion. The upsampled output channels are added by a summer 752 with the disparity or motion streams from the DMPAC module 522 to produce the refined disparity or motion streams. While the encoder-decoder structured refine network configuration 702 has many more stages than the sequential refine network configuration 602, in one example the receptive field size is greater at 374×374 and the computational complexity, the total number of multiplications and additions, is similar because of the simplicity of the MobileNetV2 configuration and the downsampling. The larger receptive size allows further improvements in the disparity by removing more noise on flat areas and repeated patterns.
(43)
(44) The 24 output channels of the first block 710 are also provided to a ninth layer 775, which provides 48 output channels. The 48 output channels are provided to the second concatenation element 773. The 304 output channels of the second concatenation element 773 are provided to a tenth layer 776, a 3×3 depthwise convolutional layer that has 304 output channels. The 304 output channels are provided to an eleventh layer 778, a convolutional layer with 256 output channels. The 256 output channels are provided to a twelfth layer 780, a convolutional layer with 256 output channels. The 256 output channels are provided to a thirteenth layer 782, a convolutional layer with one output channel for disparity operation and two output channels for motion operation. The output of the thirteenth layer 782 is provided to a first summer 784.
(45) The disparity or motion outputs of the DMPAC module 522 are provided to a pooling layer 786, which downsamples the streams by a factor of four. The output of the pooling layer 786 is provided to the first summer 784. The output of the first summer 784 is provided to a second upsampling layer 788, which upsamples by a factor of two. The output of the second upsampling layer 788 is provided to a second summer 790.
(46) The output of the thirteenth layer 782 is also provided to a third upsampling layer 792, which upsamples by a factor of two. The output of the third upsampling layer 792 is provided to the second summer 790. The output of the second summer 790 is provided to a fourth upsampling layer 794, which upsamples by a factor of two, returning to the original channel density. The output of the fourth upsampling layer 794 is provided to a third summer 798.
(47) The output of the thirteenth layer 782 is also provided to a fifth upsampling layer 796, which upsamples by a factor of two. The output of the fifth upsampling layer 796 is provided to the third summer 798. The output of the third summer 798 is the refined disparity or motion streams.
(48)
(49) The output of the eighth layer 818 is provided to a first residuals module 822 and a tenth layer 820. The first residuals module 822 is a residuals module as shown in
(50) The disparity or motion stream of the DMPAC 552 is provided to a first downsampling layer 821, which downsamples the disparity or motion stream to match the eight times downsampled outputs of the first residuals module 822 and has one output channel. The outputs of the first downsampling layer 821 and the first residuals module 822 are summed by ninth layer 823, which has one output channel for disparity and two output channels for motion.
(51) The tenth layer 820 is an upsampling convolutional layer with 64 output channels and an upsampling of two. The output channels of the sixth layer 814, the tenth layer 820 and the ninth layer 823 are provided to an eleventh layer 826, which is a concatenating layer, so that the eleventh layer 826 has 129 or 130 input channels. In addition to concatenating, the eleventh layer 826 is a convolutional layer with 64 output channels.
(52) The output of the eleventh layer 826 is provided to a second residuals module 830 and a twelfth layer 828. The second residuals module 830 is a residuals module as shown in
(53) The twelfth layer 828 is an upsampling convolutional layer with 32 output channels and an upsampling of two. The output channels of the fourth layer 810, the twelfth layer 828 and the thirteenth layer 831 are provided to a fourteenth layer 834, which is a concatenating layer, so that the fourteenth layer 834 has 65 input channels. In addition to concatenating, the fourteenth layer 834 is a convolutional layer with 32 output channels.
(54) The output of the fourteenth layer 834 is provided to a third residuals module 838 and a fifteenth layer 836. The third residuals module 838 is a residuals module as shown in
(55) The fifteenth layer 836 is an upsampling convolutional layer with 16 output channels and an upsampling of two. The output of the fifteenth layer 836 is concatenated with the output of the first layer 804 and the output of the sixteenth layer 839 to a fourth residuals module 842, so that the seventeenth layer has 33 input channels. The fourth residuals module 842 is a residuals module as shown in
(56) A summer 844 combines the output of the fourth residuals module 842 and the disparity or motion stream of the DMPAC 552 to provide the refined disparity or motion stream.
(57) These are four examples of CNN configurations to operate with a block matching hardware module such as a DMPAC module. These examples are small enough to operate in real time on standard SoCs. Many other CNN configurations can be developed based on the teachings provided by these examples and this description.
(58) In one example, training of the stereo configuration was done using the KITTI stereo 2015 dataset, available at www.cvlibs.net/datasets/kitti/index.php and referenced generally in Andreas Geiger, Philip Lenz, and Raquel Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” Proc. Computer Vision Pattern Recognition, 2012. The dataset was randomly divided into a training set (80%) and a test set (20%). During training, for each epoch, the training set was divided into a training part (90%) and a validation part (10%).
(59) In one example, the KITTI stereo 2012 dataset was used for training instead of the KITTI stereo 2015 dataset.
(60) In one example, training of the optical flow configuration was done using the virtual KITTI dataset available at europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds. The dataset was divided into a training set (80%) and a test set (20%). For the virtual KITTI dataset, as it contains 5 different driving scenes, the division was done according to the driving scenario. During training, for each epoch, the training set was divided into a training part (90%) and a validation part (10%).
(61) In one example the training was done using the Adam optimizer, Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” 3rd International Conference for Learning Representations, San Diego, 2015, available at arxiv.org/abs/1412.6980. The initial learning rate R was 0.001. The validation loss was monitored to modify the learning rate. If the validation loss did not decrease for longer than N1=7 epochs, the learning rate was decreased by 50%. If the validation loss did not decrease for longer than N2=18 epochs, the training was stopped.
(62) The results of using one example of the sequential configuration and one example of the hourglass configuration are shown in
(63)
(64) The above description is intended to be illustrative, and not restrictive. For example, the above-described examples may be used in combination with each other. Many other examples will be apparent upon reviewing the above description. The scope should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.”