Deep Learning Based Multi-Sensor Detection System for Executing a Method to Process Images from a Visual Sensor and from a Thermal Sensor for Detection of Objects in Said Images

20230237785 · 2023-07-27

    Inventors

    Cpc classification

    International classification

    Abstract

    A Deep Learning based Multi-sensor Detection System for executing a method to process images from a visual sensor and from a thermal sensor for detection of objects in said images, wherein a first deep learning network for processing images from the visual sensor and a second deep learning network for pro-cessing images from the thermal sensor are jointly used and collaboratively trained for improving both networks ability to accurately detect said objects in said images.

    Claims

    1. A Deep Learning based Multi-sensor Detection System for executing a method to process images from a visual sensor and from a thermal sensor for detection of objects in said images, wherein a first deep learning network for processing images from the visual sensor and a second deep learning network for processing images from the thermal sensor are jointly used and collaboratively trained for improving both networks ability to accurately detect said objects in said images.

    2. The Deep Learning based Multi-sensor Detection System of claim 1, that learns from data from at least two different sensors by jointly and collaboratively training two deep learning networks, one on images from a visual camera sensor and another on thermal data from a thermal sensor to improve an object detector's performance across varying lighting and weather conditions.

    3. The Deep Learning based Multi-sensor Detection System of claim 1, wherein the first deep learning network for processing images from the visual sensor and the second deep learning network for processing images from the thermal sensor receive visual data and thermal data, respectively, that are derived from the same scene.

    4. The Deep Learning based Multi-sensor Detection System of claim 1, wherein a mimicry loss is determined between the first deep learning network for processing images from the visual sensor and the second deep learning network for processing images from the thermal sensor, and used for improving the accuracy of both said networks.

    5. The Deep Learning based Multi-sensor Detection System of claim 4, wherein the mimicry loss is used to align the feature spaces of both networks and helps in each network learning complementary knowledge of data from the other network, while a supervised loss helps in retaining the knowledge of a network's own data.

    6. The Deep Learning based Multi-sensor Detection System of claim 4, wherein an overall loss function for each of the first network and second network is determined which is represented by the sum of the mimicry loss and the supervised detection loss of the first network and second network, respectively.

    7. The Deep Learning based Multi-sensor Detection System of claim 1, wherein each of the first network and the second network comprises an encoder and a detection head for localization and classification of objects in the images, and that both the first network and the second network are provided with a decoder taking features from intermediate layers of the encoder to reconstruct the images.

    8. The Deep Learning based Multi-sensor Detection System of claim 7, wherein the decoder for the visual images takes features from the encoder for the visual images, and wherein the decoder for the thermal images takes features from the encoder for the thermal images.

    9. The Deep Learning based Multi-sensor Detection System of claim 7, wherein the decoder for the visual images takes features from the encoder for the thermal images, and wherein the decoder for the thermal images takes features from the encoder for the visual images.

    Description

    BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

    [0018] The invention will hereinafter be further elucidated with reference to the drawing of an exemplary embodiment of a MultiModal Framework according to the invention to combine data from different sensors to provide a reliable and comprehensive detection system that is not limiting as to the appended claims. The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawing:

    [0019] FIG. 1 shows an example of visual images derived from a prior art detection system for objects in such images;

    [0020] FIG. 2 shows an example of images derived from a detection system according to an embodiment of the present invention for objects in such images;

    [0021] FIG. 3 shows a schematic representation of a multimodal framework according to an embodiment of the present invention;

    [0022] FIG. 4 shows a schematic representation of a multimodal framework according to an embodiment of the present invention completed with a regular reconstruction facility; and

    [0023] FIG. 5 shows a schematic representation of a multimodal framework according to an embodiment of the present invention completed with a cross reconstruction facility.

    [0024] Whenever in the figures the same references or reference numerals are applied, these references or reference numerals refer to the same parts.

    DETAILED DESCRIPTION OF THE INVENTION

    [0025] FIG. 1 shows that visual images derived from a prior art detection system for objects is such images, suffer from the problem that pedestrians and vehicles that are masked due to the headlight beam are not clearly visible (and hence not predicted) when using just visual images, but they are very clearly seen in the corresponding thermal images. The shown images are images from the FLIR dataset, see: Teledyne FLIR https://www.flir.eu/oem/adas/adas-dataset-form/, 2018. In FIG. 1, the pedestrians obscured and missed in RGB images but seen clearly in thermal images.

    [0026] FIG. 2 shows an example of images derived from a detection system according to an embodiment of the present invention for objects in such images. The visual information is integrated with thermal information which helps in detecting people and vehicles in difficult scenarios. Again, these images are taken from the above-mentioned FLIR dataset. The invention of addition of thermal data helps in detecting pedestrians and cars that are not clearly visible due to lighting and headlight glares as highlighted in yellow.

    [0027] FIG. 3 shows the scheme according to which a Deep Learning based Multi-sensor Detection System is set up for executing a method to process images from a visual sensor and from a thermal sensor for detection of objects in said images, wherein a first deep learning network for processing images from the visual sensor and a second deep learning network for processing images from the thermal sensor are jointly used and collaboratively trained for improving both networks ability to accurately detect said objects in said images. FIG. 3 is a schematic of MMC with RGB network (red-hue) and Thermal network (grey-hue).

    [0028] With reference to FIG. 3, a MultiModal-Collaborative (MMC) framework is depicted with two networks that are trained in a collaborative manner. As an example the data from the visual sensor are referred to as RGB data. The RGB-network is provided on the upper part of the figure and receives the RGB images while the thermal-network, which is shown below the RGB network, receives the corresponding thermal images as the input. The Collaborative training framework provides flexibility for each network to learn complementary knowledge from the other modality without impeding its ability to learn on the modality it is predominantly trained on. Each network is trained with a supervised detection loss and for the mimicry loss, the Kullback—Leibler (KL) divergence is used.

    [0029] The overall loss function per network is the sum of detection loss and the mimicry loss. The KL divergence (D.sub.KL) is applied on the soft logits p.sub.rgb and p.sub.thm. λ.sub.rgb and λ.sub.thm are the balancing weights.


    custom-character.sub.MMC−RGB=custom-charactercustom-character.sub.et+λ.sub.rghcustom-character.sub.KL(p.sub.rgb∥p.sub.thm)


    custom-character.sub.MMC−Thm=custom-charactercustom-character.sub.et+λ.sub.thmcustom-character.sub.KL(p.sub.thm∥p.sub.rgb)

    [0030] The detection loss is a weighted summation of classification and regression losses:

    [00001] Det = 1 N Cls Cls + λ Reg Reg

    [0031] To further encourage the method according to an embodiment of the present invention to explore the input feature space exhaustively and extract all the semantic information into the learned representations, an auxiliary task for reconstructing the inputs can be applied. The auxiliary task network takes in the features from the intermediate layers of encoders and aims to reconstruct the input image via the decoders. Hence, each of the first network and the second network comprises an encoder and a detection head for localization and classification of objects in the images, and both the first network and the second network are provided with a decoder taking features from intermediate layers of the encoder to reconstruct the images. There are two possible embodiments: [0032] MMC+Reconstruction [0033] MMC+Cross Reconstruction

    [0034] In the first embodiment providing MMC+Reconstruction, the decoder for the visual images takes features from the encoder for the visual images, and the decoder for the thermal images takes features from the encoder for the thermal images. This shows FIG. 4, which is a schematic of MMC with Reconstruction (Decoders are shown in blue-hue). The reconstruction Loss for each network is shown below. x.sub.rgb and x.sub.thm are the inputs, Enc and Dec denote the Encoder and the Decoder used for feature extraction and reconstruction respectively.


    custom-character.sub.Rec−RGB=Σ(x.sub.rgb−Dec.sub.rgb(Enc.sub.rgb(x.sub.rgb)).sup.2


    custom-character.sub.Rec−Thm=Σ(x.sub.thm−Dec.sub.thm(Enc.sub.thm(x.sub.thm)).sup.2

    [0035] FIG. 5 shows an alternative embodiment, wherein the decoder for the visual images takes features from the encoder for the thermal images, and wherein the decoder for the thermal images takes features from the encoder for the visual images. FIG. 5 is a Schematic of MMC with Cross Reconstruction. The encoder and decoder are thus of different modality. This encourages the backbone to disentangle texture and semantic features and learn to utilize the semantic features from a thermal image to reconstruct the corresponding RGB image. For the downstream task, the detection head selects the relevant semantic features and this helps in domain adaptation as the semantic features remain the same during different lighting conditions. The cross-reconstruction Loss for each network in this modality is shown below.


    custom-character.sub.CrossRec−RGB=Σ(x.sub.rgb−Dec.sub.rgb(Enc.sub.thm(x.sub.thm)).sup.2


    custom-character.sub.CrossRec−Thm=Σ(x.sub.thm−Dec.sub.thm(Enc.sub.rgb(x.sub.rgb)).sup.2

    [0036] Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.

    [0037] Although the invention has been discussed in the foregoing with reference to exemplary embodiments of the Deep Learning based Multi-sensor Detection System of the invention, the invention is not restricted to these particular embodiments which can be varied in many ways without departing from the invention. The discussed exemplary embodiments shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiments are merely intended to explain the wording of the appended claims without intent to limit the claims to these exemplary embodiments. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using these exemplary embodiments.

    [0038] Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other. Although the invention has been described in detail with particular reference to the disclosed embodiments, other embodiments can achieve the same results. Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.