METHOD FOR TRAINING A NEURAL NETWORK TO DELIVER THE VIEWPOINTS OF OBJECTS USING UNLABELED PAIRS OF IMAGES, AND THE CORRESPONDING SYSTEM
20220058484 · 2022-02-24
Assignee
- Toyota Jidosha Kabushiki Kaisha (Toyota-shi Aichi-ken, JP)
- The University Court Of The University Of Edinburgh (Edinburgh, GB)
Inventors
Cpc classification
G06V10/7753
PHYSICS
G06F18/2155
PHYSICS
International classification
Abstract
A system and a method for training a neural network to deliver the viewpoint of objects, the method comprising minimizing distances between each training image of a first set of training images, the output of the neural network with the viewpoint of this training image, and each pair of a second set of training image pairs, the second image of each pair of the second set of training image pairs with the output of a decoder neural network when the first image of this pair is inputted to an encoder neural network, the second image of this pair is inputted to the neural network to obtain a viewpoint, the obtained encoded image is rotated according to the viewpoint, and the rotated encoded image is decoded.
Claims
1. A method for training a neural network to deliver a viewpoint of a given object visible on an image when this image is inputted to this neural network, the method comprising: providing an encoder neural network configured to receive an image as input and to deliver an encoded image, providing a decoder neural network configured to receive an encoded image having the same dimensions as an encoded image delivered by the encoder neural network, and configured to output a decoded image, providing a first set of training images with, for each image, the viewpoint of an object belonging to a given category which is visible on the image, and providing a second set of training image pairs, wherein each pair of the second set of training image pairs comprises: a first image on which an object belonging to the given category is visible; and a second image on which the object of the first image is visible with a viewpoint which differs from the viewpoint in the first image, and wherein training the neural network comprises adapting parameters of the neural network, parameters of the encoder neural network, and parameters of the decoder neural network by minimizing the distances between: for each training image of the first set of training images, the output of the neural network when the training image is inputted to the neural network, with the viewpoint of the training image, and for each pair of the second set of training image pairs, the second image of each pair of the second set of training image pairs with the output of the decoder neural network when: the first image of this pair is inputted to the encoder neural network to obtain an encoded image, the second image of this pair is inputted to the neural network to obtain a viewpoint, the encoded image is rotated by a rotation corresponding to this viewpoint to obtain a rotated encoded image, and the rotated encoded image is inputted into the decoder neural network to obtain the output of the decoder neural network.
2. The method of claim 1, wherein the viewpoint of an object visible on an image comprises 3 values defining a vector expressed in a referential centered with respect to the object and oriented towards an image acquisition apparatus used to acquire the image.
3. The method of claim 1, wherein the encoded image is a vector having a resolution which is lower than the resolution of the image.
4. The method of claim 1, wherein the dimension of the encoded image is a multiple of three.
5. The method of claim 1, wherein training the neural network is performed using the following loss function:
6. The method of claim 5, wherein distances are calculated using perceptual loss.
7. The method of claim 1, wherein the neural network, and/or the encoder neural network, and/or the decoder neural network are convolutional neural networks.
8. A neural network trained by the method according to claim 1.
9. A system for training a neural network to deliver a viewpoint of a given object visible on an image when this image is inputted to this neural network, the system comprising: an encoder neural network configured to receive an image as input and to deliver an encoded image, a decoder neural network configured to receive an encoded image having the same dimensions as an encoded image delivered by the encoder neural network, and configured to output a decoded image, a first set of training images with, for each image, the viewpoint of an object belonging to a given category which is visible on the image, and a second set of training image pairs, wherein each pair of the second set of training image pairs comprises: a first image on which an object belonging to the given category is visible; and a second image on which the object of the first image is visible with a viewpoint which differs from the viewpoint in the first image, and a training module configured to adapt parameters of the neural network, parameters of the encoder neural network, and parameters of the decoder neural network by minimizing distances between: for each training image of the first set of training images, the output of the neural network when the training image is inputted to the neural network, with the viewpoint of the training image, and for each pair of the second set of training image pairs, the second image of each pair of the second set of training image pairs with the output of the decoder neural network when: the first image of this pair is inputted to the encoder neural network to obtain an encoded image, the second image of this pair is inputted to the neural network to obtain a viewpoint, the encoded image is rotated by a rotation corresponding to this viewpoint to obtain a rotated encoded image, and the rotated encoded image is inputted into the decoder neural network to obtain the output of the decoder neural network.
10. A system including the neural network according to claim 8.
11. A vehicle comprising the system according to claim 10.
12. A recording medium readable by a computer and having recorded thereon a computer program including instructions for executing the method according to claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0089] How the present disclosure may be put into effect will now be described by way of example with reference to the appended drawings, in which:
[0090]
[0091]
[0092]
[0093]
DESCRIPTION OF THE EMBODIMENTS
[0094] An exemplary method for training a neural network to deliver the viewpoint of a given object visible on an image will now be described.
[0095] The viewpoint of an object is defined as the combination of the azimuth angle of the object with respect to a camera, the elevation of the object, and the in-plane rotation of the object.
[0096] On
[0097] In some embodiments, the vector v has a norm of 1 (the three coordinates define a point on a sphere of radius 1, as this facilitates expressing a rotation, as will be described hereinafter.
[0098] Also, the referential is associated with a given orientation of the object OBJ, for all the objects having the same category (for example car).
[0099] The methods of the disclosure relate to training a neural network so that it can output the three values a.sup.1, a.sup.2, and a.sup.3.
[0100] As can be conceived by the person skilled in the art, this training will be directed to categories of objects. For example, the neural network will be trained to deliver the viewpoint of a car when a car is visible on the image. The disclosure is however not limited to the detection of the viewpoint of a car but can also concern other objects, including objects which can be observed on a road.
[0101]
[0102] On the
[0103] On the
[0104] For a given set of m labelled images with their ground-truth viewpoints with respect to a camera defined as T={(I.sub.i,v.sub.i)}.sub.i=1.sup.m, where I.sub.i is an RGB image which belongs to I and v.sub.i=(a.sup.1, a.sup.2, a.sup.3)∈V is the three dimensional vector of the ground-truth viewpoint of objects visible on each image. The neural network NN performs the function ƒ.sub.v:I.fwdarw.V such that ƒ.sub.v(I;θ.sub.v)=v where θ.sub.v are the parameters of ƒ.sub.v. In a manner which is known in the art, it is possible to train this neural network by minimizing the following sum:
[0105] This training will include adapting θ.sub.v, for example by performing a stochastic gradient descent.
[0106] It should be noted that this training is often designated as supervised training.
[0107] In the present method, additional images are used to train the neural network. T is a first set of training images, and a second set of training images u is also provided. The images of the second set can be unlabeled, which means that there is no a priori knowledge of the viewpoint of the objects visible on the images of this set.
[0108] The second set contains training image pairs, with each pair containing: [0109] a first image on which an object belonging to the given category is visible; and [0110] a second image on which the object of the first image is visible with a viewpoint which differs from the viewpoint in the first image.
[0111] Thus, the second set U is designated as U={(I.sub.i;I.sub.i′)} and each pair contains images of a same object, for example a same car or plane, captured at different viewpoints.
[0112] In order to use the second set U to train the neural network NN, an encoder neural network ENN is provided. This encoder neural network is configured to receive an image (I on the figure) as input, and to deliver an encoded image as output (EI on the figure).
[0113] For example, the encoder neural network is a convolutional neural network including five blocks, with each block comprising two convolutional layers with the second convolutional layer using stride in order to reduce spatial dimensions. The convolutions are 3×3 convolutions with a channel depth which starts at 32 and which doubles every block. These five blocks of the encoder neural network are further connected to a fully connected layer.
[0114] Because a fully connected layer is used, the output of the encoder neural network is a vector. In some embodiments, the depth of this vector is lower than the resolution image I (image height times image width times 3 for RGB). Also, the resolution of this vector may be a multiple of three so as to facilitate a subsequent rotation.
[0115] On the
[0116] Also, there is provided a decoder neural network DNN configured to receive an encoded image as input having the same dimensions as the encoded images outputted by the encoder neural network ENN, and configured to output images which have the same dimensions as the images inputted to the encoder neural network ENN.
[0117] On the
[0118] The structure of the decoder neural network is a mirrored version of the structure of the decoder neural network.
[0119] It appears that the encoder neural network and the decoder neural network form an auto-encoder.
[0120] The operation of the decoder neural network, for example when used in an auto-encoder operation, can be written as ƒ.sub.d(ƒe(I;θ.sub.e);θ.sub.d), with θ.sub.d being the parameters of the decoder neural network ENN which will be adapted during training.
[0121] While it is possible to obtain decoded images from encoded images which correspond to the original image, information regarding the viewpoint may not be clearly usable in the encoded image. Instead, the present method involves a conditional image generation technique.
[0122] In the present method, for a given pair of images (I.sub.i;I.sub.i′) that show a same object under different viewpoints, the viewpoint of the object visible on a second image I′ of a pair will be used to deduce a rotation ROT to be applied to an encoded image obtained from the first image I of this pair, before inputting the rotated image to the decoder neural network. Consequently, the image delivered by the decoder neural network should correspond to the second image I′, or, at least, minimizing the distance between I′ and the output of the decoder neural network is the goal of the training. Thus, on the
[0123] If the viewpoint of image I′ is unknown (i.e. I′ is an unlabeled image), determining this viewpoint may be done using the neural network NN. The neural network NN outputs a viewpoint v from which a rotation matrix can be deduced to perform a rotation operation ROT which will rotate the encoded image EI into a rotated encoded image REI. A multiplication between the rotation matrix and the vector/encoded image EI which has a resolution which is a multiple of three.
[0124] By way of example, deducing this rotation matrix from the viewpoint of v can be performed using the “look at” transformation which is well known to the person skilled in the art. For example, this transformation is used in the library OpenGL in its version 2.1. An explanation of the operation of this transformation is present in August 2020 at URL: https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/gluLookAt.xml. In the example described at this URL, “eye” is equivalent to the viewpoint, “center” is set at (0,0,0) and “up” at (0,0,1).
[0125] This feature addresses the lack of ground-truth for I′, and extends the learning of the encoder/decoder neural network to unlabeled images by allowing gradients originating from the decoder to be back-propagated to the neural network NN. On the
[0126] The above use of the neural network NN leads to a training which can be designated as unsupervised training.
[0127] It can be conceived that using the neural network NN to obtain the viewpoint is only relevant if the neural network NN is trained and accurate. In order to synergistically use the labeled images and the unlabeled images during the training so as to better train the neural network NN, it is proposed to combine in a single loss function a loss associated with the unlabeled images U and a loss associated with the labeled images T. Thus, the present method combines a supervised training and an unsupervised training.
[0128] In the present method, training the neural network NN comprises adapting the parameters of the neural network NN, the parameters of the encoder neural network ENN, and the parameters of the decoder neural network DNN (respectively θ.sub.v, θ.sub.e, θ.sub.d) by minimizing the distances between: [0129] for each training image of the first set T of training images, the output of the neural network when the training image is inputted to the neural network, with the viewpoint of this training image, [0130] for each pair of the second set U of training image pairs, the second image of each pair of the second set of training image pairs with the output of the decoder neural network when: [0131] the first image/of this pair is inputted to the encoder neural network ENN to obtain an encoded image EI, [0132] the second image I′ of this pair is inputted to the neural network NN to obtain a viewpoint v, [0133] the encoded image EI is rotated by a rotation ROT corresponding to this viewpoint to obtain a rotated encoded image REI, [0134] the rotated encoded image REI is inputted into the decoder neural network DNN to obtain the output of the decoder neural network DNN.
[0135] In other words, the following loss function L is used:
[0136] In the above equation, λ is a hyperparameter having a value which will be set during a calibration step. This hyperparameter indicates a tradeoff between the unsupervised and supervised training.
[0137] While the above formula is directed to using the entirety of T and U, training may be performed iteratively, with each iteration comprising selecting a given number of individual images (for example 64) from T and U so as to use them in the above two sums for calculating a loss to be used in the back-propagation (for example using the stochastic gradient method or another method).
[0138] Thus, a batch-training is performed.
[0139]
[0140] This system 100 comprises a processor 101 and a non-volatile memory 102. The system 100 therefore has a computer system structure.
[0141] In the non-volatile memory 102, the neural network NN, the encoder neural network ENN, and the decoder neural network DNN are stored.
[0142] Additionally, the first set T and the second set U are stored in the non-volatile memory 102.
[0143] A training module TR is also stored in the non-volatile memory 102 and this module can consist of computer program instructions which, when executed by the processor 101, will perform the training and adapt the weights θ.sub.v, θ.sub.e, and θ.sub.d.
[0144]
[0145] The system 201 comprises a processor 203 and a non-volatile memory 204 in which the neural network NN is stored after the training described in reference to
[0146] The above-described training allows obtaining neural networks which have been observed to perform better at detecting viewpoints than neural networks simply trained using a labelled set of training images (supervised training). Notably, it has been observed that various increases of accuracy can be obtained using a portion of the labelled dataset using for training.