Localising a vehicle

12530032 · 2026-01-20

Assignee

Oxford University Innovation Limited (Oxford, GB)

Inventors

Cpc classification

International classification

Abstract

A computerised method of generating a first trainable transform arranged to be used in localisation of an entity, the transform being arranged to transform a first representation of an environment to a second, different, representation of the environment, the method comprising processing a plurality of first training representations of an environment using the first trainable transform to generate a transformed first training representation; performing at least one of: i)) running at least one known process on the first training representation and a modified version of the first training representation to generate an error signal where in the process is selected such that the first trainable transform is arranged to enhance features within the first training representation; and ii) running at least one known process on a second training representation, corresponding to the first training representation, but under a different lighting condition, and on the modified version of the first training representation to generate an error signal wherein the process is selected such that the first trainable transform is arranged to enhance features within the first training representation; and; c) using the error signal to train the first transform.

Claims

1. A computerised method of generating a first trainable transform arranged to be used in localisation of an entity, the first trainable transform being arranged to transform a first representation of an environment representing a scene captured under a first lighting or weather condition to a second, different, representation of the environment modeling the same scene under a second, different lighting or weather condition, the method comprising: a) i) generating a modified first training representation of an environment by processing a first training representation of the environment using the first trainable transform, the first training representation representing a scene captured under the first lighting or weather condition and the modified first training representation modelling the same scene under the second lighting or weather condition; ii) generating a synthetic first training representation by processing the modified first training representation using a second trainable transform, the second trainable transform arranged to transform a training representation of an environment representing a scene captured under the second lighting or weather condition to a representation of the environment modelling the same scene under the first lighting or weather condition; iii) running at least one known process on the first training representation and the synthetic first training representation to generate a first error signal encapsulating a difference between the first training representation and the synthetic first training representation, wherein the process is selected such that the first trainable transform is arranged to enhance features within the first training representation; and iv) training at least the first trainable transform based on the first error signal; and; b) i) generating a modified second training representation of an environment by processing a second training representation of the environment using the first trainable transform, the second training representation representing a scene captured under the first lighting or weather condition and the modified second training representation of the environment modelling the same scene under the second lighting or weather condition; ii) running at least one known process on the modified second training representation, and on a third training representation captured under the second lighting or weather condition and representing the same scene as the second training representation, to generate a second error signal encapsulating a difference between the second training representation and the third training representation; and iii) training the first trainable transform based on the second error signal.

2. A method according to claim 1 wherein the first error signal is also used to train the second trainable transform.

3. A method according to claim 2 wherein the first and/or second trainable transforms are provided by a neural network.

4. A method according to claim 3 wherein the known process generates feature descriptors.

5. A method according to claim 4 wherein the known process detects features within the first and second representations.

6. A method according to claim 5 in which weights of the trainable transform are initialized before training commences.

7. A method according to claim 6 which trains a discriminator to be able to discriminate whether a representation is a synthetic representation.

8. A method according to claim 7 which repeats training of the method using a set of second training representations being representations corresponding to representations from the first training representations but under different lighting or weather conditions.

9. Use of a trainable transform, trained according to the method of claim 8, and used in a vehicle to localise, or at least assist, in localising the vehicle in matching input representation against a library of stored representations.

10. A vehicle having a sensor arranged to take current representations of the vehicle surroundings, the current representations capturing the vehicle surroundings under a first lighting or weather condition and the vehicle further comprising a processing circuitry having access to a library of stored representations of the surroundings, the stored representations including representations of the surroundings under a second, different lighting or weather condition, wherein the processing circuitry is arranged to perform the following: a) at least one of: i) transforming the current representations using a first trainable transform to transform the current representations to transformed representations modeling the same vehicle surroundings under the second lighting or weather condition and searching the library of stored representations to match the transformed image wherein the transformation is arranged to enhance features within the transformed representation; and ii) transforming at least some of the stored representations from the library using a second transform arranged to transform the stored representations to transformed stored representations modeling the same vehicle surroundings under the first lighting or weather conditions, and searching the transformed stored representations to match the current representation, wherein the transformation is arranged to enhance features within the transformed representation, wherein the first and second transforms are trained by a Machine Learning algorithm; b) using representations from the library of stored representations that are located in the search as matches to localise the vehicle.

11. A system arranged to train a first trainable transform arranged to transform a first representation of an environment representing a scene captured under a first lighting or weather condition to a second, different, representation of the environment modeling the scene under a second, different lighting or weather condition, the system comprising processing circuitry being programmed to: a) i) generating a modified first training representation of an environment by processing a first training representation of the environment using the first trainable transform, the first training representation representing a scene captured under the first lighting or weather condition and the modified first training representation modelling the same scene under the second lighting or weather condition; ii) generate a synthetic first training representation by processing the modified first training representation using a second trainable transform, the second trainable transform arranged to transform a training representation of an environment representing a scene captured under the second lighting or weather condition to a representation of the environment modeling the same scene under the first lighting or weather condition iii) run at least one known process on the first training representation and the synthetic training representation to generate an error signal encapsulating a difference between the first training representation and the synthetic first training representation, wherein the process is selected such that the first trainable transform is arranged to enhance features within the first training representation; and iv) train at least the first trainable transform based on the first error signal and; b) i) generating a modified second training representation of an environment by processing a second training representation of the environment representing a scene captured under the first lighting or weather condition and the modified second training representation of the environment modeling the same scene under the second lighting or weather condition; ii) run at least one known process on the modified second training representation, and on a third training representation captured under the second lighting or weather condition and representing the same scene as the second training representation, to generate a second error signal encapsulating a difference between the second training representation and the third training representation; and iii) train the first trainable transform based on the second error.

12. A machine readable medium containing instructions which when read by a computer cause that machine to train a first trainable transform arranged to be used in localisation of an entity, the first trainable transform being arranged to transform a first representation of an environment representing a scene captured under a first lighting or weather condition to a second, different, representation of the environment modelling the same scene under a second, different lighting or weather condition, the first trainable transform trained by performing at least: a) i) generating a modified first training representation of an environment by processing a first training representation of the environment using the first trainable transform, the first training representation representing a scene captured under the first lighting or weather condition and the modified first training representation modelling the same scene under the second lighting or weather condition; ii) generating a synthetic first training representation by processing the modified first training representation using a second trainable transform, the second trainable transform arranged to transform a training representation of an environment representing a scene captured under the second lighting or weather condition to a representation of the environment modelling the same scene under the first lighting or weather condition; iii) running at least one known process on the first training representation and the synthetic first training representation to generate a first error signal encapsulating a difference between the first training representation and the synthetic first training representation, wherein the process is selected such that the first trainable transform is arranged to enhance features within the first training representation; and iv) training at least the first trainable transform based on the first error signal; and; b) i) generating a modified second training representation of an environment by processing a second training representation of the environment using the first trainable transform, the second training representation representing a scene captured under the first lighting or weather condition and the modified second training representation of the environment modelling the same scene under second lighting or weather condition; ii) running at least one known process on the modified second training representation, and on a third training representation captured under the second lighting or weather condition and representing the same scene as the second training representation, to generate a second error signal encapsulating a difference between the second training representation and the third training representation; and iii) training the first trainable transform based on the second error signal.

13. A method of localising a vehicle comprising using a sensor of the vehicle arranged to take current representations of the surroundings of the vehicle, the current representations representing a scene captured under a first lighting or weather condition, wherein the method comprises: a) performing at least one of: i) transforming the current representations using a first trainable transform to transform the current representations to transformed current representations modeling the same scene under a second, different lighting or weather condition, and searching a library of stored representations to match the transformed current representations wherein the transformation is arranged to enhance features within the transformed representation, the stored representations including representations of the surroundings under a second, different lighting or weather condition; and ii) transforming at least some of the stored representations from the library using a second transform arranged to transform to transformed stored representations modeling the same scene under the first lighting or weather condition and searching the transformed stored representations to match the current representation wherein the transformation is arranged to enhance features within the transformed representation, wherein the first and second transforms are trained by a Machine Learning algorithm; b) using representations from the library of stored representations that are located in the searches of step a) i) and/or step a) ii) to localise the vehicle.

14. A machine readable medium containing instructions which when read by a computer cause that computer on a vehicle to: a) use a sensor of the vehicle arranged to take current representations of the surroundings of the vehicle, the current representations representing the vehicle surroundings captured under a first lighting or weather condition; b) performing at least one of: i) transforming the current representations using a first trainable transform arranged to transform the current representations to transformed current representations modeling the same surroundings under a second, different lighting or weather condition, and searching a library of stored representations to match the transformed current representations wherein the transform is arranged to enhance features within the transformed current representations; and ii) transforming at least some of the stored representations from the library using a second transform arranged to transform the stored representation to transformed stored representations modeling the same surroundings under the first lighting or weather condition, and searching the transformed stored representations to match the current representation wherein the transform is arranged to enhance features within the transformed representation, wherein the first and second transforms are trained by a Machine Learning algorithm; and c) using representations from the library of stored representations that are located as matches in the search to localise the vehicle.

15. The computerized method of claim 1, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first lighting condition and the second, different lighting condition.

16. The computerized method of claim 1, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first weather condition and the second, different weather condition.

17. The vehicle of claim 10, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first lighting condition and the second, different lighting condition.

18. The vehicle of claim 10, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first weather condition and the second, different weather condition.

19. The method of claim 13, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first lighting condition and the second, different lighting condition.

20. The method of claim 13, wherein the first lighting or weather condition and the second, different lighting or weather condition are the first weather condition and the second, different weather condition.

Description

BRIEF DESCRIPTION OF AN EMBODIMENT OF THE INVENTION

(1) FIG. 1 schematically shows an embodiment of an embodiment;

(2) FIG. 2 schematically shows the architecture used in stage 1 arranged to train a pair of generators to transfer appearance from source to target representations, and vice-versa, without requiring registration of the representations;

(3) FIG. 3 schematically shows the architecture used in stage 2 of the training process on a well-aligned subset of the training representations in order to minimise the difference between feature detector and descriptor layers between different conditions;

(4) FIG. 4 shows the Haar response stack;

(5) FIG. 5 schematically shows the internal architecture;

(6) FIG. 6 illustrates images generated (right hand images) from an input image (left hand image);

(7) FIG. 7 highlights localisation error and shows real day to real night localisation vs. real day to synthetic day localisation;

(8) FIG. 8 shows the inlier count for real day vs. real night localisation and real day vs. synthetic night as a function of distance travelled;

(9) FIG. 9 shows a graph giving probability of dead reckoning as a function of distance, when a localisation failure has occurred;

(10) FIG. 10 shows example feature-based localisation between different conditions using appearance transfer, where the top two images of each set of four images shows matching between real images and the bottom two images shows matching between the same real image and a synthetic image (where horizontal lines represent inlier matches; and

(11) FIG. 11 shows a flowchart explaining a method.

(12) FIG. 1 describes a vehicle arranged to a trainable transform that have been trained according to a later described embodiment. Thus, FIG. 1 describes a system that utilises the output, a trainable transform, of the later described training process.

(13) FIG. 1 shows a vehicle 100 on which there is a sensor 102. Here the sensor is a monocular camera but might be any other suitable sensor; for example a steroscopic camera pair, a LiDAR sensor, or the like.

(14) The sensor 102 is arranged to monitor its locale and generate data based upon the monitoring thereby providing data on a sensed scene around the vehicle.

(15) In the embodiment shown in FIG. 1, the vehicle 100 is travelling along a road 108 and the sensor 100 is imaging the locale (eg the building 110, road 108, etc.) as the vehicle 100 travels. In this embodiment, the vehicle 102 also comprise processing circuitry 112 arranged to capture data from the sensor and subsequently to process the data (in this case images, but could be other representations) generated by the sensor 102. Thus, the processing circuitry captures data from the sensor 102. In the embodiment being described, the processing circuitry 112 also comprises, or has access to, a storage device 114 on the vehicle.

(16) The vehicle may employ a localisation pipeline as described in reference [3] above. Paper [3] is hereby incorporated by reference and the skilled person is directed to read this paper, particularly with reference to the localisation pipeline.

(17) The lower portion of the Figure shows components that may be found in a typical processing circuitry 112. A processing unit 118 may be provided which may be an Intel 86 processor such as an 15, 17 processor or the like. The processing unit 118 is arranged to communicate, via a system bus 120, with an I/O subsystem 122 (and thereby with external networks, displays, and the like) and a memory 124.

(18) The skilled person will appreciate that memory 124 may be provided by a variety of components including a volatile memory, a hard drive, a non-volatile memory, any of the machine readable media described elsewhere, etc. Indeed, the memory 124 may comprise a plurality of components under the control of the processing unit 118.

(19) However, typically the memory 124 provides a program storage portion 126 arranged to store program code which when executed performs an action and a data storage portion 128 which can be used to store data either temporarily and/or permanently.

(20) In other embodiments at least a portion of the processing circuitry 112 may be provided remotely from the vehicle. As such, it is conceivable that processing of the data generated by the sensor 102 is performed off the vehicle 100 or a partially on and partially off the vehicle 100. In embodiments in which the processing circuitry is provided both on and off the vehicle then a network connection (such as a 3G UMTS (Universal Mobile Telecommunication System), 4G (such as LTELong Term Evolution), WiFi (IEEE 802.11), WiMAX, or like).

(21) It is convenient to refer to a vehicle 100 travelling along a road but the skilled person will appreciate that embodiments of the invention need not be limited to land vehicles and could water borne vessels such as ships, boats or the like or indeed air borne vessels such as airplanes, or the like. Indeed, it may be that the method is performed by entities other than vehicles, such as robots, or mobile devices carried by users, or the like.

(22) Likewise, it is convenient in the following description to refer to image data generated by the sensor 100 but other embodiments of the invention may generate other types of the data. As such, the embodiment being described utilises images, ie pictures of the environment. However, it is conceivable that other types of representations of the environment may be suitable. For example, LiDAR scans, may be used instead of images. Therefore, reference to image below should be taken to cover other types of data.

(23) The embodiment being described trains a neural network (NN) to transform images. The NN provides an example of a trainable transform. The trained NN can then be used to generate images which can then be used, as described hereinafter, to aid localisation of the vehicle, or the like.

(24) The embodiment being described uses a feature detection and matching pipeline using the SURF feature [22] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust features, Computer vision-ECCV 2006, pp. 404-417, 2006, and employs a 2-stage training strategy. Other embodiments may not use both stages of the embodiment being described. It is possible that other embodiments will use just the first stage, or just the second stage. However, use of both stages together has been found to prove advantageous in the quality of the synthetic images generated by the embodiment being described.

(25) In the first stage a cycle-consistency architecture, similar to [7], is used to train a generator to transform an input source image into a synthetic image with a target condition. The generator may be thought of as a trainable transform, since it is trained during the training phase and is arranged to transform an image (or other representation) input thereto. The synthetic image, generated by the first generator, is subsequently transformed by a second generator (which again may be thought of as a trainable transform) back into a synthetic image that has the initial condition, with the process being repeated in the reverse direction.

(26) In the second stage, the image generators are fine tuned independently using a well-aligned subset of the dataset.

(27) In the first stage, shown in FIG. 2, two (ie first and second trainable transforms) generators, a first GAB, which transforms condition A into condition B, and a second G.sub.BA, which transforms condition B into condition A, are trained using a collection of unpaired source and target images. G.sub.BA is arranged/trained so that it learns to invert the effect of the G.sub.AB. G.sub.AB and G.sub.BA may each be thought of as being trainable transforms. A discriminator loss is applied on the synthetic images, and an L1 loss is applied between the synthetic images and the input images. Additionally, SURF detector response maps are computed (ie the output of a known process) on the synthetic and input images and apply an L1 loss between them, and similarly compute dense per-pixel SURF descriptor maps (ie the output of the known process) on the synthetic and input images and apply an L1 loss between them; these methods are further described in III-A and III-B below.

(28) Thus, it can be seen that the first stage takes a first training representation 200 (step 1100) and transforms it using the first trainable transform (here GAB). The output of GAB may be thought of as being a modified version 202 of the first training representation.

(29) Then, the modified version 202 of the first training representation is input to the second trainable transform (here GBA) and a synthetic version 204 of the first training representation is generated.

(30) Then, in the embodiment being described, both a descriptor map a detector response map is calculated (ie a known process is performed) for each of the first training 206 image and the synthetic version 208 and used to generate an error signal 210.

(31) Here a source image may be thought of as being a first training representation and a target image may be thought of as being a second training representation. In the first stage described here, the first training representation (source image) is unpaired with the second training representation (target image), but the second training representation corresponds to a similar representation to that of the first representation.

(32) In the second stage, G.sub.AB and G.sub.BA (ie trainable transforms) are separately trained using a small dataset of aligned day and night images (ie first and second representations, with a second well aligned representation being provided for each first representation). The use of pixel-aligned images allows the generators to learn certain feature transformations that might have been uncaptured by the unsupervised method used in the first stage, which only learns to align the image distributions without any explicit pixel-wise mapping. This time, the L1 loss is applied between SURF detector response maps (ie the detector response map is the output of a known process) computed on aligned target images and synthetic images, and between dense descriptor response maps (ie the descriptor response map) computed on aligned target and synthetic images. The architecture of the second stage is shown in FIG. 3.

(33) Thus, the second, fine tuning stage, may be thought of as taking a second training representation, corresponding to the first training representation. Here the first and second training representations are well aligned.

(34) The trainable transform (each of G.sub.AB and G.sub.BA in turn) are then trained by transforming the first training representation to generate a modified training representation. Subsequently a surf detector map and a descriptor map are generated on both the modified first training representation and the second training representation; ie a known process is performed on each of the modified first training representation and the second training representation. An error signal is then generated by comparing the descriptor and/or detector maps to train the trainable transforms.

(35) In the embodiment being described the first stage is followed by the second, fine tuning stage. However, it is possible that some embodiments may perform just the first stage or just the second stage.

(36) In the above description, the generation of descriptor maps and detector maps are used in examples of a known process to run on the images. Other embodiments may use other known process such as perceptual loss where a first training image and synthetic image are input to an image classification network and comparing the activations in one of the layer(s).

(37) The generator architecture is based on UResNet [23] R. Guerrero, C. Qin, O. Oktay, C. Bowles, L. Chen, R. Joules, R. Wolz, M. Valdes-Hernandez, D. Dickie, J. Wardlaw, et al., White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks, arXiv preprint arXiv:1706.00935, 2017, which combines a UNet [24] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234-241 with residual (ResNet) [25] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 modules. The internal architecture of the generator is shown in FIG. 5.

(38) In the embodiment being described, the discriminator architecture is a CNN with 5 layers. The first 4 layers are comprised of a convolution operation followed by instance normalisation and leaky ReLu (Rectified Linear Unit), and the last layer is a convolution operation which outputs a H/8W/8 map classifying the receptive field in the image space as real or fake, where H and W represent the height and width of the input image.

(39) More specifically, the architecture employs 3 down-convolutional layers 500, 502, 504 with stride 2, 9 ResNet 518 blocks and 3 up-convolutional layers 506, 508, 510 with fractional stride , with skip connections 512, 514, 516 between corresponding down and up-convolutional layers. Each convolutional layer consists of a convolution operation, followed by instance normalisation and leaky ReLU as depicted by the shading of the layer (506-516). Each ResNet block 518 consists of a convolution, followed by instance normalisation, leaky ReLU, a second convolution, instance normalisation and addition of the original block input to the resulting output.

(40) Examples of generator results for both G.sub.AB and G.sub.BA are shown in FIG. 10 for a range of different condition pairs.

(41) A. SURF Detector Response Map

(42) The SURF detector response map is obtained using a convolutional version of the original method of approximating the determinants of Hessians described in [22] above. For each scale we generate three box filters to approximate the second-order derivative of the Gaussian

(43) $\frac{^{2}}{x^{2}} g (), \frac{^{2}}{y^{2}} g (), \frac{^{2}}{xy} g ()$
on the X,Y and diagonal directions respectively. We convolve these filters with the image I, yielding the response maps L.sub.xx(G), L.sub.yy() and L.sub.xy().

(44) Using the Hadamard product, the matrix of approximations of the determinant of Hessians is:
det( custom character .sub.approx)=L.sub.xx.Math.L.sub.yy0.81*L.sub.xy.Math.L.sub.xy(1)
Dense SURF Descriptors

(45) The methodology used in OpenSURF ([26] C. Evans, Notes on the OpenSURF Library, University of Bristol Tech Rep CSTR09001 January, no. 1, p. 25, 2009) is adapted and a fast, convolutional method for building dense perpixel SURF descriptors is employed, through which gradients can be passed. For each scale out of the N chosen scales, we precompute: a look-up table for the 81 relative offsets of pixel neighbours that are used to build SURF descriptors; an N81-matrix for the scale-specific Gaussian weights of the 81 offsets; a 16-length column vector for the Gaussian weights of the 16 neighbourhoods; and HAAR-like box filters for both the X and Y directions;

(46) The input image is then convolved with the HAAR box filters and the wavelet responses are stored. For each chosen scale we stack 81 copies of the wavelet responses and multiply them with the scale-specific Gaussian weight.

(47) Then, for each of the 16 pixel neighbourhoods that make up a SURF descriptor we: offset the stacked copies along the X and Y directions according to the offset look-up table (see for example offsets at 400a, 400b, etc); multiply by the neighbourhood-specific Gaussian weight; sum along the stack direction both the raw and absolute values for the X and Y-directions respectively, yielding 4 matrices; element-wise multiply each matrix with its specific Gaussian neighbourhood weight LUT; stack the 4 resulting matrices

(48) Finally, each column of the resulting 64 layer stack of HW size matrices is normalized, where H and W are the height and width of the input images. This stack represents the dense per-pixel SURF descriptor, for each scale. The stacking and summing operation is depicted in FIG. 4.

(49) b) Descriptor Loss:

(50) Thus, the embodiment being described utilises a descriptor loss. Such a descriptor loss L.sub.Desc which may be thought of as guiding the training of the trainable transform (ie a generator) such that descriptors of regions or subregion components of transformed first representations obtained from input first representations depicting a particular scene under an initial condition match as closely as possible the descriptors of regions or subregion components of second representations depicting the particular scene under a target condition. During the training phase of the trainable transform, the first and second representation are typically provided by representations from a training set. During runtime (such as when utilised on the vehicle 100) the first representations are typically provided by representations from the sensor 102.

(51) Alternatively, or additionally, the distribution of descriptors of regions or subregion components of transformed representations obtained from input representations depicting a particular scene under an initial condition matches as closely as possible the distribution of descriptors of regions or subregion components of images depicting the particular scene under a target condition.

(52) Here, descriptors may represent intensities of regions or subregion components, a linear transformation of intensities of regions or subregion components, a non-linear transformation of intensities of regions or subregion components.

(53) a) Detector Loss:

(54) Further, the embodiment also being described utilises a detector loss. Such a detector loss may be thought of as guiding the training of the trainable transform such that the locations of regions or subregion components of interest of transformed images obtained from input images depicting a particular scene under an initial condition match as closely as possible the locations of regions or subregion components of interest of images depicting the particular scene under a target condition.

(55) Alternatively, or additionally, the detectors are such that the distribution of locations of regions or subregion components of interest of transformed images obtained from input images depicting a particular scene under an initial condition matches as closely as possible the distribution of locations of regions or subregion components of interest of images depicting the particular scene under a target condition.

(56) Here regions or subregion components of interest may be categorised according to their difference in intensity/amplitude across the region or variance or information content quantifiable using a common measure.

(57) Here, transformed images includes the outputs of the trainable transforms, such as the modified and/or synthetic images.

(58) Feature detectors and descriptors for day-night matching are evaluated in [27] (H. Zhou, T. Sattler, and D. W. Jacobs, Evaluating local features for day-night matching, in Computer VisionECCV 2016 WorkshopsAmsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III, 2016, pp. 724-736), showing that most features are detected at small scales (<10). Following experimentation, the embodiment being described computes SURF loss terms for the first 5 scales in order to speed-up the training process and it was found that this did not cause significant performance loss. An explanation for this could be that inside smaller pixel neighbourhoods, appearance changes between images with different conditions can be more uniform compared to larger neighbourhoods. However, the skilled person will appreciate that other embodiments may compute loss terms for more scales which may increase the accuracy further but generally at the penalty of increased processing times. Further, other embodiments may compute fewer than 5 scales.

(59) B. Losses

(60) Similar to [7], the embodiment being described applies an adversarial loss through a discriminator on the output of each generator: discriminator D.sub.B on the output of generator G.sub.AB, and discriminator D.sub.A on the output of generator G.sub.BA. This loss is formulated as:
custom character B.sub.adv=(D.sub.B(G.sub.AB(I.sub.A))1).sup.2(2)
A.sub.adv=(D.sub.A(G.sub.BA(I.sub.B))1).sup.2(3)

(61) The adversarial objective L.sub.adv becomes:
custom character .sub.adv=B.sub.adv+A.sub.adv(4)

(62) The discriminators are trained to minimize the following loss:
custom character B.sub.adv=(D.sub.B(I.sub.B)1).sup.2+(D.sub.B(G.sub.AB(I.sub.A))).sup.2(5)
A.sub.disc=(D.sub.A(I.sub.A)1).sup.2+(D.sub.A(G.sub.BA(I.sub.B))).sup.2(6)

(63) The discriminator objective L.sub.disc becomes:
custom character .sub.disc=B.sub.disc+A.sub.disc(7)

(64) The cycle consistency loss [7] is applied between the input and synthetic image, and between the SURF detector Det(.Math.) and dense descriptor Desc(.Math.) maps computed from these two images:
custom character .sub.rec=I.sub.inputI.sub.reconstructed.sub.1(8)
.sub.LoG=Det(I.sub.input)Det(I.sub.reconstructed).sub.1(9)
.sub.desc=Desc(I.sub.input)Desc(I.sub.reconstructed).sub.1(10)

(65) The complete generator objective L.sub.gen becomes:
custom character .sub.gen=.sub.rec*.sub.rec+.sub.LoG*+.sub.desc*.sub.desc+.sub.adv*.sub.adv(11)

(66) Each term is a hyperparameter that weights the influence of each loss component. For the fine-tuning stage, where the target image is aligned with the input and synthetic images, the losses become:
custom character F.sub.LoG=Det(I.sub.target)Det(I.sub.synthetic).sub.1(12)
F.sub.desc=Desc(I.sub.target)Desc(I.sub.synthetic).sub.1(13)

(67) The fine-tuning objective L.sub.finetune becomes:
custom character .sub.finetune=.sub.LoG*F.sub.LoG+.sub.desc*F.sub.desc(14)

(68) The embodiment being described computes the generator functions G.sub.AB, G.sub.BA such that:

(69) $\begin{matrix} G_{AB}, G_{BA} = \underset{G_{AB}, G_{BA}, D_{B}, D_{A}}{\arg \min}_{gen} +_{disc} +_{finetune} & (15) \end{matrix}$

(70) The embodiment being described is arranged to minimise the above losses as follows.

(71) Data was used from 6 traversals from the Oxford RobotCar Dataset [11] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, 1 Year, 1000 km: The Oxford RobotCar Dataset, The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3-15, 2017, collected up to 1 year apart, yielding 5 condition pairs: day-night, day-snow, day-dawn, day-sun and day-rain. For each of the traversals, the RTK-GPS ground truth was filtered and any data points with more than 25 cm of translation standard deviation were discarded.

(72) Training datasets for each condition pair were created from the entirety of the day traversal and a portion representing approximately 20% of the paired condition, to simulate a case where reasonable amounts of mapping data cannot be acquired. The remaining 80% of the paired condition was used to benchmark the performance of the synthetic images.

(73) The well-aligned datasets used in the second training stages were created by selecting image pairs between which none or only a small viewpoint rotation existed. Image pairs with no translation or rotation misalignment were used as-is, and for those with small rotational differences the target images were affine-warped into the frame of the source images using the known poses provided by the RTK-GPS ground truth.

(74) A. Training

(75) For the cycle-consistency stage (ie the first stage), a network training regimen similar to [7] is employed. For each iteration the discriminator, is trained on a real target domain image and a synthetic image from a previous iteration with the goal of minimizing L.sub.disc, and then the generators are trained on input images to minimize L.sub.gen. In particular, the embodiment being described used the Adam solver ([28] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, CoRR, vol. abs/1412.6980, 2014) with an initial learning rate set at 0.0002, a batch size of 1, .sub.rec=8, .sub.det=2, .sub.desc=2 and .sub.adv=1. The skilled person will appreciate that other solvers are possible.

(76) For the fine-tuning stage (ie the second stage), a small well-aligned subset of the dataset was used for training, and arranged to minimize L.sub.finetune, with the same learning parameters.

(77) B. Localisation

(78) Once the parameters have been learnt by the methods described above, it is possible to use the parameters for localisation of a vehicle, or the like.

(79) One embodiment, now described, used the trained generators G.sub.AB to transform the day map frames into target-condition frames, and G.sub.BA to transform the 5 types of target-condition frames into day-condition frames.

(80) To benchmark the synthetic images in the context of localisation, embodiments used the experience-based navigation system of [3] which implements a feature-based topological localiser ([29] M. Cummins and P. Newman, Appearance-only slam at large scale with fab-map 2.0, The International Journal of Robotics Research, vol. 30, no. 9, pp. 1100-1123, 2011) followed by a geometric verification stage using RANSAC ([30] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM, vol. 24, no. 6, pp. 381-395, jun 1981) and a nonlinear optimisation to minimise inlier reprojection errors.

(81) In contrast to adding the synthetic frames as a separate map, feature correspondences were accumulated from real to real image matching and synthetic to real image matching, and it was found that this lead to more robust and accurate solutions.

(82) In the embodiment being described, the generator runs at approximately 1 Hz for images with a resolution of 1280960, and at approximately 3 Hz for images with a resolution of 640480 on an Nvidia Titan X GPU. The skilled person will appreciate that these frequencies will likely change as different processors (ie GPU's) are used.

RESULTS

(83) A. Quantitative Results

(84) Below, results are shown taking into consideration both the frequency and the quality of localisation.

(85) TABLE-US-00001 TABLE I METRIC LOCALISATION FOR DIFFERENT NETWORK ARCHITECTURES SURF + REFERENCE RGB - only SURF finetune RMSE RMSE RMSE RMSE RMSE RMSE RMSE RMSE Map vs. Traversal % (M) (o) % (M) (o) % (M) (o) % (M) (o) Real Day vs. Real 19.5 12.74 24.06 Night Real Day vs. 25 3.59 11.38 51 2.43 4.06 60 1.99 3.32 Synthetic Day

(86) Table I compares the root mean squared translation (RMSE (m)) and rotation errors (RMSE (O)), with respect to the RTK-GPS ground truth, and the cumulative successfully localized portion as a percentage of the distance travelled in the case of day-night localisation. Results are shown for raw images, images obtained with an RGB-only implementation of [7], and images obtained using stages 1 and 2 of the embodiment described above. The results show an increase in the accuracy of localisation using synthetic images generated from the stage-1 model, and a further increase in accuracy from the stage-2 fine-tuned model.

(87) TABLE-US-00002 TABLE II METRIC LOCALISATION PERFORMANCE BETWEEN CONDITIONS REFERENCE Synthetic RMSE RMSE RMSE RMSE Map vs. Traversal % (M) (o) % (M) (o) Day vs. Night 19.5 12.74 24.06 51.0 2.43 4.06 Day vs. Snow 94.6 9.02 8.07 98.6 3.38 8.19 Day vs. Dawn 31.9 11.59 7.04 66.8 2.55 6.30 Day vs. Sun 33.0 32.5 82.04 71.0 9.40 24.52 Day vs. Rain 27.2 11.28 9.85 58.6 2.91 7.84

(88) Table II presents localisation results for a wide range of conditions transformed into day using stage-1 trained models, illustrating the performance of the method when localising relative to a single condition. In all cases the localisation rate is improved (often by a factor of 2) and the metric errors are reduced.

(89) FIG. 7 shows two histograms giving the distribution of translation and rotation errors with respect to ground truth, each histogram showing the raw image matching and for our best solution, in the case of day-night localisation. Overall a large improvement in localisation accuracy was observed, compared to raw images and images produced by an RGB-only implementation of [7].

(90) To generate the histograms, translation outliers larger than 5 meters in absolute have been accumulated in the 5 and +5 meter bins. Rotation outliers larger than 30 degrees in absolute have been accumulated in the 30 and +30 degree bins.

(91) FIG. 8 shows the number of match inliers as a function of the distance travelled, for the raw images and for our best solution, in the case of day-night localisation. A significant increase in the number of inliers for real-to synthetic matching, compared to real-to-real image matching was observed.

(92) FIG. 9 shows the probability of travelling in (Visual Odometry) VO-based open-loop when a localisation failure occurs, as a function of distance travelled. Significant improvements, were observed, when using synthetic images generated using the embodiment described above. It is noted that, perhaps surprisingly, images generated using an RGB-only implementation of [7] did not bring a large improvement in robustness.

(93) From FIG. 9, it will be noted that there is observed a large increase in the robustness of day-night localisation when using synthetic images. To generate FIG. 9, the map is from daytime, the input images are at night.

(94) B. Qualitative Results

(95) FIG. 10 presents qualitative results in a series of locations throughout Oxford where matching between raw images failed or produced a very small number of inliers. Matching within the image pairs is represented by a horizontal line between points that have been matched within the image pairs. The figure presents matches between real images (top) and between real and synthetic images (bottom). Note how the learned image transformation (by construction) does a qualitatively good job reconstructing details that are described by the feature detector and descriptor, such as window frames.

CONCLUSIONS

(96) A system is presented that yields robust localisation under adverse conditions. The system may be thought of as taking input images which are transformed in such a way as to enhance point wise matching to a stored image (for example on an image library to which a vehicle or other entity has access). In the embodiment being described, trainable transformation is learnt using a cyclic GAN while explicitly accounting for the attributes feature detection and description stages. The embodiment described utilises feature detectors and descriptor responses.

(97) Using modest target training data, which emulates scenarios where mapping is expensive, time consuming or difficult, the embodiment described generated resulting synthetic images which consistently improved place recognition and metric localisation compared to baselines. Thus, such embodiments may not only reduce, perhaps drastically, the cost and inconvenience of mapping under diverse conditions, but also improving the efficacy of the maps produced when used in conjunction with our method.

(98) Further the described embodiment is typically arranged to processes the image streams outside of the localisation pipeline, either offline or online, and hence may be used as a front end to many existing systems.

Localising a vehicle

Assignee

Inventors

Cpc classification

Classification Explorer

A01B1/00

HUMAN NECESSITIES

International classification

Classification Explorer

G05D1/00

PHYSICS

Abstract

Claims

Description