DEVICE AND SYSTEM FOR AUTONOMOUS VEHICLE CONTROL

Abstract

A computer device and system for controlling an autonomous vehicle are provided. The computer device comprises a memory and a processor, the computer device configured to be fitted to a vehicle and to communicate with a camera or sensor, the processor being configured to: pre-process an original image from the camera or sensor data from the sensor to produce an input image; present the input image to a neural network stored in the memory; wherein the neural network is trained to classify a feature in an image presented to it, the neural network having an input layer, a hidden layer and an output layer, the output layer including three outputs: a first feedback output for selecting pixels from the input image to input at the input layer at each iteration of the neural network; a second feedback output for selecting a colour channel of the selected pixels to input at the input layer at each iteration; and a third output for outputting an output value indicative of a classification result from the neural network; the processor further configured to obtain the output value from the neural network; and post-process the output value from the neural network to identify a feature of the environment of a vehicle.

Claims

1. A computer device comprising a memory and a processor, the computer device configured to be fitted to a vehicle and to communicate with a camera or sensor, the processor being configured to: pre-process an image received from the camera or sensor data from the sensor to produce an input image; present the input image to a neural network stored in the memory of the computer device; wherein the neural network is trained to classify a feature in an image presented to it, the neural network having an input layer, a hidden layer and an output layer, the output layer including three outputs: a first feedback output for selecting pixels from the input image to input at the input layer at each iteration of the neural network; a second feedback output for selecting a colour channel of the selected pixels to input at the input layer at each iteration; and a third output for outputting an output value indicative of a classification result from the neural network; the processor further configured to obtain the output value from the neural network; and post-process the output value from the neural network to identify a feature of the environment of a vehicle.

2. The computer device of claim 1 wherein the computer device is a system on a chip (SoC).

3. The computer device of any preceding claim wherein the computer device is further configured to communicate with a control computer of the vehicle.

4. The computer device of any preceding claim wherein the neural network stored in the memory is configured to perform one or more specific tasks including image classification, object detection and road segmentation.

5. The computer device of any preceding claim wherein the memory includes multiple neural networks, the processor being configured to present the input image to the multiple neural networks and to further process the output value of each of the multiple neural networks to identify the feature of the environment of the vehicle.

6. The computer device of any preceding claim, wherein the computer device is configured to perform pre-processing, post-processing and presenting to a neural network locally at the computer device, such that connection to an external network outside of the vehicle is not necessary to identify the feature of the environment.

7. A vehicle control system for fitting in or on a vehicle, the system comprising: a sensor or camera; a control computer; and the computer device of any of claims 1 to 6; wherein the computer device is configured to receive sensor data or an original image from the sensor or camera, and send information related to the feature of the environment of a vehicle to the control computer; the control computer being configured to control one or more components of a vehicle based on the information received from the computer device.

8. The vehicle control system of claim 7, wherein the control computer is configured to autonomously control the vehicle.

9. The vehicle control system of any of claim 7 or 8, further comprising a plurality of computer devices according to any of claims 1 to 6, each of the plurality of computer devices being configured to perform a different specific task.

10. A vehicle comprising the control system of any of claims 7 to 9.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0082] Examples of the invention will now be described in more detail, by way of example, and with reference to the drawings in which:

[0083] FIG. 1 is a schematic diagram of a system 100 including a perception software stack;

[0084] FIG. 2 is a schematic diagram of the perception software stack;

[0085] FIG. 3 is a flow diagram of a method 300;

[0086] FIG. 4 is a schematic diagram of a neural network 400;

[0087] FIG. 5 is a schematic diagram of the perception software stack configured to perform image classification;

[0088] FIG. 6 is a flow diagram of example images undergoing pre-processing;

[0089] FIG. 7 is a schematic diagram of the perception software stack configured to perform image segmentation and object detection according to a specific example;

[0090] FIG. 8 is a flow diagram of example images undergoing pre-processing;

[0091] FIG. 9 is a flow diagram of example images undergoing post-processing for image segmentation;

[0092] FIG. 10 is a flow diagram of example images undergoing post-processing for object detection;

[0093] FIG. 11 is a flow diagram of example images used to inform control of a vehicle based on the outcome of the perception software stack;

[0094] FIG. 12 is a flow diagram of a training method;

[0095] FIG. 13 is a diagram of a first training system;

[0096] FIG. 14 is a diagram of a second training system;

[0097] FIG. 15 is a diagram of a first example embodiment of a computer device according to the invention; and

[0098] FIG. 16 is a diagram of a vehicle control system according to the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0099] The invention described here is a computer device and corresponding system for controlling an autonomous vehicle. The computer device includes a trained neural network. The neural network is suitable for identifying a feature in the environment of a vehicle, and is used in a perception software stack by the computer device. The perception software stack is effectively built around the neural network to use the function of the neural network to perform one or more of several specific tasks, such as image segmentation and object detection. Because the neural network is of low-resolution, it is capable of running through data and processing results much more quickly and with less computational power than conventional neural networks. The feedback capabilities of the neural network allow it to select its own inputs from the input image. This adaptive approach effectively mimics active vision in nature, selecting and analysing small parts of an image to obtain information about the image on-the-fly rather than studying an entire image in a pixel-by-pixel brute force approach. These traits of the neural network allow it to perform efficiently whilst still providing accurate results. The invention will now be described in more detail with reference to the accompanying figures.

[0100] FIG. 1 is a schematic diagram of an autonomous driving system 100. The system 100 includes a training software module 102, and a computer device 104 that is configured to run a perception software stack 108.

[0101] The training software module 102 is configured to provide to the computer device 104 a set of trained weights for a low resolution recurrent active vision neural network (LRRAVNN) that forms part of the perception software stack 108. The training software module 102 generates the trained weights, for performing a specific task with the neural network, using an artificial evolution algorithm. The specific task that the weights are trained for includes one or more of image classification, image segmentation, and object detection. The specific tasks are thus computer vision tasks, the results of which are used to inform control of an autonomous vehicle. Once the process of training the weights for one or more of the specific tasks is completed by the artificial evolution algorithm, the trained weights are uploaded to the computer device 104 via a data transfer for use in the neural network in the perception software stack 108.

[0102] The computer device 104 includes the perception software stack 108 or communicates with a computer readable medium having the perception software stack 108 stored thereon. The computer device includes a processor 110 and a memory 112. The perception software stack 108 is preferably implemented as computer-readable instructions stored on the memory 112 and executable by the processor 110. The computer device 104 is configured to be fitted into a vehicle, and includes an input for connecting to a sensor 106 and an output for outputting the results of the specific task to inform control of the vehicle. The computer device may be an integrated circuit.

[0103] The final weights trained by the training software module 102 are stored on the memory 112 when they are provided to the computer device 104. The memory 112 is also configured to store a configuration file, whereby the configuration file includes the operating parameters of the perception software stack 108. The operating parameters of the perception software stack 108 are different for each specific task. As such, the configuration file stored in the memory 112 is different depending on which specific task is intended to be performed by the computer device 104.

[0104] The sensor 106 is configured to provide to the computer device 104 sensor data describing the environment of the sensor 106. Throughout the following description, the sensor is referred to as a camera that produces a visual image. However, it is to be understood that the sensor 106 can be a camera, infrared sensor, LIDAR sensor or the like, attached to the vehicle to which the computer device 104 is fitted. The sensor 106 is configured to send the sensor data to the computer device 104 regularly as is required by an autonomous driving system.

[0105] The sensor data provided to the computer device 104 from the sensor 106 is manipulated by the perception software stack 108. The perception software stack 108 includes a plurality of layers, including a network layer comprising the neural network. When sensor data is provided to the computer device 104, the processor 110 runs the perception software stack 108 on the sensor data based on the operating parameters in configuration file and the trained weights stored in the memory 112. The sensor data is passed through each layer of the perception software stack 108 in order, to obtain a result for the specific task being performed by the computer device 104. Once a result for the specific task is obtained, the computer device 104, using the processor 110, is configured to communicate with systems in the vehicle to aid control of the vehicle based on the result of the specific task.

[0106] Each of the training module 102, the computer device 104 and the perception software stack 108 will now be described in more detail, with reference to their physical implementations and their associated methods of use.

[0107] Firstly, the perception software stack 108 is discussed here with reference to FIG. 2, which shows a schematic diagram of the perception software stack 108. The perception software stack 108 includes a first layer 202, a second layer 204, a third layer 206, and a fourth layer 208. When sensor data, such as an image, is received by the computer device 104 from the sensor 106, it is fed into the first layer 202. The sensor data is then fed through the second, third and fourth layers in order, where it is manipulated and processed.

[0108] The first layer 202 is responsible for pre-processing the sensor data. This pre-processing is common to all specific tasks and includes a predetermined colour scheme transformation.

[0109] The second layer 204 is configured to perform further pre-processing of the sensor data, whereby the further pre-processing is dependent on the specific task being performed by the perception software stack 108. The second layer 204 is shown in FIG. 2 as being split into two sub-layers to illustrate that there are two types of further pre-processing that occur depending on the specific task being performed.

[0110] The third layer 206 is a network layer and includes the neural network that uses the weights trained by the training software module 102 for the specific task. The third layer 206 is split into three sub-layers in FIG. 2 to illustrate that the neural network has three modes of operation based on the different trained weights for each specific task.

[0111] The fourth layer 208 is configured to post-process the results outputted by the neural network for each specific task. The fourth layer 208 is similarly split into three sub-layers to illustrate that there are three different types of post-processing that can occur, one for each of the specific tasks of image classification, image segmentation and object detection.

[0112] Although FIG. 2 is illustrated with sub-layers in this way, it is to be understood that alternatively, each specific task can be provided with its own perception software stack 108, meaning that a plurality of individual software stacks exist without the need for sub-layers.

[0113] FIG. 2 also shows the processor 110 and the memory 112. The processor 110 is used to run the perception software stack 108, and the memory 112 is used to provide the perception software stack 108 with the operating parameters from the configuration file and the trained weights, dependent on the specific task to be performed.

[0114] FIG. 3 shows a flow diagram of a general method 300 that incorporates use of the perception software stack 108.

[0115] At step 302, a sensor/camera input is received. The input is an image such as a frame from a video.

[0116] At step 304, the image received from the sensor/camera is pre-processed at the first layer 202. The pre-processing of the image includes performing a colour conversion from the native colour scheme of the image, for instance RGB, to a HSa* colour scheme. The HSa* colour scheme includes a hue chancel (H) 224, a saturation channel (S) 226, and a green/magenta channel (a*) 228. The conversion of the colour scheme aids the performance of the LLRVANN with respect to the image when performing the specific task. Alternatively, other colour schemes may also be used. One option is to use an edge filter on the received image and use the gradient filtered output as a colour channel itself. Different types of edge filter may be used to form three different colour channels in this way.

[0117] At step 306, the image undergoes further pre-processing at the second layer 202. This involves dimensionality reduction and resolution adjustment to generate an input image for presenting to the neural network. Exactly how the image is reshaped and scaled is dependent on the specific task to be performed by the neural network. In general, the original image undergoes dimensionality reduction to produce a one-dimensional array for each colour channel. The further pre-processing in this step also includes either reducing the resolution and thus the size of the image, or splitting the image into multiple smaller images called ‘patches’, such that the size of the input image or images produced in step 306 conform with the input size requirements of the neural network.

[0118] At step 308, the input image generated in step 306 is presented to the neural network in the third layer 206. The neural network processes the input image, or images if the image was split into patches in step 306, using the weights trained for the specific task that is being performed, for a maximum number of iterations T. The neural network selects pixels from the input image using two image selection output neurons and a colour channel selection output neuron.

[0119] For each input image, an output score is produced by the neural network and outputted for post-processing by two further output neurons.

[0120] At step 310, the outputted result from the neural network in step 308 undergoes post-processing at the fourth layer 208. The post-processing step is different depending on the specific task being performed by the neural network. Examples of post-processing are provided with reference to FIGS. 9 and 10 where each specific task is discussed in more detail. The post processing step manipulates the output score from two further output neurons to determine an outcome for the specific task with respect to the input image. The outcome of the specific task is then used to inform control of the vehicle in which the system 100 and method 300 are used.

[0121] The method 300 described above refers to a single neural network. However, a plurality of k neural networks, where k is a positive integer, can be used to perform a specific task. When k neural networks are used to perform the same specific task, the output scores produced by the k neural networks are combined and post-processed together in step 310 as will be discussed in more detail below with reference to the specific tasks. Having k neural networks produces more reliable results.

[0122] Before the specific tasks are described, the architecture of the neural network included in the third layer 206 of the perception software stack 108 will be described here with reference to FIG. 4. The neural network 400 has a continuous time recurrent neural network (CTRNN) architecture. The CTRNN architecture has an input layer 202, a hidden recurrent layer 204 and an output layer 206. Each of these layers include a plurality of neurons 208, otherwise known as nodes or processing units. These neurons connect to each other via a set of weighted connections 210. The weighted connections 210 have a weight w.sub.ji which determines the influence of neuron ion the neuron j. The weighted connections 210 are provided with the trained weights for the specific task that is intended to be performed. Each input neuron in the input layer 202 is connected to all hidden neurons in the hidden layer 204, and each hidden neuron is connected to all other hidden neurons including itself and to all output neurons in the output layer 206. Each neuron has a transfer function that determines how the inputs from other neurons are integrated, and an activation function that determines the output the neuron produces. For the input layer 202 the value of each neuron, y.sub.i can be described by equation Eq.1 below, where l is the pixel input value, i is in the range of 1 to the total number of input neurons numinput, and g is a sensor gain value:

y.sub.i=g×l.sub.i Eq. 1

[0123] For the hidden layer 204 the value of each neuron can be described by equation Eq.2 below, where i is in the range of 1 to the total number of hidden neurons numhidden, delta, is a decay constant, and y.sub.i.sup.cp is the cell potential of the l.sup.th hidden layer neuron:

[00001] $\begin{matrix} {delta}_{i} \times y_{i}^{cp} = \overset{numinput}{\underset{j = 1}{.Math.}} w_{j i} \times sigmoid (y_{j} + {bias}_{j}) & Eq . 2 \end{matrix}$

For the output layer 206 the value of each neuron, y.sub.i can be described by equation Eq.3 below, where i is in the range of 1 to the total number of output neurons numoutput:

[00002] $\begin{matrix} y_{i} = \overset{numhidden}{\underset{j = 1}{.Math.}} w_{j i} \times sigmoid (y_{j} + {bias}_{j}) & Eq . 3 \end{matrix}$

The neural network 400 has 32 input neurons in the input layer 402, 15 hidden neurons in the hidden layer 404, and 5 output neurons in the output layer 406. However, the neural network 400 may have more or less neurons at each layer. The neural network 400 has a maximum of 150 input neurons, to ensure that computational load is maintained at a low level and that the low-resolution aspect of the neural network 400 is maintained.

[0124] The neural network 400 is configured to iteratively process an input image 412. The input image 412 is the colour channel set of one-dimensional arrays that are produced by feeding a camera-captured image or other sensor data through the first 402 and second layers 404 of the perception software stack 108. The neural network 400 processes the input image 412 for a number of iterations up to a maximum iteration value T. At each iteration, pixel values from the input image 412 are processed by the neural network 400. The 5 output neurons include two image selection output neurons 414 and 416 for selecting co-ordinates of pixels in the image 412 to process with the neural network 400 at each iteration, a colour channel selection output neuron 418 for selecting one of three colour channels 424, 426 and 428 of the image 412 to process with the neural network 400 at each iteration, and two output prediction neurons, 420 and 422. The image selection output neurons 414 and 416 and the colour channel selection output neuron 418 are thus feedback outputs that are configured to modify the input to the input neurons at the input layer 402. This represents the active vision mechanism of the neural network 400. The output prediction neurons 420 and 422 provide an output score relating to the specific task that the neural network 400 is configured to run. Each of the output neurons 414, 416, 418, 420 and 422 output a value between 0 and 1.

[0125] The specific tasks will now be described with reference to the perception software stack 108, the method 300 and the neural network 400.

[0126] The specific task of image classification is described here with reference to FIGS. 5 and 6. Classification is the core task upon which other tasks such as road-segmentation and object detection are based. Classification is the process of classifying a whole image into a class from a range of possible classes, such as ‘bike and car’ (class 1) or ‘neither bike nor car’ (class 2). There may be a plurality of classes.

[0127] FIG. 5 is a flow diagram that illustrates the functionality of the perception software stack 108 when performing the specific task of image classification. The first, second, third and fourth layers 202, 204, 206 and 208 are illustrated as functional boxes in FIG. 5. Initially, the sensor 106 supplies sensor data such as an image from a camera to the first layer 202. Once this image is received by the first layer 202, the received image is converted into an appropriate colour scheme by the first layer 202 of the perception software stack 108, such as HSa* as discussed above in step 304. The HSa* colour scheme is made up of an H colour channel image 502, an S colour channel image 504 and an a* colour channel image 506. Each of these colour channel images are scaled down by the second layer 204 from their original size to lower-resolution images. For example, the original image may be 1280×720 pixels and is then converted to a smaller image of 64×40 pixels. The dimensions of the colour channel images are also reduced in the second layer 204, from two-dimensional images to one-dimensional arrays 508. As such, when performing the task of classification, the original image is scaled down and converted into a one-dimensional array of pixels for each of three colour channels, to form the input image 212 for processing by the neural network 400. It is to be understood that the colour conversion and the reduction of dimensions of the original image may occur in any order, meaning the position of the first 202 and second layers 204 in the perception software stack 108 are interchangeable.

[0128] FIG. 6 shows how an example image 602 is processed by the first layer 202 and the second layer 204 of the perception software stack 108 when performing image classification. The example image 602 is firstly converted from its native RGB colour scheme to the HSa* colour scheme in the first layer 202. This produces three colour channel images: a H image 604, an S image 606, and a a* image 608. Each of these colour channel images are then further pre-processed in the second layer 204 to reduce the resolution of these images and therefore their size. A reduced H image 610 is processed from the H image 604, a reduced S image 612 is processed from the S image 606, and a reduced a* image 614 is processed from the a* image 608. Although the reduced images are shown as being two-dimensional in FIG. 6, this is for illustrative purposes only. In reality, the reduced images are processed further to form the one-dimensional arrays 508 for each colour channel.

[0129] Referring back to FIG. 5, once the one-dimensional colour channel images 508 are formed by the second layer 204, they are presented to the neural network 400 in the third layer 206. The image selection output neurons 214 and 216, responsible for selecting the co-ordinates of pixels in the one-dimensional array of pixels to process at each iteration, select the pixels for each iteration according to a calculation based on two variables IN_MULT and numinput. These variables are set according to the size of the one-dimensional array and the number of input neurons in the neural network 400 respectively. Continuing from the example above, numinput is the number of input neurons, which is 32. IN_MULT is set to ensure that the total range of pixels, which is 64×40=2560, are obtainable when the image selection output neurons 214 and 216 output their maximum value. In this example, IN_MULT is set to 80, because 80×32 is 2560, the maximum pixel index value. Making these variables dependent on the size of the scaled down image allows the neural network to iterate over all possible pixels in the one dimensional array. A start position, Start_pos, that designates a pixel index for selecting pixels from for the present iteration is calculated using:

Start_pos=(OUT1×(IN_MULT))×(OUT2×numinput)−c Eq. 4

where OUT1 is the value outputted by a first of the two image selection output neurons 214, OUT2 is the value outputted by a second of the two image selection output neurons 216 and c is the number of input neurons (and thus the number of pixels processed by the neural network 400 at each iteration). OUT1 and OUT2 are between 0 and 1. This equation is limited at the low-end by applying the conditional equation:

If Start_pos<0then Start_pos=0 Eq. 5

[0130] The neural network 400 then selects c pixels starting from the pixel index nearest to the numerical value Start_pos. In the above example, the neural network 400 selects 32 pixels in this manner, illustrated in FIG. 6 as the string of pixels 616. The two equations Eq.4 and Eq.5 allow for the entire of the one-dimensional array to be iteratively scanned, according to different values of OUT 1 and OUT2. It is to be understood that one of output values OUT1 and OUT2 can be replaced by a constant numerical value between 0 and 1 or removed entirely, such that the selection of pixels in the one dimensional arrays is only dependent on one of the two image selection output neurons 214 and 216. However, it is preferable to include both values OUT1 and OUT2 to reduce the individual reliance on each of the image selection output neurons 214 and 216. Using both of OUT1 and OUT2 is more reliable.

[0131] The colour channel selection output neuron 218 outputs a value OUT3 responsible for selecting the colour channel of the input image 212. The value OUT3 is between 0 and 1. The specific colour channel is selected according to the following logic:

if OUT3<0.33select H channel;

if 0.33<OUT3<0.66select S channel;

if OUT3>0.66select a*channel Eq. 6

[0132] Once the pixels from the one-dimensional arrays of pixels and the colour channel have been selected, the selected pixels are processed by the neural network 400. The output neurons 220 and 222 each output an output score OUT4 and OUT5 respectively, between 0 and 1. For each iteration the neural network 400 runs, an iteration output prediction value it_pred_val is stored, where:

it_pred_val=OUT4−OUT5 Eq. 7

[0133] Once the number of iterations is equal to T, the maximum number of iterations, a final prediction value final_pred_val is calculated by averaging the stored iteration output prediction values it_pred_val across all the iterations. This gives a final prediction value final_pred_value between −1 and +1.

[0134] Preferably, in the above calculation of the final_pred_val, the first ten iterations run by the neural network 400 are discounted to allow the network to settle, such that the it_pred_val values are averaged over the iterations after the first ten iterations. This means that the total number of it_pred_val values used in the calculation of the final_pred_value is equal to T−10.

[0135] The final prediction value final_pred_value then undergoes post-processing in the fourth layer 208 of the perception software stack 108. At this stage, two variables are calculated. These include a discrete predicted outcome, PRED, and a numerical confidence measure DIST that defines the distance from one of two confidence level thresholds, UP_LIMIT and LOW_LIMIT. The two confidence level thresholds UP_LIMIT and LOW_LIMIT may be set to any value between −1 and 1. For example, the UP_LIMIT and LOW_LIMIT may be +0.2 and −0.2 respectively. For classification tasks, PRED denotes the class that the processed image is predicted to belong to. DIST is a measure used to determine the overall class where more than one neural network 400 is used to process the input image or patches. PRED and DIST are calculated according to the following logic in this instance:

if: final_pred_value>UP_LIMIT: PRED=class1; DIST=I(final_pred_value−UP_LIMIT);

if:final_pred_value<LOW_LIMIT: PRED=class2;DIST=I(final_pred_value-LOW_LIMIT)I;

else:PRED=Neutral,DIST=I(final_pred_value−UP_LIMIT)I Eq. 8

[0136] In other words, if the final_pred_value from the neural network 400 is greater than the upper threshold, it is determined in step 310 that the processed image belongs to class 1. If final_pred_value is lower than the lower threshold, it is determined that the processed image belongs to class 2. If final_pred_value is somewhere between the upper and lower thresholds, then the class is labelled as neutral, meaning it neither definitively belongs to class 1 or class 2.

[0137] The variable DIST is used when there are k neural networks performing the classification task. When k networks are performing classification, the PRED values for each network are accumulated. For example, if there are 20 networks, there may be 14 instances of PRED=class 1 and 6 instances of PRED=class 2. This equates to 70% of the 20 networks producing a PRED=class 1 result and 30% of the 20 networks producing a PRED=class 2 result. These percentages are calculated and compared to a class threshold value, class thresh. If the percentage associated with a particular class is higher than the class thresh, it is determined that the processed image belongs to that particular class. For example, if class thresh is 60%, then a determination is made that the image belongs to class 1, because 70% of the 20 networks produced a PRED=class 1 result, and 70% is greater than the threshold of 60%. However, if a percentage associated with a class does not exceed the class thresh, the class of the image is not immediately apparent and the DIST variable is used instead. In this case, the class of the image is determined based on a value FIN_DIST, wherein FIN_DIST is calculated for each class using:

[00003] $\begin{matrix} FIN_DIST (class) = {.Math.}_{n = 1}^{n = k} PRED (class) + z {.Math.}_{n = 1}^{n = k} DIST (class) & Eq . 9 \end{matrix}$

where z is a scaling factor that is a positive real number. For example, assume that there are six networks such that k=6, where the PRED and DIST values for each network are as provided in Table 1 below:

TABLE-US-00001 TABLE 1 Network PRED DIST Network 1 Class 1 0.8 Network 2 Class 1 0.2 Network 3 Class 1 0.1 Network 4 Class 2 0.8 Network 5 Class 2 0.1 Network 6 Class 2 0.4

[0138] FIN_DIST for class 1 is calculated using the sum of instances of PRED=class 1, added to the scaling factor multiplied by the sum of DIST values when PRED=class 1. As such, the value of FIN_DIST(class 1) is equal to (1+1+1)+z(0.8+0.2+0.1), which is equal to 3+1.1z. FIN_DIST for class 2 is calculated using the sum of instances of PRED=class 2, added to the scaling factor multiplied by the sum of DIST values when PRED=class 2. As such, the value of FIN_DIST(class 1) is equal to (1+1+1)+z(0.8+0.1+0.4), which is equal to 3+1.3z. The class with the largest FIN_DIST value is determined to be the class to which the image belongs. In the above example, this is class 2. This classification result output by the specific task of classification to inform the control of an autonomous vehicle. A confidence value is also outputted, whereby the confidence value is proportional to the DIST term in equation Eq.9. Classification can be used in autonomous driving to identify the current environment, such as an urban road, a residential road, or high-street for example. Classification can also aid in identifying landmarks in the visual field, such as a bank or supermarket building that is present in the input image. Furthermore, classification can aid in identifying visible junctions and intersections. Thus the neural network can help classify if the current input image requires a right turn, left turn or straight ahead motion from the vehicle.

[0139] The specific task of image segmentation, and in particular, road segmentation, is described here with reference to FIGS. 7 to 9. Road segmentation is the process of separating free road space in an image from areas that are not free road space. This can be visualised by overlaying a segmented triangular shape that denotes road freespace on the image. Road segmentation thus helps to identify the boundaries of where an autonomous vehicle can safely move to. Road segmentation is applied using a very similar method to the specific task of classification as discussed above. Where classification aims to assign a class to a whole image, road segmentation involves dividing a whole image into patches, effectively classifying the individual patches, and then segmenting the whole image based on the classification of the patches.

[0140] FIG. 7 is a flow diagram that illustrates the functionality of the perception software stack 108 when performing the specific task of image segmentation. The first, second, third and fourth layers 202, 204, 206 and 208 are illustrated as functional boxes in FIG. 7. Initially, the sensor 106 supplies sensor data such as an image from a camera to the first layer 202. The function of the first layer 202 for image segmentation is identical to its function with respect to image classification. The received image is thus converted into the HSa* colour scheme in the first layer 202 of the perception software stack 108. The HSa* colour scheme is made up of a H colour channel image 702, an S colour channel 704 image and an a* colour channel image 706, as shown in FIG. 7.

[0141] At the second layer 204 of the perception software stack 108, the further pre-processing differs for image segmentation when compared to classification, in that the colour channel images 702, 704 and 706 are each divided into a plurality of patches 708a to 708n. The patches 708a to 708n have a configurable size, such as 64×40 pixels for example, and stride, depending on the original image size and the input size requirements for the neural network. The patches 708a to 708n are extracted from the original image using the following logic, considering a patch of width P.sub.w, a height of P.sub.h, horizontal stride St.sub.h, vertical stride St.sub.v, where P_num.sub.h and P_num.sub.v are the total number of patches in the horizontal and vertical directions respectively. The first patch is extracted from the top left corner of the image plane, from the 0th row and 0th column of the rows and columns of pixels in each of the colour channel images 702, 704 and 706. The second patch is extracted from the 0.sup.th row, and the 0.sup.th column+St.sub.h. The third patch is extracted from the 0.sup.th row and the 0.sup.th column+2St.sub.h. This process repeats until the rightmost image boundary is reached or until P_num.sub.h is exceeded. In other words, patches are taken along the first row of pixels of the colour channel images 702, 704 and 706 from left to right, incrementing by the horizontal stride St.sub.h until the rightmost boundary of the colour channel images 702, 704 and 706 are reached. Once patches have been extracted from the 0.sup.th row, extraction is shifted to the 0.sup.th row+St.sub.v, wherein the process repeats, extracting patches from left to right until the rightmost boundary of the colour channel images 702, 704 and 706 are reached or P_num.sub.h is exceeded. This process continues to the 0.sup.th row+2St.sub.v and onwards until P_num.sub.h is exceeded, or the bottom-right corner boundary of the colour channel images 702, 704 and 706 are reached. As an example, P.sub.w may be 64, P.sub.h 40, St.sub.h 30, St.sub.v 13, P_num.sub.h 20 and P_num.sub.v20, giving 400 patches for a colour channel image of size 640×300. Different values for these variables can result in spaces between consecutive patches or overlapping consecutive patches. Each patch is further reduced to a one-dimensional array of pixels (not shown in FIG. 7) in the same way as is done in image classification. It is to be understood that the creation of patches can be performed before the colour conversion into the HSa* colour scheme, such that the original image from the sensor 106 is divided into patches, before the patches are converted into the HSa* colour scheme. In other words the first 202 and second layers 204 and their functions are interchangeable as is the case in image classification.

[0142] FIG. 8 shows how an example image 802 is processed by the first layer 202 and the second layer 204 of the perception software stack 108 when performing image segmentation. The example image 802 is firstly converted from its native RGB colour scheme to the HSa* colour scheme in the first layer 202. This produces three colour channel images: a H image 804, an S image 806, and a a* image 808. Each of these colour channel images are then further pre-processed in the second layer 204 to form patches for each colour channel image. FIG. 8 shows three exemplary patches 810, 812 and 814 of the S image 806. Although the patches are shown as being two-dimensional in FIG. 8, this is for illustrative purposes only. In reality, the patches are stored as one-dimensional arrays for each colour channel.

[0143] Referring back to FIG. 7, once the one-dimensional arrays corresponding to the patches are formed by the second layer 204, they are iteratively presented to the neural network 400 in the third layer 206 of the perception software stack 108. Starting with the first patch, and for each patch generated by the second layer 204, the neural network 400 performs the same processes as discussed above with respect to the specific task of image classification, processing c pixels from a string of pixels 816 based on the values of OUT1 to OUT3 and equations Eq.4 to Eq. 7 to produce a final prediction value final_pred_val, for each patch, calculated by averaging stored iteration output prediction values it_pred_val across T iterations for each patch. The difference between road segmentation and classification at this stage is that the neural network 400 has differently trained weights and repeats processing to classify each individual patch rather than the image as a whole.

[0144] Post processing occurs in the fourth layer 208 of the software perception stack 108. The post-processing in road-segmentation can be performed in a similar way to image classification, in which the discrete predicted outcome, PRED, and the numerical confidence measure DIST are calculated according to equation Eq.8 for each image patch. In the road segmentation task, class 1 and class 2 refer to road/non-road classes.

[0145] The variable DIST is used when there are k neural networks performing the road segmentation task. When k networks are performing classification, the PRED values for each network are accumulated as in the classification task, and FIN_DIST is calculated for each class using equation Eq.9. The class with the largest FIN_DIST value is determined to be the class to which the first patch belongs. This process of classifying an individual patch is then repeated for all patches.

[0146] More preferably, once the final prediction value final_pred_value is calculated for each patch, it is normalized between 0 and 1, and preferably multiplied by 255, to form a heat map pixel value. For example, when the final_pred_value is −0.5, it is normalized between 0 and 1 to become 0.25, may then be multiplied by 255. As such, each patch is assigned a heat map pixel value between 0 and 255 that is proportional to its final_pred_value. The patches 708a to 708n are then reassembled on the image plane of the original image according to their respective positions in the original image, whereby all of the pixels in each respective patch are assigned the same value equal to the heat map pixel value of that respective patch. If the patches are generated in the second layer 204 such that they overlap each other in image plane of the original image, the patches are divided further into sub-patches. The sub-patches are sized such that they do not overlap neighbouring sub-patches. For example, for an original image of size 640×300, each 60×40 patch is divided into six smaller sub-patches of size 32×13. The sub-patches are then stored in a 21×22 array to provide a heat map image 708a that resembles the same image plane of the original image (not to scale in FIG. 7). When a patch is divided into sub-patches, the sub patches inherit the heat map pixel value of the divided patch. When neighbouring patches overlap in the image plane of the original image, and are subsequently divided into sub-patches, the sub-patches that are located in the overlapping portions of the neighbouring patches are designated a heat map pixel value that is the average of the heat map pixel values of the overlapping neighbouring patches. A plurality of patches may overlap in horizontal and vertical directions, meaning that a sub-patch in an overlapping portion of the plurality of patches will be designated a heat map pixel value that is the average of the overlapping plurality of patches.

[0147] Similarly, if during step 306 neighbouring patches are generated such that they are physically separated from each other in the image plane of the original image, the patches are divided into sub-patches. Sub-patches are also generated between the neighbouring patches, and are then designated a heat map pixel value that is dependent on the heat map pixel values of the neighbouring patches.

[0148] Once the patches have been reassembled on the image plane of the original image, or where there is overlap or separation of the patches on the image plane and sub-patches have consequently been generated, a heat map image 708a is produced. The further processing of the heat map image 708a is explained now with reference to an example as illustrated in FIG. 9.

[0149] FIG. 9 shows a graphical flow diagram that includes an example image 902 of a road environment captured by a camera, and a heat map image 904 that includes the 21×22 array containing sub-patches 906 that have been generated and processed with the neural network 400 as described above. The darker areas of the heat map image 904 indicate sub-patches 906 that have a heat map pixel value closer to 0, which according to the normalised final_pred_value, indicates the existence of road. The whiter areas of the heat map image 904 indicate sub-patches 906 that have a heap map pixel value of closer to 255, which according to the normalised final_pred_value, indicates non-road.

[0150] The post-processing in the fourth layer 208 of the perception software stack 108 continues, by applying segmentation or fitting algorithms to the heat map image 904. Applying a segmentation algorithm results in extracting a grid based shape from the heat map image 904. In an example, Otsu's thresholding method is firstly applied to make the heat map image 904 a binary image. A shape is then extracted from the binary image using a structural analysis algorithm such as the algorithm disclosed here:

https://www.semanticscholar.org/paper/Topological-structural-analysis-of-digitized-binay-Suzuki-Abe/cf021db5e811fd5b67ee3aa4db0a6a0351d276d2

[0151] This example algorithm works on connected component analysis principles, by trying to find an outer border within a binary digitized image. All connected border shapes are first extracted. In a second pass, all ‘holes’ within the image planes are assigned scores based on their proximity to borders and filled pixels. The final pass attempts to fill in ‘holes’ depending on the their scores and adds them to existing shapes. The outermost final border is considered as the connected shape structure output.

[0152] The result of this example segmentation for one neural network 400 is shown in FIG. 9 as a segmented image 908. The segmented image 908 includes a segmented region 910 that is separated from the rest of the original image, overlaid in FIG. 9 for visualisation purposes.

[0153] It is to be understood that k neural networks can be used concurrently to produce a plurality of heat map images 708a to 708k as shown in FIG. 7. When there are more than one heat map images, the results of the segmentation process are combined. For example referring to FIG. 9, a second neural network 400 may, from the original image 902, produce the heat map image 912 and the segmented image 914 as shown in FIG. 9. The segmented image 914 of the second neural network has a segmented region 916. In this case, a combined segmented image 918 is formed from the intersection of the respective segmented regions 910 and 916, such that the combined segmented image 918 has a combined segmented region 920 that is formed of an area common to each respective segmented region of the plurality of segmented images 904 and 914. The segmented region 910, 916 and 920 is either overlaid as a visual output or the features of this shape can be used as a ‘freespace’ shape to control an autonomous vehicle.

[0154] Alternatively a fitting algorithm is applied to the heat map image 904 to produce a shape such as a triangle, whereby the area of the triangle indicates the existence of road. The triangle can be overlaid on the original image 902 to form a hybrid image 922 as shown in FIG. 9 as an alternative to the combined segmented image 918. The hybrid image 922 includes a triangle 924 indicating the existence of road. The triangle 924 can be tracked and used in the control of an autonomous vehicle. An example fitting algorithm suitable for fitting a shape such as the triangle 924 to the heat map image 904 is discussed here. Firstly a starting pixel is selected from the centre-bottom of the heat map image 604. From this starting pixel, three functions are employed to traverse pixels in the left, right and upwards directions with respect to the heat map image 604. Each of these functions compare the value of a pixel to a threshold, whereby the threshold is selected to distinguish between road and non-road values in the heat map image 904. If each function determines that the pixel above, and to the immediate left and right of the starting pixel is of a ‘road’ pixel value, the functions iteratively traverse further from the starting pixel in the upwards, left and right directions until a pixel is identified that does not have a ‘road’ pixel value. In this case it is determined that this pixel has a non-road value and as such is a boundary to the road. After boundaries are found by each of the three functions, the boundaries in the left, right and upwards direction are connected to form the triangle 924. Finally, the segmented image 908, the combined segmented image 918 and/or the hybrid image 922 are output in the specific task of image segmentation for use in controlling the autonomous vehicle. It is to be understood that ‘road’ pixel value, does not refer to the raw pixels of the input image plane. Rather, these are the intermediate pixel values assigned to the post-network processed heat map image 604.

[0155] It is to be understood that the fitting algorithm may contain thresholds for acceptable error, such that a boundary pixel is not identified until at least 1-10 consecutive pixels do not have a ‘road’ pixel value.

[0156] The specific task of object detection is described here with reference to FIG. 7 and FIG. 10. Object detection is the process of identifying specific objects such as a car or bike in sensor data outputted by the sensor 106. Object detection is initially performed in the same way as image segmentation as explained above with respect to FIGS. 7 and 8. In particular, the schematic diagram of the layers of the perception software stack 108 as shown with respect to image segmentation in FIG. 7 is exactly the same for object detection.

[0157] As with image segmentation, the process of object detection includes generating a heat map image 708a of patches or sub-patches that are each assigned a heat map pixel value according to a normalised final_pred_value calculated for each patch. In object detection, the neural network 400 is configured to classify, for example, patches that belong to an object such as a car. Therefore, the normalised final_pred_value calculated for each patch is an indication of whether or not the patch belongs to a car in the original image.

[0158] FIG. 10 shows a graphical flow diagram that includes an original example image 1002 of a road environment captured by a camera, and a heat map image 1004 that includes the 21×22 array containing sub-patches 1006 that have been generated and processed with the neural network 400 using the same processes as with image segmentation. In the heat map image 1004, the darker regions with heat map pixel values closer to 0 indicate a higher likelihood of a car being present, whilst the whiter regions with higher heat map pixel values indicate a low likelihood of a car being present. Once the heat map image 1004 is produced, the specific task of object detection differs from classification and image segmentation in that further post-processing steps are taken. In particular, three levels of thresholding are applied to the heat map image 1004, by comparing the heat map pixel value H of patches/sub-patches with the values l.sub.1 and l.sub.2, and updating the heat map pixel values H.sub.new as described by the following logic:

if H≤l.sub.1,H.sub.new=0;

if l.sub.1<H≤l.sub.2,H.sub.new=0.15;

else if H>l.sub.2,H.sub.new=1 Eq. 10

[0159] The variables l.sub.1 and l.sub.2 are user-configurable, and may be values such as 0.25 and 0.5 respectively. It is to be understood that equation Eq.10 is exemplified by the case where the heat map pixel values are normalised between 0 and 1, however they may be in the range of 0 to 255 as described above with respect to image segmentation. The thresholding performed by equation Eq.10 reduces the heat map image 1004 to a reduced heat map image 1006, wherein the heat patches have heat map pixel values of 0, 0.15 or 1. Patches/sub-patches with a heat map pixel value of 0 are referred to as low patches, patches/sub-patches with a heat map pixel value of 0.15 are referred to as medium patches, and patches/sub-patches with a heat map pixel value of 1 are referred to as high patches. The reduced heat map image 1006 is formed using the same image plane as the original image 1002. The reduced heat map image 1006 then undergoes further processing to produce bounding boxes 1008 as shown in FIG. 10. These bounding boxes are formed by the following logic.

[0160] Firstly, all connected shapes of low and medium patches in the reduced heat map image 1006 are identified. A connected shape comprises two or more patches/sub-patches, such that individual low or medium patches are not identified as a connected shape. Of the identified connected shapes, any connected shape with no low patches, or in other words, any connected shape consisting of solely medium patches, is disregarded. Next, the boundaries of each separate connected shape are determined as co-ordinates in the upwards, downwards, left and right directions in the reduced heat map image 1006, by determining the last connected low or medium patch in each of these directions. These co-ordinates in the reduced heat map image 1006 are then used to draw horizontal lines, from the upper and lower co-ordinates, and vertical lines, from the left and right co-ordinates, to form the bounding boxes 1008. Preferably, for each bounding box, the number of low, medium and high patches contained within the bounding box are calculated to provide a confidence value for the respective bounding box. The confidence value, Confidence for each bounding box is calculated by:

[00004] $\begin{matrix} Confidence = \frac{2 p_{l o w} + p_{mid}}{p_{l o w} + p_{mid} + p_{h i g h}} & Eq . 11 \end{matrix}$

Where p.sub.low, P.sub.mid and p.sub.high are the number of low, medium and high patches respectively. Low patches are given a weighting of 2 in equation Eq.11. Due to this, the Confidence may theoretically exceed 1. To prevent this from happening Confidence is limited between 0 and 1.

[0161] Once the bounding boxes 1008 have been formed and Confidence calculated, the specific task of object detection outputs the original image 1002 overlaid with the bounding boxes according to their position on the reduced heat map image 1006, for use in controlling the autonomous vehicle. This is shown as output image 1010 in FIG. 10. The confidence value Confidence is also output for each bounding box.

[0162] It is to be understood that k neural networks 200 may run the specific task of object detection concurrently, such that a plurality of heat map images 708a to 708k and 1004 and reduced heat map images 1006 are produced in the fourth layer 208 of the perception software stack 108. In this case, bounding boxes 1008 are formed for each of the plurality of reduced heat map images 1006 and corresponding confidence values calculated according to equation Eq.11. To form the output image 1010, the bounding boxes 1008 of each reduced heat map image 1006 are combined. When bounding boxes 1008 intersect, their confidence values are averaged. Preferably, the output image 1010 is subject to further thresholding to only display bounding boxes 1008 above a certain confidence value.

[0163] Once the specific tasks of image classification, segmentation, and/or object detection are completed, the output from each specific task is used to inform the control of an autonomous vehicle. The specific tasks help to identify features of the environment of the vehicle, such as the road, pedestrians, road signs, objects, buildings, other road users, junctions and intersections and the like. Controlling an autonomous vehicle ultimately depends upon defining a ‘freespace’. Freespace is the area detected as the road, by the specific task of road segmentation, subtracted by areas within the detected road which are occupied by an object such as car, pedestrian or the like. The freespace is thus a shape formed by combining the outputs of road segmentation and object detection. Once the freespace is known, the vehicle can be controlled to navigate the freespace using standard kinematics algorithms. In particular, co-ordinate transformations are performed between the image plane showing the freespace and the three-dimensional real-world environment such that the vehicle can be controlled using standard control systems.

[0164] FIG. 11 shows a diagram 1100 representing the freespace shape formed by combining the outputs of road segmentation and object detection. An array 1102 represents an exemplary simplified freespace shape that corresponds to part of the image plane of an original image taken by a camera. The array 1102 is populated with a value equal to 1 where freespace is present and a 0 where freespace is not present. This array 1102 can be formed, for example, by defining the freespace as the area in the triangle 922 in the hybrid image 922, subtracted by the bounding boxes 1008 in the output image 1010.

[0165] The array 1102 is split into rows as shown in block 1104, so that the centroid C1 of the freespace shape can be calculated. Initially, the centroids AC1 to AC4 of each row are identified, as shown in FIG. 11. An arrow from the position of the autonomous vehicle 1106 with respect to the image is connected to each centroid AC1 to AC4. Where N.sub.row is the number of free pixels (1 values) in each row up to a total of n rows, the centroid C1 of the freespace shape is calculated by:

[00005] $\begin{matrix} C 1 = \frac{{.Math.}_{r o w = 1}^{r o w = n} N_{r o w} A C_{r o w}}{{.Math.}_{r o w = 1}^{r o w = n} N_{r o w}} & Eq . 12 \end{matrix}$

[0166] It is to be understood that other methods of calculating the centroid of the freespace may also be used, including graphical methods, such as using angular bisectors on the triangle 924 in the hybrid image 922 to form the image 1108. Once the co-ordinates of the centroid C1 of the freespace shape are calculated, various aspects of control of an autonomous vehicle can be informed using the freespace shape corresponding to an original image and other freespace shapes relating to previously processed images. For example, aspects of the autonomous vehicle relating to movement, such as speed and direction, may be informed by the location of the centroid C1 derived from consecutive image frames. Where C.sub.1x and C.sub.1y are the co-ordinates of the centroid C1, and C1.sub.x-1 and C1.sub.y-1 are the co-ordinates of a centroid C−1 from the immediately previously derived centroid corresponding to a previously captured original image, x.sub.mid is the x-co-ordinate of the middle of the image plane of the original image, y.sub.threshold is a predetermined row in the image place which serves as a cut off point for non-linear speed control and P1, D1, P2, D2 are scalar hyperparameters:

direction=P1(x.sub.mid−C.sub.x)+D1(C.sub.x−C.sub.x−1) Eq. 13

speed=P2(y.sub.threshold−C.sub.y)+D2(C.sub.yC.sub.y−1) Eq. 14

[0167] It is to be understood that other methods of using the calculated freespace to provide driving commands to a vehicle or computer system within the vehicle may be applied. When there are k networks which each provide their own outcome of a specific task, and thus form their own freespace shape, the method of controlling an autonomous vehicle include using combination techniques and may further include using particle swarm optimization techniques to find the optimal outcome from the k networks. For example, using combination may include averaging individual freespace centroids from each of the k networks. The centroids may be weighted differently from each other when calculating the average. Alternatively an algorithm focusing on the Coordinated Collective Behaviour Reynolds Model may be used, where alignment, cohesion and separation of the outputs of the specific task for k different networks are calculated to find the optimal outcome for the k networks. The alignment, cohesion and separation values in this swarm optimization algorithm are vectors from the position of the autonomous vehicle to the centroid of the freespace shape for each of the k networks.

[0168] Whilst the specific tasks of road segmentation and object detection have been described above in detail, it is to be understood that the general method 300 can be employed in any similar computer vision task in an autonomous vehicle, such as collision detection, road-sign detection and object tracking. In each of these tasks, a feature of the environment of the vehicle is identified, detected, determined or segmented from the rest of the environment. Each of these actions rely on the action of the neural network which fundamentally classifies an input image. The different layers of the perception software stack are modified to the requirements of each task and the training of the neural network is different based on the task. As such, the neural network is trained to classify different features dependent on the task for which it is supposed to run.

[0169] Furthermore, the application of the method 300 and the perception software stack 108 is not limited to autonomous vehicles, but can also be used in any vehicle or machine where computer vision is used. For example, the method 300 and perception software stack 108 may be used in the fields of robotics, and in neighbouring fields such as industrial manufacture, medicine, hazardous area exploration and the like. ‘Any vehicle’ refers to a vehicle where vision is required or is otherwise useful to aid the control of the vehicle. As such, vehicles includes road-vehicles such as cars, trucks and motorbikes; marine vehicles such as boats and submarines, aerial vehicles such as drones, aeroplanes and helicopters, and other specialist vehicles such as space vehicles.

[0170] It is thus to be understood that the environment in which the method 300 and the perception software stack 108 is to be used can vary. The environment may be in land, sea, air or space. Each of these environments has unique features that define the freespace area in which the vehicle is safe to navigate. On land, the features may include roads, pedestrians, hazards, objects, signage and buildings, for example. In sea and in air, the features may include weather formations, standard shipping and air lanes and hazards for example.

[0171] It is further to be understood that each of these different environments may require specialist or different sensors 106 in order to acquire sensor data that describes the environment. As such, the sensor 106 may be a radar sensor, a LIDAR sensor, a camera, a charge-coupled device, an ultrasonic sensor, an infrared sensor or the like. The sensor data received from such sensors is manipulated as explained above with reference to the ‘original image’. If the sensor provides data in three dimensions, such as the LIDAR sensor, the pre-processing steps further include dimensionality reduction to reduce the three dimensional sensor data to the one dimensional arrays before presenting said one dimensional arrays to the neutral network or networks.

[0172] It is to be understood that the method 300 and the perception software stack 108 may be implemented on any computer device or integrated circuit. Furthermore, the method 300 and the software stack 108 may be written to memory as computer-readable instructions, which, when executed by a processor, cause the processor to perform the method 300 and implement the function of the software stack 108.

[0173] The method 300 and perception software stack 108 are adapted for each specific task through a training process, performed by the training software module 102. The training process will now be described here in more detail with reference to FIG. 12.

[0174] The purpose of the training process is to train the neural network to perform a specific task. The CTRNN and neural network architecture of the neural network does not change between the specific tasks. Instead, the weights w.sub.ji in the weighted connections of the neural network are given values determined by the training process. These trained weights alter the calculations and thus the decision-making of the neural network so that it is adapted to perform the specific task. The general training process involves using a genetic algorithm to artificially evolve random initial weights such that, after a number of generations, they are effective at adapting the neural network to perform the specific task accurately.

[0175] FIG. 12 shows a flow diagram illustrating how the training process 1200 is performed by the training software module 102 for one neural network.

[0176] At step 1202, an initial population of chromosomes for the neural network is generated from a pseudo-random number generator function. The initial population is represented by a floating point array of N.sub.pop chromosomes. Each chromosome has a number of variables equal to the number of weights for the neural network N.sub.weights. The weights may include a tau or decay constant and layer bias, such that they are not strictly synaptic weights from node to node. Each chromosome is an encoded/non-encoded representation of a set of weight values corresponding to the weights for the neural network. Due to the use of random number generation, each chromosome has a random initial value for each of the weights in N.sub.weights.

[0177] At step 1204, each chromosome is inputted into the architecture of the neural network, such that the weight values contained in a particular chromosome are applied to the real weighted connections in the neural network. Training data such as a series of example images are then presented to the input layer of the neural network and the outputs are recorded. This occurs for each chromosome in the initial population, preferably in parallel and concurrently. The performance of the initial population of chromosomes is then evaluated by applying a fitness function and recording a fitness score for each chromosome. The fitness function relates to the example images and the particular specific task that is being trained for. The fitness score provides a numerical indication of each chromosome's effectiveness at performing the specific task. As noted above, the specific tasks include image classification, object detection and road segmentation. In terms of the process performed by the neural network 400, in the specific task of image classification, the whole input image is classified, and in object detection and road segmentation, patches of the input image are classified separately. The neural network 400 therefore performs a very similar classification method for each of the specific tasks. The differences between the specific tasks are more prevalent in the post-processing steps 310 performed by the fourth layer 208 of the perception software stack 108, as discussed above with reference to FIGS. 5 to 10. In light of the similarity between specific tasks, one fitness function is suitable for training the neural network 400 to perform all specific tasks. The fitness function is defined as follows. Assume for example, there are two classes: Class 1 (1) and Class 2 (0). For each iteration of the training process, the equations Eq.3 to Eq.7 defined above are used to store a final_pred_value for each chromosome in the initial population. At the start of step 1204, the fitness score fitness is equal to zero for each chromosome. The fitness score fitness is reset after each population is evaluated. It effectively accumulates correct classifications of the set of example images used in the training process for each chromosome. More particularly, the evaluation at step 1204 determines whether the each chromosome can be used in the neural network to correctly classify a given set of example images, meaning the final_pred_value should be nearer 1 for Class 1 and nearer −1 for Class 2. The fitness score for each chromosome is calculated at the evaluation step 1204 for every example image in the set of example images using the following logic:

[0178] For when the true class is Class 1:

if final_pred_value>thresh_upper, fitness=fitness+1; Eq. 15

[0179] For when the true class is Class 2:

if final_pred_value<thresh_lower, fitness=fitness+1; Eq. 16

Where thresh_upper and thresh_lower are an upper and lower threshold respectively, such as 0.0.1 and −0.01. Different values of these variables affect the outcome of the training process. For further classes, such as a third class, further thresholds may be introduced. According to equations Eq.13 and Eq.14, the higher the fitness score, the better the neural network is at correctly classifying the set of example images. The example images may be different for training each specific task. For example, for training road segmentation, example images of roads may be provided in the training process 1200, but for object detection, example images of object such as pedestrians, bicycles and vehicles may be provided. Furthermore, if the specific task being trained for is image classification, the example images may be scaled-down images, whereas if the specific task being trained for is road segmentation or object detection, the example images may be a series of pre-defined patches.

[0180] At step 1206, the genetic algorithm is run and the next generation is created. Following the initial population, a second population of chromosomes is generated using the initial population of chromosomes and their associated fitness scores evaluated in step 1204. This involves running a genetic algorithm on the chromosomes based on their fitness scores. At least one of four operations are performed on the initial population of chromosomes to generate the second population of chromosomes. These operations include elitism, truncation, mutation and recombination. When elitism is performed, a selection of the chromosomes with the best fitness scores are replicated onto the second population without alteration. The chromosomes are thus ranked after the evaluation in step 1204 according to their fitness scores, and when elitism is applied, the chromosomes with the best fitness scores are selected. When truncation is performed, a selection of the chromosomes with the worst fitness scores are removed such that they do not form part of the second population of chromosomes. When recombination (or crossover) is performed, a new chromosome is generated for the second population by combining two or more chromosomes from the initial population. The two or more chromosomes from the initial population used to generate the new chromosome for the second generation are selected using a roulette wheel selection technique, which means that chromosomes with better fitness scores have a higher probability of being selected for recombination. The two chromosomes selected for recombination are recombined according to an operation between the two chromosomes. This may be a single, two, or k point crossover, where k is a positive real number less than N.sub.weights. Other crossover operations may be used for the process of recombination. When mutation is performed, one or more of the floating point numbers in a chromosome, representing a weight, is modified by the addition, subtraction, multiplication or division of a random number. Preferably, the total number of chromosomes in the second population is equal to the number of chromosomes in the initial population, such that the number of chromosomes discarded via truncation equals the number of chromosomes introduced to the population via recombination.

[0181] At step 1208, steps 1204 and 1206 are repeated with respect to the second population of chromosomes and a new third population of chromosomes. The fitness scores are evaluated for the second population, and these are then used to generate the third population. The above process repeats, forming a new generation of chromosomes at the end of each evaluation step. This starts from the initial population and ends with the nth population, where n is a positive real number, and represents a training epoch signifying the maximum number of generations of populations.

[0182] At step 1210, the final weights are output for use in the neural network 400. It is to be understood that, whilst the above description of the training software module 102 and the method 1200 discuss one neural network, it is preferable that multiple k networks are trained using the training software module 102 and the method 1200. In this case, the initial population includes a set of k floating point arrays that are randomly generated, whereby each floating point array is configured to train one of the k neural networks.

[0183] To train the network or networks efficiently, the training software module 102 is implemented in a specific arrangement of hardware. In general, the hardware includes a primary module and a secondary module. The primary module is configured to perform the method 1200 up to and including the generation of the initial population 1202. The primary module thus defines the parameters of the training method 1200, including the number of chromosomes to be generated, the training epoch number n and the operations to be performed in the formulation of the next generation of chromosomes 1206. Once the initial population is formed in the primary module, it is sent to the secondary module. The secondary module is configured to evaluate the performance 1204 of each chromosome in the initial population. Preferably, the secondary module is configured to evaluate each chromosome in the initial population concurrently. Once evaluation of all chromosomes in the initial population is complete, a fitness score for each chromosome is returned to the primary module. At the primary module the next population of chromosomes are generated 1206 as a result of the genetic algorithm being run. The next population are then fed back into the secondary module and the process repeats until the nth generation 1208. When this generation is reached, final weights are deduced by selecting the best performing chromosomes and decoding them to determine weight values. These are then saved to a memory for transfer to the perception software stack 108.

[0184] Alternatively, when selecting the chromosomes to be saved to the memory for transfer to the perception software stack 108, re-evaluation and validation may firstly occur to ensure that the trained weight values are accurate. Re-evaluation involves, after the training process has been completed, selecting all chromosomes across all generations that have a fitness score above a specified cut-off threshold. These selected chromosomes are then re-evaluated for a different set of example images or image patches. This second example set of images is known as a validation set and ensures the accuracy of the selected chromosomes. Based on the re-evaluation, the best performing chromosomes and thus the best performing network(s) can be selected and stored.

[0185] Implementations of the general configuration will now be discussed here with reference to FIGS. 13 and 14. It is to be understood that a plurality of the configurations shown in FIGS. 13 and 14 may be used to train multiple networks concurrently from different starting populations. A first example of the training hardware is described here with reference to FIG. 13. FIG. 13 shows a first training system 1300 configured to implement the training process 1200 of the training software module 102. The first training system 1300 includes a central processing unit (CPU) 1302, a graphics processing unit (GPU) 1304 and a memory 1306. The CPU 1302 is illustrated in FIG. 13 including functional boxes 1302a to 1302d, relating to functions performed by the CPU 1302 during the training process 1200. The GPU 1304 includes parallel computing blocks 1304a to 1304n.

[0186] The CPU 1302 is firstly configured to prepare data 1302a for the training process 1200 by setting the parameters of the training algorithm such as the size of each population N.sub.pop, the number of generations n, and the operations to be used in forming each new generation as discussed above. These parameters may be read from a training configuration file. Each of object detection, image segmentation and classification have different training configuration files.

[0187] Next, the CPU 1302 is configured to generate the initial population of chromosomes 1302b. As discussed above, initially, each chromosome is a set of randomly generated weights for the N.sub.weights. The initial population of chromosomes is then sent from the CPU 1302 to the GPU 1304 to be evaluated. Evaluation of each of the chromosomes is done concurrently, in parallel within the GPU 1304. The GPU 1304 evaluates each chromosome in a separate parallel computing block 1304a to 1304n. The number of blocks 1304a to 1304n is preferably equal to the number of chromosomes in the initial population N.sub.pop, such that each block 1304a 1304n is configured to evaluate one chromosome, corresponding to one set of weights for the neural network. Each block 1304a to 1304n is implemented using CUDA® from NVIDIA® for example. Each block comprises a plurality of threads, whereby the number of threads is equal to the number of input neurones num_input in the neural network (not shown in FIG. 13). There are hence two layers of parallelism within the GPU 1304, the first layer being the parallel-computing blocks 1304a to 1304n and the second being the plurality of threads within the blocks 1304a to 1304n. In each block 1304a to 1304n, a neural network architecture is populated with weights corresponding to the particular chromosome being evaluated. An example image is then presented to the neural network and the output recorded. The output is then used in the determination of the fitness score for the chromosome being evaluated. As this occurs in every block 1304a to 1304n concurrently, the evaluation returns a fitness score for each chromosome. For each generation, the GPU 1304 sends an array of fitness scores corresponding to the chromosomes in said generation back to the CPU 1302, for running step 1206 of the method 1200 as discussed above. In particular, functional box 1302c in FIG. 13 corresponds to the application of operations such as mutation, recombination, elitism and truncation to manipulate the chromosomes according to the array of fitness scores. Functional box 1302d corresponds to the result of these operations in generating the next population of chromosomes. These are then sent into the GPU 1304 for another round of evaluation until a last generation counter reaches the training epoch number n. Once the training epoch number n is reached, the final weights are determined or chosen according to the fitness scores of the chromosomes of the final population. Preferably, the highest ranking chromosome in terms of fitness score is selected to provide the final weights. Alternatively, the highest ranking chromosome from any generation provides the final weights. The final weights are then stored in them memory 1306. The memory 1306 may then be used to store and/or transfer the final weights to the computer device 104 for use in the third layer 206 of the perception software stack 108.

[0188] FIG. 14 shows a second example of a training system 1400 configured to implement the training process 1200 of the training software module 102. The second training system 1400 includes a primary central processing unit (CPU) 1402, a cluster of secondary CPUs 1404a to 1404n, and a memory 1406. The primary CPU 1402 is illustrated in FIG. 14 including functional boxes 1402a to 1402f, relating to functions performed by the primary CPU 1402 during the training process 1200.

[0189] The primary CPU 1402 is firstly configured to prepare data 1402a for the training process 1200 by setting the parameters of the training algorithm such as the size of each population N.sub.pop, the number of generations n, and the operations to be used in forming each new generation as discussed above. These parameters may be read from a training configuration file. Each of object detection, image segmentation and classification have different training configuration files. Next, the primary CPU 1402 is configured to generate the initial population of chromosomes 1402b. Initially, each chromosome is a set of randomly generated weights for the N.sub.weights. Following the generation of the initial population, the primary CPU 1402 is configured to broadcast 1402c the initial population of chromosomes to the cluster of secondary CPUs 1404a to 1404dn. The primary CPU 1402 is thus communicatively coupled to the cluster of secondary CPUs 1404a to 1404n. Each of the secondary CPUs may be on the same server as each other and as the primary CPU 1402, or may be located across multiple servers. Preferably, the number of secondary CPUs 1404a to 1404n is equal to the number of chromosomes in the population, N.sub.pop, so that each secondary CPU 1404a to 1404n can concurrently evaluate a chromosome from the initial population. The number of secondary CPUs 1404a to 1404n can however be less than N.sub.pop. In this case, some or each of the secondary CPUs 1404a to 1404n may be required to evaluate more than one chromosome from the population. Evaluation of each of the chromosomes is thus done concurrently or partially concurrently, in parallel by each of the secondary CPUs 1404a to 1404n. The evaluation by each secondary CPU 1404a to 1404n returns a fitness score for each chromosome. Each fitness score or scores from each secondary CPU 1404a to 1404n are then sent back to the primary CPU 1402 where they are received 1402d. An array of fitness scores may thus be formed from the fitness scores received at the primary CPU 1402. The chromosomes and their corresponding fitness scores are then run through the genetic algorithm 1402e, meaning step 1206 of the method 1200 is performed as discussed above. In particular, functional box 1402e of the primary CPU 1402 in FIG. 14 corresponds to the application of operations such as mutation, recombination, elitism and truncation to manipulate the chromosomes according to the array of fitness scores. Functional box 1402f corresponds to the result of these operations in generating the next population of chromosomes. These are then sent into the cluster of secondary CPUs 1404a to 1404n for another round of evaluation until a last generation counter reaches the training epoch number n. Once the training epoch number n is reached, the final weights are determined or chosen according to the fitness scores of the chromosomes of the final population. Preferably, the highest ranking chromosome in terms of fitness score is selected to provide the final weights. Alternatively, the highest ranking chromosome from any generation provides the final weights. The final weights are then stored in them memory 1406. The memory 1406 may then be used to store and/or transfer the final weights to the computer device 104 for use in the third layer 206 of the perception software stack 108.

[0190] It is to be understood that the determination of the final weights to be used in the perception software stack 108 may be done according to factors other than the fitness scores and ranking of chromosomes. For example, a particular chromosome may classify specific objects, such as bicycles, very effectively but other objects, such as cars, less effectively. The weights from this chromosome may still be selected as the final weights if for instance, multiple k networks are being used, whereby a network that effectively identifies bicycles is useful. In other words, the final weights may be determined based on the intended function of the neural network. Furthermore, more than one set of weights from more than one chromosome may be selected, so that more than one network can selected using the same training process 1200.

[0191] The examples illustrated in FIGS. 13 and 14 may also include an interface for receiving input from the user. Using this interface, the user can manually filter or give preference to particular chromosomes or networks in the training process, and can similarly select any chromosome from any generation to use as the final weights for the neural network.

[0192] It is to be understood that the training process and the training system may be implemented in any computer system, including a distributed computing system such as a cloud or server based computer system. The primary module discussed above is configured to perform all the steps of the training process apart from evaluation of the sets of weights or chromosomes. The evaluation is performed by the secondary module which has parallel computing capabilities. In a distributed system, the secondary module may communicate with the primary module via a server and/or over the internet.

[0193] The hardware aspects of the computer device 104 according to the invention will now be discussed with reference to FIGS. 1, 15 and 16. The following hardware is implemented as the computer device 104, and includes the perception software stack 108 or communicates with a computer readable medium having the perception software stack 108 stored thereon. The computer device includes a processor 110 and a memory 112. The perception software stack 108 is preferably implemented as computer-readable instructions stored on the memory 112 and executable by the processor 110. The computer device 104 is configured to be fitted into a vehicle, and includes an input for connecting to a sensor 106 and an output for outputting the results of the specific task to inform control of the vehicle. The computer device may be an integrated circuit such as a System on a Chip (SoC).

[0194] An example of the computer device 104 is described in detail with reference to FIG. 15, showing an apparatus 1500. The apparatus 1500 is configured to perform the method 300. The apparatus 1500 includes a random access memory 1502, a flash memory 1504, an External Bus Interface 1506 for interfacing external memory devices, a Flash Programmer 1508 for flashing the memory of a microcontroller, a memory controller 1510, a peripheral data controller 1512, a memory storage chip 1514 for storing the perception software stack 108, an input/output 1516 for interfacing with the sensor 106, a power in and voltage regulator 1518, a peripheral bridge 1520, one or more processors 1522, a debugger 1524, application specific logic 1526, one or more buttons and/or LED indicators 1528 and an expansion bus 1530. These components are connected as shown in FIG. 15.

[0195] It is to be understood that the apparatus 1500 is configured to perform the method 300 for one or more of the specific tasks of image classification, object detection and image segmentation. Some of the components of the apparatus 1500 may be removed or substituted with similar components as will be understood. In use, the sensor 106 provides input sensor data to the apparatus 1500 via the input/output 1516. The sensor data is then manipulated according to the aforementioned methods. The apparatus 1500 may communicate with the sensor using any suitable communication means, such as Ethernet, Universal Serial Bus (USB), serial, Bluetooth, wireless networking (Wi-Fi) and the like. The apparatus 1500 may be a SoC and may take the form of a computer, smartphone, tablet or the like. A SoC has the advantage that task specific computer programs can be written specifically for the SoC thus saving on loading time and speed. The apparatus 1500 may however be a traditional computer installed on a single motherboard.

[0196] In an example, the computer device 104 is modular, meaning the computer device 104 is responsible for performing one out of several specific tasks. There may then be a module for each of image classification, object detection, and image segmentation, formed of individual computer devices 104. Each of these devices may communicate with each other via wired or wireless connection methods and may also connect to the same or different sensors 106.

[0197] Each one or more of the computer devices 104 may connect to a network that is external to the vehicle. This allows the software stored thereon to be updated. For example, the weights stored in the memory 112 may be updated via communication with the external network. However, each of the computer devices 104 is configured to function or be capable of functioning without communication with an external network. The low resolution of the neural network allows the specific tasks to be performed on the computer device 104 without external computing aid.

[0198] Once the apparatus 1500 or computer device 104 has run the method 300 to obtain an outcome for a specific task, it is configured to send information relating to the outcome to a controller computer. The controller computer can be any computer which requires the outcome, such as a vehicle's engine control unit or another SoC. The apparatus 1500 may also store historic outcomes from specific tasks on its own memory chip 1514.

[0199] FIG. 16 shows a vehicle 1600 including the system 100. The vehicle 1600 includes one or more sensors 1602 fixed to the vehicle, an apparatus 1604 corresponding to the computer device 104 and/or the apparatus 1500, a controller computer 1606 and a vehicle component unit 1608. The apparatus 1604 includes at least a processor and memory and is configured to perform the method 300 discussed above. Once an outcome is calculated by the apparatus 1604, it is sent to the controller computer 1606. The controller computer 1606 then uses the outcome to control the vehicle 1600, by sending instructions to the vehicle component unit 1608. The vehicle component unit 1608 is a physical component or system of the vehicle 1600 that is responsible for movement of the vehicle. The vehicle component unit 1608 may thus be the engine control unit, a braking unit, a steering unit or the like. The controller computer 1606 includes at least a processor, memory and communication components for communicating with the apparatus 1604 and the vehicle component unit 1608. The controller computer 1606 is preferably configured to autonomously or semi-autonomously control movement and thus driving of the vehicle 1600. The apparatus 1604 is preferably a SoC as discussed above, and may be one of several individual apparatuses. Each of these apparatuses may communicate with each other and the controller computer 1606. Sensor data is fed from the sensor 1602 into the apparatus 1604, where it is manipulated using the method 300 and the trained neural networks, as previously mentioned. The outcome produced by the neural network in the apparatus 1604 is then sent to the controller computer 1606 where it is used to control one or more vehicle component units 1608.

[0200] It is to be understood that any conventional computing components may be used to implement the computer device 104 and the components shown in FIG. 16. However, it is preferred that the computer device 104 or apparatuses 1604 are SoC due to the speed and efficiency benefits of using a SoC design in comparison to traditional computers. Advancement in modern computing has allowed all the crucial internals that allow a computer to run to be installed on a single chip. These would include processors, input/outputs, memory controllers, storage and the like. In traditional computing these would have been managed by different components which would have to have been manually installed on a single motherboard connecting everything together. A SoC, by contrast, integrates multiple or all of these functions onto a single small board which is the same size or often smaller than a conventional CPU. Where traditionally, a computer requires the operating system first to boot up, and drivers to load, by contrast, the advantage of an SoC is that task specific computer programs can be written specifically for the SoC thus saving on loading time and speed.

[0201] The SoC also has the advantage of not needing to connect to an external network. Each of the specific tasks of image classification, object detection and image segmentation can be performed by the SoC locally in the vehicle. Furthermore, due to the low-resolution aspects of the neural networks, multiple networks can be stored in the memory of the computer device 104 or SoC. This means that swarm optimization or other collective behaviour algorithms or techniques can be applied to input data from a camera or sensor locally at the computer device 104 or SoC, without having to communicate with an external network. This saves valuable time which can improve the responsiveness and thus safety of a vehicle such as an autonomous vehicle which includes the computer device 104 or SoC.

[0202] Where multiple SoCs are used, each for a different specific task, the local nature of the calculations and functioning of the SoCs allow them to easily communicate and pool their outputs together. For example, where a first SoC is configured to perform the function of object detection, and a second SoC is configured to perform the function of road segmentation on an input image, the SoCs may communicate with each other to determine the available freespace on the road (the segmented road subtracted by any objects on the road). Alternatively this function is performed by the controller computer 1606 when it is connected to multiple apparatuses 1604.

[0203] It is not necessary that the camera/sensor 106 and 1602 be fitted to the vehicle 1600 or computer device 104. Instead, the camera/sensor 106 and 1602 may be physically separate from the computer device 104 and vehicle 1600, such that the camera/sensor 106 and 1602 is not fitted to a structure or is fitted at an external location. The external location may be, for example, at a junction on a road, on a traffic sign or on a lamppost. In this example, the camera/sensor 106 and 1602 communicates with the computer device 104 or vehicle 1600 and thus the apparatus 1604 via a network. The camera/sensor 106 and 1602 and the computer device 104 or apparatus 1604 comprise or are locally connected to network connection hardware configured to connect to a network. The network connection hardware may include any one or more of a W-Fi module, a cellular module, a mobile-network transmitter and receiver, an antenna, a Bluetooth module and the like. In this example, the output of the neural network in performing a specific task may be shared from the computer device 104 or apparatus 1604 to other computers or vehicles directly, or sent to a central computer on a server or network for distribution to other vehicles.

[0204] Although the description above relates to the specific example of a vehicle and in particular an autonomous vehicle, it is noted that the computer device 104 and the vehicle 1600 can alternatively be any machine where visual sensory data is gathered, manipulated or used to perform an action. As such, the computer device 104 and the vehicle 1600 may be a robot, a CCTV system, a smart device for a smart home, such as a smart speaker, smartphone or a smart appliance.

[0205] Similarly, although the description above relates to performing specific tasks related to vehicles, such as object detection, road segmentation and image classification, it is to be understood that the principles of using active vision in a LRRAVNN according to the method 300 can be applied to any computer vision task. As such, other tasks may be performed by the computer device 104 using the method 300. For example, a CCTV system using the method 300 may perform facial recognition, whilst a robot using the method 300 in a manufacturing environment may perform object classification and quality checking. Further tasks related to autonomous driving may also be performed, such as traffic sign recognition, road-marking recognition and pot-hole detection.

DEVICE AND SYSTEM FOR AUTONOMOUS VEHICLE CONTROL

Inventors

Cpc classification

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

B60W2555/00

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

B60W60/001

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06V10/7715

PHYSICS

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06F18/2413

PHYSICS

Classification Explorer

G06V20/588

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V10/56

PHYSICS

Classification Explorer

B60W2420/42

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06V20/56

PHYSICS

International classification

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

Classification Explorer

G06V10/77

PHYSICS

Classification Explorer

B60W60/00

PERFORMING OPERATIONS; TRANSPORTING

Abstract

Claims

Description