DEVICE AND SYSTEM FOR AUTONOMOUS VEHICLE CONTROL
20230368544 · 2023-11-16
Inventors
Cpc classification
B60W2555/00
PERFORMING OPERATIONS; TRANSPORTING
B60W60/001
PERFORMING OPERATIONS; TRANSPORTING
G06V10/7715
PHYSICS
G06V20/58
PHYSICS
G06V20/588
PHYSICS
International classification
G06V20/58
PHYSICS
G06V20/56
PHYSICS
G06V10/77
PHYSICS
Abstract
A computer device and system for controlling an autonomous vehicle are provided. The computer device comprises a memory and a processor, the computer device configured to be fitted to a vehicle and to communicate with a camera or sensor, the processor being configured to: pre-process an original image from the camera or sensor data from the sensor to produce an input image; present the input image to a neural network stored in the memory; wherein the neural network is trained to classify a feature in an image presented to it, the neural network having an input layer, a hidden layer and an output layer, the output layer including three outputs: a first feedback output for selecting pixels from the input image to input at the input layer at each iteration of the neural network; a second feedback output for selecting a colour channel of the selected pixels to input at the input layer at each iteration; and a third output for outputting an output value indicative of a classification result from the neural network; the processor further configured to obtain the output value from the neural network; and post-process the output value from the neural network to identify a feature of the environment of a vehicle.
Claims
1. A computer device comprising a memory and a processor, the computer device configured to be fitted to a vehicle and to communicate with a camera or sensor, the processor being configured to: pre-process an image received from the camera or sensor data from the sensor to produce an input image; present the input image to a neural network stored in the memory of the computer device; wherein the neural network is trained to classify a feature in an image presented to it, the neural network having an input layer, a hidden layer and an output layer, the output layer including three outputs: a first feedback output for selecting pixels from the input image to input at the input layer at each iteration of the neural network; a second feedback output for selecting a colour channel of the selected pixels to input at the input layer at each iteration; and a third output for outputting an output value indicative of a classification result from the neural network; the processor further configured to obtain the output value from the neural network; and post-process the output value from the neural network to identify a feature of the environment of a vehicle.
2. The computer device of claim 1 wherein the computer device is a system on a chip (SoC).
3. The computer device of any preceding claim wherein the computer device is further configured to communicate with a control computer of the vehicle.
4. The computer device of any preceding claim wherein the neural network stored in the memory is configured to perform one or more specific tasks including image classification, object detection and road segmentation.
5. The computer device of any preceding claim wherein the memory includes multiple neural networks, the processor being configured to present the input image to the multiple neural networks and to further process the output value of each of the multiple neural networks to identify the feature of the environment of the vehicle.
6. The computer device of any preceding claim, wherein the computer device is configured to perform pre-processing, post-processing and presenting to a neural network locally at the computer device, such that connection to an external network outside of the vehicle is not necessary to identify the feature of the environment.
7. A vehicle control system for fitting in or on a vehicle, the system comprising: a sensor or camera; a control computer; and the computer device of any of claims 1 to 6; wherein the computer device is configured to receive sensor data or an original image from the sensor or camera, and send information related to the feature of the environment of a vehicle to the control computer; the control computer being configured to control one or more components of a vehicle based on the information received from the computer device.
8. The vehicle control system of claim 7, wherein the control computer is configured to autonomously control the vehicle.
9. The vehicle control system of any of claim 7 or 8, further comprising a plurality of computer devices according to any of claims 1 to 6, each of the plurality of computer devices being configured to perform a different specific task.
10. A vehicle comprising the control system of any of claims 7 to 9.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0082] Examples of the invention will now be described in more detail, by way of example, and with reference to the drawings in which:
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
[0090]
[0091]
[0092]
[0093]
[0094]
[0095]
[0096]
[0097]
[0098]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0099] The invention described here is a computer device and corresponding system for controlling an autonomous vehicle. The computer device includes a trained neural network. The neural network is suitable for identifying a feature in the environment of a vehicle, and is used in a perception software stack by the computer device. The perception software stack is effectively built around the neural network to use the function of the neural network to perform one or more of several specific tasks, such as image segmentation and object detection. Because the neural network is of low-resolution, it is capable of running through data and processing results much more quickly and with less computational power than conventional neural networks. The feedback capabilities of the neural network allow it to select its own inputs from the input image. This adaptive approach effectively mimics active vision in nature, selecting and analysing small parts of an image to obtain information about the image on-the-fly rather than studying an entire image in a pixel-by-pixel brute force approach. These traits of the neural network allow it to perform efficiently whilst still providing accurate results. The invention will now be described in more detail with reference to the accompanying figures.
[0100]
[0101] The training software module 102 is configured to provide to the computer device 104 a set of trained weights for a low resolution recurrent active vision neural network (LRRAVNN) that forms part of the perception software stack 108. The training software module 102 generates the trained weights, for performing a specific task with the neural network, using an artificial evolution algorithm. The specific task that the weights are trained for includes one or more of image classification, image segmentation, and object detection. The specific tasks are thus computer vision tasks, the results of which are used to inform control of an autonomous vehicle. Once the process of training the weights for one or more of the specific tasks is completed by the artificial evolution algorithm, the trained weights are uploaded to the computer device 104 via a data transfer for use in the neural network in the perception software stack 108.
[0102] The computer device 104 includes the perception software stack 108 or communicates with a computer readable medium having the perception software stack 108 stored thereon. The computer device includes a processor 110 and a memory 112. The perception software stack 108 is preferably implemented as computer-readable instructions stored on the memory 112 and executable by the processor 110. The computer device 104 is configured to be fitted into a vehicle, and includes an input for connecting to a sensor 106 and an output for outputting the results of the specific task to inform control of the vehicle. The computer device may be an integrated circuit.
[0103] The final weights trained by the training software module 102 are stored on the memory 112 when they are provided to the computer device 104. The memory 112 is also configured to store a configuration file, whereby the configuration file includes the operating parameters of the perception software stack 108. The operating parameters of the perception software stack 108 are different for each specific task. As such, the configuration file stored in the memory 112 is different depending on which specific task is intended to be performed by the computer device 104.
[0104] The sensor 106 is configured to provide to the computer device 104 sensor data describing the environment of the sensor 106. Throughout the following description, the sensor is referred to as a camera that produces a visual image. However, it is to be understood that the sensor 106 can be a camera, infrared sensor, LIDAR sensor or the like, attached to the vehicle to which the computer device 104 is fitted. The sensor 106 is configured to send the sensor data to the computer device 104 regularly as is required by an autonomous driving system.
[0105] The sensor data provided to the computer device 104 from the sensor 106 is manipulated by the perception software stack 108. The perception software stack 108 includes a plurality of layers, including a network layer comprising the neural network. When sensor data is provided to the computer device 104, the processor 110 runs the perception software stack 108 on the sensor data based on the operating parameters in configuration file and the trained weights stored in the memory 112. The sensor data is passed through each layer of the perception software stack 108 in order, to obtain a result for the specific task being performed by the computer device 104. Once a result for the specific task is obtained, the computer device 104, using the processor 110, is configured to communicate with systems in the vehicle to aid control of the vehicle based on the result of the specific task.
[0106] Each of the training module 102, the computer device 104 and the perception software stack 108 will now be described in more detail, with reference to their physical implementations and their associated methods of use.
[0107] Firstly, the perception software stack 108 is discussed here with reference to
[0108] The first layer 202 is responsible for pre-processing the sensor data. This pre-processing is common to all specific tasks and includes a predetermined colour scheme transformation.
[0109] The second layer 204 is configured to perform further pre-processing of the sensor data, whereby the further pre-processing is dependent on the specific task being performed by the perception software stack 108. The second layer 204 is shown in
[0110] The third layer 206 is a network layer and includes the neural network that uses the weights trained by the training software module 102 for the specific task. The third layer 206 is split into three sub-layers in
[0111] The fourth layer 208 is configured to post-process the results outputted by the neural network for each specific task. The fourth layer 208 is similarly split into three sub-layers to illustrate that there are three different types of post-processing that can occur, one for each of the specific tasks of image classification, image segmentation and object detection.
[0112] Although
[0113]
[0114]
[0115] At step 302, a sensor/camera input is received. The input is an image such as a frame from a video.
[0116] At step 304, the image received from the sensor/camera is pre-processed at the first layer 202. The pre-processing of the image includes performing a colour conversion from the native colour scheme of the image, for instance RGB, to a HSa* colour scheme. The HSa* colour scheme includes a hue chancel (H) 224, a saturation channel (S) 226, and a green/magenta channel (a*) 228. The conversion of the colour scheme aids the performance of the LLRVANN with respect to the image when performing the specific task. Alternatively, other colour schemes may also be used. One option is to use an edge filter on the received image and use the gradient filtered output as a colour channel itself. Different types of edge filter may be used to form three different colour channels in this way.
[0117] At step 306, the image undergoes further pre-processing at the second layer 202. This involves dimensionality reduction and resolution adjustment to generate an input image for presenting to the neural network. Exactly how the image is reshaped and scaled is dependent on the specific task to be performed by the neural network. In general, the original image undergoes dimensionality reduction to produce a one-dimensional array for each colour channel. The further pre-processing in this step also includes either reducing the resolution and thus the size of the image, or splitting the image into multiple smaller images called ‘patches’, such that the size of the input image or images produced in step 306 conform with the input size requirements of the neural network.
[0118] At step 308, the input image generated in step 306 is presented to the neural network in the third layer 206. The neural network processes the input image, or images if the image was split into patches in step 306, using the weights trained for the specific task that is being performed, for a maximum number of iterations T. The neural network selects pixels from the input image using two image selection output neurons and a colour channel selection output neuron.
[0119] For each input image, an output score is produced by the neural network and outputted for post-processing by two further output neurons.
[0120] At step 310, the outputted result from the neural network in step 308 undergoes post-processing at the fourth layer 208. The post-processing step is different depending on the specific task being performed by the neural network. Examples of post-processing are provided with reference to
[0121] The method 300 described above refers to a single neural network. However, a plurality of k neural networks, where k is a positive integer, can be used to perform a specific task. When k neural networks are used to perform the same specific task, the output scores produced by the k neural networks are combined and post-processed together in step 310 as will be discussed in more detail below with reference to the specific tasks. Having k neural networks produces more reliable results.
[0122] Before the specific tasks are described, the architecture of the neural network included in the third layer 206 of the perception software stack 108 will be described here with reference to
y.sub.i=g×l.sub.i Eq. 1
[0123] For the hidden layer 204 the value of each neuron can be described by equation Eq.2 below, where i is in the range of 1 to the total number of hidden neurons numhidden, delta, is a decay constant, and y.sub.i.sup.cp is the cell potential of the l.sup.th hidden layer neuron:
For the output layer 206 the value of each neuron, y.sub.i can be described by equation Eq.3 below, where i is in the range of 1 to the total number of output neurons numoutput:
The neural network 400 has 32 input neurons in the input layer 402, 15 hidden neurons in the hidden layer 404, and 5 output neurons in the output layer 406. However, the neural network 400 may have more or less neurons at each layer. The neural network 400 has a maximum of 150 input neurons, to ensure that computational load is maintained at a low level and that the low-resolution aspect of the neural network 400 is maintained.
[0124] The neural network 400 is configured to iteratively process an input image 412. The input image 412 is the colour channel set of one-dimensional arrays that are produced by feeding a camera-captured image or other sensor data through the first 402 and second layers 404 of the perception software stack 108. The neural network 400 processes the input image 412 for a number of iterations up to a maximum iteration value T. At each iteration, pixel values from the input image 412 are processed by the neural network 400. The 5 output neurons include two image selection output neurons 414 and 416 for selecting co-ordinates of pixels in the image 412 to process with the neural network 400 at each iteration, a colour channel selection output neuron 418 for selecting one of three colour channels 424, 426 and 428 of the image 412 to process with the neural network 400 at each iteration, and two output prediction neurons, 420 and 422. The image selection output neurons 414 and 416 and the colour channel selection output neuron 418 are thus feedback outputs that are configured to modify the input to the input neurons at the input layer 402. This represents the active vision mechanism of the neural network 400. The output prediction neurons 420 and 422 provide an output score relating to the specific task that the neural network 400 is configured to run. Each of the output neurons 414, 416, 418, 420 and 422 output a value between 0 and 1.
[0125] The specific tasks will now be described with reference to the perception software stack 108, the method 300 and the neural network 400.
[0126] The specific task of image classification is described here with reference to
[0127]
[0128]
[0129] Referring back to
Start_pos=(OUT1×(IN_MULT))×(OUT2×numinput)−c Eq. 4
where OUT1 is the value outputted by a first of the two image selection output neurons 214, OUT2 is the value outputted by a second of the two image selection output neurons 216 and c is the number of input neurons (and thus the number of pixels processed by the neural network 400 at each iteration). OUT1 and OUT2 are between 0 and 1. This equation is limited at the low-end by applying the conditional equation:
If Start_pos<0then Start_pos=0 Eq. 5
[0130] The neural network 400 then selects c pixels starting from the pixel index nearest to the numerical value Start_pos. In the above example, the neural network 400 selects 32 pixels in this manner, illustrated in
[0131] The colour channel selection output neuron 218 outputs a value OUT3 responsible for selecting the colour channel of the input image 212. The value OUT3 is between 0 and 1. The specific colour channel is selected according to the following logic:
if OUT3<0.33select H channel;
if 0.33<OUT3<0.66select S channel;
if OUT3>0.66select a*channel Eq. 6
[0132] Once the pixels from the one-dimensional arrays of pixels and the colour channel have been selected, the selected pixels are processed by the neural network 400. The output neurons 220 and 222 each output an output score OUT4 and OUT5 respectively, between 0 and 1. For each iteration the neural network 400 runs, an iteration output prediction value it_pred_val is stored, where:
it_pred_val=OUT4−OUT5 Eq. 7
[0133] Once the number of iterations is equal to T, the maximum number of iterations, a final prediction value final_pred_val is calculated by averaging the stored iteration output prediction values it_pred_val across all the iterations. This gives a final prediction value final_pred_value between −1 and +1.
[0134] Preferably, in the above calculation of the final_pred_val, the first ten iterations run by the neural network 400 are discounted to allow the network to settle, such that the it_pred_val values are averaged over the iterations after the first ten iterations. This means that the total number of it_pred_val values used in the calculation of the final_pred_value is equal to T−10.
[0135] The final prediction value final_pred_value then undergoes post-processing in the fourth layer 208 of the perception software stack 108. At this stage, two variables are calculated. These include a discrete predicted outcome, PRED, and a numerical confidence measure DIST that defines the distance from one of two confidence level thresholds, UP_LIMIT and LOW_LIMIT. The two confidence level thresholds UP_LIMIT and LOW_LIMIT may be set to any value between −1 and 1. For example, the UP_LIMIT and LOW_LIMIT may be +0.2 and −0.2 respectively. For classification tasks, PRED denotes the class that the processed image is predicted to belong to. DIST is a measure used to determine the overall class where more than one neural network 400 is used to process the input image or patches. PRED and DIST are calculated according to the following logic in this instance:
if: final_pred_value>UP_LIMIT: PRED=class1; DIST=I(final_pred_value−UP_LIMIT);
if:final_pred_value<LOW_LIMIT: PRED=class2;DIST=I(final_pred_value-LOW_LIMIT)I;
else:PRED=Neutral,DIST=I(final_pred_value−UP_LIMIT)I Eq. 8
[0136] In other words, if the final_pred_value from the neural network 400 is greater than the upper threshold, it is determined in step 310 that the processed image belongs to class 1. If final_pred_value is lower than the lower threshold, it is determined that the processed image belongs to class 2. If final_pred_value is somewhere between the upper and lower thresholds, then the class is labelled as neutral, meaning it neither definitively belongs to class 1 or class 2.
[0137] The variable DIST is used when there are k neural networks performing the classification task. When k networks are performing classification, the PRED values for each network are accumulated. For example, if there are 20 networks, there may be 14 instances of PRED=class 1 and 6 instances of PRED=class 2. This equates to 70% of the 20 networks producing a PRED=class 1 result and 30% of the 20 networks producing a PRED=class 2 result. These percentages are calculated and compared to a class threshold value, class thresh. If the percentage associated with a particular class is higher than the class thresh, it is determined that the processed image belongs to that particular class. For example, if class thresh is 60%, then a determination is made that the image belongs to class 1, because 70% of the 20 networks produced a PRED=class 1 result, and 70% is greater than the threshold of 60%. However, if a percentage associated with a class does not exceed the class thresh, the class of the image is not immediately apparent and the DIST variable is used instead. In this case, the class of the image is determined based on a value FIN_DIST, wherein FIN_DIST is calculated for each class using:
where z is a scaling factor that is a positive real number. For example, assume that there are six networks such that k=6, where the PRED and DIST values for each network are as provided in Table 1 below:
TABLE-US-00001 TABLE 1 Network PRED DIST Network 1 Class 1 0.8 Network 2 Class 1 0.2 Network 3 Class 1 0.1 Network 4 Class 2 0.8 Network 5 Class 2 0.1 Network 6 Class 2 0.4
[0138] FIN_DIST for class 1 is calculated using the sum of instances of PRED=class 1, added to the scaling factor multiplied by the sum of DIST values when PRED=class 1. As such, the value of FIN_DIST(class 1) is equal to (1+1+1)+z(0.8+0.2+0.1), which is equal to 3+1.1z. FIN_DIST for class 2 is calculated using the sum of instances of PRED=class 2, added to the scaling factor multiplied by the sum of DIST values when PRED=class 2. As such, the value of FIN_DIST(class 1) is equal to (1+1+1)+z(0.8+0.1+0.4), which is equal to 3+1.3z. The class with the largest FIN_DIST value is determined to be the class to which the image belongs. In the above example, this is class 2. This classification result output by the specific task of classification to inform the control of an autonomous vehicle. A confidence value is also outputted, whereby the confidence value is proportional to the DIST term in equation Eq.9. Classification can be used in autonomous driving to identify the current environment, such as an urban road, a residential road, or high-street for example. Classification can also aid in identifying landmarks in the visual field, such as a bank or supermarket building that is present in the input image. Furthermore, classification can aid in identifying visible junctions and intersections. Thus the neural network can help classify if the current input image requires a right turn, left turn or straight ahead motion from the vehicle.
[0139] The specific task of image segmentation, and in particular, road segmentation, is described here with reference to
[0140]
[0141] At the second layer 204 of the perception software stack 108, the further pre-processing differs for image segmentation when compared to classification, in that the colour channel images 702, 704 and 706 are each divided into a plurality of patches 708a to 708n. The patches 708a to 708n have a configurable size, such as 64×40 pixels for example, and stride, depending on the original image size and the input size requirements for the neural network. The patches 708a to 708n are extracted from the original image using the following logic, considering a patch of width P.sub.w, a height of P.sub.h, horizontal stride St.sub.h, vertical stride St.sub.v, where P_num.sub.h and P_num.sub.v are the total number of patches in the horizontal and vertical directions respectively. The first patch is extracted from the top left corner of the image plane, from the 0th row and 0th column of the rows and columns of pixels in each of the colour channel images 702, 704 and 706. The second patch is extracted from the 0.sup.th row, and the 0.sup.th column+St.sub.h. The third patch is extracted from the 0.sup.th row and the 0.sup.th column+2St.sub.h. This process repeats until the rightmost image boundary is reached or until P_num.sub.h is exceeded. In other words, patches are taken along the first row of pixels of the colour channel images 702, 704 and 706 from left to right, incrementing by the horizontal stride St.sub.h until the rightmost boundary of the colour channel images 702, 704 and 706 are reached. Once patches have been extracted from the 0.sup.th row, extraction is shifted to the 0.sup.th row+St.sub.v, wherein the process repeats, extracting patches from left to right until the rightmost boundary of the colour channel images 702, 704 and 706 are reached or P_num.sub.h is exceeded. This process continues to the 0.sup.th row+2St.sub.v and onwards until P_num.sub.h is exceeded, or the bottom-right corner boundary of the colour channel images 702, 704 and 706 are reached. As an example, P.sub.w may be 64, P.sub.h 40, St.sub.h 30, St.sub.v 13, P_num.sub.h 20 and P_num.sub.v20, giving 400 patches for a colour channel image of size 640×300. Different values for these variables can result in spaces between consecutive patches or overlapping consecutive patches. Each patch is further reduced to a one-dimensional array of pixels (not shown in
[0142]
[0143] Referring back to
[0144] Post processing occurs in the fourth layer 208 of the software perception stack 108. The post-processing in road-segmentation can be performed in a similar way to image classification, in which the discrete predicted outcome, PRED, and the numerical confidence measure DIST are calculated according to equation Eq.8 for each image patch. In the road segmentation task, class 1 and class 2 refer to road/non-road classes.
[0145] The variable DIST is used when there are k neural networks performing the road segmentation task. When k networks are performing classification, the PRED values for each network are accumulated as in the classification task, and FIN_DIST is calculated for each class using equation Eq.9. The class with the largest FIN_DIST value is determined to be the class to which the first patch belongs. This process of classifying an individual patch is then repeated for all patches.
[0146] More preferably, once the final prediction value final_pred_value is calculated for each patch, it is normalized between 0 and 1, and preferably multiplied by 255, to form a heat map pixel value. For example, when the final_pred_value is −0.5, it is normalized between 0 and 1 to become 0.25, may then be multiplied by 255. As such, each patch is assigned a heat map pixel value between 0 and 255 that is proportional to its final_pred_value. The patches 708a to 708n are then reassembled on the image plane of the original image according to their respective positions in the original image, whereby all of the pixels in each respective patch are assigned the same value equal to the heat map pixel value of that respective patch. If the patches are generated in the second layer 204 such that they overlap each other in image plane of the original image, the patches are divided further into sub-patches. The sub-patches are sized such that they do not overlap neighbouring sub-patches. For example, for an original image of size 640×300, each 60×40 patch is divided into six smaller sub-patches of size 32×13. The sub-patches are then stored in a 21×22 array to provide a heat map image 708a that resembles the same image plane of the original image (not to scale in
[0147] Similarly, if during step 306 neighbouring patches are generated such that they are physically separated from each other in the image plane of the original image, the patches are divided into sub-patches. Sub-patches are also generated between the neighbouring patches, and are then designated a heat map pixel value that is dependent on the heat map pixel values of the neighbouring patches.
[0148] Once the patches have been reassembled on the image plane of the original image, or where there is overlap or separation of the patches on the image plane and sub-patches have consequently been generated, a heat map image 708a is produced. The further processing of the heat map image 708a is explained now with reference to an example as illustrated in
[0149]
[0150] The post-processing in the fourth layer 208 of the perception software stack 108 continues, by applying segmentation or fitting algorithms to the heat map image 904. Applying a segmentation algorithm results in extracting a grid based shape from the heat map image 904. In an example, Otsu's thresholding method is firstly applied to make the heat map image 904 a binary image. A shape is then extracted from the binary image using a structural analysis algorithm such as the algorithm disclosed here:
https://www.semanticscholar.org/paper/Topological-structural-analysis-of-digitized-binay-Suzuki-Abe/cf021db5e811fd5b67ee3aa4db0a6a0351d276d2
[0151] This example algorithm works on connected component analysis principles, by trying to find an outer border within a binary digitized image. All connected border shapes are first extracted. In a second pass, all ‘holes’ within the image planes are assigned scores based on their proximity to borders and filled pixels. The final pass attempts to fill in ‘holes’ depending on the their scores and adds them to existing shapes. The outermost final border is considered as the connected shape structure output.
[0152] The result of this example segmentation for one neural network 400 is shown in
[0153] It is to be understood that k neural networks can be used concurrently to produce a plurality of heat map images 708a to 708k as shown in
[0154] Alternatively a fitting algorithm is applied to the heat map image 904 to produce a shape such as a triangle, whereby the area of the triangle indicates the existence of road. The triangle can be overlaid on the original image 902 to form a hybrid image 922 as shown in
[0155] It is to be understood that the fitting algorithm may contain thresholds for acceptable error, such that a boundary pixel is not identified until at least 1-10 consecutive pixels do not have a ‘road’ pixel value.
[0156] The specific task of object detection is described here with reference to
[0157] As with image segmentation, the process of object detection includes generating a heat map image 708a of patches or sub-patches that are each assigned a heat map pixel value according to a normalised final_pred_value calculated for each patch. In object detection, the neural network 400 is configured to classify, for example, patches that belong to an object such as a car. Therefore, the normalised final_pred_value calculated for each patch is an indication of whether or not the patch belongs to a car in the original image.
[0158]
if H≤l.sub.1,H.sub.new=0;
if l.sub.1<H≤l.sub.2,H.sub.new=0.15;
else if H>l.sub.2,H.sub.new=1 Eq. 10
[0159] The variables l.sub.1 and l.sub.2 are user-configurable, and may be values such as 0.25 and 0.5 respectively. It is to be understood that equation Eq.10 is exemplified by the case where the heat map pixel values are normalised between 0 and 1, however they may be in the range of 0 to 255 as described above with respect to image segmentation. The thresholding performed by equation Eq.10 reduces the heat map image 1004 to a reduced heat map image 1006, wherein the heat patches have heat map pixel values of 0, 0.15 or 1. Patches/sub-patches with a heat map pixel value of 0 are referred to as low patches, patches/sub-patches with a heat map pixel value of 0.15 are referred to as medium patches, and patches/sub-patches with a heat map pixel value of 1 are referred to as high patches. The reduced heat map image 1006 is formed using the same image plane as the original image 1002. The reduced heat map image 1006 then undergoes further processing to produce bounding boxes 1008 as shown in
[0160] Firstly, all connected shapes of low and medium patches in the reduced heat map image 1006 are identified. A connected shape comprises two or more patches/sub-patches, such that individual low or medium patches are not identified as a connected shape. Of the identified connected shapes, any connected shape with no low patches, or in other words, any connected shape consisting of solely medium patches, is disregarded. Next, the boundaries of each separate connected shape are determined as co-ordinates in the upwards, downwards, left and right directions in the reduced heat map image 1006, by determining the last connected low or medium patch in each of these directions. These co-ordinates in the reduced heat map image 1006 are then used to draw horizontal lines, from the upper and lower co-ordinates, and vertical lines, from the left and right co-ordinates, to form the bounding boxes 1008. Preferably, for each bounding box, the number of low, medium and high patches contained within the bounding box are calculated to provide a confidence value for the respective bounding box. The confidence value, Confidence for each bounding box is calculated by:
Where p.sub.low, P.sub.mid and p.sub.high are the number of low, medium and high patches respectively. Low patches are given a weighting of 2 in equation Eq.11. Due to this, the Confidence may theoretically exceed 1. To prevent this from happening Confidence is limited between 0 and 1.
[0161] Once the bounding boxes 1008 have been formed and Confidence calculated, the specific task of object detection outputs the original image 1002 overlaid with the bounding boxes according to their position on the reduced heat map image 1006, for use in controlling the autonomous vehicle. This is shown as output image 1010 in
[0162] It is to be understood that k neural networks 200 may run the specific task of object detection concurrently, such that a plurality of heat map images 708a to 708k and 1004 and reduced heat map images 1006 are produced in the fourth layer 208 of the perception software stack 108. In this case, bounding boxes 1008 are formed for each of the plurality of reduced heat map images 1006 and corresponding confidence values calculated according to equation Eq.11. To form the output image 1010, the bounding boxes 1008 of each reduced heat map image 1006 are combined. When bounding boxes 1008 intersect, their confidence values are averaged. Preferably, the output image 1010 is subject to further thresholding to only display bounding boxes 1008 above a certain confidence value.
[0163] Once the specific tasks of image classification, segmentation, and/or object detection are completed, the output from each specific task is used to inform the control of an autonomous vehicle. The specific tasks help to identify features of the environment of the vehicle, such as the road, pedestrians, road signs, objects, buildings, other road users, junctions and intersections and the like. Controlling an autonomous vehicle ultimately depends upon defining a ‘freespace’. Freespace is the area detected as the road, by the specific task of road segmentation, subtracted by areas within the detected road which are occupied by an object such as car, pedestrian or the like. The freespace is thus a shape formed by combining the outputs of road segmentation and object detection. Once the freespace is known, the vehicle can be controlled to navigate the freespace using standard kinematics algorithms. In particular, co-ordinate transformations are performed between the image plane showing the freespace and the three-dimensional real-world environment such that the vehicle can be controlled using standard control systems.
[0164]
[0165] The array 1102 is split into rows as shown in block 1104, so that the centroid C1 of the freespace shape can be calculated. Initially, the centroids AC1 to AC4 of each row are identified, as shown in
[0166] It is to be understood that other methods of calculating the centroid of the freespace may also be used, including graphical methods, such as using angular bisectors on the triangle 924 in the hybrid image 922 to form the image 1108. Once the co-ordinates of the centroid C1 of the freespace shape are calculated, various aspects of control of an autonomous vehicle can be informed using the freespace shape corresponding to an original image and other freespace shapes relating to previously processed images. For example, aspects of the autonomous vehicle relating to movement, such as speed and direction, may be informed by the location of the centroid C1 derived from consecutive image frames. Where C.sub.1x and C.sub.1y are the co-ordinates of the centroid C1, and C1.sub.x-1 and C1.sub.y-1 are the co-ordinates of a centroid C−1 from the immediately previously derived centroid corresponding to a previously captured original image, x.sub.mid is the x-co-ordinate of the middle of the image plane of the original image, y.sub.threshold is a predetermined row in the image place which serves as a cut off point for non-linear speed control and P1, D1, P2, D2 are scalar hyperparameters:
direction=P1(x.sub.mid−C.sub.x)+D1(C.sub.x−C.sub.x−1) Eq. 13
speed=P2(y.sub.threshold−C.sub.y)+D2(C.sub.yC.sub.y−1) Eq. 14
[0167] It is to be understood that other methods of using the calculated freespace to provide driving commands to a vehicle or computer system within the vehicle may be applied. When there are k networks which each provide their own outcome of a specific task, and thus form their own freespace shape, the method of controlling an autonomous vehicle include using combination techniques and may further include using particle swarm optimization techniques to find the optimal outcome from the k networks. For example, using combination may include averaging individual freespace centroids from each of the k networks. The centroids may be weighted differently from each other when calculating the average. Alternatively an algorithm focusing on the Coordinated Collective Behaviour Reynolds Model may be used, where alignment, cohesion and separation of the outputs of the specific task for k different networks are calculated to find the optimal outcome for the k networks. The alignment, cohesion and separation values in this swarm optimization algorithm are vectors from the position of the autonomous vehicle to the centroid of the freespace shape for each of the k networks.
[0168] Whilst the specific tasks of road segmentation and object detection have been described above in detail, it is to be understood that the general method 300 can be employed in any similar computer vision task in an autonomous vehicle, such as collision detection, road-sign detection and object tracking. In each of these tasks, a feature of the environment of the vehicle is identified, detected, determined or segmented from the rest of the environment. Each of these actions rely on the action of the neural network which fundamentally classifies an input image. The different layers of the perception software stack are modified to the requirements of each task and the training of the neural network is different based on the task. As such, the neural network is trained to classify different features dependent on the task for which it is supposed to run.
[0169] Furthermore, the application of the method 300 and the perception software stack 108 is not limited to autonomous vehicles, but can also be used in any vehicle or machine where computer vision is used. For example, the method 300 and perception software stack 108 may be used in the fields of robotics, and in neighbouring fields such as industrial manufacture, medicine, hazardous area exploration and the like. ‘Any vehicle’ refers to a vehicle where vision is required or is otherwise useful to aid the control of the vehicle. As such, vehicles includes road-vehicles such as cars, trucks and motorbikes; marine vehicles such as boats and submarines, aerial vehicles such as drones, aeroplanes and helicopters, and other specialist vehicles such as space vehicles.
[0170] It is thus to be understood that the environment in which the method 300 and the perception software stack 108 is to be used can vary. The environment may be in land, sea, air or space. Each of these environments has unique features that define the freespace area in which the vehicle is safe to navigate. On land, the features may include roads, pedestrians, hazards, objects, signage and buildings, for example. In sea and in air, the features may include weather formations, standard shipping and air lanes and hazards for example.
[0171] It is further to be understood that each of these different environments may require specialist or different sensors 106 in order to acquire sensor data that describes the environment. As such, the sensor 106 may be a radar sensor, a LIDAR sensor, a camera, a charge-coupled device, an ultrasonic sensor, an infrared sensor or the like. The sensor data received from such sensors is manipulated as explained above with reference to the ‘original image’. If the sensor provides data in three dimensions, such as the LIDAR sensor, the pre-processing steps further include dimensionality reduction to reduce the three dimensional sensor data to the one dimensional arrays before presenting said one dimensional arrays to the neutral network or networks.
[0172] It is to be understood that the method 300 and the perception software stack 108 may be implemented on any computer device or integrated circuit. Furthermore, the method 300 and the software stack 108 may be written to memory as computer-readable instructions, which, when executed by a processor, cause the processor to perform the method 300 and implement the function of the software stack 108.
[0173] The method 300 and perception software stack 108 are adapted for each specific task through a training process, performed by the training software module 102. The training process will now be described here in more detail with reference to
[0174] The purpose of the training process is to train the neural network to perform a specific task. The CTRNN and neural network architecture of the neural network does not change between the specific tasks. Instead, the weights w.sub.ji in the weighted connections of the neural network are given values determined by the training process. These trained weights alter the calculations and thus the decision-making of the neural network so that it is adapted to perform the specific task. The general training process involves using a genetic algorithm to artificially evolve random initial weights such that, after a number of generations, they are effective at adapting the neural network to perform the specific task accurately.
[0175]
[0176] At step 1202, an initial population of chromosomes for the neural network is generated from a pseudo-random number generator function. The initial population is represented by a floating point array of N.sub.pop chromosomes. Each chromosome has a number of variables equal to the number of weights for the neural network N.sub.weights. The weights may include a tau or decay constant and layer bias, such that they are not strictly synaptic weights from node to node. Each chromosome is an encoded/non-encoded representation of a set of weight values corresponding to the weights for the neural network. Due to the use of random number generation, each chromosome has a random initial value for each of the weights in N.sub.weights.
[0177] At step 1204, each chromosome is inputted into the architecture of the neural network, such that the weight values contained in a particular chromosome are applied to the real weighted connections in the neural network. Training data such as a series of example images are then presented to the input layer of the neural network and the outputs are recorded. This occurs for each chromosome in the initial population, preferably in parallel and concurrently. The performance of the initial population of chromosomes is then evaluated by applying a fitness function and recording a fitness score for each chromosome. The fitness function relates to the example images and the particular specific task that is being trained for. The fitness score provides a numerical indication of each chromosome's effectiveness at performing the specific task. As noted above, the specific tasks include image classification, object detection and road segmentation. In terms of the process performed by the neural network 400, in the specific task of image classification, the whole input image is classified, and in object detection and road segmentation, patches of the input image are classified separately. The neural network 400 therefore performs a very similar classification method for each of the specific tasks. The differences between the specific tasks are more prevalent in the post-processing steps 310 performed by the fourth layer 208 of the perception software stack 108, as discussed above with reference to
[0178] For when the true class is Class 1:
if final_pred_value>thresh_upper, fitness=fitness+1; Eq. 15
[0179] For when the true class is Class 2:
if final_pred_value<thresh_lower, fitness=fitness+1; Eq. 16
Where thresh_upper and thresh_lower are an upper and lower threshold respectively, such as 0.0.1 and −0.01. Different values of these variables affect the outcome of the training process. For further classes, such as a third class, further thresholds may be introduced. According to equations Eq.13 and Eq.14, the higher the fitness score, the better the neural network is at correctly classifying the set of example images. The example images may be different for training each specific task. For example, for training road segmentation, example images of roads may be provided in the training process 1200, but for object detection, example images of object such as pedestrians, bicycles and vehicles may be provided. Furthermore, if the specific task being trained for is image classification, the example images may be scaled-down images, whereas if the specific task being trained for is road segmentation or object detection, the example images may be a series of pre-defined patches.
[0180] At step 1206, the genetic algorithm is run and the next generation is created. Following the initial population, a second population of chromosomes is generated using the initial population of chromosomes and their associated fitness scores evaluated in step 1204. This involves running a genetic algorithm on the chromosomes based on their fitness scores. At least one of four operations are performed on the initial population of chromosomes to generate the second population of chromosomes. These operations include elitism, truncation, mutation and recombination. When elitism is performed, a selection of the chromosomes with the best fitness scores are replicated onto the second population without alteration. The chromosomes are thus ranked after the evaluation in step 1204 according to their fitness scores, and when elitism is applied, the chromosomes with the best fitness scores are selected. When truncation is performed, a selection of the chromosomes with the worst fitness scores are removed such that they do not form part of the second population of chromosomes. When recombination (or crossover) is performed, a new chromosome is generated for the second population by combining two or more chromosomes from the initial population. The two or more chromosomes from the initial population used to generate the new chromosome for the second generation are selected using a roulette wheel selection technique, which means that chromosomes with better fitness scores have a higher probability of being selected for recombination. The two chromosomes selected for recombination are recombined according to an operation between the two chromosomes. This may be a single, two, or k point crossover, where k is a positive real number less than N.sub.weights. Other crossover operations may be used for the process of recombination. When mutation is performed, one or more of the floating point numbers in a chromosome, representing a weight, is modified by the addition, subtraction, multiplication or division of a random number. Preferably, the total number of chromosomes in the second population is equal to the number of chromosomes in the initial population, such that the number of chromosomes discarded via truncation equals the number of chromosomes introduced to the population via recombination.
[0181] At step 1208, steps 1204 and 1206 are repeated with respect to the second population of chromosomes and a new third population of chromosomes. The fitness scores are evaluated for the second population, and these are then used to generate the third population. The above process repeats, forming a new generation of chromosomes at the end of each evaluation step. This starts from the initial population and ends with the nth population, where n is a positive real number, and represents a training epoch signifying the maximum number of generations of populations.
[0182] At step 1210, the final weights are output for use in the neural network 400. It is to be understood that, whilst the above description of the training software module 102 and the method 1200 discuss one neural network, it is preferable that multiple k networks are trained using the training software module 102 and the method 1200. In this case, the initial population includes a set of k floating point arrays that are randomly generated, whereby each floating point array is configured to train one of the k neural networks.
[0183] To train the network or networks efficiently, the training software module 102 is implemented in a specific arrangement of hardware. In general, the hardware includes a primary module and a secondary module. The primary module is configured to perform the method 1200 up to and including the generation of the initial population 1202. The primary module thus defines the parameters of the training method 1200, including the number of chromosomes to be generated, the training epoch number n and the operations to be performed in the formulation of the next generation of chromosomes 1206. Once the initial population is formed in the primary module, it is sent to the secondary module. The secondary module is configured to evaluate the performance 1204 of each chromosome in the initial population. Preferably, the secondary module is configured to evaluate each chromosome in the initial population concurrently. Once evaluation of all chromosomes in the initial population is complete, a fitness score for each chromosome is returned to the primary module. At the primary module the next population of chromosomes are generated 1206 as a result of the genetic algorithm being run. The next population are then fed back into the secondary module and the process repeats until the nth generation 1208. When this generation is reached, final weights are deduced by selecting the best performing chromosomes and decoding them to determine weight values. These are then saved to a memory for transfer to the perception software stack 108.
[0184] Alternatively, when selecting the chromosomes to be saved to the memory for transfer to the perception software stack 108, re-evaluation and validation may firstly occur to ensure that the trained weight values are accurate. Re-evaluation involves, after the training process has been completed, selecting all chromosomes across all generations that have a fitness score above a specified cut-off threshold. These selected chromosomes are then re-evaluated for a different set of example images or image patches. This second example set of images is known as a validation set and ensures the accuracy of the selected chromosomes. Based on the re-evaluation, the best performing chromosomes and thus the best performing network(s) can be selected and stored.
[0185] Implementations of the general configuration will now be discussed here with reference to
[0186] The CPU 1302 is firstly configured to prepare data 1302a for the training process 1200 by setting the parameters of the training algorithm such as the size of each population N.sub.pop, the number of generations n, and the operations to be used in forming each new generation as discussed above. These parameters may be read from a training configuration file. Each of object detection, image segmentation and classification have different training configuration files.
[0187] Next, the CPU 1302 is configured to generate the initial population of chromosomes 1302b. As discussed above, initially, each chromosome is a set of randomly generated weights for the N.sub.weights. The initial population of chromosomes is then sent from the CPU 1302 to the GPU 1304 to be evaluated. Evaluation of each of the chromosomes is done concurrently, in parallel within the GPU 1304. The GPU 1304 evaluates each chromosome in a separate parallel computing block 1304a to 1304n. The number of blocks 1304a to 1304n is preferably equal to the number of chromosomes in the initial population N.sub.pop, such that each block 1304a 1304n is configured to evaluate one chromosome, corresponding to one set of weights for the neural network. Each block 1304a to 1304n is implemented using CUDA® from NVIDIA® for example. Each block comprises a plurality of threads, whereby the number of threads is equal to the number of input neurones num_input in the neural network (not shown in
[0188]
[0189] The primary CPU 1402 is firstly configured to prepare data 1402a for the training process 1200 by setting the parameters of the training algorithm such as the size of each population N.sub.pop, the number of generations n, and the operations to be used in forming each new generation as discussed above. These parameters may be read from a training configuration file. Each of object detection, image segmentation and classification have different training configuration files. Next, the primary CPU 1402 is configured to generate the initial population of chromosomes 1402b. Initially, each chromosome is a set of randomly generated weights for the N.sub.weights. Following the generation of the initial population, the primary CPU 1402 is configured to broadcast 1402c the initial population of chromosomes to the cluster of secondary CPUs 1404a to 1404dn. The primary CPU 1402 is thus communicatively coupled to the cluster of secondary CPUs 1404a to 1404n. Each of the secondary CPUs may be on the same server as each other and as the primary CPU 1402, or may be located across multiple servers. Preferably, the number of secondary CPUs 1404a to 1404n is equal to the number of chromosomes in the population, N.sub.pop, so that each secondary CPU 1404a to 1404n can concurrently evaluate a chromosome from the initial population. The number of secondary CPUs 1404a to 1404n can however be less than N.sub.pop. In this case, some or each of the secondary CPUs 1404a to 1404n may be required to evaluate more than one chromosome from the population. Evaluation of each of the chromosomes is thus done concurrently or partially concurrently, in parallel by each of the secondary CPUs 1404a to 1404n. The evaluation by each secondary CPU 1404a to 1404n returns a fitness score for each chromosome. Each fitness score or scores from each secondary CPU 1404a to 1404n are then sent back to the primary CPU 1402 where they are received 1402d. An array of fitness scores may thus be formed from the fitness scores received at the primary CPU 1402. The chromosomes and their corresponding fitness scores are then run through the genetic algorithm 1402e, meaning step 1206 of the method 1200 is performed as discussed above. In particular, functional box 1402e of the primary CPU 1402 in
[0190] It is to be understood that the determination of the final weights to be used in the perception software stack 108 may be done according to factors other than the fitness scores and ranking of chromosomes. For example, a particular chromosome may classify specific objects, such as bicycles, very effectively but other objects, such as cars, less effectively. The weights from this chromosome may still be selected as the final weights if for instance, multiple k networks are being used, whereby a network that effectively identifies bicycles is useful. In other words, the final weights may be determined based on the intended function of the neural network. Furthermore, more than one set of weights from more than one chromosome may be selected, so that more than one network can selected using the same training process 1200.
[0191] The examples illustrated in
[0192] It is to be understood that the training process and the training system may be implemented in any computer system, including a distributed computing system such as a cloud or server based computer system. The primary module discussed above is configured to perform all the steps of the training process apart from evaluation of the sets of weights or chromosomes. The evaluation is performed by the secondary module which has parallel computing capabilities. In a distributed system, the secondary module may communicate with the primary module via a server and/or over the internet.
[0193] The hardware aspects of the computer device 104 according to the invention will now be discussed with reference to
[0194] An example of the computer device 104 is described in detail with reference to
[0195] It is to be understood that the apparatus 1500 is configured to perform the method 300 for one or more of the specific tasks of image classification, object detection and image segmentation. Some of the components of the apparatus 1500 may be removed or substituted with similar components as will be understood. In use, the sensor 106 provides input sensor data to the apparatus 1500 via the input/output 1516. The sensor data is then manipulated according to the aforementioned methods. The apparatus 1500 may communicate with the sensor using any suitable communication means, such as Ethernet, Universal Serial Bus (USB), serial, Bluetooth, wireless networking (Wi-Fi) and the like. The apparatus 1500 may be a SoC and may take the form of a computer, smartphone, tablet or the like. A SoC has the advantage that task specific computer programs can be written specifically for the SoC thus saving on loading time and speed. The apparatus 1500 may however be a traditional computer installed on a single motherboard.
[0196] In an example, the computer device 104 is modular, meaning the computer device 104 is responsible for performing one out of several specific tasks. There may then be a module for each of image classification, object detection, and image segmentation, formed of individual computer devices 104. Each of these devices may communicate with each other via wired or wireless connection methods and may also connect to the same or different sensors 106.
[0197] Each one or more of the computer devices 104 may connect to a network that is external to the vehicle. This allows the software stored thereon to be updated. For example, the weights stored in the memory 112 may be updated via communication with the external network. However, each of the computer devices 104 is configured to function or be capable of functioning without communication with an external network. The low resolution of the neural network allows the specific tasks to be performed on the computer device 104 without external computing aid.
[0198] Once the apparatus 1500 or computer device 104 has run the method 300 to obtain an outcome for a specific task, it is configured to send information relating to the outcome to a controller computer. The controller computer can be any computer which requires the outcome, such as a vehicle's engine control unit or another SoC. The apparatus 1500 may also store historic outcomes from specific tasks on its own memory chip 1514.
[0199]
[0200] It is to be understood that any conventional computing components may be used to implement the computer device 104 and the components shown in
[0201] The SoC also has the advantage of not needing to connect to an external network. Each of the specific tasks of image classification, object detection and image segmentation can be performed by the SoC locally in the vehicle. Furthermore, due to the low-resolution aspects of the neural networks, multiple networks can be stored in the memory of the computer device 104 or SoC. This means that swarm optimization or other collective behaviour algorithms or techniques can be applied to input data from a camera or sensor locally at the computer device 104 or SoC, without having to communicate with an external network. This saves valuable time which can improve the responsiveness and thus safety of a vehicle such as an autonomous vehicle which includes the computer device 104 or SoC.
[0202] Where multiple SoCs are used, each for a different specific task, the local nature of the calculations and functioning of the SoCs allow them to easily communicate and pool their outputs together. For example, where a first SoC is configured to perform the function of object detection, and a second SoC is configured to perform the function of road segmentation on an input image, the SoCs may communicate with each other to determine the available freespace on the road (the segmented road subtracted by any objects on the road). Alternatively this function is performed by the controller computer 1606 when it is connected to multiple apparatuses 1604.
[0203] It is not necessary that the camera/sensor 106 and 1602 be fitted to the vehicle 1600 or computer device 104. Instead, the camera/sensor 106 and 1602 may be physically separate from the computer device 104 and vehicle 1600, such that the camera/sensor 106 and 1602 is not fitted to a structure or is fitted at an external location. The external location may be, for example, at a junction on a road, on a traffic sign or on a lamppost. In this example, the camera/sensor 106 and 1602 communicates with the computer device 104 or vehicle 1600 and thus the apparatus 1604 via a network. The camera/sensor 106 and 1602 and the computer device 104 or apparatus 1604 comprise or are locally connected to network connection hardware configured to connect to a network. The network connection hardware may include any one or more of a W-Fi module, a cellular module, a mobile-network transmitter and receiver, an antenna, a Bluetooth module and the like. In this example, the output of the neural network in performing a specific task may be shared from the computer device 104 or apparatus 1604 to other computers or vehicles directly, or sent to a central computer on a server or network for distribution to other vehicles.
[0204] Although the description above relates to the specific example of a vehicle and in particular an autonomous vehicle, it is noted that the computer device 104 and the vehicle 1600 can alternatively be any machine where visual sensory data is gathered, manipulated or used to perform an action. As such, the computer device 104 and the vehicle 1600 may be a robot, a CCTV system, a smart device for a smart home, such as a smart speaker, smartphone or a smart appliance.
[0205] Similarly, although the description above relates to performing specific tasks related to vehicles, such as object detection, road segmentation and image classification, it is to be understood that the principles of using active vision in a LRRAVNN according to the method 300 can be applied to any computer vision task. As such, other tasks may be performed by the computer device 104 using the method 300. For example, a CCTV system using the method 300 may perform facial recognition, whilst a robot using the method 300 in a manufacturing environment may perform object classification and quality checking. Further tasks related to autonomous driving may also be performed, such as traffic sign recognition, road-marking recognition and pot-hole detection.