Method of image processing using a neural network

11302009 · 2022-04-12

Assignee

Inventors

Cpc classification

International classification

Abstract

A method of generating landmark locations for an image crop comprises: processing the crop through an encoder-decoder to provide a plurality of N output maps of comparable spatial resolution to the crop, each output map corresponding to a respective landmark of an object appearing in the image crop; processing an output map from the encoder through a plurality of feed forward layers to provide a feature vector comprising N elements, each element including an (x,y) location for a respective landmark. Any landmarks locations from the feature vector having an x or a y location outside a range for a respective row or column of the crop are selected for a final set of landmark locations; with remaining landmark locations tending to be selected from the N (x,y) landmark locations from the plurality of N output maps.

Claims

1. A method comprising: identifying an object within an image; generating a crop comprising at least a portion of said object; processing said crop by one or more convolutional layers to provide an output map of lower spatial resolution than said crop; processing said output map by one or more de-convolutional layers to provide N output maps of comparable spatial resolution to said crop, each N output map of said N output maps corresponding to a respective landmark of said object; obtaining N landmark locations from said N output maps output by the one or more de-convolutional layers; processing said output map by one or more layers different from the one or more de-convolutional layers to provide a feature vector comprising (x,y) locations for multiple landmarks; selecting a first set of landmark locations from the multiple landmarks of said feature vector, at least some of the first set of landmark locations being outside a boundary of said crop; and selecting a second set of landmark locations from said N landmark locations associated with the N output maps.

2. The method according to claim 1, wherein the second set of landmark locations represents locations comprising distortion relative to the first set of landmark locations.

3. The method according to claim 1, wherein the first set of landmark locations selected from said feature vector do not comprise distortion relative to said crop.

4. The method according to claim 1, further comprising processing said output map by one or more feed forward layers to provide a classification of at least one of: pitch, yaw, or roll of said object within said crop.

5. The method according to claim 1, wherein said one or more convolutional layers and said one or more de-convolutional layers are associated with a single stage encoder-decoder.

6. The method according to claim 1, wherein said output map from a first convolutional layer is aggregated with an output map of said N output maps from a first de-convolutional layer to provide an input map for a second convolutional layer.

7. The method according to claim 1, wherein said object comprises a face.

8. The method according to claim 1, wherein said crop comprises a range of 64×64 pixels.

9. The method according to claim 1, wherein processing said crop or processing said output map comprises execution by at least one of: a general-purpose processor; a multi-core processor; a dedicated neural network processing engine; or a multi-core neural network processing engine.

10. The method of claim 1, wherein the output map comprises a plurality of channels.

11. The method of claim 1, further comprising feeding said N output maps and said feature vector through a set of neural network layers to provide said first set of landmark locations or said second set of landmark locations.

12. A system comprising: one or more processors; and one or more non-transitory computer-readable media storing instructions executable by the one or more processors, wherein the instructions, when executed, cause the system to perform operations comprising: identifying an object within an image; generating a crop comprising at least a portion of said object; processing said crop by one or more convolutional layers to provide an output map of lower spatial resolution than said crop; processing said output map by one or more de-convolutional layers to provide N output maps having substantially similar resolution to said crop, at least one N output map of said N output maps corresponding to a landmark of said object; obtaining N landmark locations from said N output maps output by the one or more de-convolutional layers; processing said output map by one or more layers different from the one or more de-convolutional layers to provide a feature vector comprising (x, y) locations for multiple landmarks; selecting a first set of landmark locations from the multiple landmarks of said feature vector, at least some of the first set of landmark locations being outside a boundary of said crop; and selecting a second set of the landmark locations from said N landmark locations associated with the N output maps.

13. The system of claim 12, the operations further comprising: training a neural network based at least in part on the first set of landmark locations and the second set of landmark locations.

14. The system of claim 12, the operations further comprising: generating, by a neural network, an additional crop associated with an additional image based at least in part on the first set of landmark locations and the second set of landmark locations.

15. The system of claim 12, wherein the second set of landmark locations represents locations comprising distortion relative to the first set of landmark locations.

16. The system of claim 12, wherein the first set of landmark locations selected from said feature vector do not comprise distortion relative to said crop.

17. The system of claim 12, wherein said object comprises a face.

18. The system of claim 12, wherein said one or more convolutional layers and said one or more de-convolutional layers are associated with a single stage encoder-decoder.

19. The method of claim 1, further comprising: training a neural network based at least in part on the first set of landmark locations and the second set of landmark locations.

20. The method of claim 1, further comprising: generating, by a neural network, an additional crop associated with an additional image based at least in part on the first set of landmark locations and the second set of landmark locations.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:

(2) FIG. 1 shows a conventional arrangement for locating landmarks within an image;

(3) FIG. 2 shows a conventional convolutional neural network (CNN) architecture;

(4) FIG. 3 shows some exemplary images processed according to an embodiment of the present invention;

(5) FIG. 4 shows an architecture for processing an image to provide a set of face landmarks according to an embodiment of the present invention; and

(6) FIG. 5 show a system for executing the architecture of FIG. 4.

DESCRIPTION OF THE EMBODIMENT

(7) Referring now to FIG. 3, there are shown a pair of duplicated images 10A (top), 10B (bottom) of the type which are to be processed according to an embodiment of the present invention. In the present example, the images include a face, however, it will be appreciated that in variants of the embodiment, images including objects other than faces can be processed.

(8) A face detector 16 such as referred to in relation to FIG. 1 provides respective face crops 18A, 18B within images 10A, 10B. Face detection within acquired images is well-known since at least US 2002/0102024, Viola-Jones with many optimisations and improvements made in such systems since then. Thus, the face detector 16 can be a dedicated hardware module such as the engine disclosed in PCT Application WO 2017/108222 (Ref: FN-470-PCT), the disclosure of which is incorporated by reference, or the face detector can be implemented in general purpose software executing on a system CPU, or indeed the face detector 16 could be implemented using one or more convolutional neural networks (CNN) and executed on a dedicated CNN engine such as described in PCT Application WO 2017/129325 (Ref: FN-481-PCT), and U.S. Application No. 62/592,665 (Ref: FN-618-US), the disclosures of which are incorporated herein by reference. Indeed, U.S. application No. 62/592,665 (Ref: FN-618-US) discloses a system including multiple neural network processing cores which can be configured to process multiple neural networks performing different tasks on the same or different images or image portions in parallel. The face detector 16 may also provide a pose for a detected object, for example, front facing, left profile, right profile etc for a face. The region(s) 18 identified by the detector 16 need not necessarily be rectangular, but if not, as described in PCT Application WO2017/032468 (Ref: FN-469-PCT), any identified region may be scaled and rotated to provide a normalised image crop 18 comprising a rectangular array of known dimensions, for example, 64×64 pixels.

(9) The face crops 18A, 18B will be provided to a neural network for processing according to an embodiment of the present embodiment and described in more detail below. Such networks are typically designed to operate based on fixed size input images and so any image crop needs to be sized to match the required input image size for the network. Input images are preferably kept as small as possible to maintain processing speed, but it will be appreciated that if an image crop has to be down-sampled more than necessary, then the precision provided for landmark locations by the neural network will be limited and for this reason the face crops tend to be as tightly framed as possible around a face to minimize any required down-sampling of the crop.

(10) In the present example, the image crop provided to the neural network comprises 64×64 pixels and so when a face fitting within such a square format is detected, maximal precision can be gained. However, if a face changes proportion, such as when a subject yawns, as in image 10B, the face detector 16 may provide a crop which does not incorporate some landmarks such as those on the subject's chin.

(11) Referring now to FIG. 4 which shows a neural network architecture 200 according to an embodiment of the present invention for landmark detection within an image.

(12) The network 200 comprises a first set of layers 210 providing a single-stage encoder-decoder producing a respect heatmap 220 for each of N landmarks in an input image 10B. the encoder-decoder can be of a conventional design such as referred to above. Each landmark is extracted from its heatmap to provide a set of N (x,y) landmark locations 230 for further processing. As discussed, the x,y values for landmark locations 230 are limited to the range of the input map, in this case, 0 . . . 63.

(13) In the encoder-decoder, a first set of encoding layers E.sub.1 . . . E.sub.3 comprising convolutional and pooling layers produce respective output maps M.sub.1 . . . M.sub.3 with successively decreasing spatial resolution and increasing depth, whereas a subsequent set of decoding layers D.sub.1 . . . D.sub.3 comprising de-convolution and un-pooling layers produce respective output maps M.sub.4 . . . M.sub.6 with successively increasing spatial resolution so that the last output map M.sub.6 has a resolution corresponding with the input image crop 18B. Note that while referred to as output maps, each of output maps M.sub.1 . . . M.sub.6 may have multiple channels. As mentioned, output map M.sub.6 comprises a channel (or map) for each landmark of the set of landmarks to be located. In such encoders, it is known to provide forward-skip connections F.sub.1 . . . F.sub.3 between encoder and decoder layers to aggregate output maps of the encoder with respective same resolution counterpart input layers for decoder layers, typically through concatenation, to improve the ability of the network to maintain the context and resolution of features extracted by the encoder layers within the subsequent decoder layers. As will be appreciated such encoder-decoders may also comprise activation functions and batch normalisation layers, however, these are not discussed in detail here.

(14) Note that in variants of the illustrated example, fewer or more encoder/decoder layers can be used and it is also possible to employ a multi-stage encoder where a decoder output from one stage is provided as input to an encoder of a subsequent stage.

(15) In any case, as explained, in the embodiment, it is desirable for the input to comprise a small crop so that network processing speed can be maintained and for this reason framing of the object, in this case a face, should be as tight as possible to maintain precision.

(16) Referring back to FIG. 3, the bottom right image shows the landmarks 230 detected for image crop 18B using the encoder-decoder 210. As will be seen, the accuracy of detection for points on the mouth and eyes is quite good. However, for landmarks such as those indicated at 26B″ which were in fact located outside the image crop 18B, the results provided by the encoder-decoder significantly misrepresent the contour of the face. On the other hand, the top right image shows the landmarks detected for image 10A, where all of the landmarks are in fact inside the image crop, using the encoder-decoder 210 and these are in general quite accurate.

(17) Turning back to FIG. 4, a second branch of the network 200 comprises a number of fully connected (FC) layers 240 using as their input an output map M.sub.3 (potentially comprising multiple channels) produced by the encoder layers E.sub.1 . . . E.sub.3 of the encoder-decoder 210. In the embodiment, the lowest resolution output map M.sub.3 is chosen as the input for the FC layers 240, however, it will be appreciated that in variants of this embodiment, other output maps could be used. It will also be appreciated that if a multi-stage encoder-decoder were used, then the input map could be taken from any of the stages.

(18) The FC layers 240 produce an output feature vector 250 where each of the N elements of the vector comprises an (x,y) location for a respective landmark. Note that as discussed, the x,y values for landmark locations 250 are not limited to the range of the input map, in this case, 0 . . . 63.

(19) It will be appreciated that the additional processing required for the FC layers 240, by comparison to using an encoder-decoder 210 alone is minimal and so the resource overhead required to implement the network 200 is not significant.

(20) It will also be seen that an existing pre-trained encoder-decoder 210 could be employed with the weights for this branch of the network locked when training the additional FC layers 240. Alternatively, if starting with an existing regression network comprising encoding layers E.sub.1 . . . E.sub.3 and FC layers 240, their weights could be locked when training the decoder layers D.sub.1 . . . D.sub.3. Alternatively, the entire network could be trained end-to-end.

(21) In any case, referring back to FIG. 3, the bottom left image shows the landmark locations 250 produced by the FC layers 240 for image crop 18B. Here, it will be seen that the predictions for landmarks 26B located outside the image crop 18B are quite good, whereas the accuracy for landmarks located around the lower mouth and some eye features in particular is not as good as for the set of landmark locations 230 produced by the encoder-decoder 210 for the image 10B. It can also be seen from the top left image that the accuracy for landmarks located around the subject's left eye in image 10A produced by the FC layers 204 is not as good as for the corresponding landmark locations produced by the encoder-decoder 210 for image 10A.

(22) In embodiments of the present invention, the sets of landmark locations 230 and 250 produced by the encoder-decoder 210 and FC layers 240 respectively from a given image crop 18 are combined to provide a final set of landmark locations 260 for the object.

(23) In one embodiment, where the landmark location for a landmark produced by the FC layers 240 includes an x or a y value outside the range of the image crop, this location is chosen for the final set of landmark locations 260.

(24) In some embodiments, all of the remaining landmark locations can be chosen from the landmark locations 230 generated by the encoder-decoder.

(25) However, in some embodiments, there may be an appreciation that the FC layers 240 produce more accurate results for some landmarks that appear within the image crop 18. These typically tend to be landmarks which are less prone to distortion, for example, face contour landmarks 47-61 from FIG. 1.

(26) While in the above embodiment, choosing landmark locations from either the landmark locations 230 or 250 is performed algorithmically, it will be appreciated that the output maps 220 or just the landmark locations 230 and the feature vector comprising the landmark locations 250 could also be provided to further neural network layers (not shown) for fusing these locations into the final set of landmark locations 260, and where the network layers would be trained so as to favour landmark locations produced by the FC layers 240 with an x or a y value outside the range of the image crop 18 and to favour landmark locations produced by the encoder-decoder 210 otherwise. This could be particularly useful for locations around an image crop boundary or where the accuracy of each approach varies for landmark locations within the image crop 18, so that the landmark locations generated in the final set of landmark locations 260 could be a fusion of the information from landmark locations 230 and 250.

(27) It will also be appreciated that because the additional cost of adding FC layers to an encoder-decoder network is relatively low, the network architecture 200 can be extended to perform other tasks. So, for example, as shown in FIG. 4, a further set of FC layers 270 can use an output layer from the encoding layers E.sub.1 . . . E.sub.3 to generate indicators of pitch, yaw and roll for the object within the image crop. This meta-information can be useful for subsequent processing.

(28) While the embodiment above has been described in terms of providing fully connected layers 240, 270, it will be appreciated that these layers need not exclusively comprise fully connected layers and may for example include some convolutional or other layers forming what may be more generally described as a feed forward network.

(29) It will be appreciated that the neural network architecture 200 of FIG. 4 can be executed on any suitable processor. So referring to FIG. 5, an image acquired by a camera image sensor and typically pre-processed by an image processing pipeline (not shown) is written to main memory 40 across a system bus 42. The weights and network configuration for the network 200 may also be stored in main memory 40. Thus, the network 200 can be executed by general purpose software executing on a system CPU 50, or indeed the network could be implemented using one or more convolutional neural networks (CNN) and executed on a dedicated CNN engine 30 such as described in PCT Application WO 2017/129325 (Ref: FN-481-PCT), and U.S. Application No. 62/592,665 (Ref: FN-618-US), the disclosures of which are incorporated herein by reference. Indeed, PCT Application WO2019/042703 (Ref: FN-618-PCT) discloses a system including multiple neural network processing cores which can be configured to process multiple neural networks performing different tasks on the same or different images or image portions in parallel. In these cases, the weights and network configuration can be pre-loaded within the CNN engine 30. In any case, with a multi-processor core, the decoding layers D.sub.1 . . . D.sub.3 could be executed at the same time as the FC layers 240 (and possibly FC layers 270) to provide results with minimal latency.