System for processing an image, method for processing the image, method for training a neural network for processing the image, and recording medium for executing the method

Abstract

The disclosure relates to system for processing an image of at least one camera. The camera has predetermined camera parameters including a lens distortion and a camera pose with respect to a predefined reference frame. The system comprises: a trained neural network with a predefined architecture, the neural network being configured to receive the image of the camera as input and to predict in response at least one characteristic, wherein the neural network architecture comprises at least one static feature map configured to encode the predetermined camera parameters including the lens distortion and/or the camera pose.

Claims

1. A system for processing an image of at least one camera, the camera having predetermined camera parameters including a lens distortion and a camera pose with respect to a predefined reference frame, the system comprising: a trained neural network with a predefined architecture, the neural network being configured to receive the image of the camera as input and to predict in response at least one characteristic, wherein the architecture of the neural network comprises at least one static feature map configured to encode the predetermined camera parameters including the lens distortion and/or the camera pose; wherein the camera pose is defined by camera rotation and/or camera translation with respect to the reference frame, and/or the predetermined camera parameters comprise optical camera parameters, and/or the camera comprises a lens which defines the lens distortion.

2. The system according to claim 1, wherein the architecture of the neural network comprises at least one first static feature map configured to: encode a predetermined optical camera parameter and the lens distortion for considering a viewing angle in 3D space for each image pixel when predicting the at least one characteristic, and/or encode the camera rotation for considering the camera rotation with respect to the reference frame when predicting the at least one characteristic.

3. The system according to claim 2, wherein the predetermined optical camera parameter comprises at least one of a camera resolution and a focal length of the camera.

4. The system according to claim 2, wherein the first static feature map comprises for each image pixel or for a group of neighboring image pixels a 3D normal vector representing the viewing angle and/or the camera rotation.

5. The system according to claim 1, wherein the architecture of the neural network further comprises at least one second static feature map configured to encode the camera translation for considering the camera translation with respect to the reference frame when predicting the at least one characteristic.

6. The system according to claim 1, wherein the architecture of the neural network further comprises a third feature map configured to encode depth information for each pixel.

7. The system according to claim 1, wherein the reference frame is defined as an external reference frame external to the system and/or the camera, the external reference frame being in particular in a pre-defined position and orientation with regard to the system and/or the camera, or the reference frame is defined based on a pose of another camera of the system.

8. The system according to claim 1, wherein the at least one static feature map is predefined and/or configured to remain unchanged during neural network training.

9. The system according to claim 1, wherein the neural network comprises a predefined number of layers, each layer comprising at least one channel, wherein the at least one static feature map is added in addition to a predefined channel in at least one layer or replacing the predefined channel.

10. The system according to claim 9, further comprising one or a plurality of digital cameras, and/or a data storage to store the trained neural network, and/or a processor to process the image using the neural network.

11. A computer implemented method for training a neural network for processing an image of a camera, the method comprising steps of: providing the neural network with a predefined architecture, the neural network being configured to receive the image of the camera as input and to predict in response at least one characteristic, providing a training set of training images of one or a plurality of cameras for training the neural network, providing at least one static feature map for the camera for training the neural network or, in case of the plurality of cameras, at least one static feature map for each camera, respectively, training the neural network based on the training images by using for each training image the static feature map; wherein the at least one static feature map is configured to encode predetermined camera parameters including a lens distortion and/or a camera pose with respect to a pre-defined reference frame of the respective camera.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.

(2) FIG. 1 shows a block diagram of a system according to embodiments of the present disclosure;

(3) FIG. 2 shows a schematic overview of the information that can be encoded into the static feature map(s) according to embodiments of the present disclosure;

(4) FIG. 3 shows an exemplary camera reference frame in a 3D coordination system according to embodiments of the present disclosure;

(5) FIG. 4 shows an exemplary pinhole camera model showing the relation between world coordinates and camera coordinates according to embodiments of the present disclosure, and

(6) FIG. 5 shows an example of neural network layers with static feature maps according to embodiments of the present disclosure.

DESCRIPTION OF THE EMBODIMENTS

(7) Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

(8) FIG. 1 shows a block diagram of a system 10 according to embodiments of the present disclosure. The system is configured to carry out the method according the present disclosure. In particular, the system may be configured to carry out computer instructions, e.g. as given by a trained neural network.

(9) For example the system may comprise a processor 11 (e.g. at least one CPU and/or GPU) and a memory 13 for executing said instructions. The memory 13 may be a non-volatile memory and it may comprise said instructions (i.e. the trained neural network and/or a computer program), e.g. of the method according the present disclosure. Said method can be executed by the processor 11 for processing an image (e.g. for semantic segmentation or any other image processing task). In particular, the memory may store a trained artificial neural network (ANN), which can be executed by the processor 11 to perform the method described below.

(10) The system 10 may further comprise and/or be connected to an acquisition module 12 configured to acquire images (e.g. one or several cameras, in particular only monocular camera(s) for obtaining monocular surround view images of the environment of the system). For example, the system may comprise a plurality of cameras which together obtain a panoramic (e.g. 360?) image of the system environment, in particular without any depth information. Alternatively it may comprise only one rotating camera.

(11) The acquisition module 12 (i.e. the camera(s)) has predetermined camera parameters including a lens distortion and a specific camera pose with respect to the system and with respect to a predefined reference frame (e.g. given by the system or any other external object).

(12) The trained neural network, e.g. being a Convolutional Neural Network (CNN), has a predefined architecture and is configured to receive the image of the acquisition module as input and to predict in response at least one characteristic (e.g. semantic image segments).

(13) Furthermore the architecture of the neural network comprises at least one static feature map configured to encode the predetermined camera parameters including the lens distortion and/or the camera pose. Said static feature map(s) will be described in more detail in the following in context of FIGS. 2 to 4.

(14) The system may be part of a robotic system or a vehicle 30. In other words, the system, in particular its acquisition module 12 may be configured to autonomously move. In this scenario, when the system comprises a plurality of cameras, it is desired to calibrate their produced images to each other. In this scenario, the reference frame is desirably defined based on the pose of another camera of the system. For example, a single neural network can be trained for all camera viewpoints, by inserting their respective global pose parameters as additional channels.

(15) Alternatively, the reference frame may be defined as a reference frame external to the system and/or the camera. Said external reference frame may be in particular in a pre-defined position and orientation with regard to the system and/or the camera. For example, the camera may have a locally fixed position (e.g. on a tower or otherwise in a predetermined height with regard to the ground level) and reference frame may be defined by the ground plane. Accordingly, the static feature map (i.e. global pose feature maps) may also be used to encode the known camera position with respect to a relevant global reference frame (e.g. the ground plane).

(16) FIG. 2 shows a schematic overview of the information that can be encoded into the static feature map(s) according to embodiments of the present disclosure.

(17) As shown in FIG. 2, it is proposed to explicitly encoding spatial information into the neural network. This is done by adding static features maps encoding the camera intrinsic parameters and viewpoint. In this context static is used to indicate they are fixed parameters, not affected by (re)training the network. A general overview of the system is shown in FIG. 2. The intrinsic parameters are represented by maps containing each pixels' viewing angle with respect to the camera viewing direction.

(18) FIG. 3 shows an exemplary camera reference frame in a 3D coordination system according to embodiments of the present disclosure.

(19) Every camera reference frame XYZ can be transformed to a global reference XYZ using a rigid transform P.sub.i. Here, it is referred to the pose of camera i with respect to a chosen global reference frame as P.sub.i=[R.sub.i t.sub.i] with R.sub.i?SO(3) the rotation component and t.sub.i? custom character .sup.3 the translation component (meaning P.sub.i is a general rigid transformation). Such a pose/transform is illustrated in FIG. 2.

(20) Camera Model

(21) It is proposed to introduce a standard camera model (for clarity and to introduce symbols and notation). The parameters in this camera model will be used to generate the proposed fixed feature maps.

(22) $\begin{matrix} K = [\begin{matrix} f_{x} & s & c_{x} (1) \\ 0 & f_{y} & c_{y} (2) \\ 0 & 0 & 1 \end{matrix}] & (1) \end{matrix}$

(23) In this context FIG. 4 shows an exemplary pinhole camera model showing the relation between world coordinates and camera coordinates according to embodiments of the present disclosure.

(24) Calibration matrix K with focal lengths f.sub.x and f.sub.y, s the skew factor between the sensor axes, and (c.sub.x, c.sub.y) the camera optical center (in pixel coordinates), cf. also FIG. 4.

(25) $\begin{matrix} P = [\begin{matrix} K & 0 (4) \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} R & t (5) \\ 0^{T} & 1 \end{matrix}] & (2) \end{matrix}$

(26) Camera matrix P containing both the intrinsic camera parameters from the calibration matrix K and the extrinsic parameters: the euclidean, rigid body transformation [R t].

(27) Camera matrix P allows to map 3D world coordinates p.sub.w=(x.sub.w,y.sub.w,z.sub.w,1) to image coordinates p.sub.i=(x.sub.i,y.sub.i,1, d).
p.sub.i?PP.sub.w(3)

(28) For every pixel in the image space (u,v) desirably also the lens distortion should be take into account:

(29) $\begin{matrix} {\begin{matrix} x^{} = \frac{u - c_{x}}{f_{x}} (8) \\ y^{} = \frac{v - c_{y}}{f_{y}} \end{matrix} & (4) \end{matrix}$

(30) $\begin{matrix} {\begin{matrix} x^{} = x^{} \frac{1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6}}{1 + k_{4} r^{2} + k_{5} r^{4} + k_{6} r^{6}} + 2 p_{1} x^{} y^{} + p_{2} (r^{2} + 2 x^{2}) (1 0) \\ y^{} = y^{} \frac{1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6}}{1 + k_{4} r^{2} + k_{5} r^{4} + k_{6} r^{6}} + p_{1} (r^{2} + 2 y^{2}) + 2 p_{2} x^{} y^{} \end{matrix} & (5) \end{matrix}$ with r.sup.2=x.sup.2 y.sup.2, p.sub.1 and p.sub.2 the tangential distortion parameters and k.sub.1, k.sub.2, k.sub.3, k.sub.4, k.sub.5, k.sub.6 the radial distortion coefficients.
Pixel Viewing Angles

(31) From the camera calibration data all pixels may be mapped back to 3D viewing angle vectors. This requires inverting the above mapping (now going from distorted camera coordinates to undistorted viewing vectors).

(32) $\begin{matrix} \begin{matrix} x^{} ? (u - c_{x}) / f_{?} \\ y^{} ? (v - c_{y}) / f_{y} \\ (x^{}, y^{}) = undistort (x^{}, y^{}, distCoeffs) \\ {[x y z]}^{T} ? R * {[x^{} y^{} 1]}^{T} \\ {[x y z]}^{T} ? {[x y z]}^{T} / .Math. xyz .Math. \end{matrix} & (6) \end{matrix}$ with undistort an approximate iterative algorithm estimating the normalized original point coordinates from the normalized distorted point coordinates (cf. e.g. Mallon, J., Whelan, P. F.: Precise radial un-distortion of images. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. Volume 1., IEEE (2004) 18-21) and distCoeffs the distortion parameters, (and as above (u,v) the image space pixel coordinates, R the camera rotation with respect to the global reference frame).

(33) While in this example the viewing angle is presented using 3D normalized vectors, also other representations could be used (e.g. Euler Angles, quaternions, axisangle representation, . . . ).

(34) Concatenating Fixed Feature Maps

(35) These per pixel viewing angles (represented by 3D unit vectors) may be added (e.g. concatenated along the channels dimension) into the neural network architecture as fixed feature map channels (note that they can be re-sampled for layers having dimensions different from the original image resolution). See FIG. 5 for an example, which shows neural network layers with static feature maps according to embodiments of the present disclosure.

(36) For any layer L in the network, the proposed static global pose feature maps may be added, in addition to (or replacing some of) the existing, dynamic channels. These channels allow the L+1 layer to use the encoded global pose information. The information may be (implicitly) available to any L+x layer.

(37) Any (combination) of the following additional fixed feature maps can be used: Local per pixel viewing angles (from camera calibration matrix K and the distortion parameters) Global per pixel viewing angles (from camera calibration matrix K+camera relative rotation matrix R) Relative camera 3D location (X,Y,Z), represented by camera relative translation vector t Estimate of background scene depth (e.g. obtained from a LIDAR scan and motion based background segmentation), represented, for example, by the distance between the camera center and the scene background.

(38) For camera calibration purposes both the intrinsic and extrinsic parameters discussed in the section above regarding concatenating fixed feature maps can be obtained using standard computer vision camera calibration techniques.

(39) Throughout the description, including the claims, the term comprising a should be understood as being synonymous with comprising at least one unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms substantially and/or approximately and/or generally should be understood to mean falling within such accepted tolerances.

(40) Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

(41) It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

System for processing an image, method for processing the image, method for training a neural network for processing the image, and recording medium for executing the method

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V2201/11

PHYSICS

Classification Explorer

G06V2201/10

PHYSICS

Classification Explorer

G06V20/20

PHYSICS

Classification Explorer

G06T2207/30244

PHYSICS

Classification Explorer

G06T7/80

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06T7/70

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

International classification

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06T7/70

PHYSICS

Classification Explorer

G06T7/80

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V20/20

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

Abstract

Claims

Description