Detection and validation of objects from sequential images of a camera by using homographies
11087150 · 2021-08-10
Assignee
Inventors
Cpc classification
B60R11/04
PERFORMING OPERATIONS; TRANSPORTING
G06V20/58
PHYSICS
International classification
Abstract
A method and a device are for identifying objects from camera images, e.g. for vehicle driver assistance systems. The method involves: capturing a series of camera images, determining a plurality of planes in space by associating adjacent corresponding features in at least two consecutive camera images with a given one of the planes, determining a relative translation vector of a plane, and identifying dynamic objects in the camera images based on the relative translation vector of an associated plane.
Claims
1. A method of detecting dynamic objects, comprising the steps: a) with a camera of a vehicle, capturing a series of images including a first image at a first time and a second image at a second time after the first time; b) determining a plurality of corresponding feature pairs, wherein each one of the corresponding feature pairs respectively consists of corresponding first and second features in the first and second images; c) establishing a plurality of spatial planes including at least one back plane extending normal to a longitudinal direction of the vehicle, at least one ground plane extending horizontally, and at least one side plane extending vertically and along the longitudinal direction of the vehicle, wherein each respective one of the spatial planes is established by associating, with the respective spatial plane, a respective plurality of adjacent feature pairs among the corresponding feature pairs; d) detecting an object in the series of images based on the adjacent feature pairs associated with a selected one of the spatial planes that have been established; e) determining a respective relative translation vector having orthogonal vector components t.sub.x, t.sub.y and t.sub.z for each respective one of the spatial planes as a measure of a motion of the respective spatial plane; and f) identifying the object as a dynamic object based on the relative translation vector of the selected spatial plane; wherein the associating of the adjacent feature pairs with the respective spatial planes comprises computing homographies and using the homographies to associate the adjacent feature pairs with the respective spatial planes; wherein the homographies describe correspondences of points or features in a respective selected image region in the first image at the first time with same or corresponding points or features in the respective selected image region in the second image at the second time; wherein each one of the homographies has only three degrees of freedom respectively relating to the orthogonal vector components t.sub.x, t.sub.y and t.sub.z; and wherein the determining of the respective relative translation vector for each one of the spatial planes involves inverting a 3×3 matrix to solve for the three degrees of freedom of the respective homography associated with the respective plane; and wherein the respective relative translation vector indicates the orientation of the selected image region.
2. The method according to claim 1, wherein the relative translation vector is defined as t/d, wherein t represents a translation of the camera of the vehicle, and d represents a distance from the selected spatial plane.
3. The method according to claim 2, wherein t/d=(t.sub.x, t.sub.y, t.sub.z).
4. The method according to claim 1, wherein the associating of the adjacent feature pairs with the respective spatial planes comprises: computing respective ones of the homographies respectively for the ground plane, the back plane and the side plane, projecting the first feature from the first image as a respective projected feature respectively onto the ground plane, the back plane and the side plane using respective applicable ones of the homographies, determining respective projection errors as respective differences between the second feature in the second image and the respective projected feature respectively for the ground plane, the back plane and the side plane, and selecting, as the spatial plane to which the adjacent feature pairs are associated, the one of the ground plane, the back plane and the side plane for which the respective projection error is the smallest among the projection errors.
5. The method according to claim 1, wherein a respective one of the homographies is computed for the back plane in accordance with:
6. The method according to claim 1, wherein a respective one of the homographies is computed for the ground plane in accordance with:
7. The method according to claim 1, wherein a respective one of the homographies is computed for the side plane in accordance with:
8. The method according to claim 1, further comprising: determining that a particular one of the spatial planes is a static plane when the motion thereof is zero as indicated by the respective relative translation vector thereof having a vector length of zero, and determining a self-rotation of the camera between the first image and the second image based on the adjacent feature pairs associated with the at least one static plane.
9. The method according to claim 8, further comprising predicting image coordinates of an epipole x.sub.s,t0 of the second image at the second time t.sub.0 according to x.sub.s,t0=H.sub.st*x.sub.s,t1, wherein H.sub.st represents a homography of the at least one static plane and x.sub.s,t1 represents image coordinates of an epipole of the first image at the first time t.sub.1.
10. The method according to claim 9, further comprising establishing a pitch rate of the camera as a tan((x.sub.s0−x.sub.s1)/f) wherein f is a focal length of the camera with respect to one pixel of the camera.
11. The method according to claim 9, further comprising establishing a yaw rate of the camera as a tan(y.sub.s0−y.sub.s1)/f) wherein f is a focal length of the camera with respect to one pixel of the camera.
12. The method according to claim 1, further comprising: determining and compensating a self-rotation of the camera, and identifying the dynamic object as an overtaking vehicle when the relative translation vector of the selected spatial plane to which the dynamic object has been associated includes the vector component t.sub.z having a negative value, wherein the vector component t.sub.z extends in the longitudinal direction of the vehicle.
13. The method according to claim 1, further comprising: determining and compensating a self-rotation of the camera, and identifying the dynamic object as an approaching potential collision object when the relative translation vector of the selected spatial plane to which the dynamic object has been associated includes the vector component t.sub.z having a positive value, wherein the vector component t.sub.z extends in the longitudinal direction of the vehicle.
14. The method according to claim 1, further comprising: determining and compensating a self-rotation of the camera, and identifying the dynamic object as an other vehicle driving in a curve when the relative translation vector of the selected spatial plane to which the dynamic object has been associated includes the vector component t.sub.x having a non-zero value, wherein the vector component t.sub.x extends horizontally and transversely to the longitudinal direction of the vehicle.
15. The method according to claim 1, further comprising labeling the dynamic object as a moving object, and tracking the dynamic object in successive images of the series of images.
16. The method according to claim 1, further comprising: assigning to the dynamic object a fixed width, a fixed height and/or a fixed length in space, and performing geometric tracking of the dynamic object.
17. The method according to claim 1, further comprising operating a driver assistance system of the vehicle in response to and dependent on the dynamic object.
18. A device for detecting dynamic objects, comprising a camera controller and evaluation electronics, wherein: the camera controller is configured to capture, with a camera of a vehicle, a series of images including a first image at a first time and a second image at a second time after the first time; and the evaluation electronics are configured: to determine a plurality of corresponding feature pairs, wherein each one of the corresponding feature pairs respectively consists of corresponding first and second features in the first and second images; to establish a plurality of spatial planes including at least one back plane extending normal to a longitudinal direction of the vehicle, at least one ground plane extending horizontally, and at least one side plane extending vertically and along the longitudinal direction of the vehicle, wherein each respective one of the spatial planes is established by associating, with the respective spatial plane, a respective plurality of adjacent feature pairs among the corresponding feature pairs; to detect an object in the series of images based on the adjacent feature pairs associated with a selected one of the spatial planes that have been established; to determine a respective relative translation vector having orthogonal vector components t.sub.x, t.sub.y and t.sub.z for each respective one of the spatial planes as a measure of a motion of the respective spatial plane; and to identify the object as a dynamic object based on the relative translation vector of the selected spatial plane; wherein the associating of the adjacent feature pairs with the respective spatial planes comprises computing homographies and using the homographies to associate the adjacent feature pairs with the respective spatial planes; wherein the homographies describe correspondences of points or features in a respective selected image region in the first image at the first time with same or corresponding points or features in the respective selected image region in the second image at the second time; wherein each one of the homographies has only three degrees of freedom respectively relating to the orthogonal vector components t.sub.x, t.sub.y and t.sub.z; and wherein the determining of the respective relative translation vector for each one of the spatial planes involves inverting a 3×3 matrix to solve for the three degrees of freedom of the respective homography associated with the respective plane; and wherein the respective relative translation vector indicates the orientation of the selected image region.
19. The device according to claim 18, wherein the camera is a single monocular camera and the images are each mono-images.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Further features, advantages and effects of the invention are set out by the following description of preferred embodiment examples of the invention, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS OF THE INVENTION
(10) Parts corresponding to one another are, as a general rule, provided with the same reference numerals in all of the figures.
(11)
(12)
(13)
(14) Alternatively,
(15) If, instead of individual correspondences, multiple adjacent correspondences are observed, objects can be segmented due to different speeds, scalings and deformation.
(16) If it is assumed that the world consists of planes, these can be described by homographs i.e. homographies and can be separated or distinguished from one another as shown below by means of their distance, speed and orientation. A homograph describes the correspondence of points on one plane between two camera positions or the correspondence of two points in two consecutive frames:
(17)
(18) In this case, the vector x.sub.t0 describes the 3D correspondence at the point in time t−0 of the vector x.sub.t1 at the point in time t−1. A homograph can be computed, in an image-based manner, by knowledge of four point correspondences (cf. Tutorial: Multiple View Geometry, Hartley, R. and Zisserman, A., CVPR June 1999: https://de.scribd.com/document/96810936/Hartley-Tut-4up accessed on 26 Sep. 2016). The relationships indicated at the top left (slide 21) of page 6 of the tutorial can be formulated as follows in the notation of equation 1:
(19)
(20) Alternatively, knowing the camera translation t, the rotation R and the distance d along the normal vector n of the plane, the homograph can be computed in accordance with equation 3. Equation 3 illustrates that, at a nonzero inverse TTC t/d, planes having a different orientation n can be modelled and that planes having an identical orientation n can be separated by means of their inverse TTC.
(21)
(22) A homograph can be theoretically decomposed into the normal vector n, the rotation matrix R and the inverse TTC t/d. Unfortunately, this decomposition is numerically extremely unstable and sensitive to measuring errors.
(23) If a scene is described by planes, it can be segmented as indicated below.
(24)
(25) If a cell comprises only one object (cells B3, D3), a homograph or homography will describe this cell very well. If, however, a cell contains more than one object (cell C3), the homograph will not describe either of the two objects well. If the point correspondences (black dot or black cross or x) are associated with the clusters (or segment) of the adjacent cells (B3 or D3) by means of their back projection error, the black dot is associated with the segment of the cell B3 and the black cross is associated with the segment of the cell D3, because the homograph for the cell C3 does not describe either the foreground or the background well.
(26) If prior knowledge of a scene exists, the segment sizes can be adjusted to the scene in that e.g. larger regions in the close region of the vehicle or in regions having a positive classification answer can be generated. A dedicated back plane, ground plane and side plane homograph i.e. homography is computed for each segment, as shown in equations 5 to 10.
(27) The computation of the back plane, ground plane and side plane homograph i.e. homography increases the selectivity because a homography with fewer degrees of freedom can only poorly model regions which contain different planes and, consequently, corresponding points will have a higher back projection error, see
e.sub.i=x.sub.t0−H.sub.ix.sub.e1. (4)
(28) If the static installation position of the camera and camera rotation are assumed in two different views (e.g. due to knowledge of the camera calibration and due to the computation of the fundamental matrix in a monocular system or due to rotation values of a rotation rate sensor cluster), the inverse TTC t/d can be computed by means of the flux vectors compensated for by the static camera rotation, as is shown below by way of example for a ground plane n′=[0 1 0]. If the rotation is not known, it can be approximately replaced by a unit matrix.
(29) If the quotient t/d is substituted by the inverse time to collision
(30)
in equation 3, it follows that:
(31)
(32) By introducing the constants a, b, c, wherein
(33)
equation 5 produces the simplified form:
(34)
(35) The result of standardizing the homogeneous coordinates is:
x.sub.0(c−t.sub.zy.sub.1)=a−t.sub.xy.sub.1 (7)
y.sub.0(c−t.sub.zy.sub.1)=b−t.sub.yy.sub.1 (8)
(36) For more than one measurement, an equation system of the form Mx=v with a vector x to be established, a matrix M and a vector v (see equation 9) is produced, which can be solved for at least three image correspondences as sampling points by e.g. a singular value decomposition or a least squares method:
(37)
(38) The back and side plane homographs i.e. homographies can be deduced similarly and respectively produce:
(39)
(40) In order to segment larger objects consisting of multiple cells, adjacent cells can be combined in a further step, in that the back projection or reprojection errors Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i or Σx.sub.t0.sup.j−H.sub.ix.sub.t1.sup.j are computed by means of sampling points (see point 1 below: RANSAC) of the adjacent segments j and i and their homographs i.e. homographies. Two adjacent clusters are combined, if Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i is less than Σx.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i or e.g. the back projection error standardized for the predicted flux length is below an adjustable threshold. In particular, two adjacent clusters can be combined, if Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i is less than Σx.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i and the two back projection errors Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i and Σx.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i fall below a threshold standardized for the flux length. Alternatively, back projection errors can be used as potentials in a graph and a global solution can be computed. The compactness of the clusters can, in this case, be established via the edge potentials in the graph.
(41) If the segments have been combined, the homographs i.e. homographies are computed again and the point correspondences are associated with the clusters having the smallest back projection error. f only directly neighboring clusters are observed, very compact objects can be generated. If the minimum error exceeds an adjustable threshold, new (cluster/object) IDs are assigned to the correspondences, in order to be able to identify partially concealed objects or objects having a slightly different TTC. By adjusting the threshold, the resolution of (slightly) different objects can be adjusted.
(42) The back projection errors can be provided with a bias which reduces the costs for related regions or a bias which increases the costs for an ID change, if point correspondences were to have the same ID affiliation over a longer period of time.
(43)
(44)
(45) This scene can be segmented in a similar way to the method described by means of
(46)
(47)
(48) A further segment can be identified in the middle of the image, in the original it is pink. It therefore has high red values in
(49) The segmenting result shown was determined without prior knowledge of the scene in only three iteration steps. This shows the enormous speediness and performance of an embodiment of the invention by temporal integration.
(50)
(51)
(52)
(53)
(54) As illustrated in
(55)
(56)
(57)
(58)
(59) If the natural rotation is considered in the correspondences prior to the computation of the homograph i.e. homography, or if the natural rotation is considered in the rotation matrix R, overtaking vehicles can be identified due to their negative t, component or swerving vehicles or vehicles driving in a curve can be identified by a nonzero lateral t.sub.x component. If the dynamic segments are predicted by means of their homographs (see “consolidation of the optical flux based on homographs” below), a dynamic map can be constructed over time.
(60) If equation 3 is observed, it can be seen that segments having an inverse TTC equal to zero describe the rotation matrix and these can be established by computing a homograph with a full degree of freedom (equation 2) from segments with t/d equal to zero. If it is assumed that the translatory components in the vicinity of the epipole cannot make themselves felt, the pitch rate and yaw rate can also be established by predicting the coordinates of the epipole (x.sub.e, y.sub.e) through the homograph of static segments and computing the a tan((x.sub.e0−x.sub.e1)/f) or a tan((y.sub.e0−y.sub.e1)/f) with the focal length f based on one pixel.
(61) If a homograph is computed with all degrees of freedom for each cluster, this can also be used to reconstruct the 3D surroundings in that, instead of the measured position x.sub.t0, the predicted position H*x.sub.t1 is used for triangulation. This not only reduces the effect of measuring errors, but also makes it possible to reconstruct objects close to the epipole.
(62) One embodiment example for consolidating the optical flux based on homographs is described below.
(63) If the segmentation is known at the point in time t−1, it can be used to both predict the objects and to generate a dense flux field. Signature-based flux methods produce signatures and cause these to be clearly associated in consecutive frames. The signatures are mostly computed from a patch (image section or image region) of a defined size. If, however, the size and form of a patch alter, it is no longer possible to find a correspondence with a fixed template (model, specimen, e.g. an image section of an image of the series of images, which corresponds to an object—for example a vehicle template—is meant). If e.g. one is approaching a back plane, the size of a patch changes. Or if one is moving over a ground plane or parallel to a side plane, both the size and the form of a patch change, see
(64) Alternatively, the current frame can be transformed at the point in time t−0 to the point in time t−1, in order to compensate for changes in scale and form.
(65)
(66)
(67)
(68)
(69)
(70) In
(71) In order to generate a dense flux field, the current image can thus be warped onto the previous image for each segment, in order to rediscover already existing correspondences which have changed in their scale or form, or in order to establish new correspondences by means of congruent templates.
(72) If not enough flux vectors for computing a homograph again are present in a current frame, the homograph from the last frame can be approximately used to make the correspondence finding more robust to changes in form and scale.
(73) The following configuration forms or aspects are advantageous and can be provided individually or in combination: 1. The image is subdivided into N×M cells and a clear i.e. unique or unambiguous cell ID is assigned to the point correspondences of a cell. The back plane, ground plane and side plane homographs i.e. homographies (equations 9, 10 and 11) are computed by means of RANSAC from the correspondences with the same IDs, and both the homography having the lowest back projection error and the sampling points used to calculate the homography are stored. In the case of the RANSAC (RAndom SAmple Consensus) method, a minimum number of randomly selected correspondences is usually used for each iteration, in order to form a hypothesis. A value, which describes whether the corresponding feature supports the hypothesis, is subsequently computed for each corresponding feature. If the hypothesis attains sufficient support through the corresponding features, then the non-supporting corresponding features can be rejected as outliers. Otherwise, a minimum number of correspondences is selected again at random. 2. The back projection errors Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i or Σx.sub.t0.sup.i−H.sub.ix.sub.t1.sup.j are computed by means of the sampling points of the adjacent homograph i.e. homography for adjacent cells i, j. If the back projection error Σx.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i is less than Σx.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i or if the errors fall below a threshold standardized for the flux length, the IDs are combined and the homographies are computed again. In particular, two adjacent cells can be clustered as belonging to the same plane (or to the same segment or to the same object), if the back projection error Σ.sub.i(x.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i) is less than Σ.sub.i(x.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i) and if both back projection errors Σ.sub.i(x.sub.t0.sup.i−H.sub.jx.sub.t1.sup.i) and Σ.sub.i(x.sub.t0.sup.i−H.sub.ix.sub.t1.sup.i) fall below a threshold standardized for the flux length. 3. The back projection errors x.sub.e0−H.sub.ix.sub.t1 of all of the point correspondences are computed for the adjacent segments and a point correspondence is associated with the segment having the lowest back projection error. If the minimum error exceeds a threshold, the correspondences are provided with a new object ID in order to also be able to identify smaller or partially concealed objects. 4. The homographs of the segments extracted at the point in time t−1 are computed again at the start of a new frame (t−0) by means of the image correspondences already found and the already existing segment IDs in the current frame are predicted. If not enough flux vectors are available to compute a homograph again in the current frame, the homographs from the last frame can be approximately used. 5. In order to generate a dense flux field, the current frame (t−0) is warped onto the last frame (t−1) for each segment, in order to rediscover already existing correspondences which have changed in their scale or form, or in order to establish new correspondences. 6. The back projection errors of the back plane, ground plane and side plane can be used to validate elevated targets, see