Method for labeling images of street scenes

Abstract

A method labels an image of a street view by first extracting, for each pixel, an appearance feature for inferring a semantic label, a depth feature for inferring a depth label. Then, a column-wise labeling procedure is applied to the features to jointly determine the semantic label and the depth label for each pixel using the appearance feature and the depth feature, wherein the column-wise labeling procedure is according to a model of the street view, and wherein each column of pixels in the images includes at most four layers.

Claims

1. A method for labeling an image of a street view, wherein the image includes a set of columns of pixels, comprising the steps of: employing at least one processor executing computer executable instructions stored on at least one computer readable memory to facilitate performing the steps of: receiving the image having image pixels including respective two-dimensional features; receiving image data points corresponding to the image, the image data points including respective three-dimensional image features; extracting, for each pixel, an appearance feature from the image pixels, wherein the appearance features are determined using a deep neural network learned from a labeled dataset; extracting, for each pixel, a depth feature from the image data points; and applying a column-wise labeling procedure to jointly determine a semantic label and a depth label for each pixel for each column of pixels from the set of columns of pixels of the image using both the appearance features and the depth features from their respective column of pixels, and wherein the column-wise labeling procedure is according to a model of the street view, wherein each column of pixels of the set of columns of pixels includes at most four ordered layers from a top layer or a fourth layer to a first layer or a bottom layer, such that the at most four ordered layers are obtained using an inference procedure that jointly estimates the sematic labels and the depth labels for each image column; processing, the column-wise labeling procedure subject to the at most four ordered layers to produce the labeled image, and the steps are performed in the processor.

2. The method of claim 1, wherein the first layer of the street view model represents drivable areas, a second layer represents dynamic objects common in the street, a third layer represents static objects, and the fourth layer represents sky, and wherein the depths in the layers are ordering from top to bottom.

3. The method of claim 2, wherein the drivable areas in the first layer include ground, grass, sidewalk.

4. The method of claim 2, wherein the dynamic objects in the second layer include vehicles, pedestrians, bicyclists, motorcyclists, and animals.

5. The method of claim 2, wherein the static objects in the third layer include buildings, bridges, and trees.

6. The method of claim 1, wherein the appearance features are determined using the deep neural network learned from the labeled dataset, such that the deep neural network is used to extract image features for sematic classes from the images.

7. The method of claim 1, wherein the depth feature is determined from disparity matching cost obtained from a stereo image.

8. The method of claim 1, wherein the depth label for a pixel in the top layer of a column of pixels is greater than the depth label of a pixel in the remaining lower layers of the column of pixels.

9. The method of claim 1, wherein the depth label for a pixel in the bottom layer of a column of pixels is determined by a ground plane constraint.

10. The method of claim 1, wherein the image is acquired by a camera device or an image acquiring device, and the labeled image is an output on a display, such that the display and the processor are arranged in a vehicle.

11. A method for labeling an image of a street view, wherein the image includes a set of columns of pixels, comprising the steps of: employing at least one processor executing computer executable instructions stored on at least one computer readable memory to facilitate performing the steps of: receiving the image having image pixels including respective two-dimensional features; receiving image data points corresponding to the image, the image data points including respective three-dimensional image features; extracting, for each pixel, an appearance feature from the image pixels, wherein the appearance features are determined using a deep multi-scale convolutional network from a labeled dataset used to extract image features for sematic classes from the images; extracting, for each pixel, a depth feature from the image data points; and applying a column-wise labeling procedure to jointly determine a semantic label and a depth label for each pixel for each column of pixels from the set of columns of pixels of the image using both the appearance features and the depth features from their respective column of pixels, and wherein the column-wise labeling procedure is according to a model of the street view, wherein each column of pixels of the set of columns of pixels includes at most four ordered layers from a top layer or a fourth layer to a first layer or a bottom layer, such that the at most four ordered layers are obtained using an inference procedure that jointly estimates the sematic labels and depth labels for each image column; processing, the column-wise labeling procedure subject to the at most four ordered layers to produce the labeled image, and the steps are performed in the processor.

12. The method of claim 1, wherein the inference procedure is based on a dynamic algorithm, that includes variables, h.sub.i1, h.sub.i2, h.sub.i3, and h.sub.i4, to denote y coordinates of top pixels of the at most four ordered layers, such that l.sub.i1, l.sub.i2, l.sub.i3, and l.sub.i4 are semantic labels for the at most four ordered layers, and d.sub.i1, d.sub.i2, d.sub.i3, and d.sub.i4 are depths of the at most four ordered layers, wherein only unknown variables are determined, for example, if the unknown variables included h.sub.1, h.sub.2, h.sub.3, l.sub.2, d.sub.3, then the unknown variables are determined by $R_{i} (h_{2}, h_{3}, d_{3}) \equiv {.Math.}_{y = h_{4}}^{h_{3} - 1} E^{A} (i, y, ��) + E^{D} (i, y, \infty) + {.Math.}_{y = h_{3}}^{h_{2} - 1} E^{A} (i, y, ��) + E^{D} (i, y, d_{3}) .$

13. The method of claim 11, wherein the inference procedure is based on a dynamic algorithm, that includes variables, h.sub.i1, h.sub.i2, h.sub.i3, and h.sub.i4, to denote y coordinates of top pixels of the at most four ordered layers, such that l.sub.i1, l.sub.i2, l.sub.i3, and l.sub.i4 are semantic labels for the at most four ordered layers, and d.sub.i1, d.sub.i2, d.sub.i3, and d.sub.i4 are depths of the at most four ordered layers, wherein only unknown variables are determined, for example, if the unknown variables included h.sub.1, h.sub.2, h.sub.3, l.sub.2, d.sub.3, then the unknown variables are determined by $R_{i} (h_{2}, h_{3}, d_{3}) \equiv {.Math.}_{y = h_{4}}^{h_{3} - 1} E^{A} (i, y, ��) + E^{D} (i, y, \infty) + {.Math.}_{y = h_{3}}^{h_{2} - 1} E^{A} (i, y, ��) + E^{D} (i, y, d_{3}) .$

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic of a method for labeling components in a pair of stereo images according to embodiments of the invention;

(2) FIG. 2 is a flow diagram of the method of FIG. 1 according to embodiments of the invention;

(3) FIG. 3 is an image of a street scene with four layers according to embodiments of the invention; and

(4) FIG. 4 is a schematic of labeled layers according to embodiments of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(5) As shown in FIGS. 1 and 2, the embodiments of our invention provide a four-layered model for a street scene and a method 200 for labeling components in a pair of stereo images 101, such as ground, e.g., a road, objects such as pedestrians and vehicles, buildings, and the sky. The images can be acquired by a camera 201. A labeled image 102 can be output on a display 202. The steps of the method can be performed in a processor 203 connected to memory, input/output interfaces by buses as known in the art. In one embodiment, the display and the processor can be arranged in a vehicle. This way, a driver of the vehicle can be apprised of an enhanced image of the scene.

(6) A deep neural network, e.g., a convolutional neural network (CNN) 210, is used to extract image features 211 for the semantic classes from the stereo images 101. A stereo cost function 220 is used to extract depth features 221. The image and depth features are processed by a column-wise image labeling procedure 230, subject to a layer ordering constraint 231 to produce the labeled image 102.

(7) Layered Street Scene

(8) As shown in FIG. 3, it is an objective of the invention to jointly estimate a semantic label and depth for each pixel in the street view image using appearance and three-dimensional information. We use a layered image interpretation. An image is horizontally partitioned into one to four layers of different semantic and depth components. Layer-1 is the ground plane, e.g., a road. Layer-2 can include pedestrians (peds) and other dynamic objects such as vehicles and motorcycles. Layer-3 contains buildings, and layer-4 is sky.

(9) The layers are ordered from the bottom to the top. In each image column, pixels belonging to the same layer have the same semantic label and approximate depth. The only exception is that the depths of the pixels in the ground layer can vary according to their vertical image coordinates, which is determined by the ground plane. We enforce a depth-order constraint, i.e., the depth of a lower layer is always smaller than the depth of a higher layer in each image column.

(10) For example, the first layer can only have ground, e.g., a drivable area. The second layer can include objects, e.g., pedestrians or vehicles. The third layer can have only buildings. The fourth layer can only have sky. We do not enforce that each image column has exactly four layers. A column can have one to four layers, in other words a column does not have to include all four layers. This implies that boundaries separating the layers need not be smooth and continuous. The four-layer model provides a rich and organized interpretation of street scenes. We provide a flexible method for enforcing geometry and semantics in the scene as described below.

(11) Notations

(12) We use custom character ={1, 2, . . . , W} and ={1, 2, . . . , H} to refer the sets that contain horizontal x and vertical y coordinates respectively, where W=|| and H=|| are the image width and height. We use five different semantic classes: namely, ground, vehicle, pedestrian, building, and sky denoted by the symbols custom character and, respectively. The set of the semantic class labels is denoted by ={}. We use for the set of disparities (depths). Herein, disparity and depth are used interchangeably. It is understood that a one-to-one conversion can be obtained by using camera calibration parameters. The cardinality of the semantic label space and disparity values are denoted by L=| custom character | and || respectively.

(13) Street Scene Model

(14) As shown in FIG. 4, we formulate the layered street view problem as a constrained energy minimization problem. The constraints encode the order of the semantic object class labels and depth values in each column. The constraints limit the solution space of the variables associated with each image column. We solve the constrained energy minimization problem efficiently using an inference algorithm based on dynamic programming.

(15) We use the variables, h.sub.i1, h.sub.i2, h.sub.i3, and h.sub.i4, to denote the y coordinates of the top pixels of the four layers in the image column 400. Let l.sub.i1, l.sub.i2, l.sub.i3, and l.sub.i4 be the semantic object class labels for the four layers, and let d.sub.i1, d.sub.i2, d.sub.i3, and d.sub.i4 be the depths of the four layers in the image column i. The ordering constraint and the knowledge of the ground planes allow us to know some of these parameters. The actual number of unknowns is only 5 given by x.sub.i=[h.sub.i1, h.sub.i2, h.sub.i3, l.sub.i2, d.sub.i3]. Hence, the label assignment for the entire image is given by x.sub.l=[x.sub.1, x.sub.2, . . . , x.sub.W]. The number of possible assignments for an image column is in the order O(H.sup.3LD), since h.sub.i1, h.sub.i2, h.sub.i3ε custom character , l.sub.i2ε, and d.sub.i3ε.

(16) Label Likelihood Ranking

(17) To rank the likelihood of the label assignment, we use evidence from image appearance features 211 and stereo disparity matching features 221. We aggregate evidence from all the pixels in a column to compute the evidence. Let U.sub.i.sup.A(x.sub.i) and U.sub.i.sup.D(x.sub.i) be the data terms representing the semantic and depth label cost, respectively, incurred as assigning x.sub.i to the image column i. The two terms are summed to yield the data term U.sub.i(x.sub.i)=U.sub.i.sup.A(x.sub.i)+U.sub.i.sup.D(x.sub.i) denoting the cost for assigning x.sub.i to the image column i. Instead of working on the standard 2D Markov Random Field (MRF) space where each pixel can have a depth value and a semantic label as independent variables, we reduce the problem to a constrained energy minimization problem given by

(18) $\min_{x_{i}} {.Math.}_{i = 1}^{W} U_{i} (x_{i})$ $s . t . h_{i 1} > h_{i 2} > h_{i 3} > h_{i 4} = 1, d_{i 1} < d_{i 2} < d_{i 3} < d_{i 4}, l_{i 1} = ��, l_{i 2} \in {��, ℙ}, l_{i 3} = ��, l_{i 4} = ��, \forall i$
where the first constraint gives the layer structure, the second constraint enforces the depth order, and the third constraint limits the semantic variation. The variable d.sub.i2 is a function of h.sub.i1, i.e., the top pixel location of the ground layer, because we assume the dynamic object is vertical with respect to the horizontal ground surface at the h.sub.i1th row of the image. The energy function Σ.sub.i=1.sup.W U.sub.i(x.sub.i) models the relation of pixels in the same column, but not pixels in the same row. The relation of pixels in the same row are implicitly encoded in the feature functions. We use image patches centering around a pixel as reception fields for the feature computation at the pixel location. Neighboring pixels have similar reception fields, and hence, have similar feature representation and class label assignments.

(19) The data term U.sub.i(x.sub.i) is the cost of assigning label x.sub.i to the column i. The data term is the sum of the pixel-wise data terms given by

(20) $U_{i} (x_{i}) = {.Math.}_{y = h_{i 4}}^{h_{i 3} - 1} E^{A} (i, y, ��) + E^{D} (i, y, \infty)$ ${.Math.}_{y = h_{i 3}}^{h_{i 2} - 1} E^{A} (i, y, ��) + E^{D} (i, y, d_{i 3})$ ${.Math.}_{y = h_{i 2}}^{h_{i 1} - 1} E^{A} (i, y, l_{i 2}) + E^{D} (i, y, d_{i 2} (h_{i 1}))$ ${.Math.}_{y = h_{i 1}}^{H} E^{A} (i, y, ��) + E^{D} (i, y, d_{i 1}),$
where the per pixel appearance data term E.sup.A(x, y, l) is the cost of assigning label l to the pixel (x, y) and the per pixel depth data term E.sup.D(x, y, d) is the cost of assigning depth d to the pixel (x, y). We use a deep convolutional neural network for obtaining the per pixel appearance data term E.sup.A(x, y, l), and use a conventional disparity cost for obtaining the per pixel depth data term E.sup.D(x, y, d).

(21) Depth Features

(22) We use the smoothed absolute intensity difference for the per pixel depth data term, which is commonly used in the stereo reconstruction algorithms. We first compute the pixel wise absolute intensity difference for each disparity value in D, which renders a cost volume representation. A box filter is then applied to smooth the cost volume. The per pixel depth data term is given by

(23) $E^{D} (x, y, d) = \frac{1}{N} {.Math.}_{(\tilde{x}, \tilde{y}) \in P_{(x, y)}} .Math. I_{L} (\tilde{x}, \tilde{y} - d) - I_{R} (\tilde{x}, \tilde{y}) .Math.,$
where I.sub.L and I.sub.R refer to the intensity values of the left and right images, P.sub.(x,y) is an image patch centered at (x, y), and N=|P.sub.(x,y)| denotes the cardinality of the patch, serving as a normalization constant. The patch size is fixed to, e.g., 11-by-11 pixels.

(24) Appearance Features

(25) We determine the per pixel appearance data term using a deep multi-scale convolutional network (CNN). For example, the network has three convolutional layers: (1) 16 filters of size 8×8 followed by rectified linear (ReLU) non-linearity and 2×2 max-pooling; and (2) 64 filters of size 7×7×16 followed by ReLU and 2×2 non-overlapping max-pooling; (3) 256 filters of size 7×7×64 followed by ReLU.

(26) In one embodiment, we use grayscale images, scaled between 0 to 1 and centered by subtracting 0.5 as input for the CNN. Filters, shared across the scales, are applied separately to the three scales of a Gaussian pyramid of the input images.

(27) Note that the feature from lower scales are upsampled to the original image size. The feature from the three scales are concatenated to obtain 3×256=768 features per pixel. To train the network, we connect the feature layer to a fully connected layer having five neurons (classes) followed by a softmax or normalized exponential layer. We can also use dropout in a fully connected layer. The network is trained by minimizing the cross-entropy error via stochastic gradient descent with momentum. We use a negative logarithm of the softmax scores of the classes as the per-pixel appearance data terms.

(28) Inference Procedure

(29) We decompose the energy minimization problem in Equation (2) into W sub-problems where the ith sub-problem is given by

(30) $\min_{x_{i}} U_{i} (x_{i})$ $s . t . h_{i 1} > h_{i 2} > h_{i 3} > h_{i 4} = 1, d_{i 1} < d_{i 2} < d_{i 3} < d_{i 4}, l_{i 1} = ��, l_{i 2} \in {��, ℙ}, l_{i 3} = ��, l_{i 4} = �� .$

(31) We solve each of the sub-problems optimally and combine their solutions to construct the full semantic and depth labeling of the image. For simplicity in the notations, we will drop the subscript i in the description below.

(32) Each of the sub-problems can be mapped to a 1D chain labeling problem. The chain has 4 nodes where the first node contains the variables of (h.sub.1, d.sub.1, l.sub.1), the second node contains the variables of (h.sub.2, d.sub.2, l.sub.2), the third node contain the variables of (h.sub.3, d.sub.3, l.sub.3), and the fourth node contain the variables of (h.sub.4, d.sub.4, l.sub.4). Utilizing the recursion in the label cost evaluation, conventional dynamic programming algorithm can solve the 1D chain labeling problem with a complexity of O((HDL.Math.H).sup.2), where the product HDL represents the size of label space at each node and the second H is the complexity required for the label cost evaluation at each node. Unfortunately, the complexity is too high for real-time applications.

(33) We describe a variant of the dynamic programming algorithm to reduce the complexity of solving the sub-problem in (9) to O(H.sup.2D) and achieve real-time performance. We first note that some of the variables are known from our street view setup as described in the problem formulation section. We only need to search the values for h.sub.1, h.sub.2, h.sub.3, l.sub.2, d.sub.3. For any combination of h.sub.1 h.sub.2 and l.sub.2, we determine the best combination of d.sub.3 and h.sub.3. In the following, we show that pre-computing the best combination of d.sub.3 and h.sub.3 for any h.sub.1 h.sub.2 and l.sub.2 can be achieved in O(H.sup.2D) time using recursion.

(34) We first observe that the problem in (9) can then be written as

(35) $\min_{h_{1}, h_{2}, l_{2}} {.Math.}_{y = h_{2}}^{h_{1} - 1} E^{A} (i, y, l_{2}) + E^{D} (i, y, d_{2} (h_{1})) + {.Math.}_{y = h_{1}}^{H} E^{A} (i, y, ��) + E^{D} (i, y, d_{1}) + Q_{i} (h_{2})$
where Q.sub.i is an intermediate cost table given by

(36) $Q_{i} (h_{1}, h_{2}) \equiv \min_{\underset{d_{3} > d_{2} (h_{1})}{h_{3}, d_{3}}} {.Math.}_{y = h_{4}}^{h_{3} - 1} E^{A} (i, y, ��) + E^{D} (i, y, \infty) + {.Math.}_{y = h_{3}}^{h_{2} - 1} E^{A} (i, y, ��) + E^{D} (i, y, d_{3}) .$

(37) Note that depth of the second layer object d.sub.2 is a function of h.sub.1 because d.sub.2 can be uniquely determined from h.sub.1 and the ground plane equation. As a result, Q.sub.i depends on both h.sub.1 and h.sub.2.

(38) By integrating E.sup.A(i, y, custom character )+E.sup.D(i, y, d.sub.3) and E.sup.A(i, y, )+E.sup.D(i, y, ∞) along the y direction, the sum given by

(39) $R_{i} (h_{2}, h_{3}, d_{3}) \equiv {.Math.}_{y = h_{4}}^{h_{3} - 1} E^{A} (i, y, ��) + E^{D} (i, y, \infty) + {.Math.}_{y = h_{3}}^{h_{2} - 1} E^{A} (i, y, ��) + E^{D} (i, y, d_{3})$

(40) for all combination of h.sub.2 and h.sub.3 can be computed in O(H.sup.2) time for fixed d.sub.3. We further note that Q.sub.i can be computed via a recursive update rule given by

(41) $Q_{i} (h_{1} + 1, h_{2}) = \min (Q_{i} (h_{1}, h_{2}), \min_{h_{3}} R_{i} (h_{2}, h_{3}, \tilde{d})),$
where {tilde over (d)} is an integer satisfying d.sub.2(h.sub.1)≦{tilde over (d)}<d.sub.2(h.sub.1+1) and is used to ensure that the depth ordering constraint between the second and third layers is met.

(42) Intuitively, we are computing a running min structure along the decreasing depth of the building layer. The recursive update rule allows us to compute Q.sub.i for any h.sub.1 and h.sub.2 in O(H.sup.2D) time. As a result, the complexity of finding the best configuration for a partition can be reduced to O(H.sup.2L+H.sup.2D)=O(H.sup.2D) where H.sup.2L is the time required for searching combinations of h.sub.1 h.sub.2 and l.sub.2. We perform the 1D labeling algorithm to each image column. The overall complexity of the labeling algorithm is O(W H.sup.2 D).

(43) Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

Method for labeling images of street scenes

Assignee

Inventors

Cpc classification

Classification Explorer

G06V40/53

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

International classification

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

G06K9/46

PHYSICS

Abstract

Claims

Description