Urban remote sensing image scene classification method in consideration of spatial relationships

11710307 · 2023-07-25

Assignee

Inventors

Cpc classification

International classification

Abstract

An urban remote sensing image scene classification method in consideration of spatial relationships is provided and includes following steps of: cutting a remote sensing image into sub-images in an even and non-overlapping manner; performing a visual information coding on each of the sub-images to obtain a feature image Fv; inputting the feature image Fv into a crossing transfer unit to obtain hierarchical spatial characteristics; performing convolution of dimensionality reduction on the hierarchical spatial characteristics to obtain dimensionality-reduced hierarchical spatial characteristics; and performing a softmax model based classification on the dimensionality-reduced hierarchical spatial characteristics to obtain a classification result. The method comprehensively considers the role of two kinds of spatial relationships being regional spatial relationship and long-range spatial relationship in classification, and designs three paths in a crossing transfer unit for relationships fusion, thereby obtaining a better urban remote sensing image scene classification result.

Claims

1. An urban remote sensing image scene classification method in consideration of spatial relationships, comprising: step 1, cutting a remote sensing image into sub-images in an even and non-overlapping manner; step 2, performing a visual information coding on each of the sub-images to obtain a feature image Fv; step 3, inputting the feature image Fv into a crossing transfer unit to obtain hierarchical spatial characteristics; step 4, performing convolution of dimensionality reduction on the hierarchical spatial characteristics to obtain dimensionality-reduced hierarchical spatial characteristics; and step 5, performing a softmax model based classification on the dimensionality-reduced hierarchical spatial characteristics to obtain a classification result; wherein the step 1 of cutting a remote sensing image into sub-images in an even and non-overlapping manner in the step 1 comprises that: a large-scale remote sensing image I with a size of M×N is sliding cut into m×n sub-images in the even and non-overlapping manner, each of the sub-images P.sub.i,j is with a size of M m × N n , row and column numbers (i, j) of the P.sub.i,j in the I are stored as spatial information, where M, N, m and n are positive integers, 1≤i≤m, and 1≤j≤n; wherein the step 2 of performing a visual information coding on each of the sub-images to obtain a feature image Fv comprises that: a pre-trained deep convolution model is used to perform the visual information coding on each of the sub-images P.sub.i,j to convert the P.sub.i,j into a vector fv.sub.i,j, and thereby the large-scale remote sensing image I is converted into the feature image Fv: Fv = ( f v 1 , 1 .Math. f v 1 , n .Math. .Math. f v m , 1 .Math. f v m , n ) ; wherein the crossing transfer unit is used for extraction and fusion of regional spatial relationship and long-range spatial relationship, an extraction formula of the regional spatial relationship is Fr=Conv(Fv)=Fv*W+b, where Fr represents spatial relationship as extracted for analysis, Conv( ) represents a convolution function, W represents a convolution kernel, B represents an offset, and * represents a convolution operation; and the long-range spatial relationship is extracted by a ReNet module based on a recurrent neural network.

2. The urban remote sensing image scene classification method as claimed in claim 1, wherein an input of the crossing transfer unit is the feature image Fv, and an output of the crossing transfer unit is the hierarchical spatial characteristics F.sub.E; the crossing transfer unit uses three paths to extract relationships for analysis and transfer relationships, a first one of the three paths first extracts the regional spatial relationship of the Fv and then extracts the long-range spatial relationship, a second one of the three paths first extracts the long-range spatial relationship of the Fv and then extracts the regional spatial relationship, and a third one of the three paths is a shortcut to transfer the Fv directly to a tail end of the crossing transfer unit without additional processing; and the hierarchical spatial characteristics F.sub.E as output is expressed to be that:
F.sub.E=tanh(ReNet.sup.2(Conv.sup.1(Fv)+Conv.sup.2(ReNet.sup.1(Fv))+Fv) where tanh represents a hyperbolic tangent function, ReNet.sup.1 and ReNet.sup.2 represent two different ReNet modules, and Conv.sup.1 and Conv.sup.2 represent two different convolution modules.

3. The urban remote sensing image scene classification method as claimed in claim 2, wherein in the step 3, the feature image passes through three crossing transfer units in series to obtain the hierarchical spatial characteristics F.sub.M; in the step 4, a convolutional layer conv.sub.1×1 with a size of 1×1 is used for the convolution of dimensionality reduction; in the step 5, a softmax model is used for the classification, and the classification result C.sub.i,j for the P.sub.i,j is expressed as that:
C.sub.i,j=argmax(softmax(conv.sub.1×1(F.sub.M).sub.i,j)) where argmax(x) represents a dimension corresponding to a maximum component of a vector x.

4. The urban remote sensing image scene classification method as claimed in claim 1, wherein in the step 3, the feature image passes through three crossing transfer units in series to obtain hierarchical spatial characteristics F.sub.M; in the step 4, a convolutional layer conv.sub.1×1, with a size of 1×1 is used for the convolution of dimensionality reduction; in the step 5, a softmax model is used for the classification, and the classification result C.sub.i,j for the P.sub.i,j is expressed as that:
C.sub.i,j=argmax(softmax(conv.sub.1×1(F.sub.M).sub.i,j)) where argmax(x) represents a dimension corresponding to a maximum component of a vector x.

5. The urban remote sensing image scene classification method as claimed in claim 1, wherein the ReNet module is used for extracting the long-range spatial relationship from four directions of up, down, left and right along rows and columns of pixels for analysis.

6. The urban remote sensing image scene classification method as claimed in claim 1, wherein in a training process of the softmax model, a loss function is cross-entropy loss, and a back-propagation method is used to optimize parameters of model.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic flowchart of the method of the invention.

(2) FIG. 2 is a schematic structural diagram of a ReNet module according to an embodiment of the invention.

(3) FIG. 3 is a schematic structural diagram of a crossing transfer unit according to an embodiment of the invention.

(4) FIG. 4 is a schematic flowchart of a data processing according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

(5) The invention will be further described below in conjunction with embodiments and the drawings, but will not be limited in any way. Any modifications or substitutions made based on the teachings of the invention shall fall within the protection scope of the invention.

(6) Aiming at conventional remote sensing image analysis methods that cannot analyze the spatial relationships among images, a model that can extract and analyze the spatial relationships among different images is designed, and the model can be roughly divided into two parts: visual information extraction and coding part, and spatial relationships fusion part. The method of the invention can achieve better scene classification of remote sensing images, such as the distinction of commercial, industrial, residential and institutional lands in the remote sensing images.

(7) Referring to FIG. 1, an embodiment of the invention includes the following steps of:

(8) step 1, cutting a remote sensing image into sub-images in an even and non-overlapping manner;

(9) step 2, performing a visual information coding on each of the sub-images to obtain a feature image Fv;

(10) step 3, inputting the feature image Fv into a crossing transfer unit to obtain hierarchical spatial characteristics;

(11) step 4, performing convolution of dimensionality reduction on the hierarchical spatial characteristics to obtain dimensionality-reduced hierarchical spatial characteristics; and

(12) step 5, performing a softmax model based classification on the dimensionality-reduced hierarchical spatial characteristics to obtain a classification result.

(13) In the step 1, as to the illustrated embodiment, in order to retain spatial relationships in the remote sensing image, a large-scale remote sensing image I with a size of M×N is sliding cut into m×n sub-images in the even and non-overlapping manner, each sub-image P.sub.i,j has a size of M/m× N/n, where the row and column numbers (i, j) of the P.sub.i,j in the I are stored as spatial information, where M, N, m and n are positive integers, 1≤i≤m, and 1≤j≤n.

(14) In the step 2, for each sub-image P.sub.i,j a visual information coding operation is applied thereto by using a pre-trained deep convolution model, so that the P.sub.i,j is converted into a vector fv.sub.i,j and finally the I is converted into a feature image Fv:

(15) Fv = ( f v 1 , 1 .Math. f v 1 , n .Math. .Math. f v m , 1 .Math. f v m , n ) , fv i , j R c formula ( 1 )

(16) After the Fv is obtained, each fv.sub.i,j may be treated as a pixel, so that the classification problem of P.sub.i,j in the I is transformed into a semantic segmentation problem of Fv. Considering that a distribution of scenes has planar distribution (airport, residential area, etc.) and linear distribution (road, river, etc.), the illustrated embodiment mainly considers two kinds of spatial relationships when considering spatial relationships, i.e., regional spatial relationship and long-range spatial relationship. Modelings of spatial relationships include the following three aspects.

(17) Aspect 1, regional spatial relationship modeling

(18) For the Fv, the regional spatial relationship may be understood as a relationship between fv.sub.i,j and a vector in a certain neighborhood area thereof. A convolutional neural network model can extract and fuse relationships in a certain neighborhood area through convolution operation, so as to achieve the purpose of regional spatial relationship modeling. Therefore, the method of the invention will adopt the convolution model in the analysis of regional spatial relationship. Assuming that W represents a convolution kernel, B represents an offset, and Fr represents spatial relationship as extracted for analysis, then a one-layer convolution model can be expressed as:
Fr=Conv(Fv)=Fv*W+b  formula (2)
where the asterisk (*) indicates a convolution operation.

(19) Aspect 2, long-range spatial relationship modeling

(20) A structural diagram of a ReNet module is shown in FIG. 2, for the Fv, the long-range spatial relationship can be understood as a relationship between fv.sub.i,j and vectors of row and column thereof. A recurrent neural network has a wide range of applications in sequence models, and its special stage information processing structure can comprehensively analyze context information. Considering that fv in the same row or in the same column can be treated as a sequential data, and thus the illustrated embodiment introduces the ReNet module based on recurrent neural network. The ReNet module can extract and analyze long-range spatial relationship from four directions of up, down, left and right along directions of row and column of pixels. Experiments show that its performance on some public data can reach the level of convolutional neural networks (Reference document: VISIN F, KASTNER K, CHO K, et al., ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks [J], arXiv preprint arXiv:1505.00393, 2015).

(21) Aspect 3, spatial relationship fusion modeling

(22) A structural diagram of a crossing transfer unit is shown in FIG. 3, ⊕ represents addition, the illustrated embodiment designs the crossing transfer unit (CTU) to realize a fusion of the regional spatial relationship with the long-range spatial relationship. CTU uses feature images as input (Fv) and output (F.sub.E), and adopts three paths for relationships extraction (for analysis) and transfer. A first path first extracts the regional spatial relationship of Fv and then extracts the long-range spatial relationship; a second path is reversed, i.e., first extracts the long-range spatial relationship of Fv and then extracts the regional spatial relationship; and a third path is a shortcut to transfer FV directly to a tail end of CTU without additional processing. Experiments show that adding a direct transfer path can speed up a convergence speed of the model (Reference document: He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C] Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:770-778). A final output result F.sub.E may be expressed as that:
F.sub.E=tanh(ReNet.sup.2(Conv.sup.1(Fv)+Conv.sup.2(ReNet.sup.1(Fv))+Fv)  formula (3)
where tanh is a hyperbolic tangent function, ReNet.sup.1 and ReNet.sup.2 represent two ReNet modules with different parameters, Conv.sup.1 and Conv.sup.2 represent two convolution modules with different parameters.

(23) After passing through three CTUs in series, a result of the modelings of spatial relationships is recorded as F.sub.M. The illustrated embodiment uses a convolutional layer conv.sub.1×1 with a size of 1×1 to perform convolution of dimensionality reduction on F.sub.M, and uses softmax model to perform classification, and finally a classification result C.sub.i,j for P.sub.i,j can be expressed as that:
C.sub.i,j=argmax(softmax(conv.sub.1×1(F.sub.M).sub.i,j))  formula (4)
where argmax(x) represents a dimension corresponding to a maximum component of a vector x.

(24) In a training process of the softmax model, a loss function is cross-entropy loss, and a back-propagation method is used to optimize parameters of model. A basis flowchart of data processing is shown in FIG. 4.

(25) A data set used in an experiment is a CSU-RESISC10 data set, and a distribution of training and testing samples of the data set after preprocessing is shown in Table 1.

(26) TABLE-US-00001 TABLE 1 Commercial Industrial Residential Construction Institutional Public Scene Classes Road Area Area Area Land Land Port Waters Place Airport Test set 17129 6768 1588 39806 530 1948 5331 12304 11587 3009 Validation set  2480  512  506  5728  22  386  665  1765  2642  494

(27) For each piece of 2000×2000 remote sensing image I in the CSU-RESISC10 data set, it first is cut into 20×20 numbers of sub-images P.sub.i,j in an even and non-overlapping manner, and each the sub-image is with a size of 100×100.

(28) For each the sub-image P.sub.i,j, a Xception model pre-trained on the CSU-RESISC10 is used to perform a visual information coding thereto, the p.sub.i,j then is converted into a 2048-dimensional vector fv.sub.i,j Finally, the I is converted into a feature image Fv∈R.sup.20×20×2048 In order to reduce the amount of calculation, before proceeding to the next calculation, a convolution with a size of 1×1 is used to reduce the Fv to 512 dimensions.

(29) A pre-training is carried out with 50 batches, a learning rate is 10.sup.−5 and an attenuation rate is 0.98, a result of the pre-training can refer to the first data row of Table 2 below.

(30) During modeling the spatial relationships, the embodiment of the invention keeps sizes of all output feature images and input feature images unchanged by adding edge compensation and controlling convolution kernel compensation. In order to fully extract the spatial relationship of fv.sub.i,j, the illustrated embodiment of the invention uses three CTU modules to progressively extract hierarchical spatial characteristics. A final output of the spatial relationship modelings is F.sub.M∈R.sup.20×20×512.

(31) The illustrated embodiment finally carries out classification as per the above formula (4).

(32) The cross-entropy is used as the loss function in the model training, the model of the illustrated embodiment of the invention is trained with 100 batches, the learning rate is 10.sup.−5 and the attenuation rate is 0.98. After training about 15 batches, the model converges.

(33) In order to verify an effectiveness of the invention, in addition to the illustrated embodiment, SPP-Net+MKL, Discriminative CNNs and a traditional natural image classification model Xception (Reference document: Chollet F, Xception: Deep learning with depthwise separable convolutions [C] Proceedings of the IEEE conference on computer vision and pattern recognition, 2017: 1251-1258) are additionally selected as comparisons. Classification experiments are carried out on the CSU-RESISC10 data set, and F1 score and Kappa coefficient (κ) are selected as evaluation basis.

(34) TABLE-US-00002 TABLE 2 F1 score Commercial Industrial Residential Constsuction Institutional Public Methods Road Area Area Area Land Land Port Waters Place Airport κ Xception 0.8131 0.3922 0.3541 0.8640 0.3793 0.2838 0.8615 0.9380 0.8340 0.8421 0.7638 SPP-Net-MKL 0.8133 0.4293 0.4680 0.8734 0.3750 0.1746 0.8265 0.9109 0.8260 0.8566 0.7624 Discriminative CNNs 0.8434 0.3723 0.4912 0.8802 0.4000 0.2639 0.8239 0.9273 0.8422 0.8057 0.7731 the invention 0.8329 0.6030 0.7643 0.9014 0.4400 0.6218 0.9239 0.9598 0.8841 0.9648 0.8410

(35) The experimental results show that due to the complexity of scenes, a single remote sensing image cannot well distinguish commercial, industrial, residential and institutional lands. For the three methods used for comparison, κ is less than 0.78. Since the spatial relationships of image are taken into consideration, compared with the three comparative experiments, relative improvements of Kappa by the method of the embodiment of the invention are 10.1%, 10.3% and 8.8% respectively.