REGULARIZED MULTI-METRIC ACTIVE LEARNING SYSTEM FOR IMAGE CLASSIFICATION

20200151518 · 2020-05-14

Assignee

Inventors

Cpc classification

International classification

Abstract

A regularized multi-metric active learning (AL) image classification system which includes three main parts. First, a regularized multi-metric learning process is utilized to jointly learn distinct metrics for different types of image features from remotely sensed image data. The regularizer incorporates the unlabeled data based on the neighborhood relationship, which helps avoid overfitting at early stages of AL, when the quantity of training data is particularly small. Then, as AL proceeds, the regularizer is also updated through similarity propagation, thus taking advantage of informative labeled samples. Finally, multiple features are projected into a common feature space, in which a batch-mode AL strategy combining uncertainty and diversity is utilized in conjunction with k-nearest neighbor (kNN) classification to enrich the set of labeled samples.

Claims

1. A method of processing remotely sensed digital images, comprising: receiving remotely sensed input image data, the input image data comprising a plurality of image features; regularizing the input image data by applying a multimetric HMML active learning process which incorporates unlabeled data in the input data based on neighborhood relationships within the input data; and updating similarity matrices in the HMML active learning process by incorporating supervised information in the dataset and iterating said regularizing to again regularize the input image data; further processing the image data to output an image classification map, wherein said a set of unlabeled samples having a maximum degree of uncertainty are first considered based on an uncertainty criterion, after which a diversity criterion is applied to select the most informative samples from a resulting contention pool.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The above and other objects, features, and advantages of various examples will become more apparent when taken in conjunction with the following description and drawings wherein identical reference numerals have been used, where possible, to designate identical features that are common to the figures, and wherein:

[0007] FIG. 1 is a flow diagram illustrating a process for processing remotely sensed image data according to one embodiment.

[0008] FIG. 2 is a summarized description of the steps performed in the diagram of FIG. 1.

[0009] FIG. 3 illustrates a system for processing remotely sensed image data using the process of FIG. 1 according to one embodiment.

DETAILED DESCRIPTION

[0010] The term drawings used herein refers to drawings attached herewith and to sketches, drawings, illustrations, photographs, or other visual representations found in this disclosure. The terms I, we, our and the like throughout this disclosure do not refer to any specific individual or group of individuals.

[0011] The present disclosure provides and system and method for processing remotely sensed input image data to produce high quality output image maps. The sensed input image data is typically high dimensional data, such as multi/hyperspectral, LiDAR, or RGB+Texture data. The method comprises three main parts, which are illustrated in the flow diagram of FIG. 1. As shown, multi-feature input image data 102 is received and directed to a regularized heterogeneous multimetric learning (HMML) processing block 104. After being processed by the block 104, the data is directed to a kNN classifier block 106. Block 106 refines the regularizer using a kNN classifier using an updated set of similarity matrices which incorporates supervised information from the data set to reflect the real similarity between data pairs. The data is then processed by block 108 which applies an active learning (AL) process wherein a set of unlabeled samples having a maximum degree of uncertainty are first considered based on an uncertainty criterion, after which a diversity criterion is applied to select the most informative samples from a resulting contention pool. The output of block 108 is directed to block 110 where training samples are incorporated and directed again to block 104 and 106, in addition to block 112 which updated the regularizer as shown and discussed further below.

[0012] Let X={[x.sub.i.sup.1, x.sub.i.sup.2, . . . , x.sub.i.sup.Q]}.sub.i=1.sup.n and denote a set of n samples with Q different types of features, where x.sub.i.sup.qR.sup.d.sup.q represents the ith sample from the qth feature type and d.sup.q is the dimensionality of the corresponding feature space. Similarly, let L={[x.sub.i.sup.1, x.sub.i.sup.2, . . . , x.sub.i.sup.Q], y.sub.i}.sub.i=1.sup.l be a set of training samples, which is constructed by selecting l samples from the set X, with corresponding class labels, and U=be the set of the remaining unlabeled samples, where U={[x.sub.i.sup.1, x.sub.i.sup.2, . . . , x.sub.i.sup.Q]}.sub.i=.sup.l+u. To deal with high dimensional input data X, LMNN, a single type feature strategy, was adapted to a multi-type feature setting and referred to as HMML. The distance between two training samples is defined by considering all the features:

[00001] .Math. q = 1 Q .Math. .Math. d ( x i q , x j q ) = .Math. q = 1 Q .Math. .Math. .Math. U q ( x i q - x j q ) .Math. 2 ( 1 )

where U.sup.qR.sup.r.sup.q.sup.d.sup.q corresponds to a transformation matrix for the qth feature type, and r.sup.q and d.sup.q are the input and output of the dimensionality of the qth feature type respectively. Also, for a labeled sample (x.sub.i, y.sub.i), we denote (x.sub.j, y.sub.j) as one of the kNNs of x.sub.i with label y.sub.j=y.sub.i, and (x.sub.l, y.sub.l) as any sample with label y.sub.jy.sub.i. Therefore, the two term loss function can be formulated as

[00002] .Math. = ( 1 - ) .Math. .Math. pull + .Math. push ( 2 ) { .Math. pull = .Math. i , j .Math. .Math. q = 1 Q .Math. .Math. .Math. U q ( x i q - x j q ) .Math. 2 .Math. .Math. push = .Math. i , j , l .Math. [ 1 + .Math. q = 1 Q .Math. .Math. .Math. U q ( x i q - x j q ) .Math. 2 - .Math. q = 1 Q .Math. .Math. .Math. U q ( x i q - x j q ) .Math. 2 ] + ( 3 )

[0013] where [.Math.].sub.+=max(.Math.,0) is the hinge loss. The term .sub.pull acts to pull neighboring samples with the same label closer, while the term .sub.push pushes differently labeled samples further apart. The two terms are combined using a weighting parameter . The multiple metrics are coupled via the hinge loss and learned jointly from the training data, thus allowing the information from multiple features to be fused. Note that HMML degenerates to LMNN when Q=1. HMML only exploits the labeled information for feature reduction, which is likely to overfit with a small training set. Incorporating the abundant unlabeled samples into the learning process is important since they can provide information on the underlying data distribution and thus help avoid overfitting. For multi-metric learning, an unsupervised regularizer is constructed as follows:

[00003] reg = 1 2 .Math. .Math. q = 1 Q .Math. .Math. .Math. i , j = 1 n .Math. .Math. W ij q .Math. .Math. U q ( x i q - x j q ) .Math. 2 = .Math. q = 1 Q .Math. .Math. tr ( X q .Math. L q .Math. X q .Math. U qT .Math. U q ) ( 4 ) W ih q = { 1 , x j q N ( x i q ) 0 , otherwise .Math. ( 5 )

[0014] For the qth type of feature, in equation (4), tr refers to the trace operator; X.sup.q=[x.sub.i.sup.q, . . . , x.sub.n.sup.q]R.sup.d.sup.q.sup.n represents the sample matrix; L.sup.q=D.sup.qW.sup.q is a Laplacian matrix; D.sup.q is a diagonal matrix whose diagonal elements are computed by

[00004] D ii q = .Math. j = 1 n .Math. .Math. W jj q ; W q = [ W ij q ] i , j = 1 n

is the kNN graph matrix, which represents the similarities between all sample pairs, and N(x.sub.i.sup.q) denotes the neighborhood of data point X.sub.i.sup.q based on the Euclidean metric. Note that W.sup.q is asymmetric and W.sub.ij.sup.q=t. The loss function in equation (2) is augmented by including the proposed relularizer into HMML, and the objective function of RegHMML is then defined as:


.sub.obj=(1).sub.pull+.sub.push+reg(6)

where is a tradeoff parameter between the loss function and the regularizer.

[0015] In a metric-active learning framework, an important step is to obtain a reduced feature space at each AL iteration. For this purpose, Eq. (6) should be minimized with respect to {U.sup.q}.sub.q=1.sup.Q, and the projection matrix U.sup.q should be constrained to be rectangular of size r.sup.qd.sup.q with r.sup.q<<d.sup.q. At each AL iteration, the gradient descendent approach is use d to solve this optimization problem. The gradient with respect to U.sup.q is

[00005] .Math. obj U q = 2 .Math. ( 1 - ) .Math. U q .Math. .Math. i , j = 1 n .Math. .Math. C ij 2 + 2 .Math. .Math. .Math. U q .Math. .Math. ( i , j , l ) N m .Math. ( C ii q - C jj q ) + 2 .Math. .Math. .Math. U q .Math. X q .Math. L q .Math. X qT ( 7 )

Where C.sub.ij.sup.q=(x.sub.i.sup.qx.sub.j.sup.q)(x.sub.i.sup.qx.sub.j.sup.q).sup.T, and N.sub.tri represents a set of triples (i, j, l)N.sub.tri that triggers the hinge loss in equation (3). After learning the projection matrix set {U.sup.q}.sub.q=1.sup.Q, kNN classification is performed based on the distance metric defined in equation (1). Therefore, having obtained the projection matrix, a sample with different types of features can be represented in a lower dimensional feature space as {circumflex over (x)}.sub.i=Ux.sub.j, where U is a block diagonal matrix with {U.sup.q}.sub.q=1.sup.Q as the block entries. The resulting feature space, in which the AL query is applied, is then {circumflex over (x)}.sub.i=[U.sup.1x.sub.i.sup.1, U.sup.2x.sub.i.sup.2, . . . , U.sup.Qx.sub.i.sup.Q].

[0016] Next, the regularizer is refined via similarity propagation as follows. In the regularizer, kNN graph based similarities {W.sup.q}.sub.q=1.sup.Q for all feature types are constructed to provide a smoothness measure for data neighborhoods and help avoid overfitting. However, in an AL framework, the fixed unsupervised similarities may not be suitable for the classification task. This is because 1) they may not connect the actual similar sample pairs, e.g., sample within the same class; and 2) unsupervised information becomes less important as more labeled samples are iteratively added into the training set. Therefore, instead of using fixed similarities {W.sup.q}.sub.q=1.sup.Q, the system learns a new set of similarity matrices {{tilde over (W)}.sup.q}.sub.q=1.sup.Q which can reflect the real similarity between data pairs by incorporating supervised information. Supervised information is information which has been selected by a user as representative as suitable training data. A strong similarity matrix constructed based on the labeled information, is defined as S.sup.(0)Rtext missing or illegible when filed, where S.sub.ij.sup.(0)=1 for any I and S.sub.ij.sup.(0)=1 for samples within the same class and zero to all other elements. Therefore, we have the same S.sup.(0) for all types of features. Then, for each feature type, we regard the 1-elements as original positive energies and try to propagate these energies to the 0-elements in S.sup.(O), following the path built in the feature specific weak similarity matrices {W.sup.q}.sub.q=1.sup.Q. For the qth feature type, for example, the similarity propagation can be formulated as

[00006] S i q ( i + 1 ) = ( 1 - ) .Math. S i ( 0 ) + .Math. .Math. j = 1 n .Math. .Math. W ij q .Math. S i q ( i + 1 ) .Math. j = 1 n .Math. .Math. W ij q ( 8 )

Where S.sub.itext missing or illegible when fileddenotes the ith row of matrix S.sup.q at the tth time stamp, and , restricted by 0<<1, is a parameter indicating the relative amount of the information from its neighbors and its supervised information. Equation (8) can be written in matrix form as

[00007] S q ( i + 1 ) = ( 1 - ) .Math. S ( 0 ) + .Math. .Math. P q .Math. S q ( i ) ( 9 )

Where P.sup.q=(D.sup.q).sup.1W.sup.q is the transition probability matrix widely used in Markov random walk models. Since 0<<1, and the eigenvalues of P.sup.q are in [1, 1], S.sup.q.sup.(0) converges and its limit can be directly calculated as

[00008] S q + = lim t .fwdarw. .Math. S q ( t ) = ( 1 - ) .Math. ( I - .Math. .Math. P q ) - 1 .Math. S ( 0 ) ( 10 )

Then, the new similarity matrix for the qth feature type can be built by exploiting symmetry in the converged similarity matrix Stext missing or illegible when filedand removing small values (absolute values are smaller than a pre-defined threshold ). The resulting similarity matrix is

[00009] W ~ q = .Math. S q + + S q + 1 2 .Math. + n ( 11 )

However, since the computational overhead for the inversion problem is O(n.sup.3), it is very time consuming to calculate (IP).sup.1 with direct methods for large scale images. Considering

[00010] ( I - .Math. .Math. P ) - 1 = I + .Math. k = 0 .Math. .Math. k .Math. P + ,

we approximate the matrix inverse by using the first order term, which becomes (IP).sup.1=I+P with O(n.sup.2) computational complexity.

[0017] Finally, the regularizer in equation (4) is refined by updating {L.sup.q}.sub.q=1.sup.Q based on the new set of similarity matrice {{tilde over (W)}.sup.q}.sub.q=1.sup.Q at every update_step iteration as AL proceeds (when update_step=1, the regularizer is updated at every iteration), which exploits the increasing labeled information provided by the user.

[0018] After learning a low dimensional feature space, an active sampling strategy is needed to enrich the training set iteratively. In a batch-mode AL, both uncertainty and diversity need to be considered. The system quantifies the uncertainty of a pixel by considering a committee of classifiers, and the samples that exhibit the maximum disagreement between different models are selected. An uncertainty criterion is applied, in which the unlabeled samples are predicted using a committee of kNN classifiers, and each member is characterized by a different number of nearest neighbors, k.

[0019] The system also applies a diversity criterion to reduce the redundancy among the new queried samples. At a given AL iteration, we consider the following restrictions for sample selection: 1) samples that have completely identical label predictions from all the committee members with any already selected sample in the batch cannot be queried; and 2) any class cannot have more than S samples, where S is a user-defined parameter, and the class label is decided using majority voting based on the committee predictions. A schematic illustration of the proposed diversity criterion is shown in Table I. In this example, assume that the committee classifiers are defined as k={1, 3, 5}, the class labels are A, B, and C, and S is set to 2. Suppose in the current iteration, candidate samples {x.sub.j}.sub.j=1.sup.Q have the same degree of uncertainty. After selecting samples x.sub.1, x.sub.2, and x.sub.3, sample x.sub.4 and x.sub.5 cannot be selected .sub.since1) x.sub.4 has the same label predictions as sample x.sub.2 from all the three calssifiers; and 2) x.sub.5 has the same majority voting label (label A) as x.sub.1 and x.sub.3. Sample x.sub.6 can still be selected as it does not conflict with either of the two restrictions. Note that this criterion does not require clustering or other complicated techniques, but is simply based on the outputs of the committee kNN classifiers which can be accessed directly. Therefore, the final AL strategy includes two steps: uncertainty and diversity. A set of candidate unlabeled samples with the maximum degree of uncertainty are first considered based on the uncertainty criterion. Then, the diversity criterion is applied to select the most informative samples from the resulting contention pool.

TABLE-US-00001 TABLE I Example of diversity criterion Query Predicted Labels Decision Candidate custom-character A;custom-character B;custom-character C query Samples k = 1 k = 3 k = 5 x not query x.sub.1 custom-character custom-character custom-character x.sub.2 custom-character custom-character custom-character x.sub.3 custom-character custom-character custom-character x.sub.4 custom-character custom-character custom-character x x.sub.5 custom-character custom-character custom-character x x.sub.6 custom-character custom-character custom-character

[0020] Throughout this description, some aspects are described in terms that would ordinarily be implemented as software programs. Those skilled in the art will readily recognize that the equivalent of such software system 1040 are constructed in hardware, firmware, or micro-code. Because data-manipulation algorithms and systems are well known, the present description is directed in particular to algorithms and systems forming part of, or cooperating more directly with, systems and methods described herein. Other aspects of such algorithms and systems, and hardware or software for producing and otherwise processing signals or data involved therewith, not specifically shown or described herein, are selected from such systems, algorithms, component, and elements known in the art. Given the systems and methods as described herein, software not specifically shown, suggested, or described herein that is useful for implementation of any aspect is conventional and within the ordinary skill in such arts.

[0021] FIG. 3 is a high-level diagram showing the components of the exemplary system 1000 for analyzing the image data and performing other analyses described herein, and related components. The system 1000 includes a processor 1086, a peripheral system 1020, a user interface system 1030, and a data storage system 1040. The peripheral system 1020, the user interface system 1030 and the data storage system 1040 are communicatively connected to the processor 1086. Processor 1086 can be communicatively connected to network 1050 (shown in phantom), e.g., the Internet or a leased line, as discussed below. The image data may be received using image sensor 202 (via electrodes 204) and/or displayed using display units (included in user interface system 1030) which can each include one or more of systems 1086, 1020, 1030, 1040, and can each connect to one or more network(s) 1050. Image sensor 202 may comprise a digital imaging device, such as a digital camera, or the like. Processor 1086, and other processing devices described herein, can each include one or more microprocessors, microcontrollers, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), programmable logic devices (PLDs), programmable logic arrays (PLAs), programmable array logic devices (PALs), or digital signal processors (DSPs).

[0022] Processor 1086 can implement processes of various aspects described herein. Processor 1086 can be or include one or more device(s) for automatically operating on data, e.g., a central processing unit (CPU), microcontroller (MCU), desktop computer, laptop computer, mainframe computer, personal digital assistant, digital camera, cellular phone, smartphone, or any other device for processing data, managing data, or handling data, whether implemented with electrical, magnetic, optical, biological components, or otherwise. Processor 1086 can include Harvard-architecture components, modified-Harvard-architecture components, or Von-Neumann-architecture components.

[0023] The phrase communicatively connected includes any type of connection, wired or wireless, for communicating data between devices or processors. These devices or processors can be located in physical proximity or not. For example, subsystems such as peripheral system 1020, user interface system 1030, and data storage system 1040 are shown separately from the data processing system 1086 but can be stored completely or partially within the data processing system 1086.

[0024] The peripheral system 1020 can include one or more devices configured to provide digital content records to the processor 1086. For example, the peripheral system 1020 can include digital still cameras, digital video cameras, cellular phones, or other data processors. The processor 1086, upon receipt of digital content records from a device in the peripheral system 1020, can store such digital content records in the data storage system 1040.

[0025] The user interface system 1030 can include a mouse, a keyboard, another computer (connected, e.g., via a network or a null-modem cable), or any device or combination of devices from which data is input to the processor 1086. The user interface system 1030 also can include a display device, a processor-accessible memory, or any device or combination of devices to which data is output by the processor 1086. The user interface system 1030 and the data storage system 1040 can share a processor-accessible memory.

[0026] In various aspects, processor 1086 includes or is connected to communication interface 1015 that is coupled via network link 1016 (shown in phantom) to network 1050. For example, communication interface 1015 can include an integrated services digital network (ISDN) terminal adapter or a modem to communicate data via a telephone line; a network interface to communicate data via a local-area network (LAN), e.g., an Ethernet LAN, or wide-area network (WAN); or a radio to communicate data via a wireless link, e.g., WiFi or GSM. Communication interface 1015 sends and receives electrical, electromagnetic or optical signals that carry digital or analog data streams representing various types of information across network link 1016 to network 1050. Network link 1016 can be connected to network 1050 via a switch, gateway, hub, router, or other networking device.

[0027] Processor 1086 can send messages and receive data, including program code, through network 1050, network link 1016 and communication interface 1015. For example, a server can store requested code for an application program (e.g., a JAVA applet) on a tangible non-volatile computer-readable storage medium to which it is connected. The server can retrieve the code from the medium and transmit it through network 1050 to communication interface 1015. The received code can be executed by processor 1086 as it is received, or stored in data storage system 1040 for later execution.

[0028] Data storage system 1040 can include or be communicatively connected with one or more processor-accessible memories configured to store information. The memories can be, e.g., within a chassis or as parts of a distributed system. The phrase processor-accessible memory is intended to include any data storage device to or from which processor 1086 can transfer data (using appropriate components of peripheral system 1020), whether volatile or nonvolatile; removable or fixed; electronic, magnetic, optical, chemical, mechanical, or otherwise. Exemplary processor-accessible memories include but are not limited to: registers, floppy disks, hard disks, tapes, bar codes, Compact Discs, DVDs, read-only memories (ROM), erasable programmable read-only memories (EPROM, EEPROM, or Flash), and random-access memories (RAMs). One of the processor-accessible memories in the data storage system 1040 can be a tangible non-transitory computer-readable storage medium, i.e., a non-transitory device or article of manufacture that participates in storing instructions that can be provided to processor 1086 for execution.

[0029] In an example, data storage system 1040 includes code memory 1041, e.g., a RAM, and disk 1043, e.g., a tangible computer-readable rotational storage device such as a hard drive. Computer program instructions are read into code memory 1041 from disk 1043. Processor 1086 then executes one or more sequences of the computer program instructions loaded into code memory 1041, as a result performing process steps described herein. In this way, processor 1086 carries out a computer implemented process. For example, steps of methods described herein, blocks of the flowchart illustrations or block diagrams herein, and combinations of those, can be implemented by computer program instructions. Code memory 1041 can also store data, or can store only code.

[0030] Various aspects described herein may be embodied as systems or methods. Accordingly, various aspects herein may take the form of an entirely hardware aspect, an entirely software aspect (including firmware, resident software, micro-code, etc.), or an aspect combining software and hardware aspects These aspects can all generally be referred to herein as a service, circuit, circuitry, module, or system.

[0031] Furthermore, various aspects herein may be embodied as computer program products including computer readable program code stored on a tangible non-transitory computer readable medium. Such a medium can be manufactured as is conventional for such articles, e.g., by pressing a CD-ROM. The program code includes computer program instructions that can be loaded into processor 1086 (and possibly also other processors), to cause functions, acts, or operational steps of various aspects herein to be performed by the processor 1086 (or other processor). Computer program code for carrying out operations for various aspects described herein may be written in any combination of one or more programming language(s), and can be loaded from disk 1043 into code memory 1041 for execution. The program code may execute, e.g., entirely on processor 1086, partly on processor 1086 and partly on a remote computer connected to network 1050, or entirely on the remote computer.

[0032] Those skilled in the art will recognize that numerous modifications can be made to the specific implementations described above. The implementations should not be limited to the particular limitations described. Other implementations may be possible.