Domain Aware Medical Image Classifier Interpretation by Counterfactual Impact Analysis

20230177683 · 2023-06-08

Assignee

Inventors

Cpc classification

International classification

Abstract

A neural network, trained for the task of deriving the attribution of image regions that significantly influence classification in a tool for pathology classification, comprising (i) a contracting branch, (ii) an attenuation module, (iii) an interconnected upsampling branch, and (iv) a final mapping module.

Claims

1. (canceled)

2. A neural network, trained for the task of deriving the attribution of image regions that significantly influence classification in a tool for pathology classification, in which a medical image is consecutively processed through (i) a contracting branch, (ii) an attenuation module, (iii) an interconnected up-sampling branch, and (iv) a final mapping module, wherein (i) the contracting branch is fed with said medical image, and derives the same features relating to a pathology of interest at different progressively decreasing resolution scales as said classification tool, (ii) the attenuation module is coupled to the contracting branch and comprises convolutional layers, weighs the final output of the contracting branch and thereby selectively damps the forwarding of a subgroup of said features relating to a pathology of interest, (iii) the up-sampling branch refines this initial localization of the attenuation module, by processing the result of the attenuation module repeatedly through (a) up-sampling, (b) merging with weighted feature-maps of the corresponding resolution-scale of the contraction path and (c) convolutional layers, so as to refine the localization of said features of said sub-group, (iv) the final mapping module projects the features onto a two-dimensional grid of same size as said medical image and derives the attribution through a combination of thresholding and smoothing applied to the result of the projection, wherein said neural network is trained by (a) feeding a plethora of learning data, structured in an assortment of batches of at least one single medical image, and (b) assessing said network's derived final output and modifying weights of said neural network such that repeating said assessment yields an improved output and repeating steps (a) and (b) until no further improvement is obtained, wherein said learning data comprise a plurality of medical images, representative of an anatomical structure of interest, adhering to prerequisites and restrictions of said pathology classification tool, and said assessment of the network's derived final output comprises subsequently marginalizing attributed image regions' contribution towards pathology classification by said classification tool for each image of a batch by altering for each image of a batch attributed pixels such that they do not affect the outcome of the classification, re-classifying an altered batch using said classification tool and quantifying the result of the re-classification.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] FIG. 1 is an overview of the marginalization performance: first two images depict marginalization of mass tissue in a mammography image, the four images in the middle demonstrate blurring, averaging and inpainting (method of this invention) the patch coming from the first image. The last image displays the ROC curves of the mammography classifier vs. healthy pixel inpainting only in healthy/pathological structures,

[0039] FIG. 2 shows the architecture of the proposed attribution network,

[0040] FIG. 3 shows the result of Result attribution heatmaps for mammography and chest X-ray: (a) original image overlayed with annotation contours (and arrows for missing GT), (b) the attribution framework according to this invention, (c) GradCAM (d) Saliency.

[0041] FIG. 4 is a table showing on top Hausdorff distance H and weak localization results L, relating maps {circumflex over (M)} and GT; showing at the bottom relating maps {circumflex over (M)} to the organ resp. image-size.

DETAILED DESCRIPTION OF THE INVENTION

[0042] Given a pathology classifier's prediction about an input-image, it is a goal of this invention to estimate its cause by attributing the specific pixel-regions that substantially influenced the predictor's outcome.

[0043] Informally, the image-area is searched for that, if changed, results in a sufficiently healthy image able to fool the classifier. The resulting attribution-map needs to be informative for the user, and faithful to its underpinning classifier. While we can quantitatively test for the later, the former is an ill-posed problem.

[0044] We therefore formalize as follows:

[0045] Let I denote an image of a domain I with pixels on a discrete grid m1×m2, c a fixed pathology-class, and f a classifier capable of estimating p(c|I), the probability of c for I.

[0046] Also let M denote the attribution-map for image I and class c, hence M∈M.sup.m.sup.1.sup.×m.sup.2({0,1}).

[0047] Furthermore, assume a function π(M) proficient in marginalizing all pixel regions attributed by M in I such that the result of the operation is still within the domain of f. Hence, π(M) yields a new image similar to I, but where we know all regions attributed by M to be healthy per definition.

[0048] Therefore, assuming I depicting a pathological case, M attributing only pathology pixel representations, π(M) is a healthy counterfactual image to I. In any case p(c|π(M)) is well defined.

[0049] Using this notation, we can formalize what an informative map {circumflex over (M)} means, hence give it an a-priori, testable semantic meaning.

[0050] We define it as:

[00001] M ˆ := arg min .Math. "\[LeftBracketingBar]" v .Math. "\[RightBracketingBar]" d ( M ) where M ^ .Math. "\[LeftBracketingBar]" v .Math. "\[RightBracketingBar]" ^ := { ρ ( c .Math. "\[LeftBracketingBar]" π ( M ) ) θ , d ( M ) δ , M S } ,

[0051] θ is the classification-threshold, d is a metric measuring the attributed area, δ a constant limiting the maps attributed area, and S the set of smooth, i.e. compact and connected, masks. Note that any map of M.sup.m.sup.1.sup.×m.sup.2({0,1}) can be (differentiable) mapped into S by taking the smoothed maximum of a convolution with a Gaussian kernel. In this form {circumflex over (M)} is clearly defined and can be intuitively understood by end-users.

[0052] Solving for {circumflex over (M)} requires choosing (i) an appropriate measure d (e.g. the map area in pixels), (ii) an appropriate size-limit δ (e.g. n times average mass-size for mammography), and (iii) a fitting marginalization technique π(.Math.). In the following is described how we solve for {circumflex over (M)} through an ANN, and overcome the out-of-domain obstacles by partial convolution for marginalization.

[0053] Architecture

[0054] Iteratively finding solutions for {circumflex over (M)} is typically time-consuming.

[0055] Therefore, we develop a dedicated ANN, capable of finding the desired attribution in a single forward pass. For this, the network learns on multiple resolutions, to focus on and combine relevant classifier-extracted features (cf. FIG. 2).

[0056] We built on a U-Net architecture, where the down-sampling, encoding branch consists of the trained classifier sans its classification layers. These features, x.sub.i,j,l, are subsequentially passed through a feature-filter, performing


x.sub.i,j,l.Math.σ((W.sub.mρ(W.sup.rx.sub.i,j,l+b.sub.l)+b.sub.m)

where ρ is an element-wise nonlinearity (namely a rectified linear unit), σ a normalization function (sigmoid function) and W. resp. b. linear transformation parameters.

[0057] This is similar to additive attention which, compared to multiplicative attention has shown better performance on high dimensional input-features. The upsampling branch consists of four consecutive blocks of: upsampling by a factor of two, followed by convolution and merging with attention-gate weighted features from the classifier of the corresponding resolution scale. After final upsampling back to input-resolution,

we apply 1×1 conv. of depth two, resulting in two channels c.sub.1,2. The final attribution-map {circumflex over (M)} is derived through thresholding

[00002] .Math. "\[LeftBracketingBar]" c 1 .Math. "\[RightBracketingBar]" .Math. "\[LeftBracketingBar]" c 1 .Math. "\[RightBracketingBar]" + .Math. "\[LeftBracketingBar]" c 2 .Math. "\[RightBracketingBar]"

[0058] Intuitively, the network attenuates the classifiers final features, generating an initial localization. This coarse map is subsequently refined by additional weighting and information from higher resolution features (cf. FIG. 2). We train the network, by minimizing


L(M)=φ(M)+ψ(M)+λ.Math.R(M),s.t.d(M)≤δ [0059] where φ(M):=−1.Math.log(p(c|π(M))),ψ(M):=log(odds(I))−log(odds(π(M))),
hence weigh the probability of the marginalized image, enforcing p(c|π(M))≤θ.

[0060] In this specific embodiment an additional regularization-term was introduced: a weighted version of total variation, which experimentally greatly improved convergence. All terms where normalized by mapping through a generalized logistic function. The inequality constrain was enforced by the method proposed in the state of the art method Kervadec, H., Dolz, J., Tang, M., Granger, E., Boykov, Y., Ayed, I.: Constrained-cnn losses forweakly supervised segmentation (2018), http://arxiv.org/abs/1805.04628.

[0061] After mapping into S, any solution to L will also estimate {circumflex over (M)}, thereby yield our desired attribution-map. The parametrization is task/classifier-dependent and will be described in following sections.

[0062] Marginalization

[0063] The goal is to marginalize image regions marked by the network of the invention during it's training process. Therefore, we aim for an image inpainting method to replace pathological tissue by healthy appearance. It has to handle arbitrary maps introduced during training. The inpainting-result should resemble valid global anatomical appearance with high quality local texture. To address the above mentioned criteria we apply the U-net like architecture with partial convolution blocks and automated layer-by-layer mask updating of Liu, G., Reda, F., Shih, K., Wang, T. C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: Proceedings of ECCV. pp. 85-100 (2018).

[0064] This convolution considers only unmasked inputs in the current sliding window. This architecture setup and it's loss function were used as introduced in the article mentioned higher (Image inpainting for irregular holes using partial convolutions).

[0065] The loss function for training the inpainting network focuses on both per-pixel reconstruction quality of the masked regions and on overall appearance of the image. A perceptual and a style loss are introduced which match images in a mapped feature space to enhance the overall appearance and a total variation term is used to ensure a smooth transition between hole regions and present image regions in the final result.

[0066] Experimental Set-Up

[0067] Datasets: We evaluated our framework on two different datasets, on mammography scans and on chest X-ray images. For mammography, the two sets DDSM [Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.: The digital database for screening mammography. In: Proceedings of IWDM. pp. 212-218 (2000)] CBIS-DDSM [Lee, R., Gimenez, F., Hoogi, A., Rubin, D.: Curated breast imaging subset of DDSM.

[0068] The Cancer Imaging Archive 8 (2016). https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY].

were used, downsampled to a resolution of 576×448 pixels. Data was split into 1231/2000 pathological/healthy samples for training, and into 334/778 scans for testing. Pixel-wise ground-truth annotation (GT) is available.

[0069] We demonstrate generalization on a private collection of healthy and tuberculotic (TBC) frontal chest X-ray images, at a downsampled resolution of 256×256. We split healthy images into sets of 1700/135 for training resp. validation set, and TBC cases into 700/70. The test set contains 52 healthy and 52 TBC samples. No pixel-wise GT information was provided for this data.

[0070] Classifiers: The backbone of our mammography attribution network is a MobileNet classifier for categorization into healthy samples and scans with masses. [Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)].

[0071] The network was trained using the Adam optimizer with batchsize of 4 and learning rate of 1e-5 for 250 epochs with early stopping. The network was pretrained with 50 k 224×224 pixel patches from the training data for the same task. The TBC attribution utilized a DenseNet-121 classifier for the binary classification task of healthy or TBC case. [0072] [Huang, G., Liu, Z., van der Maaten, L., Weinberger, K. Q.: Densely connected convolutional networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)].

[0073] It was trained using the SGD momentum optimizer with a batchsize of 32 and learning rate of 1e-5 for 2000 epochs. The network was pretrained by the CheXpert dataset. [0074] [Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R. L., Shpanskaya, K. S., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., Ng, A. Y.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. AAAI (2019)].

[0075] Marginalization: Both inpainter networks were trained on the healthy training samples with a batch size of 1 for mammography and 5 for chest X-ray. The training was done in two phases, the first phase with batch normalization (BN) after each convolution layer and the second with BN only in the decoder part. The network for the mass classification task was trained with learning rates of 1e-5/1e-6 and for the TBC classification task of 2e-4/1e-5 for the two phases. For each image irregular masks were generated which mimic possible configurations during the attribution network training.

[0076] Attribution: We used the last four resolution-scales of each classifier, and in all cases the features immediately after activation function, following the convolution. The weights of the pre-trained ANNs where kept fixed during the complete process. Filter-depths of the up-sampling convolution blocks correspond with the equivalent down-sampling filters, filter-size is fixed to 1×1.

[0077] Up-sampling itself is done via neighbourhood up-sampling. We used standard gradient descend, and a cyclic learning rate, varying between 1e-6 and 1e-4, and trained for up to 5000 epochs with early stopping. We thresholded the masks at 0.55, and used a gaussian RBF with σ=5e-2, and a smoothing parameter of 30. All trainable weights where random normal initialized.

[0078] Results

[0079] Marginalization: To evaluate the inpainter network we assessed how much the classification score of an image changes, when pathological tissue is replaced.

[0080] Thus, we computed ROC curves using the classifier on all test samples [0081] (i) without any inpainting as reference, and for comparison random sampled inpainting [0082] (ii) only in healthy respective [0083] (iii) pathological scans over 10 runs.

[0084] The clear distance between the ROC curves of the image classifiers without any inpainting and with inpainting in pathological regions shows that the classifier is sensitive to changes around pathological regions of the image. Moreover, it is visible that the ROC curves of inpainting in healthy tissues follow closely the unaffected classifier ROC curve. Accordingly, the AUCs for both classifiers are 0.89, the mean AUCs of the mammo/chest healthy inpainters are 0.89/0.88 and the mean AUC of the pathological mammo inpainter is 0.86.

[0085] Attribution: We compared our attribution network against the gradient explanation saliency map (SAL), and the network/gradient-derived Grad-CAM visualizations. We limited our comparisons to these direct approaches, as they are widely used within medical imaging, and inherently valid. Popular reference based approaches either utilize blurring, noise or some other heuristic, or where not available, therefore could not be considered. Quantitatively, we relate (i) the result-maps to both organ, and ground truth (GT) annotations, and (ii) to each other.

[0086] Particular for (i) we studied the Hausdorff distances H between GT and {circumflex over (M)} indicating location proximity. Lower values demonstrate better localization in respect to the pathology.

[0087] Further we performed a weak localization experiment: per image, we derived bounding boxes (BB) for each connected component of GT and {circumflex over (M)} attributions.

[0088] A GT BB counts as found, if any {circumflex over (M)} BB has an IOU≤0.125. We chose this threshold, as a proficient classifier presumably focuses on the masses boundries and neighborhoods, thereby limiting possible BB-overlap. We report average localization L. For (ii) we derived the area ratio A between {circumflex over (M)} and organ-mask (breast-area) or whole image (chest X-ray). Again, lower values indicate a smaller thereby clearer map. Note that for (i) we could only derive values for mammography. All measurements were performed on binary masks, hence GradCAM and SAL had to be thresholded. We chose the 50, 75, 90 percentiles, i.e. compared 50, 25, 10 per-cent of the map-points. Where multiple pathologies, or mapping results occurred we used the median for a robust estimation per image.

[0089] Statistically significant difference between all resulting findings was formalized using Wilcoxon signed-rank tests, for α<0.05. Additionally we tested our network with randomised parameterization (labels have no effect in our case).

[0090] As seen in Table 1, our framework achieves significant lower H, than either GradCAM or SAL at all threshold levels. Moreover, we report significantly better weak localization (L) which underlines the higher accuracy of our approach. Qualitatively our attribution-maps are tighter focused and enclose the masses. The former is also expressed by the lower overlap values A. All p-values where significantly bellow 1e-2 thereby hardening our results. Randomization of the ANNs weights yields pure noise maps.