End-to-End Attention Pooling-Based Classification Method for Histopathology Images

20220188573 · 2022-06-16

    Inventors

    Cpc classification

    International classification

    Abstract

    The present disclosure provides an end-to-end attention pooling-based classification method for histopathological images. The method specifically includes the following steps: S1, cutting the histopathology image into patches of a specified size, removing the patches with too much background area and packaging the remaining patches into a bag; S2, training a deep learning network by taking the bag obtained in S1 as an input using a standard multi-instance learning method; S3, scoring all the patches by using the trained deep learning network, and selecting m patches with highest and lowest scores for each whole slide image to form a new bag; S4, building a deep learning network including an attention pooling module, and training the network by using the new bag obtained in S3; and S5, after the histopathology image to be classified is processed in S1 and S3, performing classification by using the model obtained in S4. The present disclosure can obtain a better classification effect under the current situation of only a small number of samples, provide an auxiliary diagnosis mechanism for doctors, and alleviate the problem of shortage of medical resources.

    Claims

    1. An end-to-end attention pooling-based classification method for histopathology images, comprising the following steps: S1, cutting a histopathological whole slide image into square patches with a side length L, pre-processing to filter patches, and packaging the remaining patches into a bag; S2, modifying the last fully connected layer in a pre-trained Resnet50 network, recording the modified model as A, and training the model A by a standard multi-instance learning method; S3, scoring all the patches by using the trained model A, and selecting m patches with highest and lowest scores for each image to form a new bag; S4, building a deep learning model B comprising an attention pooling module, and training the model B by using the new bag obtained in S3; and S5, after a bag containing 2m patches is obtained by processing the histopathology image to be classified in S1 and S3, classifying the bag by using the model B trained in S4, wherein the classification result is the final classification result of the image to be classified.

    2. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein preprocessing to filter patches in S1 refers to removing the patches with the background area ratio exceeding a certain range, and is specifically implemented as follows: firstly, calculating a threshold of the background area and foreground area for a whole slide image at a low resolution by using an Otsu method, wherein a specific algorithm of the Otsu method is to find a threshold t to minimize a sum of variances within the foreground area and the background area, and a calculation formula is as follows:
    σ.sup.2(t)=w.sub.0(t)σ.sub.0.sup.2(t)+w.sub.1(t)σ.sub.1.sup.2(t), σ.sub.0.sup.2(t) refers to a variance of a gray value within the background area when t is taken as the threshold, and σ.sub.1.sup.2(t) refers to a variance within the foreground area when t is taken as the threshold; and it is assumed that when t is taken as the threshold, a gray value of pixels belonging to the background area is recorded as B(i), a gray value of pixels belonging to the foreground area is recorded as F(i), a number of the pixels belonging to the foreground area is recorded as N.sub.F, and a number of the pixels belonging to the background area is recorded as N.sub.B, and calculation methods of the variances are as follows: σ 0 2 ( t ) = 1 N B .Math. ( B ( i ) - 1 N B .Math. B ( i ) ) 2 , and σ 1 2 ( t ) = 1 N F .Math. ( F ( i ) - 1 N F .Math. F ( i ) ) 2 , and w.sub.1(t) and w.sub.2(t) are proportions of the foreground area and the background area when t is taken as the threshold; and after an optimal threshold is obtained by using the above algorithm, a proportion of the background area in a whole patch area under the optimal threshold is calculated, and if the proportion exceeds a certain value, the patch is discarded.

    3. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein in S2, an output dimension of the last fully connected layer of the Resnet50 network is modified to 128, and a fully connected layer with an input dimension of 128 and an output dimension of 2 is added after the fully connected layer of the last layer.

    4. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein training the model A by the standard multi-instance learning method in S2 is as follows: an assumption of multi-instance learning is that at least one patch in a positive bag is positive, and all patches in a negative bag are negative; before each epoch of training, firstly all the patches are scored by using the model A, k patches with the highest score in each image are selected, and the selected patches are labeled the same as the whole slide image; and all the selected patch-label pairs constitute a dataset required for training, the model A is trained, and the above process is repeated until an accuracy of the model A on a validation set no longer improves.

    5. The end-to-end attention pooling-based classification method for histopathology images according to claim 4, wherein training the model A until the accuracy of the model A on the validation set no longer improves is specifically implemented as follows: for each patch x.sub.i in the image, an output of the model A is [o.sub.i1, o.sub.i2], and then [p.sub.i1, p.sub.i2] is obtained by using a softmax function, wherein p i 2 = e o i 1 e o i 1 + e o i 2 . outputs of all the patches are collected to obtain [p.sub.12, p.sub.22, . . . , p.sub.l2], l is a number of the patches, a maximum value is recorded as p.sub.i, and then a predicted value y.sub.p of the image is: y p = { 0 p i 0.5 1 p i > 0.5 , and when the predicted value is 0, it means that the image is predicted to be negative, and when the predicted value is 1, it means that the image is predicted to be positive; and according to a prediction result, the classification accuracy of the model A is calculated, and if the accuracy of the model A on the validation set does not improve after a specified number of epoch, training is stopped and a model with the highest accuracy on the validation set is saved.

    6. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein scoring the patches by using the trained model A in S3 is specifically as follows: for each patch, a two-dimensional output [o.sub.1, o.sub.2] is obtained for the last fully connected layer of the model A, and the output is normalized by using a softmax function to obtain [p.sub.1, p.sub.2], wherein p i = e o i .Math. j = 1 2 e o j , and then p.sub.2 is recorded as the score of the model A on the patch.

    7. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein the model B comprising the attention pooling module in S4 has a structure specifically as follows: a feature extraction module, followed by a fully connected layer with a tan h activation layer and a fully connected layer with a sigmoid activation layer, and a classification module; the feature extraction module is used to extract a feature h.sub.i corresponding to each patch, then the two fully connected layers with different activation layers are used to calculate attention weights according to the feature, and a specific calculation process is as follows: a ti = tanh ( w T h i ) , a si = sigmoid ( v T h i ) , and a i = e a ti a si .Math. j = 1 2 m e a tj a sj , w and v are parameters of the two fully connected layers, j has no actual meaning, and is an index used for summation, a.sub.i is the attention weight of an i-th feature h.sub.i, and a weighted sum of the weights and the features is performed to obtain a feature representation of the image: h = .Math. i = 1 2 m a i h i , and then the feature is classified by the classification module composed of two fully connected layers.

    8. The end-to-end attention pooling-based classification method for histopathology images according to claim 7, wherein the feature extraction module uses a convolutional neural network (CNN), a basic unit of the CNN is a bottleneck module, each bottleneck module is composed of three convolutional layers, three batch regularization layers and a relu activation layer, and a CNN skeleton is divided into four parts comprising 3, 4, 6 and 3 bottleneck modules respectively; then a global pooling layer and two fully connected layers are connected; and an output dimension of the last fully connected layer is 128.

    9. The end-to-end attention pooling-based classification method for histopathology images according to claim 1, wherein the model A and the model B are both trained by using a stochastic gradient descent method and an Adam optimizer, an initial learning rate is 0.00001, and a weight decay coefficient is 0.0001; and in a process of training the model A, a value of the batch size is 32, and in a process of training the model B, a value of the batch size is 1.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0034] FIG. 1 is a working flowchart of the present disclosure (taking a breast histopathology image as an example);

    [0035] FIG. 2 is a partial area of a histopathological whole slide image used in an experiment of the present disclosure;

    [0036] FIG. 3 is a flowchart of training a model A by a standard multi-instance learning method in the present disclosure;

    [0037] FIG. 4 is a structural diagram of an attention network used in the present disclosure;

    [0038] FIG. 5 is a working mechanism of an existing attention pooling-based method; and

    [0039] FIG. 6 is a working mechanism of an end-to-end attention pooling-based method provided by the present disclosure.

    DETAILED DESCRIPTION OF THE EMBODIMENTS

    [0040] The technical solutions will now be described clearly and completely by taking a breast histopathology image as an example in the embodiments of the present disclosure with reference to appended drawings in the embodiments of the present disclosure. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art on the basis of the embodiments of the present disclosure without creative efforts shall fall within the protection scope of the present disclosure.

    [0041] Referring to FIG. 1 to FIG. 6, the present disclosure provides the following technical solution: an automatic classification method for histopathological whole slide images includes the following steps.

    [0042] S1, A histopathological whole slide image is cut into square patches with a side length L. In the present disclosure, a value of L is 224. The patch with a background area accounting for 50% is removed, and the remaining patches are packaged into a bag.

    [0043] A method for removing the patches with the background area accounting for 50% is as follows: firstly calculating a threshold of the background area and foreground area for a whole slide image at a low resolution by using an Otsu method. A specific algorithm of the Otsu method is to find a threshold t to minimize a sum of variances within the foreground area and the background area, that is,


    σ.sup.2(t)=w.sub.0(t)σ.sub.0.sup.2(t)+w.sub.1(t)σ.sub.1.sup.2(t).

    [0044] σ.sub.0.sup.2(t) refers to a variance of a gray value within the background area when t is taken as the threshold, and σ.sub.1.sup.2(t) refers to a variance within the foreground area when t is taken as the threshold. It is assumed that when t is taken as the threshold, a gray value of pixels belonging to the background area is recorded as B(i), a gray value of pixels belonging to the foreground area is recorded as F(i), a number of the pixels belonging to the foreground area is recorded as N.sub.F, and a number of the pixels belonging to the background area is recorded as N.sub.B, and calculation methods of the variances are as follows:

    [00007] σ 0 2 ( t ) = 1 N B .Math. ( B ( i ) - 1 N B .Math. B ( i ) ) 2 , and σ 1 2 ( t ) = 1 N F .Math. ( F ( i ) - 1 N F .Math. F ( i ) ) 2 .

    [0045] w.sub.1(t) and w.sup.2(t) are proportions of the foreground area and the background area when tis taken as the threshold. After an optimal threshold is obtained by using the above algorithm, a proportion of the background area in a whole patch area under the optimal threshold is calculated, and if the proportion exceeds a certain value (preferably, in the present disclosure, the value is 50%), the patch is discarded.

    [0046] S2, The output dimension of the last fully connected layer in a pre-trained Resnet50 network is modified from 1,000 to 128, a fully connected layer with an input dimension of 128 and an output dimension of 2 is added at the end, and the modified model is recorded as A. The model A is trained by a standard multi-instance learning method.

    [0047] Training the model A by the standard multi-instance learning method is as follows: an assumption of multi-instance learning is that at least one patch in a positive bag is positive, and all patches in a negative bag are negative. Before each epoch of training, firstly all the patches are scored by using the model A, k patches with the highest score in each image are selected, and the selected patches are labeled the same as the whole image. All the selected patch-label pairs constitute a dataset required for training, and the model A is trained. The above process is repeated until an accuracy of the model A on a validation set no longer improves. When the accuracy on the validation set is calculated, a classification result of the patches with the highest score of the model A is taken as a classification result of the whole image of the model A.

    [0048] Training the model A until the accuracy of the model A on the validation set no longer improves is specifically implemented as follows.

    [0049] For each patch x.sub.i in the image, an output of the model A is [o.sub.i1, o.sub.i2], and then [p.sub.i1, p.sub.i2] is obtained by using a softmax function, where

    [00008] p i 2 = e o i 1 e o i 1 + e o i 2 .

    [0050] Outputs of all the patches are collected to obtain [p.sub.12, p.sub.22, . . . , p.sub.l2], l is a number of the patches, a maximum value is recorded as p.sub.i, and then a predicted value y.sub.p of the image is:

    [00009] y p = { 0 , p i 0.5 1 , p i > 0.5 .

    [0051] When the predicted value is 0, it means that the image is predicted to be negative, and when the predicted value is 1, it means that the image is predicted to be positive. According to a prediction result, the classification accuracy of the model A is calculated, and if the accuracy of the model A on the validation set does not improve after a specified number of epoch, training is stopped and a model with the highest accuracy on the validation set is saved.

    [0052] S3, All the patches are scored by using the trained model A. m patches with highest and lowest scores are selected for each image to form a new bag. Preferably, in the present disclosure, a value of m is 40.

    [0053] Scoring the patches by using the trained model A is specifically as follows: for each patch, a two-dimensional output [o.sub.1, o.sub.2] is obtained for the last fully connected layer of the model A, and the output is normalized by using a softmax function to obtain [p.sub.1, p.sub.2], where

    [00010] p i = e o i .Math. j = 1 2 e o j .

    [0054] Then p.sub.2 is recorded as the score of the model A on the patch (p.sub.1 is useless in the scoring process, and is directly discarded).

    [0055] S4, A deep learning model B including an attention pooling module is built, and the model B is trained by using the new bag obtained in S3.

    [0056] The model B including the attention pooling module in S4 has a structure specifically as follows: a feature extraction module, followed by a fully connected layer with a tan h activation layer and a fully connected layer with a sigmoid activation layer, and a classification module.

    [0057] The deep learning model B including the attention pooling module has a mechanism as follows: firstly the feature extraction module is used to extract a feature h.sub.i corresponding to each patch, and then the two fully connected layers with different activation layers are used to calculate attention weights according to the feature. A specific calculation process is as follows:

    [00011] a ti = tanh ( w T h i ) , a si = sigmoid ( v T h i ) , and a i = e a ti a si .Math. j = 1 2 m e a tj a sj .

    [0058] w and v are parameters of the two fully connected layers, and a.sub.i is the attention weight of an i-th feature h.sub.i. A weighted sum of the weights and the features is performed to obtain a feature representation of the image:

    [00012] h = .Math. i = 1 2 m a i h i .

    [0059] Then the feature is classified by a classifier composed of two fully connected layers.

    [0060] The feature extraction module uses a CNN, a basic unit of the CNN is a bottleneck module, each bottleneck module is composed of three convolutional layers, three batch regularization layers and a relu activation layer, and a CNN skeleton is divided into four parts including 3, 4, 6 and 3 bottleneck modules respectively. Then a global pooling layer and two fully connected layers are connected. An output dimension of the last fully connected layer is 128.

    [0061] S5, After a bag containing 2m patches is obtained by processing the histopathology image to be processed in S1 and S3, the bag is classified by using the model B trained in S4. The classification result is the final classification result of the original image to be classified.

    [0062] The model A and the model B are both trained by using a stochastic gradient descent method and an Adam optimizer, an initial learning rate is 0.00001, and a weight decay coefficient is 0.0001. In a process of training the model A, a value of the batch size is 32. In a process of training the model B, a value of the batch size is 1.

    [0063] An automatic classification algorithm for histopathological whole slide images is still in a theoretical stage at present. Most of the current algorithms for histopathology diagnosis are based on maximum pooling or other pooling to build a two-stage model. Due to the lack of an effective feedback mechanism for the quality of feature extraction, the accuracy of these models is generally not high. In 2019, a classification index on a CAMELYON16 dataset used in the experiment was only 0.725. The present disclosure improves the original deep learning method, samples the patches with the scores given by the deep learning model, and then uses the end-to-end attention pooling-based network for classification, such that the diagnosis accuracy of the original method is effectively improved, and reaches 0.790 on the CAMELYON16 dataset used in the experiment. It can provide effective help for doctors.

    [0064] The specific embodiments described herein are merely illustrative of the spirit of the present disclosure. A person skilled in the art can make various modifications or supplements to the specific embodiments described or replace them in a similar manner, but it may not depart from the spirit of the present disclosure or the scope defined by the appended claims.