METHOD FOR ESTABLISHING 3D MEDICAL IMAGE SEGMENTATION MODEL BASED ON MASKED MODELING AND APPLICATION THEREOF

Abstract

Disclosed is a method for establishing a 3D medical image segmentation model based on masked modeling and application thereof includes: establishing a semi-supervised learning network, wherein a student network includes an encoding module for extracting latent features and a segmentation decoder that predicts segmentation results, a teacher network includes an encoding module and a segmentation decoder that are structurally consistent with the student network; training the semi-supervised learning network, wherein during training, two random masking operations are performed on each image, and the image is input to the two networks respectively; optimizing and updating the weight of the student network, and transferring the updated weight to the teacher network, wherein the training loss function includes prototype representation loss, which is used to characterize the difference between the prototypes extracted and generated by the two networks; the student network may further include a reconstruction decoder and an auxiliary segmentation decoder.

Claims

1. A method for establishing a 3D medical image segmentation model based on masked modeling, comprising: (S1) establishing a semi-supervised learning network comprising a student network and a teacher network; wherein the student network comprises: a first encoding module and a decoding module; the first encoding module comprises an encoder for extracting different size features of a 3D input image to obtain a latent feature; the decoding module comprises a first segmentation decoder, and the first segmentation decoder is disposed for performing feature extraction and up-sampling on the latent feature to obtain segmentation results; the teacher network comprising: a second encoding module structurally consistent with the first encoding module, and a second segmentation decoder structurally consistent with the first segmentation decoder; (S2) using a 3D medical image segmentation dataset comprising labeled images and unlabeled images to train the semi-supervised learning network, wherein the training method is as follows: fixing weights of the teacher network, and performing two random masking operations on each of the images, and the image is input to the student network and teacher network respectively, optimizing and updating a weight of the student network according to a preset training loss function, and transferring the updated weight to the teacher network; the training loss function comprises a prototype representation loss L.sub.p1, which is disposed to characterize a difference between features in a corresponding area of a segmentation target in the latent features V.sup.s and V.sup.t extracted by the student network and the teacher network; (S3) extracting the first encoding module and connecting the first encoding module to a first decoder to form the 3D medical image segmentation model.

2. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1, wherein for the student network or teacher network, a feature p.sub.fg in the corresponding area of the segmentation target in the latent feature is calculated as follows: $p_{fg} = \frac{1}{C} {.Math.}_{j = 1}^{C} Up (V_{j}) .Math. P_{j}$ wherein V represents the latent feature, P represents the segmentation result; C represents a number of channels of the latent feature, V.sub.j represents a j-th channel of V, P.sub.j represents a j-th channel of P; and UP( ) represents an up-sampling operation; and an expression of the prototype representation loss L.sub.p1 is as follows: $L_{p 1} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} L_{mse} (p_{fg}^{s}, p_{fg}^{t})$ wherein p.sub.fg.sup.s and p.sub.fg.sup.t represent characteristics of the corresponding area of the segmentation target in the latent features V.sup.s and V.sup.t respectively, L.sub.mse represents a root mean square error, N and M respectively represent a number of the labeled images and the unlabeled images in the 3D medical image segmentation dataset.

3. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1, wherein the training loss function further comprises: a latent feature loss L.sub.fea; the latent feature loss L.sub.fea is disposed to characterize a difference between the latent features extracted by the student network and the teacher network, and is expressed as follows: $L_{fea} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} L_{mse} (V_{i}^{s}, V_{i}^{t})$ wherein L.sub.mse represents a root mean square error, N and M respectively represent a number of the labeled images and the unlabeled images in the 3D medical image segmentation dataset; V.sub.i.sup.s and V.sub.i.sup.t respectively represent the latent features extracted by the student network and the teacher network after an i-th image X.sub.i in the 3D medical image segmentation dataset is input.

4. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 3, wherein in the student network, the decoding module further comprises K auxiliary segmentation decoders; the auxiliary segmentation decoder is disposed to extract and up-sample the latent features to obtain the segmentation results; the up-sampling methods of the K auxiliary segmentation decoders are different from each other, and are different from that of the first segmentation decoder; and the training loss function further comprises: a segmentation consistency loss L.sub.mc; a segmentation consistency constraint is disposed to characterize a difference between segmentation results of the K auxiliary segmentation decoders and the first segmentation decoder, and is expressed as follows: $L_{mc} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} \underset{m, n = 1 & m n}{.Math.} L_{mse} (P_{i . m}^{s}, P_{i, n}^{s_sharp})$ wherein K is a positive integer, A=C.sub.K+1.sup.2; P.sub.i.m.sup.s and P.sub.i.n.sup.s represent segmentation results predicted by an m-th segmentation decoder and an n-th segmentation decoder respectively after the image X.sub.i is input, P.sub.i,n.sup.s_sharp represents a result after sharpening P.sub.i,n.sup.s; the segmentation decoder is an auxiliary segmentation decoder or the first segmentation decoder.

5. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 4, wherein in the student network, the decoding module further comprises: a reconstruction decoder; wherein the reconstruction decoder is disposed to extract and up-sample the latent features to restore an original image information and obtain a reconstructed image; the training loss function further comprises: a reconstruction loss L.sub.sup1; the reconstruction loss is disposed to characterize a difference between the reconstructed image reconstructed by the student network and the original image, and is expressed as follows: ${\begin{matrix} L_{\sup 1} = \frac{}{N + M} {.Math.}_{i = 1}^{N + M} L_{rec} (Q_{i}^{s}, X_{i}) \\ L_{rec} = \frac{1}{N + M} {.Math.}_{l = 1}^{N + M} L_{mse} (Q_{i}^{s}, X_{i}) \end{matrix}$ wherein Q.sub.i.sup.s represents the reconstructed image reconstructed by the student network after the image X.sub.i is input, and represents a balance parameter.

6. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 5, wherein in the first encoding module, F Hybridformer modules connected successively are also comprised following the encoder; in the second encoding module, F Hybridformer modules connected successively are also comprised following the encoder; the Hybridformer module is disposed to calculate self-attention in a pixel space and a sample dimension; and the latent feature extracted by the student network is a feature image extracted by the encoder in the first encoding module and processed by the F HybridFormer modules, and the latent feature extracted by the teacher network is a feature image extracted by the encoder in the second encoding module and processed by the F HybridFormer modules; wherein F is a positive integer.

7. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1, wherein the training loss function further comprises: a segmentation loss L.sub.sup2, which is disposed to characterize a difference between a segmentation result predicted by the first segmentation decoder and a gold standard, and is expressed as follows: $L_{\sup 2} = \frac{1}{N} {.Math.}_{i = 1}^{N} L_{seg} (P_{i}^{s}, Y_{i})$ wherein N represents a number of the labeled images in the 3D medical image segmentation dataset, Y.sub.i represents a gold standard for a segmented image corresponding to the i-th image X.sub.i in the 3D medical image segmentation dataset, P.sub.i.sup.s represents the segmentation result predicted by the first segmentation decoder after the image X.sub.i is input; and L.sub.seg represents a sum of a DICE loss and a cross-entropy loss.

8. The method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1, wherein the random masking operation comprises: dividing the 3D medical images into non-overlapping cubes of equal size, randomly selecting a proportion of cubs, and setting pixels in the corresponding area to zero.

9. A 3D medical image segmentation method, comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

10. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, a device where the computer-readable storage medium is located is controlled to execute the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 1.

11. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 2, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

12. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 3, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

13. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 4, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

14. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 5, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

15. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 6, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

16. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 7, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

17. A 3D medical image segmentation method,-comprising: a 3D medical image to be segmented being input into the 3D medical image segmentation model established by the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 8, and a segmentation result is obtained from an output of the 3D medical image segmentation model.

18. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, a device where the computer-readable storage medium is located is controlled to execute the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 2.

19. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, a device where the computer-readable storage medium is located is controlled to execute the method for establishing the 3D medical image segmentation model based on masked modeling according to claim 3.

20. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, a device where the computer-readable storage medium is located is controlled to execute the 3D medical image segmentation method according to claim 9.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0042] FIG. 1 is a schematic framework diagram of a semi-supervised learning model provided by an embodiment of the present invention.

[0043] FIG. 2 is a schematic diagram of a method for establishing a 3D medical image segmentation model based on masked modeling provided by an embodiment of the present invention.

[0044] FIG. 3 is a schematic diagram comparing the segmentation results on the same left atrium dataset obtained by the 3D medical image segmentation method provided by an embodiment of the present invention and the existing methods; wherein (a) shows the gold standard for segmentation, (b) shows the segmentation result obtained by the existing MT method, (c) shows the segmentation result obtained by the existing UA-MT method, (d) shows the segmentation result obtained by the existing SSASNet method, (e) shows the segmentation result obtained by the existing DTC method, (f) shows the segmentation result obtained by the existing URPC method, (g) shows the segmentation result obtained by the existing MCNet+ method, and (h) shows the segmentation result obtained by the 3D medical image segmentation method provided by an embodiment of the present invention.

DESCRIPTION OF EMBODIMENTS

[0045] In order to make the purpose, technical solutions and advantages of the present invention more clearly, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

[0046] In the present invention, the terms first, second, etc. (if present) in the present invention and the accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0047] Due to sparse sample size, there are problems of poor robustness, poor generalization and low segmentation accuracy when applying existing SSL method adopting a dual-model architecture to 3D medical image segmentation. In view of the above technical problem, the present invention provides a method for establishing a 3D medical image segmentation model based on masked modeling and application thereof. The whole idea is to generate two different masked images by performing random masking operations on the input original images, and input the masked images into the student network and teacher network respectively, so that the input images of the two models are all incomplete, but they jointly contain the overall information. Due to the randomness of the masking strategy, there is also vast diversity in the image segmentation tasks of the two networks, so that the two networks may robustly learn related yet complementary features, which, in turn, allows the consistency constraint at the feature level to provide effective unsupervised guidance throughout the training, thus improving the robustness, generalization and accuracy of segmentation.

[0048] Based on the above masking strategy, in order to further improve the performance of 3D medical image segmentation, the invention provides two specially designed learning fashions, namely diverse joint-task learning (DJL) and decoupled inter-student learning (DIL), and instantiated them as an enhanced teacher-student architecture to realize robust semi-supervised 3D medical image segmentation. In various joint task learning, the student network not only completes the 3D image segmentation task, but also completes the task of restoring the original image information. These two tasks will share the same encoder structure and generate different masked images based on the masking strategy. The student and teacher models learn to jointly segment the same targets while restoring different image contents. The joint task also has vast diversity due to the randomness of the masking strategy. To facilitate DJL, the student model in DIL is attached with one or more auxiliary decoding branches for segmentation, which may be viewed as other students via different up-sampling designs. Pairwise consistency constraints are optimized at the output level, allowing the two branches to learn mutually with the original student model while their own weights remain detached from the teacher-student synchronization. The student, therefore, benefits from more informative unsupervised guidance from the extra decoupled knowledge, and it is possible to avoid errors in the knowledge learned from the original branch, which in turn affects the ability of the teacher network to learn the correct knowledge. In this way, it is possible to gain the ability to suspect, monitor and correct the teacher's possible errors.

[0049] Examples are provided below.

Example 1

[0050] A method for establishing a 3D medical image segmentation model based on masked modeling, as shown in FIG. 1 and FIG. 2, including:

[0051] (S1) Establishing a semi-supervised learning network including a student network and a teacher network, wherein the structure of the semi-supervised learning network is shown in FIG. 1:

[0052] Referring to FIG. 1, the student network includes: a first encoding module and a decoding module; the first encoding module includes an encoder, which is used to extract different size features of the 3D input image to obtain latent features; the decoding module includes a first segmentation decoder, and the first segmentation decoder is used to extract and up-sample latent features to obtain segmentation results.

[0053] Optionally, in this embodiment, the encoder in the student network is composed of multiple convolutional layers and down-sampling layers; the first segmentation decoder is correspondingly composed of multiple convolutional layers and up-sampling layers, and the up-sampling operation is implemented through transposed convolution.

[0054] In this embodiment, the input student network and teacher network are images processed through random masking operations, there will be some missing information in the images. In response to this situation, in this embodiment, in the first encoding module, F Hybridformer modules connected successively are also included following the encoder, and F is a positive integer; the HybridFormer module is used to calculate self-attention in pixel space and sample dimension. Moreover, the latent features extracted by the student network are the feature images extracted by the encoder in the first encoding module and processed by F HybridFormer modules.

[0055] Optionally, in the HybridFormer module, the calculation of self-attention in pixel space and self-attention in sample dimension are completed by two parts, which include multi-layer convolution layers and down-sampling layers respectively. In practical applications, the number of F HybridFormer modules may be flexibly set according to actual needs.

[0056] The latent features output by the first encoding module are input into the first segmentation decoder, and the low-level output feature map and the feature map extracted by the corresponding upper-level large-scale convolution kernel are concatenated along the channel direction through the spanning connection, and 33 convolution is adopted to further extracts large-size features, and finally obtains a segmentation prediction map.

[0057] Referring to FIG. 1, the teacher network includes: a second encoding module and a second segmentation decoder. The second encoding module is structurally consistent with the first encoding module in the student network, including the encoder and F HybridFormer modules following the encoder, and is used to extract latent features of the input masked image. The second segmentation decoder is structurally consistent with the first segmentation decoder in the student network, and is used to extract and up-sample latent features to obtain segmentation results.

[0058] In the student network and teacher network, the introduction of the HybridFormer module effectively improves the feature representation ability in the latent space.

[0059] Referring to FIG. 1, in order to implement diverse joint task learning (DJL) to better learn complementary information in masked image pairs, the decoding module in the student network includes two types of branches, one is the reconstruction decoder and the other one is the segmentation decoder. The reconstruction decoder is used to extract and up-sample latent features to restore the original image information and obtain the reconstructed image; the segmentation decoder predicts the segmentation results. In the meantime, in order to reduce the prediction uncertainty in segmentation results, the embodiment specifically includes three segmentation decoders, one of which is the first segmentation decoder, and the other two are auxiliary segmentation decoders. These two auxiliary segmentation decoders are also used for performing feature extraction and up-sampling on latent features to obtain segmentation results. However, the up-sampling methods adopted by these two auxiliary segmentation decoders are different from each other and are different from the up-sampling method of the first segmentation decoder, involving linear interpolation and nearest neighbor interpolation respectively. The auxiliary segmentation decoder cooperates with the first segmentation decoder to achieve decoupled inter-student learning (DIL).

[0060] It should be noted that the number of auxiliary segmentation decoders and the up-sampling method adopted here are only exemplary descriptions and should not be understood as the only limitation of the present invention. In practical applications, they may be flexibly adjusted according to actual needs.

[0061] For the established semi-supervised learning model, this embodiment provides a training method for the model accordingly, which is specifically as follows:

[0062] (S2) Using the 3D medical image segmentation dataset including labeled images and unlabeled images to train the semi-supervised learning network.

[0063] The training method is as follows: fixing the weights of the teacher network, and performing two random masking operations on each image, and the image is input to the student network and teacher network respectively, optimizing and updating the weight of the student network according to the preset training loss function, and transferring the updated weight to the teacher network. Optionally, in this embodiment, the student network is transferred to the teacher network, which is specifically accomplished through exponential moving average (EMA).

[0064] In this embodiment, X.sub.i custom-character and Y.sub.i{0,1}.sup.HWD are adopted to represent the input image and segmentation labels, wherein H, W and D represent the height, width and depth of the 3D image. The labeled dataset and the unlabeled dataset are expressed as Da.sub.L={X.sub.i,Y.sub.i |i=1, . . . N} and Da.sub.U={X.sub.i|i=N+1, . . . N+M} respectively, wherein N and M represent the number of labeled images and unlabeled images respectively. P(Y.sub.pred|X;) is a probability map of the input image, and is used to represent the probability that each pixel belongs to various segmentation targets.

[0065] Optionally, in this embodiment, specific methods of masking 3D images include the following:

[0066] The 3D medical image is divided into, for example, 444 non-overlapping cubes with equal size, and then a proportion of cubes are randomly sampled from even distribution for masking, and the pixels in the covered area are set to 0. Through the masking operation, the two masked images input to the student network and the teacher network have some information missing. Due to the randomness of the random masking operation, the masked areas of the two masked images are different, and features will be corrupted by random masks in the latent space, which is crucial for medical image segmentation tasks. In the meantime, the two masked images jointly contain complete image information.

[0067] Based on the semi-supervised learning network established in step (S1), the masked image input to the student network is processed by its encoding module to obtain the latent feature Vs, and then the first segmentation decoder predicts the segmentation result P.sup.s_s_Tconv. Meanwhile, the segmentation results P.sup.s_s_Tri and P.sup.s_s_Nea are obtained through two auxiliary segmentation decoders. After the original image information is restored by the reconstruction decoder, the reconstructed image X.sub.s.sup.r is obtained. After the masked image input to the teacher network is processed by its encoding module, the latent feature V.sup.t is obtained, and then the segmentation result P.sup.s_t is obtained through predicting performed by the first segmentation decoder.

[0068] In the existing dual-model architecture, pixel space information is only extracted by enforcing the probability map of the student network and the probability map of the teacher network, while ignoring the constraints of the latent feature space and the mining of information in unlabeled data. Although consistency learning plays a very important role in semi-supervised segmentation tasks, the masking operation adopted in this embodiment reduces redundant information and creates a more challenging feature representation task than noise or transformation. The same image with different random masks may lead to different predictions, especially when the target area to be segmented is masked. Directly aligning the prediction results of the teacher-student network is too strict and may lead to collapse of predictions. Inspired by prototype learning, this embodiment uses the features and prediction maps of the latent space to extract the prototype representation of its foreground to explore the connection between the feature map and the probability map. Correspondingly, in this embodiment, the training loss function includes the prototype representation loss L.sub.p1, which is used to characterize the difference between the features of the corresponding area of the segmentation target in the latent features V.sup.s and V.sup.t extracted by the student network and the teacher network. The prototype, namely the characteristics of the corresponding area of the segmentation target in the latent features, is specifically determined jointly through the latent features extracted by the encoding module and the segmentation result predicted by the segmentation decoder. Therefore, in this embodiment, the design of the training loss function may be used to constraint the feature space and the prototype to train the established model and obtain better segmentation performance.

[0069] In order to extract multi-channel feature information, this embodiment averages the features along the channel dimension when calculating the prototype. Correspondingly, for the student network or teacher network, the features in the corresponding area of the segmentation target in the latent features are calculated as follows:

[00007] $p_{fg} = \frac{1}{C} {.Math.}_{j = 1}^{C} Up (V_{j}) .Math. P_{j}$

[0070] In the formula, V represents the latent feature, P represents the segmentation result; C represents the number of channels of the latent feature, V.sub.j represents the j-th channel of V, P.sub.j represents the j-th channel of P; UP(represents the up-sampling operation; and P.sub.fg[0,1].sup.HWD represents the foreground prototype. Based on the generation method of the prototype, the prototype representation loss L.sub.p1 is expressed as follows:

[00008] $L_{p 1} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} L_{mse} (p_{fg}^{s}, p_{fg}^{t})$

[0071] In the formula, p.sub.fg.sup.s and p.sub.fg.sup.t represent the characteristics of the corresponding area of the segmentation target in the latent features V.sup.s and V.sup.t respectively, L.sub.mse represents the root mean square error, N and M respectively represent the number of labeled images and unlabeled images in the 3D medical image segmentation dataset.

[0072] This embodiment further takes into consideration that the masks of the student network and the teacher network are independent from each other, so the training loss function is further improved to enhance the similarity of the latent features (V.sup.s, V.sup.t) extracted by the two to achieve information complementation. Correspondingly, in this embodiment, the training loss function also includes: latent feature loss L.sub.fea; latent feature loss L.sub.fea is used to characterize the difference between the latent features extracted by the student network and the teacher network, and the expression thereof is as follows:

[00009] $L_{fea} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} L_{mse} (V_{i}^{s}, V_{i}^{t})$

[0073] In the formula, L.sub.mse represents the root mean square error, N and M respectively represent the number of labeled images and unlabeled images in the 3D medical image segmentation dataset; V.sub.i.sup.s and V.sub.i.sup.t respectively represent latent features extracted by student network and teacher network after the i-th image X.sub.i in the 3D medical image segmentation dataset is input.

[0074] In order to effectively avoid the impact of prediction uncertainty on the robustness, generalization and accuracy of the network, in this embodiment, based on the introduction of the auxiliary segmentation decoder, the auxiliary segmentation decoder together with the original first segmentation decoder will be enforced to pass consistency constraints in pairs, and the sharpening function is adopted to attenuate the influence of pixels that are easily misclassified. For any segmentation decoder, P.sub.i.sup.s is adopted to represent the segmentation result predicted thereby after the i-th image X.sub.i is input, and the prediction result after processing by the sharpening function is expressed as follows:

[00010] $P_{i}^{s_sharp} = \frac{{(P_{i}^{s})}^{1 / T}}{{(P_{i}^{s})}^{1 / T} + {(1 - P_{i}^{s})}^{1 / T}}$

[0075] In the formula, T represents the hyperparameter used to control the degree of sharpening; P.sub.i.sup.s_sharp is the result processed by the sharpening function, and is used as a pseudo label in the consistency constraint between segmentation decoders, thus enabling the segmentation results predicted by the decoder to be close to the results after sharpening.

[0076] In this embodiment, the consistency constraint between segmentation decoders is characterized by the segmentation consistency loss L.sub.mc. Correspondingly, the training loss function also includes: segmentation consistency loss L.sub.mc; the segmentation consistency constraint is used to characterize the difference between the segmentation results of the two auxiliary segmentation decoders and the first segmentation decoder, and is expressed as follows:

[00011] $L_{mc} = \frac{1}{N + M} {.Math.}_{i = 1}^{N + M} \underset{m, n = 1 & m n}{.Math.} L_{mse} (P_{i . m}^{s}, P_{i, n}^{s_sharp})$

[0077] In the formula, P.sub.i.m.sup.s and P.sub.i.n.sup.s respectively represent the segmentation results predicted by the m-th segmentation decoder and the n-th segmentation decoder after the image X.sub.i is input, and P.sub.i,n.sup.s_sharp represents the result after sharpening processing P.sub.i.n.sup.s; the segmentation decoder is the auxiliary segmentation decoder or the first segmentation decoder. In this embodiment, by introducing a segmentation consistency constraint into the training loss function, the segmentation results of each segmentation decoder may be constrained to be consistent.

[0078] Since this embodiment implements joint task learning, the joint task includes not only the segmentation task, but also the reconstruction task, and the two tasks share the same encoding structure. When the information of masked image is accurate, the original and uncorrupted voxel information may be reconstructed through the reconstruction task, so that the encoder extracts features more effectively. In order to ensure the accuracy of the reconstruction task, in this embodiment, the training loss function also includes: reconstruction loss L.sub.sup1; the reconstruction loss is used to characterize the difference between the reconstructed image reconstructed by the student network and the original image, and is expressed as follows:

[00012] ${\begin{matrix} L_{\sup 1} = \frac{}{N + M} {.Math.}_{l = 1}^{N + M} L_{rec} (Q_{i}^{s}, X_{i}) \\ L_{rec} = \frac{1}{N + M} {.Math.}_{l = 1}^{N + M} L_{mse} (Q_{i}^{s}, X_{i}) \end{matrix}$

[0079] In the formula, Q.sub.i.sup.s represents the reconstructed image reconstructed by the student network after the image X.sub.i is input, and a represents the balance parameter.

[0080] In the meantime, in order to ensure the accuracy of the segmentation task, in this embodiment, the training loss function also includes: segmentation loss L.sub.sup2, which is used to characterize the difference between the segmentation result predicted by the first segmentation decoder and the gold standard, and is expressed as follows:

[00013] $L_{\sup 2} = \frac{1}{N} {.Math.}_{i = 1}^{N} L_{seg} (P_{i}^{s}, Y_{i})$

[0081] In the formula, N represents the number of labeled images in the 3D medical image segmentation dataset, Y.sub.i represents the gold standard for the segmented image corresponding to the i-th image X.sub.i in the 3D medical image segmentation dataset, P.sub.i.sup.s represents the segmentation result predicted by the first segmentation decoder after the image X.sub.i is input; and L.sub.seg represents the sum of DICE loss and cross-entropy loss.

[0082] Based on the above analysis, in this embodiment, the overall training loss function may be expressed as follows:

[00014] $L_{total} = L_{p 1} + L_{fea} + L_{mc} + L_{\sup 1} + L_{\sup 2} .$

[0083] On basis of random masking and based on the above semi-supervised learning network structure and the corresponding training loss function, a student network with excellent segmentation performance may be obtained in the embodiment.

[0084] On basis of the above steps (S1) and (S2), this embodiment also includes: (S3) extracting the first encoding module and connecting the first encoding module to the first decoder to form a 3D medical image segmentation model.

[0085] In general, this embodiment ensures task diversity through random masking strategies. On basis of the above, through various joint task learning (DJL) and decoupled inter-student learning (DIL), the student network and teacher network in the semi-supervised learning network may robustly learn related yet complementary features, which, in turn, allows the consistency constraint at the feature level to provide effective unsupervised guidance throughout the training, and also allows the student network to receive decoupled knowledge so as to obtain more informative unsupervised guidance, and provides the teacher network with abilities to perform error suspicion monitoring and correction, thus ultimately improving the robustness, generalization and higher accuracy of 3D medical image segmentation models established by the present embodiment.

Example 2

[0086] A 3D medical image segmentation method, including:

[0087] The 3D medical image to be segmented is input into the 3D medical image segmentation model established by the method for establishing a 3D medical image segmentation model based on masked modeling provided in the above-mentioned Example 1, and the segmentation result is obtained from the output of the 3D medical image segmentation model.

[0088] Since the 3D medical image segmentation model established in Example 1 has better robustness and generalization, as well as higher segmentation accuracy, based on the 3D medical image segmentation model, this embodiment may obtain segmentation results with high accuracy in various 3D medical image segmentation scenarios.

Example 3

[0089] A computer-readable storage medium includes a stored computer program. When the computer program is executed by a processor, the device where the computer-readable storage medium is located is controlled to execute the method for establishing a 3D medical image segmentation model based on masked modeling provided in the above-mentioned Example 1, and/or, the 3D medical image segmentation method provided in the above-mentioned Example 2.

[0090] The following uses the segmentation results on the 3D GE-MRI dataset from the left atrial segmentation challenge to further verify the advantageous effects of the method provided by the present invention. For the left atrial dataset, it is obtained using the clinical whole-body MRI scanner and the resolution of the data is 0.6250.6250.625 mm.sup.3. Totally, 154 scans have expert annotations, among which 123 scans and 31 scans are randomly selected to train and test our method, respectively. Before model training, we implement the pre-processing on all the scans by normalizing pixel intensities to unit variance and zero mean, and randomly cropping the samples to 11211280 mm.sup.3. For this dataset, the image segmentation visualization results of the 3D medical image segmentation method provided by the present invention and other existing segmentation methods on the left atrium dataset are shown in FIG. 3. In FIG. 3, (a) is the gold standard for segmentation, (b) to (g) are the segmentation results of the existing MT, UA-MT, SSASNet, DTC, URPC, and MCNet+ methods respectively, and (h) is the segmentation results of the 3D medical image segmentation method provided in the embodiment of the present invention. Table 1 lists segmentation quantitative results using 10% and 20% labeled data. Dice, Jaccard similarity coefficient (Jaccard), 95% Hausdorff distance (95HD) and average surface distance (ASD) are used as the evaluation metrics. The greater Dice and Jaccard value, the better the network performance; the smaller 95HD and ASD values, the better the network performance. In Table 1, Ours represents the segmentation method provided by the present invention.

[0091] As shown in Table 1, the segmentation results with supervised V-Net with 10% and 20% annotations will serve as the baselines. It is shown in Table 1 that all semi-supervised approaches may provide more productive guidance on segmented results than the supervised baselines with 10% annotations, which reveals the usefulness of diverse and substantial information contained in the unlabeled data for model training. In particular, as shown in FIG. 3, supervised V-Net with 100% labeled data outperforms all semi-supervised methods. Although the supervised V-Net with 100% labeled data surpasses the compared semisupervised approaches, there only exists the slight difference between it and our method. Besides, the above quantitative results comprehensively demonstrate that our approach may yield the substantial segmentation performance improvements over other semi-supervised methods.

[0092] Specifically, the mainstream semi-supervised methods achieved greater improvements than the corresponding supervised training V-Net on both 20% and 10% labeled data. In particular, the method of the present invention outperforms the compared methods in four quantitative metrics. Among the compared methods, MC-Net+ performs slightly worse than the method of the present invention. Segmentation results for all semi-supervised methods improve as labeled data increases. With extremely limited labeled data, compared with other state-of-the-art semi-supervised algorithms, the method of the present invention may still provide significantly improved segmentation metrics. In order to visually reveal the advantage of the method of the present invention, FIG. 3 visually shows the 2D and 3D views of segmented results generated by all evaluated methods under 10% and 20% labeled data setting. Clearly, the method of the present invention preserves more fine details and sharpens most isolated areas. Although the method of the present invention does not achieve the optimal results in terms of number of parameters and calculations, the difference from other evaluated algorithms is not significant, so the method of the present invention will not bring too much computational burden in view of the powerful computing power of modern computers.

TABLE-US-00001 TABLE 1 QUANTITATIVE COMPARISONS OF ALL EVALUATED APPROACHES ON THE LEFT ATRIUM SEGMENTATION TASK Metrics Complexity Scans used Dice(%) Jaccard(%) 95HD(voxel) ASD(voxel) Para(M) MACs(G) Methods Labeled Unlabeled V-Net 25 0 88.98 80.25 13.87 3.39 9.18 46.85 MT(2017) 25(20%) 98(80%) 90.17 82.18 9.52 2.48 9.18 46.85 UA-MT(2019) 25(20%) 98(80%) 90.67 82.99 7.86 2.35 9.18 46.85 SASSNet(2020) 25(20%) 98(80%) 90.57 82.83 7.27 2.24 9.44 46.88 DTC(2021) 25(20%) 98(80%) 90.64 82.93 9.39 2.65 9.44 46.88 URPC(2021) 25(20%) 98(80%) 90.33 82.36 9.70 1.66 5.85 69.36 MC-Net + (2022) 25(20%) 98(80%) 91.19 83.86 6.18 1.48 9.44 46.88 Ours 25(20%) 98(80%) 91.92 85.10 5.19 1.43 11.52 47.37 V-Net 12 0 87.07 77.27 15.13 4.29 9.18 46.85 MT(2017) 12(10%) 111(90%) 88.58 79.63 11.47 2.71 9.18 46.85 UA-MT(2019) 12(10%) 111(90%) 89.38 80.92 12.36 3.56 9.18 46.85 SASSNet(2020) 12(10%) 111(90%) 89.60 81.25 7.89 2.24 9.44 46.88 DTC(2021) 12(10%) 111(90%) 90.09 82.03 8.36 1.86 9.44 46.88 URPC(2021) 12(10%) 111(90%) 89.72 82.06 7.07 2.47 5.85 69.36 MC-Net + (2022) 12(10%) 111(90%) 90.60 82.89 9.05 2.42 9.44 46.88 Ours 12(10%) 111(90%) 91.02 83.57 6.31 1.44 11.52 47.37

[0093] It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should all be included in the protection scope of the present invention.

METHOD FOR ESTABLISHING 3D MEDICAL IMAGE SEGMENTATION MODEL BASED ON MASKED MODELING AND APPLICATION THEREOF

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/7792

PHYSICS

Classification Explorer

G06V20/64

PHYSICS

Classification Explorer

G06V10/44

PHYSICS

Classification Explorer

G06V10/776

PHYSICS

Classification Explorer

G06V10/7753

PHYSICS

Classification Explorer

G06V10/26

PHYSICS

Classification Explorer

Y02T10/40

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G06V2201/03

PHYSICS

International classification

Classification Explorer

G06V10/778

PHYSICS

Classification Explorer

G06V10/44

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06V10/776

PHYSICS

Classification Explorer

G06V10/26

PHYSICS

Classification Explorer

G06V20/64

PHYSICS

Abstract

Claims

Description