IMAGE SEGMENTATION SYSTEM AND METHOD
20230267616 · 2023-08-24
Inventors
- Dheo Arokhim Yusufi CAHYO (Singapore, SG)
- Han Nian Marcus ANG (Singapore, SG)
- Leopold SCHMETTERER (Singapore, SG)
- Ai Ping YOW (Singapore, SG)
- Wing Kee Damon WONG (Singapore, SG)
Cpc classification
G06T3/40
PHYSICS
G06T2207/10101
PHYSICS
International classification
G06T3/40
PHYSICS
A61B3/10
HUMAN NECESSITIES
Abstract
Disclosed herein is a method of segmenting a volumetric image comprising a plurality of slices, the method comprising: inputting a target slice of the volumetric image to a deep neural network (DNN) having a multi-task learning architecture, the multi-task learning architecture comprising: a segmentation DNN that is configured to output a segmentation of the target slice; and a reconstruction DNN that is configured to: receive a plurality of adjacent slices to the target slice; and output a reconstruction of the target slice based on the plurality of adjacent slices; wherein the reconstruction DNN is further configured to share spatial information with the segmentation DNN, the spatial information being indicative of correlations between the adjacent slices and the target slice.
Claims
1-17. (canceled)
18. A method of segmenting a volumetric image comprising a plurality of slices, the method comprising: inputting a target slice of the volumetric image to a deep neural network (DNN) having a multi-task learning architecture, the multi-task learning architecture comprising: a segmentation DNN that is configured to output a segmentation of the target slice; and a reconstruction DNN that is configured to: receive a plurality of adjacent slices to the target slice; and output a reconstruction of the target slice based on the plurality of adjacent slices; wherein the reconstruction DNN is further configured to share spatial information with the segmentation DNN, the spatial information being indicative of correlations between the adjacent slices and the target slice.
19. A method according to claim 18, wherein the reconstruction DNN comprises a convolutional feature extractor for generating first feature data from the adjacent slices, and a reconstruction downsampler for generating first reduced-dimension feature data from the first feature data at one or more scales.
20. A method according to claim 19, wherein the reconstruction DNN comprises a reconstruction upsampler for transforming the first reduced-dimension feature data to first upsampled data having the same dimensions as the first feature data.
21. A method according to claim 19, wherein the reconstruction DNN comprises one or more dimension reduction layers for applying a dimension reduction mechanism to the first feature data and/or to the first reduced-dimension feature data.
22. A method according to claim 21, wherein the dimension reduction mechanism comprises: inputting the first feature data and/or the first reduced-dimension feature data to a 3D convolution layer; applying an aggregation of features between adjacent slices; applying batch normalization to the output of the 3D convolution layer; and applying a ReLU activation function to the output of the batch normalization.
23. A method according to claim 21, wherein layers of the reconstruction downsampler are connected to layers of the reconstruction upsampler via respective ones of the dimension reduction layers by concatenation.
24. A method according to claim 18, wherein the segmentation DNN comprises a convolutional feature extractor for generating second feature data from the target slice, and a segmentation downsampler for generating second reduced-dimension feature data from the second feature data at one or more scales.
25. A method according to claim 24, wherein the segmentation DNN comprises a segmentation upsampler for transforming the second reduced-dimension feature data to second upsampled data having the same dimensions as the second feature data.
26. A method according to claim 24, wherein layers of the segmentation downsampler are connected to layers of the segmentation upsampler.
27. A method according to claim 24, wherein the reconstruction DNN is configured to share spatial information with the segmentation DNN by element-wise addition of output of layers of the reconstruction upsampler to output of layers of the segmentation upsampler.
28. A method according to claim 18, wherein the loss function of the segmentation DNN is the 2D Intersection over Union (IoU) loss function.
29. A method according to claim 18, wherein the volumetric image is a 3D medical image.
30. A method according to claim 29, wherein the 3D medical image is a 3D optical coherence tomography (OCT) image.
31. A method according to claim 30, wherein the 3D OCT image is a retinal image, and wherein the target slice corresponds to a layer of the choroid.
32. A method according to claim 31, wherein the method is repeated for a plurality of target slices, and wherein the method further comprises generating a choroidal thickness map from segmentation of the plurality of target slices.
33. A system for segmentation of a volumetric image comprising a plurality of slices, comprising: at least one processor; and computer-readable storage having stored thereon instructions for causing the at least one processor to carry out a method according to claim 18.
34. Non-transitory computer-readable storage having instructions stored thereon for causing at least one processor to carry out a method according to claim 18.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] Embodiments will now be described, by way of non-limiting example, with reference to the drawings in which:
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
DETAILED DESCRIPTION
[0045] The present disclosure relates to a computationally efficient and accurate segmentation approach, which is robust to interstitial variations, for the segmentation of volumetric medical images. In the present disclosure, a novel segmentation multi-task learning architecture that is capable of fully automated three-dimensional segmentation of volumetric medical image data is proposed for volumetric segmentation. The proposed architecture incorporates both reconstruction and segmentation tasks. Simultaneous reconstruction and segmentation extracts intra-slice features which are directly used for segmentation. In particular, the multi-task learning architecture aggregates the spatial context in adjacent cross-sectional slices to reconstruct a central slice. Said multi-task learning architecture reconstructs said central slice by learning the spatial information between the adjacent slices. Soft parameter sharing between the reconstruction and segmentation tasks may be used to channel the spatial information. Said soft parameter sharing aggregates the spatial features more explicitly by directly learning the correlation between adjacent slices and the slices that will be segmented.
[0046] Spatial context learnt by the proposed reconstruction mechanism may be fused using a U-Net-based architecture. In the present disclosure, the proposed U-Net-based architecture is referred to as Spatial Aggregated Networks (SA-Net) due to its aggregation of spatial information. SA-Net learns the spatial information between adjacent cross-sections to reconstruct a selected cross-section. Said SA-Net is a convolutional neural network that is based on a fully convolutional network and its architecture can be modified and extended to work with fewer training images and to yield more precise segmentations. The main idea of the proposed U-Net-based architecture is to supplement a contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. Further, a successive convolutional layer can then learn to assemble a precise output based on this information. In the proposed SA-Net, there are a large number of feature channels in the upsampling part, which allows the network to propagate context information to higher resolution layers. It will be appreciated that incorporating spatial information from corresponding adjacent slices enables the proposed SA-Net architecture to explicitly integrate spatial correspondences. In general, the present disclosure does not require the whole volumetric image to be considered, thus avoiding costly computation and extensive memory requirements. At the same time, the proposed approach is not computationally heavy and is also not prone to memory leakage problems which are common in recurrent networks.
[0047]
[0054] As shown in
[0055] Detailed connections between the segmentation DNN 108 and reconstruction DNN 112 are shown in
[0056] In the reconstruction DNN 112, explicit spatial information from the adjacent slices 114 may be extracted. In particular, explicit spatial information from the adjacent slices 114 may be extracted by using a series of 3D convolutions. It will be appreciated that the reconstruction DNN 112 can be divided into downsampling and upsampling parts. The adjacent slices 114 are downsampled and the convolutions are repeated to extract multi-scale representations of spatial context. In some embodiments, the reconstruction DNN 112 comprises a convolutional feature extractor 202 for generating first feature data from the adjacent slices 114. The adjacent slices 114 are then downsampled and the convolutions are repeated to extract multi-scale representations of spatial context. During the downsampling process, rich spatial information is exploited from the adjacent slices 114 by using 3D convolution and max pooling layers. In one embodiment as shown in
[0057] After the downsampling stage, convolutional upsampling is later performed at different levels to ensure consistent representation of information from different scales, and is concatenated with the residuals at the same scale. In some embodiments, the reconstruction DNN 112 comprises a reconstruction upsampler 206 for transforming the first reduced-dimension feature data generated by the reconstruction downsampler 204 to first upsampled data having the same dimensions as the first feature data at one or more scales. After upsampling a final 2D convolution is performed and the loss between output and the ground truth (i.e., the I.sub.i slice) is calculated. Embodiments of the present disclosure use mean squared error to calculate the similarity distance between the predicted output y.sup.pred and ground truth y.sup.true.
[0058] Other similarity or dissimilarity measurement such as SSIM (structural similarity) index may also be used.
[0059] In some embodiments, the reconstruction DNN 112 may further comprise one or more dimension reduction layers for applying a dimension reduction mechanism (DRM) 210 to the first feature data generated by the convolutional feature extractor 202. The reconstruction DNN 112 may also comprise one or more dimension reduction layers for applying another DRM 212 to the first reduced-dimension feature data generated by the reconstruction downsampler 204. In the present disclosure, said DRM 210 and 212 are used for a more efficient representation of the information. In particular, to reduce the number of parameters given by the 3D convolution layers, in the bottleneck block the present disclosure incorporates said DRM 210 and 212 to convert 3D information into two-dimensional (2D) information. In some embodiments, the converted 3D information generated by the DRMs 212 and 210 are then upsampled using 2D convolution layers in reconstruction upsamplers 206 and 208, respectively.
[0060]
[0061]
[0062] As shown in
[0063] In some embodiments, as illustrated in
[0064] In the segmentation DNN 108, explicit spatial information from the target slice 104 may be extracted (see
[0065] After the downsampling stage, convolutional upsampling is later performed at different levels to ensure consistent representation of information from different scales, and is concatenated with the residuals at the same scale. In some embodiments, the segmentation DNN 108 comprises segmentation upsamplers 218 and 220 for transforming the second reduced-dimension feature data to second upsampled data having the same dimensions as the second feature data. In the upsampling part, high-resolution features during downsampling are concatenated with low-resolution features. In each end of the upsampling block, consisting of one 2D upsampling layer and two 2D convolution layers, the knowledge of the inter-slice features from the reconstruction branch is fused. The high-resolution 2D volumetric features are added element-wise with 2D intra-slice extracted features to incorporate the inter-correlation features between slices.
[0066] As shown in
[0067] As shown in
[0068] In embodiments as illustrated in
[0069] The loss function of the segmentation DNN may be the 2D Intersection over Union (IoU) loss function. The upsampling DNN is ended with a one-by-one 2D convolution and a sigmoid activation function. The present disclosure uses 2D IoU loss function to maximize the intersection region between the prediction and the ground truth. This 2D IoU loss function is defined as
[0070] In some embodiments, the volumetric image is a 3D medical image. In particular, the proposed SA-Net could potentially be applied for the segmentation and detection of structures in medical imaging modalities that acquire 3D volumetric data, which include but are not limited to Optical Coherence Tomography, Computed Tomography and Magnetic Resonance Imaging. The 3D medical image may be a 3D optical coherence tomography (OCT) image. Said OCT refers to a relatively recent medical imaging approach which enables high resolution depth-resolved imaging of structures below the surface of the retina. This allows visualization of sub-retinal changes which were not observable using fundus photography. The utility of OCT imaging has led to its widespread adoption in many clinical practices and has even replaced fundus photography as the main form of ophthalmic imaging for some practices. In the present disclosure, the 3D OCT image may be a retinal image, and the target slice may correspond to a layer of the choroid.
[0071] Also disclosed herein is a system for segmentation of a volumetric image comprising a plurality of slices, comprising at least one processor; and computer-readable storage having stored thereon instructions for causing the at least one processor to carry out the disclosed method.
Experiment
[0072] After pre-processing, embodiments of the present disclosure use a five-fold cross-validation strategy to train and evaluate the proposed model. To avoid training bias and risk of overfitting, it is ensured that all images from the same eye were in the same fold. This avoids a scenario where the testing and training partitions could potentially consist of different images from the same eye. The overall experimental result is then obtained by averaging over all validation sets in each fold. The architecture is developed using Python version 3.7.4 and TensorFlow version 2.0. Experiments were conducted using a workstation with GPU NVIDIA RTX 2080 Ti and 64GB RAM.
[0073]
[0074] The feature extractor 202/204, for example as illustrated in
[0075] The proposed SA-Net for volumetric segmentation of the choroid was evaluated. The choroid is clinically of interest as the vascular layer of the eye, providing upwards of 60% of the blood supply to the retina. Variations in the choroid have been linked to many ocular conditions, including age-related macular degeneration and diabetic retinopathy. Until recently, OCT imaging of the choroid has been challenging, as it is obscured by the highly scattering retinal pigment epithelium and visibility of the choroid was highly limited using spectral domain OCT systems operating at the 800 nm range. However, the adoption of swept-source lasers operating at 1000 nm into OCT systems has provided a window of opportunity for choroidal analysis due to reduced scattering.
[0076] The proposed SA-Net was evaluated on two OCT datasets. The first data set is composed of 40 high myopia eyes acquired using a commercial swept-source OCT (SS-OCT) system, DRI OCT Triton (Topcon Corp., Japan) with a 1050 nm wavelength, scanning speed of 100,000 A-scans/sec and 7 mm × 7 mm scanning protocol, centred at the macula. Each eye volume in the Triton data set contains 256 slices with dimensions 256 × 128. Another separate data set is obtained by acquiring scans from nine normal eyes using the PLEX Elite 9000 SS-OCT system (Carl Zeiss Meditec, Jena, Germany) operating at a wavelength range between 1040 nm and 1060 nm, with a scanning speed of 100,000 A-scans/sec and 15 mm × 9 mm scanning protocol. Each eye volume in the PLEX data set contains 834 slices with dimension 512 × 500. Prior pre-processing is performed to limit the field of view of the acquired scans to the macula region and to resize the dimensions to 256 × 128. The network receives the target slice for segmentation together with the adjacent slices as inputs for reconstruction. Slices from the ends of the volume are padded by averaging the target slice with the available adjacent slices.
[0077] The segmentation result was evaluated volumetrically by calculating the IoU, dice score and accuracy over a volume, with respect to ground truth segmentation. The inter-slice correlation was assessed by measuring the quality of the choroidal thickness map generated from the choroidal segmentation. The method was repeated for a plurality of target slices, and further comprised generating a choroidal thickness map from segmentation of the plurality of target slices. In particular, the choroidal thickness map was obtained by stacking the choroidal thickness obtained from each slice. The generated map was evaluated by calculating the structural similarity index, which assesses the similarity of the predicted thickness map and ground truth thickness map. Given two images with the same dimension, x and y, SSIM formula is given by
where .Math..sub.x, .Math..sub.y, σ.sub.x, σ.sub.y, σ.sub.xy are the average of x, the average of y, the variance of x, the variance of y, and the covariance of x and y respectively. While c.sub.1 = (0.001DR).sup.2 and c.sub.2 = (0.003DR).sup.2. In the present disclosure, DR or dynamic range is given by:
[0078] Table 1 shows the result comparison for the Triton data set using the proposed SA-Net with other segmentation approaches such as 3D U-Net, BC U-Net and GGPF-Net. The results show that the SA-Net architecture has successfully outperformed other architectures for volumetric segmentation. This demonstrates that learning the adjacent spatial features explicitly from reconstruction enabled more precise 3D volumetric segmentation.
TABLE-US-00001 Method IoU Dice Acc SSIM 2D U-Net 0.9144 0.9551 0.9923 0.6047 3D U-Net 0.9011 0.9477 0.9911 0.5798 BC U-Net 0.9204 0.9583 0.9929 0.6344 GGPF-Net 0.9166 0.9560 0.9926 0.6362 SA-Net 0.9221 0.9592 0.9930 0.6379
[0079] Table 2 also shows the result for the PLEX data, where the proposed architecture achieved similar results. It is also important to take note that the network complexity and computational power needed for the present architecture are much less than those needed for BC U-Net, resulting in faster learning and inference time.
TABLE-US-00002 Method IoU Dice Acc SSIM 2D U-Net 0.7793 0.8744 0.9793 0.3028 3D U-Net 0.7836 0.8767 0.9802 0.3149 BC U-Net 0. 7988 0.8865 0.9817 0.3126 GGPF-Net 0.7560 0.8585 0.9773 0.3005 SA-Net 0.7927 0.8829 0.9811 0.3091
[0080]
[0081] Table 3 shows a detailed comparison between the proposed SA-Net and the state-of-art networks. Spatial information can provide useful context for volumetric segmentation. In the proposed SA-Net, incorporating spatial information from corresponding adjacent slices enabled our proposed SA-Net architecture to explicitly integrate spatial correspondences. SA-Net was compared with other recent approaches to segment the choroid in volumetric OCT images from two different commercial devices, and it was demonstrated that SA-Net outperformed the other approaches in segmentation accuracy and quality of the generated choroidal thickness map, with lesser computational requirements. The results show that SA-Net could be used for efficient and accurate segmentation of OCT data as well as potentially other volumetric medical images.
TABLE-US-00003 SA-Net 2D U-Net 3D U-Net BC U-Net GGPF-Net Bio-Net Automation Full Full Full Full Full No Computational Resources Light Light Heavy Heavy Light Light Robustness to Noise Good Poor Poor Good Moderate Moderate
[0082] Also disclosed herein is a non-transitory computer-readable storage having instructions stored thereon for causing at least one processor to carry out the disclosed method.
[0083]
[0084] As shown, the mobile computer device 700 includes the following components in electronic communication via a bus 706: [0085] (a) a display 302; [0086] (b) non-volatile (non-transitory) memory 704; [0087] (c) random access memory (“RAM”) 708; [0088] (d) N processing components 710; [0089] (e) a transceiver component 712 that includes N transceivers; and [0090] (f) user controls 714.
[0091] Although the components depicted in
[0092] The display 302 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and OLED displays).
[0093] In general, the non-volatile data storage 704 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 704, or by instructions stored in memory 704.
[0094] In some embodiments for example, the non-volatile memory 704 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
[0095] In many implementations, the non-volatile memory 704 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 704, the executable code in the non-volatile memory 804 is typically loaded into RAM 708 and executed by one or more of the N processing components 710.
[0096] The N processing components 710 in connection with RAM 708 generally operate to execute the instructions stored in non-volatile memory 704. As one of ordinarily skill in the art will appreciate, the N processing components 710 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
[0097] The transceiver component 712 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
[0098] It should be recognized that
[0099] It will be appreciated that embodiments of the present disclosure provides a novel segmentation architecture that is capable of fully automated three-dimensional segmentation of volumetric medical image data. This architecture encompasses the following key novel aspects. First, soft parameter sharing aggregates the spatial features more explicitly by directly learning the correlation between adjacent slices and the slice that will be segmented. In addition, simultaneous reconstruction and segmentation extracts intra-slice features which are directly used for segmentation. Further, automated generation of volumetric choroidal representation enables 3D visualization of the choroid. Last but not least, generation of full-field choroidal thickness maps enables enface analysis of thickness variations in the choroid across the retina.
[0100] It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
[0101] Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising”, will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
[0102] The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.