World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

ICA-Unet: An improved U-net network for brown adipose tissue segmentation

    https://doi.org/10.1142/S1793545822500183Cited by:7 (Source: Crossref)

    Abstract

    Brown adipose tissue (BAT) is a kind of adipose tissue engaging in thermoregulatory thermogenesis, metaboloregulatory thermogenesis, and secretory. Current studies have revealed that BAT activity is negatively correlated with adult body weight and is considered a target tissue for the treatment of obesity and other metabolic-related diseases. Additionally, the activity of BAT presents certain differences between different ages and genders. Clinically, BAT segmentation based on PET/CT data is a reliable method for brown fat research. However, most of the current BAT segmentation methods rely on the experience of doctors. In this paper, an improved U-net network, ICA-Unet, is proposed to achieve automatic and precise segmentation of BAT. First, the traditional 2D convolution layer in the encoder is replaced with a depth-wise over-parameterized convolutional (Do-Conv) layer. Second, the channel attention block is introduced between the double-layer convolution. Finally, the image information entropy (IIE) block is added in the skip connections to strengthen the edge features. Furthermore, the performance of this method is evaluated on the dataset of PET/CT images from 368 patients. The results demonstrate a strong agreement between the automatic segmentation of BAT and manual annotation by experts. The average DICE coefficient (DSC) is 0.9057, and the average Hausdorff distance is 7.2810. Experimental results suggest that the method proposed in this paper can achieve efficient and accurate automatic BAT segmentation and satisfy the clinical requirements of BAT.

    1. Introduction

    Brown adipose tissue (BAT) is a special type of adipose tissue. From one perspective, it is similar to white adipose tissue (WAT) and is an energy storage tissue in the human body. From another perspective, BAT is undergoing extremely active metabolic activities owing to a considerable number of mitochondria in BAT cells. Studies have revealed that three main physiological purposes of BAT can be identified, namely, thermoregulatory thermogenesis, metaboloregulatory thermogenesis, and secretory.1,2 Therefore, BAT has essential physiological significance for human body temperature regulation, resistance to cold, prevention of obesity, regulation of energy balance, and resistance to infection.3,4,5 The activity of BAT expresses significant differences at different ages and genders. Its activity in infants and women is higher than that in adults and men. The latest research has reported that BAT activity is negatively correlated with body weight, and BAT may be adopted as a target tissue for the treatment of obesity.6 Therefore, BAT-related research is associated with a wide range of clinical applications. With the vigorous development of modern medicine, detection methods for BAT are constantly improved. Positron emission tomography/computer tomography (PET/CT) technology has become the mainstream detection technology for BAT. Besides, 18F-fluoro-2-deoxyglucose (18F-FDG) is broadly used as the PET/CT tracer.7 The mechanism is that glucose is metabolized vigorously and accumulated in BAT due to the different metabolic states of different tissues of the human body.8,9 These characteristics can be reflected through PET images for detection and analysis, as shown in Fig. 1.

    Fig. 1.

    Fig. 1. PET/CT image and anatomical location of BAT.

    Regarding the segmentation of BAT, existing studies are primarily based on the experience of radiologists and nuclear medicine physicians. Researchers have demonstrated that thresholding and clustering are very suitable for the segmentation of BAT since BAT exists in specific anatomical locations and PET images have high contrast.10,11,12,13 Generally, simple thresholding was used for segmenting BAT. First, the individual’s standard uptake value (SUV) is calculated based on PET data.14,15 Second, it is manually divided with medical imaging software. BAT is present if the diameter of the tissue area is greater than 5mm and the CT density is restricted to 1903019030HU,16 and the SUV is more than 2g/ml or 3g/ml in the corresponding 18F-FDG PET image.17,18 Finally, BAT is distinguished from lymph nodes, blood vessels, bones, thyroid, and other tissues according to anatomical knowledge.19 However, this type of method has the following shortcomings: (1) The traditional BAT segmentation step requires excessive imaging and clinical knowledge, and the conclusions are mostly the subjective judgment of clinicians; (2) the traditional BAT segmentation method depends on the standard division of thresholds, while it is difficult for standard thresholds segmentation in complex differentiation problems because of the specificity of the tissues and organs of different patients. Therefore, it is urgent to develop an automatic BAT segmentation method for tackling the above problems.

    Recent success in deep learning, especially the use of a Deep Convolutional Neural Network,20 has accelerated the development of automatic image segmentation. In deep learning algorithms, the U-net network structure21 is widely used for medical image segmentation.22,23,24,25,26 The U-net network consists of an encoder and a decoder, as well as skip connections between the encoder and the decoder. The image is first convolutionally down-sampled by the encoder several times to obtain feature maps of different scales, indicating that the features of different scales are learned. Then, the bottom feature map is up-sampled in the decoder, and the encoder is correspondingly connected by skipping scale feature maps. As a result, feature maps of different scales are merged. Finally, the network combines low-resolution information and high-resolution information. The low-resolution information provides the location and category information of the segmentation target, while high-resolution information is required in the edge segmentation. Under the combination of both, the medical image segmentation task can be well completed by U-net.

    Accordingly, a BAT segmentation method is designed in this paper based on improved U-net, ICA-Unet. The U-net network is improved based on channel attention block, a depth-wise over-parameterized convolutional (Do-Conv) layer, and image information entropy (IIE) block, which will be described in detail in Sec. 2.2.

    2. Materials and Methods

    2.1. Materials

    This retrospective analysis was approved by the Ethics Committee of Xijing Hospital (Approval No. KY201730081), and the informed consent was waived. The PET/CT images involved in this paper are obtained from the Department of Nuclear Medicine, Xijing Hospital, Fourth Military Medical University. The dataset includes 368 sets of DICOM image sequences, containing CT images with a size of 512×512512×512 and a corresponding PET image with a size of 274×274274×274.

    The images need to be preprocessed before the dataset is segmented. First, the SUV value is calculated based on the original PET data according to formula (1) (hereinafter referred to as the SUV calculated based on the original PET data as the PET data). During the process of calculating the SUV value, the calculation rule based on the DICOM tag is adopted.27 Besides, XPETXPET is set as a three-dimensional matrix of PET data and YSUVYSUV is set as a three-dimensional matrix of SUV data. Other variables are acquisition time (TA)TA), patient weight (WP)WP), radiopharmaceutical start time (TRS)TRS), radionuclide total dose (DRT)DRT), radionuclide half-life (LRH)LRH), rescale intercept (IR)IR), and rescale slope (SR)SR). The calculation formula is :

    Ysuv=(IR+SRXPET)exp(ln2(TATRS)LRH)DRT÷WP.Ysuv=(IR+SRXPET)exp(ln2(TATRS)LRH)DRT÷WP.(1)
    Second, the original CT data have been optimized with the median filtering method, which is widely used in medical images noise processing, to reduce the noise in the CT data. Finally, the PET data are up-sampled by the nearest neighbor interpolation resampling method and enlarged to be the same as the CT data. Besides, the related spatial parameters of the PET data are matched with the corresponding CT parameters to manage the problem of the inconsistent size and image parameters of the PET and CT.

    Since BAT is broadly distributed in the neck region of the human body, it is necessary to obtain the neck ROI region before BAT segmentation for avoiding interference from other nonBATs with high SUV values in the human body. Additionally, the PET/CT image cannot be guaranteed to be in a fixed proportion of the image in the process of acquiring the images. Thus, Faster RCNN28,29 has been employed to obtain the ROI area of the neck to overcome this complication.

    The ROI area is obtained through the network. On this basis, the corresponding upper and lower boundary slice interval (x,x+h)(x,x+h) of the axial direction of the PET/CT data are acquired. This is taken as the interval to obtain several PET/CT images and construct a bimodal dataset.

    After the data processing described above, about 4500 sets of bimodal PET/CT images can be obtained. Then, annotations for each PET/CT image are created. Consequently, the annotated data are randomly divided into 70% training data and 30% test data for training and testing.

    2.2. Methods

    2.2.1. Network architecture

    The architecture of the proposed ICA-Unet network is illustrated in Fig. 2. The network is based on the classic U-net structure,21 with the introduction of image information entropy,30 channel attention,31 and Do-Conv layer.32 Specifically, our study follows the classic structure of the U-net. In the encoder module, the PET/CT bimodal image is first input into the network as two channels. Second, the 2D convolution module in the U-net structure is replaced with the Do-Conv module. After a layer of 3×3×33×3×3 convolution with the stride of 1 and zero padding, the rectified linear unit (ReLU) activations and batch normalization (BN) are calculated. Then, the channel weight is calculated by the channel attention module and multiplied by the feature map. Finally, it passes through a block of Do-Conv, ReLU, and BN. Additionally, successive 2×2×22×2×2 max pooling with the stride of 2 is performed to enlarge receptive fields after the double-layer convolution.

    Fig. 2.

    Fig. 2. ICA-Unet network structure diagram.

    Symmetrically with the encoder, the feature maps of the subsequent decoder are up-sampled four times with de-convolutions to restore spatial details. Specifically, a 2×22×2 de-convolutions module with a step size of 2 is performed, followed by the same convolution operation as the encoding module. Furthermore, the skip connections are deployed, and the IIE module is introduced to calculate the IIE of the down-sampled feature map, so as to strengthen the edge information. Then, this feature map is fused with the feature map of the same level obtained from the decoding module. The global context information is complementary to the spatial details. Finally, the segmentation result is output after 1×11×1 single-layer convolution.

    2.2.2. Image information entropy

    In 1948, C. E. Shannon, the father of information theory, published a paper “A Mathematical Theory of Communication”, pointing out that any information has redundancy, and its size is related to the probability or degree of confusion of each symbol in the information.33 Shannon cited the concept of thermodynamics and called the average amount of information after eliminating redundancy as IIE.

    In the information theory, information entropy is defined as the expectation of a random variable I(X)I(X) in the set (X,q(X))(X,q(X)),

    H(x)=xXq(x)I(x)=xXq(x)logq(x),H(x)=xXq(x)I(x)=xXq(x)logq(x),(2)
    where H(X)H(X) denotes the information entropy of XX, which describes the degree of confusion and uncertainty of the elements in XX.

    Pixel is the basic unit of a digital image. Image data are essentially a matrix of pixels in a computer. Essentially, the difference in images is that pixels of different gray levels are distributed in different spatial regions with different probabilities. Therefore, the value of kk is set to 255 for the image of kk-level grayscale, and the i(i1,,k)i(i1,,k) level grayscale is represented by pi. Then, the entropy is as follows :

    H(pi)=pilog21pi,H(pi)=pilog21pi,(3)
    where 0ik0ik, the accumulation of information entropy of different grayscale levels is defined as image information entropy, then the IIE of the entire image is as follows :
    H=ki=0H(pi)=ki=0pilog2(pi),k=255,H=ki=0H(pi)=ki=0pilog2(pi),k=255,(4)
    where pipi indicates the probability of each level of grayscale pixels in the entire image. When pi=0pi=0, pilog(pi)=0pilog(pi)=0. pipi is calculated by the grayscale histogram, that is, the quotient of the number of pixels of grayscale ii and the total number of pixels in the image.

    In the U-net network, the decoder involves a combination of four times double convolutional layers and pooling layers. There are multiple feature maps of different levels, scales, and aspects in the output of each pooling layer. By calculating the IIE based on the feature map, the edge of the object and the rapidly changing pixel information in the image can be captured through the retention of the detailed texture structure of the original image. Meanwhile, the edge feature of the object can be enhanced to make the generated image feature more expressive. Furthermore, the enhancement of edge information can contribute to better contour integrity and coherence of the final segmentation results of the network.

    2.2.3. Channel attention

    In 2017, Senet31,34 won the championship in the image classification task of the ImageNet competition. It performed an attention mechanism in the channel dimension to significantly improve the network performance.

    The channel attention mechanism consists of three operations: Squeeze, Excitation, and Scale. The Squeeze operation compresses the two-dimensional features of each channel into a real number through global pooling, equivalent to having a global receptive field. Assuming that there are CC channels in total, a 1×1×1×1×CC feature will eventually be obtained. The purpose of the Excitation operation is to generate a weight for each channel. Specifically, the dimension of the feature map is first reduced to 1/rr of the original through a fully connected layer. Then, the ReLU is calculated, and the original dimension CC of the feature map is obtained through a fully connected layer. Finally, the sigmoid function is performed for normalization. The Scale operation is to multiply the normalized weight coefficient with the feature map of each channel.

    In this paper, bimodal data of PET/CT are introduced. The channel attention module can assist the network in judging the importance of different channels. In other words, the network can better judge the information importance of CT and PET data after convolution. Therefore, the introduction of the channel attention module is conducive to learning crucial information in the bimodal data and better extracting image features.

    2.2.4. Depth-wise over-parameterized convolution

    Li et al. proposed depth-wise over-parameterized convolution (Do-Conv),32 which can replace the traditional convolution to accelerate the network convergence and improve the performance of the network. Do-Conv is a combination of traditional convolution and depth-wise convolution.

    Assume that the number of channels of the input feature map is CinCin, the size of the convolution kernel is M×NM×N, and the output channel of the feature map is CoutCout. The convolution kernel W can be expressed as WRCout×(M×N)×Cin. With representing the traditional convolution operation, O=WP.

    OCout=(M×N)×CiniWCoutiPi.(5)
    Different from the traditional convolution operation, a channel of the output feature in depth-wise convolution is only related to a specific channel of the input feature rather than other channels of the input feature. Assuming that there are Dmul convolution kernels with a size of M×N and the input feature channel is Cin, the convolution kernel D can be expressed as DR(M×N)×Dmul×Cin, and the output channel can be expressed as Dmul×Cin. With indicating depth-wise convolution, O=DP.
    ODmulCin=M×NiWiDmulCinPiCin.(6)
    Do-Conv is to perform depth-wise convolution on the input feature vector and then calculates traditional convolution. It can be written as follows :
    O=W(DP)=(DTW)P.(7)
    Since the network performs IIE modules and channel attention modules, the convergence speed of the network will be affected. Therefore, Do-Conv is conducted in the encoder and decoder to replace the traditional 2D convolution, so as to accelerate the network convergence speed and improve the network performance.

    2.2.5. Implementation

    The proposed method was implemented using Python language and Pytorch package35 on the workstation with single graphics processing unit (NVIDIA GeForce GTX TITAN V). The Loss function of the network was composed of Sigmoid and BECLoss. Assuming there are N batches and each batch predicts n labels, the Loss function can be defined as follows:

    loss={l1,,lN},ln=[ynlog2(σ(xn))+(1yn)log2(1σ(xn))],(8)
    where σ(xn) indicates the Sigmoid function, which can map x to the interval (0, 1) :
    σ(x)=11+exp(x).(9)
    The network was trained by an RMSProp optimizer36 with rho of 0.9 and e of 0.0001. The initial learning rate was set to 0.00003. The training contained 100 epochs, and the batch size was set to 4. In the training and testing, 512×512 PET and CT images were combined to form a double-channel 2×512×512 matrix and input into the network. After inference, the segmentation of each image was formed. Moreover, the proposed ICA-Unet was performed three times and the average value was taken as the final result to alleviate the impact of random initialization in training. The network had been fully trained, as shown in Fig. 3. The loss of the network has converged.

    Fig. 3.

    Fig. 3. The loss curve in the process of training.

    2.2.6. Evaluation metrics

    With expert manual annotations as ground truth, the segmentation performance of ICA-Unet was quantitatively evaluated with the following six metrics37: (1) mIoU, (2) Sensitivity (SEN), (3) Specificity (SPE), (4) Dice Similarity Coefficient (DSC), (5) accuracy (ACC), and (6) Hausdorff Distance (HD)38 :

    mIoU=TPTP+FN+FP,(10)
    SEN=TPTP+FN,(11)
    SPE=TNTN+FP,(12)
    DSC=2TP2TP+FN+FP,(13)
    ACC=TP+TNTP+TN+FN+FP,(14)
    HD=max{dHD(A,B),dHD(B,A)},(15)
    dHD(A,B)=maxxA{minyB{d(x,y)}},(16)
    where TP and FP denote the numbers of true positives and false positives, respectively; TN and FN refer to the numbers of true negatives and false negatives, respectively; HD indicates the maximum distance between two pixels sets; dHD(A,B) designates the directed Hausdorff distance between the ground truth and the predicted value; d(x,y) stands for the Euclidean distance between two pixels.

    3. Results and Discussion

    3.1. Comparison to state-of-the-art methods

    The segmentation results on the PET/CT dataset built previously were obtained by our method with the ICA-Unet network. The other six state-of-the-art methods, CT threshold, PET threshold, CT and PET Threshold intersection, U-net,21 U-net++,39 and SegNet,40 are presented in Fig. 4 and Table 1. The visualization results imply that in the examples, the automatic segmentation results obtained by our method are more anxious and consistent with the ground truth. Particularly, the segmentation boundary is clearer, complete, and coherent.

    Fig. 4.

    Fig. 4. Comparison of experimental results. (a) Comparison diagram, actual results; (b) CT threshold; (c) pet threshold; (d) intersection of CT and pet threshold; (e) K means; (f) U-net; (g) U-net++; (h) Segnet; (i) Proposed.

    Table 1. Evaluation index of BAT segmentation results by different methods.

    MethodsmIoUSENSPCDSCACCHD
    CT_range0.05670.60750.89110.10260.8876112.8274
    PET_range0.41630.62560.99810.66280.993422.2598
    CT_and_PET_range0.46500.43990.99950.54860.992523.4087
    K means0.19410.43480.92050.25950.915276.7140
    Unet0.48620.61900.99700.63900.992970.9661
    Unet++0.65570.67890.99750.70660.993018.8122
    Segnet0.57740.58980.99700.68480.992230.6596
    Proposed0.83220.84000.99990.90570.99827.2810

    It can be observed from Fig. 4 that the segmentation results obtained from the CT threshold method (Fig. 4(b)), in which the areas with a CT density of greater than 190 HU and less than 30 HU are regarded as BAT, contain the area of BAT. However, there are significant errors of segmentation and a considerable number of WATs treated as a BAT area mistakenly. The results obtained from the CT threshold method are unreasonable. The segmentation results from the PET threshold method (Fig. 4(c)), in which the areas with the SUV values of greater than 2g/ml or 3g/ml are regarded as the BAT area, exclude a large number of WATs, and contain a more accurate area of BAT. However, the results are under-segmented and over-segmented, as well as discontinuous and noisy. Moreover, the edges are not smooth, reflecting that the poor segmentation results from the PET threshold. Segmentation results from CT and PET threshold intersection methods (Fig. 4(d)) demonstrate that the accuracy of the segmentation area is significantly improved compared with the PET threshold method and the CT threshold method. Nevertheless, the results are also under-segmented, and the region is discontinuous, resulting in the average segmentation results. The segmentation results obtained from the K means method (Fig. 4(e)) contain the area of BAT. However, there is significant over-segmentation, and a considerable number of surrounding tissues are mistakenly treated as BAT regions. The results obtained from the K means method are unreasonable. Segmentation results from U-net (Fig. 4(f)) contain most BAT areas, while the area is hollow, discontinuous, and noisy, and the edges are not smooth. This is contrary to the ground truth. Compared with the results from U-net, the accuracy of segmentation results from U-net++ (Fig. 4(g)) has been enhanced. Nonetheless, there is still a certain amount of under-segmentation. Segmentation results from SegNet (Fig. 4(h)) are similar to those from U-net++, while there are more under-segmentation and over-segmentation compared to U-net++. The method proposed (Fig. 4(i)) in this paper can obtain semblable segmentation results with the ground truth compared with the previous methods, with the more delicate and smoother boundary of results. It grapples with the problems of under-segmentation and discontinuity in other methods.

    Table 1 suggests that the ICA-Unet we proposed had the highest Dice coefficient (DSC) and mIoU compared with the traditional threshold methods. Besides, the ICA-Unet we proposed also had the highest DSC and mIoU and the lowest Hausdorff distance compared with mainstream medical image segmentation networks.

    The data in Table 1 reveal that the method we proposed had significant advantages in performance and accuracy and avoids the problems of under-segmentation, discontinuity, and difference in individual thresholds caused by static threshold segmentation methods. Objectively, the ICA-Unet can realize the automatic segmentation of BAT.

    3.2. Evaluation of network architectures

    In this study, the six possible combinations of the Unet, IIE, Do-Conv, and CAT modules were conducted to assess the effectiveness of our network (ICA-Unet) architecture. Moreover, its accuracy was evaluated with mIoU and loss convergence speed. The quantification results are provided in Table 2 and Fig. 5. Thus, the following conclusions can be drawn. (1) Our architecture (U-net+IIE+Do+CAT) had the highest accuracy compared with other architectures; (2) the convergence rate of Loss is relatively fast when our architecture is trained.

    Table 2. Ablation experiment results.

    MethodsmIoUSENSPCACCHDLoss convergence epoch
    Unet0.48620.61900.99700.992970.966112
    Unet+IIE0.55230.67260.99700.993168.225032
    Unet+Do0.44830.61340.99300.989576.747910
    Unet+CAT0.48400.40370.96700.958397.420328
    Unet+IIE+Do0.39770.34470.99430.984596.031437
    Unet+IIE+CAT0.60680.66410.99980.993916.201050
    Unet+Do+CAT0.53910.57500.99820.992923.030516
    Proposed0.83220.84000.99990.99827.281028
    Fig. 5.

    Fig. 5. Histogram of ablation experiment results.

    As suggested from Table 2 and Fig. 5, further analysis was conducted on the IIE module. To sum up, the segmentation accuracy of the network had been improved with the introduction of the image IIE module. It exhibited an improvement of 0.07 compared to the mIoU of U-net, U-net+IIE. However, the loss convergence speed of the network has significantly slowed down (convergence is reached at epoch=32).

    The analysis of Do-Conv implied that the loss convergence speed can be significantly improved by replacing the 2D convolutional layer with the Do-Conv convolution layer in the U-net. Besides, it can also effectively manage the problem of the decrease of network convergence rate caused by the introduction of the IIE module and channel attention module (CAT).

    The CAT module demonstrated that the introduction of the CAT module can effectively improve the segmentation accuracy of the network. Additionally, it can dramatically improve the segmentation accuracy when used together with IIE. Simultaneously, the loss convergence speed of the network decreased (loss convergence is reached at epoch=50).

    In summary, the IIE and CAT modules can effectively strengthen the segmentation accuracy of the network, and the Do-Conv layer can highly accelerate the loss convergence speed of the network. Therefore, the three are combined in the U-net network architecture. The experimental results verified that the best segmentation result can be obtained on the basis of retaining a certain loss convergence speed of the network.

    4. Conclusions

    In this study, an improved U-net network (ICA-Unet) for BAT segmentation has been proposed based on the classic U-net architecture, the image information entropy, channel attention, and Do-Conv modules. The network can learn the PET/CT channel weights from the channel attention modules and enhance the edge features of the maps through IIE modules. After the introduction of these modules, the decrease in loss convergence speed of the network can be mitigated by Do-Conv layers. The network architecture and method we proposed present the mIoU score of 0.832. Compared with other methods, ICA-Unet has significant advantages and avoids the problems of under-segmentation, discontinuity, and threshold difference caused by static threshold segmentation methods, realizing automatic BAT segmentation. The proposed method can assist radiologists and nuclear medicine physicians in efficiently segmenting BAT and significantly facilitate clinicians and researchers to conduct related research on BAT.

    Conficts of Interest

    The authors declare that there are no conficts of interest relevant to this paper.

    Acknowledgments

    This work was supported in part by the National Natural Science Foundation of China (61701403, 82122033, 81871379); National Key Research and Development Program of China (2016YFC0103804, 2019YFC1521103, 2020YFC1523301, 2019YFC1521102); Key R&D Projects in Shaanxi Province (2019ZDLSF07-02, 2019ZDLGY10-01); Key R&D Projects in Qinghai Province (2020-SF-143); China Post-doctoral Science Foundation (2018M643719); Young Talent Support Program of the Shaanxi Association for Science and Technology (20190107).