Processing math: 100%
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Adrenal Tumor Segmentation on U-Net: A Study About Effect of Different Parameters in Deep Learning

    https://doi.org/10.1142/S2196888823500161Cited by:1 (Source: Crossref)

    Abstract

    Adrenal lesions refer to abnormalities or growths that occur in the adrenal glands, which are located on top of each kidney. These lesions can be benign or malignant and can affect the function of the adrenal glands. This paper presents a study on adrenal tumor segmentation using a modified U-Net model with various parameter selection strategies. The study investigates the effect of fine-tuning parameters, including k-fold values and batch sizes, on segmentation performance. Additionally, the study evaluates the effectiveness of different preprocessing techniques, such as Discrete Wavelet Transform (DWT), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Image Fusion, in enhancing segmentation accuracy. The results show that the proposed model outperforms the original U-Net model, achieving the highest scores for Dice, Jaccard, sensitivity, and specificity scores of 0.631, 0.533, 0.579, and 0.998, respectively, on the T1-weighted dataset with DWT applied. These results highlight the importance of parameter selection and preprocessing techniques in improving the accuracy of adrenal tumor segmentation using deep learning.

    1. Introduction

    Adrenal lesions are growths in the adrenal glands that can be either benign or malignant and are often discovered incidentally during imaging tests. While symptoms may include abdominal pain, high blood pressure, rapid heartbeat, sweating, and anxiety, many adrenal tumors are asymptomatic. The most common type of adrenal tumor is the benign adenoma, which can cause excess hormone production, leading to conditions such as Cushing’s syndrome or primary aldosteronism. Malignant adrenal tumors, such as adrenal cortical carcinoma, are rare but can be aggressive and difficult to treat.1

    Segmentation of adrenal lesions from medical images can be a difficult task due to several factors. First, the variability in size and shape of adrenal lesions makes it challenging to develop a single segmentation method that can accurately detect and describe all types of lesions. Second, the varying levels of contrast enhancement in adrenal lesions can impact the accuracy of their detection and segmentation. Additionally, various imaging artifacts, such as motion artifacts, noise, or image distortion caused by patient movement during scanning, can affect the quality of medical images and consequently, the accuracy of adrenal lesion segmentation. Moreover, adrenal lesion segmentation is a subjective process that is dependent on the interpretation and expertise of the radiologist or clinician, which can lead to intra-observer and inter-observer variability. Limited availability of medical image datasets for the development and validation of adrenal lesion segmentation algorithms can also affect their accuracy and generalizability. Lastly, the time-consuming and labor-intensive nature of adrenal lesion segmentation can limit its feasibility in clinical practice. Despite these challenges, the use of advanced machine learning techniques, such as deep learning algorithms, has the potential to improve the accuracy and efficiency of adrenal lesion segmentation.

    U-Net is a convolutional neural network architecture that has become popular for its application in medical image segmentation.2 The U-Net architecture comprises a contracting path and a symmetric expanding path, which facilitate contextual and precise localization, respectively. The contracting path is composed of convolutional layers followed by max-pooling operations that gradually reduce spatial resolution while increasing the feature map depth. Conversely, the expanding path comprises up-convolutional layers that increase spatial resolution and reduce feature map depth. The incorporation of skip connections between corresponding layers in the contracting and expanding paths allows the network to concatenate feature maps from different resolutions and retain high-resolution information throughout the segmentation process. U-Net has demonstrated exceptional performance in numerous medical image segmentation tasks due to its ability to capture contextual information and perform precise segmentation.3,4,5,6

    In segmentation studies using images where it is difficult to distinguish organs, such as abdominal images, image preprocessing methods and parameter optimization are required to improve the performance of the widely used U-Net model. Image preprocessing is a crucial step that involves the use of techniques to enhance, clean up, or modify digital images before they are analyzed or used for further processing.7 These methods aim to improve the quality of images by removing noise, correcting distortions, and increasing the contrast of images. However, the choice of appropriate parameters is also critical to improving the performance of deep learning models.8 This is because the selection of parameters can significantly affect the accuracy, convergence rate, and generalization ability of the model. Due to the high dimensionality and complexity of deep learning models, the number of parameters involved can be vast, making it challenging to find the optimum configuration. Incorrect selection of parameters can lead to problems such as over or underfitting, which can reduce the performance of the model. Therefore, it is essential to adjust the parameters for specific datasets and tasks to achieve the best results. To select the optimum parameters, a thorough understanding of the model’s architecture, data characteristics, and optimization algorithm used is necessary. The selection of optimum parameters is a crucial step in deep learning model development, which can improve the performance of the model and make it more effective in practical applications.

    2. Related Works

    Adrenal tumor segmentation studies lack a generally accepted dataset, leading to variations in the datasets used in different studies. Hence, the evaluation of these studies is based on the methodology and performance metrics obtained. Previous adrenal tumor segmentation studies have been performed on abdominal Computer Tomography (CT) and Magnetic Resonance (MR) images. For instance, Tang et al. conducted a segmentation study of heochromocytoma on 10 3D CT images and achieved 92.7% Dice Score Coefficient (DSC) in the arterial phase and 92.9% DSC in the venous phase.9 Chai et al. proposed a computer-aided system consisting of a segmentation algorithm, classification algorithm, and feature extraction for the detection and classification of adrenal tumors on CT images. The study achieved an accuracy of 90% on 436 CT images.10 Koyuncu et al. developed an Adrenal Tumor Segmentation (ATUS) pipeline using a hybrid system of multiple methods for adrenal tumor segmentation in CT images. The study achieved a DSC of 83.06%, Jaccard of 71.44%, sensitivity of 86.44%, specificity of 99.66%, accuracy of 99.43%, and structural similarity index measure (SSIM) of 98.51% using 32 images.11 Zhao et al. utilized a 3D U-Net model to segment adrenal tumor vessels on 49 3D CT images, achieving a DSC of 94.69% and a mean intersection over union (MIoU) of 90.22%.12 Bi et al. used a fully convolutional network (FCN) to detect adrenal tumors on low-contrast CT images and achieved precision, recall, and f-score metrics of 69.82%, 76.29%, and 72.91%, respectively, using 38 low-contrast CT images.13 Barstugan et al. proposed a hybrid method including some image processing methods for the segmentation of adrenal tumors on MR images and achieved a sensitivity, specificity, accuracy, precision, DSC, Jaccard, and SSIM of 60.07%, 99.99%, 97.69%, 83.68%, 63.14%, 51.14%, and 97.07%, respectively, using 113 MR images.14

    The current literature on adrenal lesion segmentation is mainly based on CT imaging. However, a recent study is the only known study in the literature to perform adrenal lesion segmentation on MR images.14 Despite both imaging techniques providing detailed images of internal structures, MR offers several advantages over CT for abdominal imaging. MR’s ability to provide superior soft tissue contrast allows for more detailed imaging of abdominal organs and structures, such as the liver, pancreas, and blood vessels, making it particularly useful for detecting and characterizing tumors and other abnormalities, as well as assessing the extent of disease. MR also has the advantage over CT in that it does not use ionizing radiation for abdominal imaging, a significant benefit for patient safety. Abdominal CT studies often require the use of iodine-containing contrast agents, which may be harmful for patients with kidney disease or iodine allergy. In contrast, MR studies typically use safer gadolinium-based contrast agents, which are less harmful for patients with these conditions. Furthermore, MR provides better visualization of certain abdominal structures that may be challenging to image with CT, such as the bile ducts and pancreas. Specifically, MR is the preferred imaging modality for evaluating pancreatic diseases, including tumors and inflammation, due to its ability to provide detailed images of the pancreas and adjacent structures. Finally, MR can provide images in multiple planes, including axial, coronal, and sagittal planes, offering a more comprehensive view of the abdomen. Conversely, CT is mainly limited to providing images in the axial plane, which can complicate the assessment of certain structures within the abdomen.

    The literature on this topic reveals several limitations and challenges, including the following:

    • Data variability: Each dataset used for research is unique, making it difficult to compare studies that use different datasets. Therefore, it is difficult to generalize findings across different datasets.

    • Limited imaging modalities: Most research in this area has focused on CT imaging. However, the use of CT imaging is limited due to the harmful radiation exposure to patients and the inability to image in multiple planes.

    • Different thresholds: Studies have employed different thresholds for different types of lesions, which can make it difficult to generalize findings.

    • Specific target selection: The use of a specific target selection for each image is one of the most significant challenges that limits the generalization of the results to other datasets and populations.

    This paper presents a study on the segmentation of adrenal lesions in T1-T2-weighted MR images of the abdomen, using U-Net-based deep learning models as a segmentation approach. In order to enhance the performance of the models, various preprocessing techniques were applied to the images to generate different datasets. Cross-validation was employed during the training phase to avoid potential bias towards a particular subset of the data, which can occur when using a single test set. The most suitable hyperparameter values were determined through cross-validation and batch size selection. The results of this study were compared with state-of-the-art models and different loss functions to reveal the advantages of the proposed method. The significant contributions of the study are as follows:

    (1)

    The first study in the literature to perform adrenal lesion segmentation on a previously used dataset, facilitating comparison.

    (2)

    Segmentation of adrenal tumors achieved using deep learning models on T1-T2 weighted abdominal MR images.

    (3)

    Improved segmentation performance and determination of the appropriate image enhancement technique for abdominal images by applying different image enhancement techniques.

    (4)

    The proposed model achieved better results than the original model after modifying the U-Net layers.

    (5)

    Hyperparameters were tuned to obtain optimal values instead of fixed values by varying cross-validation and batch size values.

    (6)

    robustness was demonstrated when compared with state-of-the-art models and models using different loss functions.

    3. Materials and Methods

    3.1. Dataset

    In this study, a dataset of T1-T2 weighted abdominal MR images of 107 patients acquired with a 2013 model SIEMENS AREA 1.5 T MR unit at the Faculty of Medicine, Department of Radiology, Selcuk University was used. The use of relevant data was discussed at the Non-Interventional Clinical Research Ethics Committee Meeting, Faculty of Medicine, Selcuk University dated 26 July 2017; Approval was obtained with the decision of the board numbered 2017/216 on the subject. The original images were in DICOM format and were converted to JPEG format with a resolution of 800×584 pixels, 8-bit depth, and a DPI of 96. Adrenal lesions in the MR images were identified by expert radiologists from the Department of Radiology, Faculty of Medicine, Selcuk University and were represented as white on the ground truth images while the background was black. Regions of interest (ROIs) were extracted from the images to improve the accuracy and efficiency of data analysis and to reduce the computational load. Figure 1 illustrates the extracted ROI regions from T1-T2 weighted images and examples of ground truth masks corresponding to adrenal lesions in these images.

    Fig. 1.

    Fig. 1. Images used in the study (a) T1-weighted ROI image, (b) Ground truth mask of T1-weighted ROI image, (c) T2-weighted ROI image, and (d) Ground truth mask of T2-weighted ROI image.

    3.2. Image enhancement

    In medical imaging, image enhancement techniques play a crucial role in improving the quality and clarity of images to aid in accurate diagnosis and treatment planning. Medical images are often susceptible to various factors, including noise, distortion, and artifacts, which can obscure significant details and make it difficult to identify pathological features. By reducing these effects and improving contrast, sharpness, and overall image quality, image enhancement techniques can improve the visual interpretability of medical images.

    This study applies different image enhancement methods, such as Discrete Wavelet Transform (DWT), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Image Fusion (Max Fusion and Average Fusion), to the datasets used and compares the effectiveness of each method.

    3.2.1. Discrete wavelet transform

    Discrete wavelet transform (DWT) is widely used in medical imaging to enhance image quality, by improving contrast, sharpness, and overall image appearance. DWT-based image enhancement techniques have the potential to reduce image noise, remove artifacts, and enhance fine details, thus facilitating more accurate diagnostic interpretation.15

    DWT is a mathematical technique that decomposes an image into different frequency components called wavelets. Figure 2 provides an example of a 2-level DWT, where an input signal (X[n]) is subjected to a low pass filter (LPF), which produces an approximation component (A), while a high pass filter (HPF) produces a detail component (D). This decomposition process is then repeated, generating further frequency resolution approximation components, which are once again down-sampled by passing through the LPF and HPF, ultimately resulting in 2-level DWT components. Figure 3 shows examples obtained by applying this process to the images used in this study.

    Fig. 2.

    Fig. 2. 2-Level discrete wavelet transform.

    Fig. 3.

    Fig. 3. (a) T1-weighted image and corresponding image after 2-level DWT, and (b) T2-weighted image and corresponding image after 2-level DWT.

    3.2.2. Contrast limited adaptive histogram equalization

    Contrast limited adaptive histogram equalization (CLAHE) is a common technique for medical image enhancement. This method improves the contrast of an image by redistributing the pixel values of the image through a histogram equalization algorithm.15 Unlike traditional histogram equalization, CLAHE adjusts the contrast enhancement locally to different regions of the image. This adaptive technique enables it to process images more effectively with varying contrast levels, such as medical images with uneven lighting or contrast. To achieve this, the image is divided into small regions, and the histogram of each region is calculated and equalized individually.16 The block diagram illustrating the CLAHE process steps is presented in Fig. 4, and examples of T1-T2 weighted abdominal images obtained by applying CLAHE to the dataset used in the study are shown in Fig. 5.

    Fig. 4.

    Fig. 4. CLAHE overall block diagram.

    Fig. 5.

    Fig. 5. (a) T1-weighted image, (b) CLAHE result of T1-weighted image, (c) T2-weighted image, and (d) CLAHE result of T2-weighted image.

    3.2.3. Image fusion

    In medical imaging, image fusion is a technique used to integrate multiple images of the same scene acquired from different sensors or modalities into a single composite image that contains more comprehensive information than the individual images. The goal of image fusion is to improve the visual quality of the images, increase useful information, and facilitate better diagnosis or treatment planning.7

    Various methods are available for image fusion in medical imaging, including maximum fusion, mean fusion, and principal component analysis (PCA). The maximum fusion approach is a straightforward method that selects the maximum intensity value from each corresponding pixel in the input images. On the other hand, the mean fusion method calculates the mean intensity value. PCA is a more complex approach that employs a mathematical transformation to maximize the variance of the resulting image.

    In this study, a wavelet transform-based fusion approach is utilized. The fusion process is executed using the MATLAB wavelet analyzer toolbox, and both maximum fusion and mean fusion techniques are implemented, specifying the type of wavelet used being specified. After performing a 2-level db2 wavelet transform on the input images, fusion datasets are generated via both techniques. Figure 6 illustrates the series of steps involved in the image fusion process and shows the resulting images obtained through this process.

    Fig. 6.

    Fig. 6. Max and mean fusion images from T1-T2 weighted images.

    3.3. Segmentation models

    3.3.1. U-Net

    The U-Net model is a widely used neural network architecture for medical image segmentation. The model employs an encoder-decoder structure with skip connections between the encoder and the corresponding decoder layers to address the challenge of preserving spatial information during image segmentation.2

    The encoder consists of multiple convolutional blocks, each consisting of two 3×3 convolutional layers, followed by a rectified linear unit (ReLU) activation function and a 2×2 max pooling layer. The number of filters in each block doubles after each max pooling layer to increase the receptive field of the network and capture more complex features.

    The decoder has the same number of blocks as the encoder, but each block is replaced by an up-sampling layer followed by concatenation with the corresponding feature map from the encoder. The concatenated feature maps are then passed through two 3×3 convolutional layers with ReLU activation. Finally, a 1×1 convolution is applied to map the feature vector to the required number of classes.

    The U-Net model uses binary cross-entropy loss, which is commonly used for binary segmentation tasks. The number of filters in the first convolutional layer is typically set to 64 and doubles after each pooling layer. However, this can be adjusted according to the size and complexity of the input images.

    Figure 7 depicts the block diagram of the original U-Net model, illustrating the encoder, decoder, and skip connections between them.

    Fig. 7.

    Fig. 7. U-Net block diagram.

    3.3.2. Modified U-Net

    The proposed modified U-Net model differs from the original U-Net model in two main aspects. First, the Parametric Rectified Linear Unit (PReLU) is used as the activation function in the layers instead of the ReLU used in the original model. PReLU allows the activation function to learn the negative slope of the function, unlike ReLU, which has a fixed slope of zero. This feature enables PReLU to adapt to the input data and leads to better performance in tasks such as image classification and object detection. Additionally, PReLU addresses the problem of “dying ReLU” by preventing the gradients in the network from going to zero for certain inputs, thus improving the learning capacity of the network.

    Second, the Dice loss function is used instead of the binary cross-entropy loss function used in the original model. Dice loss is based on the Dice similarity coefficient, a statistical measure that quantifies the similarity of two datasets. The Dice similarity coefficient is defined as twice the intersection of two sets divided by the sum of their sizes. Dice loss has various advantages over binary cross-entropy loss in medical image segmentation tasks. First, it is better suited for unbalanced datasets because it penalizes false negatives and false positives equally, resulting in better performance on unstable datasets. Second, it provides a smoother optimization environment and is less likely to get stuck at the local minimum, leading to better results. Finally, Dice loss encourages better segmentation by penalizing false negatives and false positives equally, leading to more accurate segmentation results and better preservation of small structures.

    The proposed U-Net has a similar architecture to the model shown in Fig. 7. Nevertheless, it distinguishes itself from the original model in terms of the activation and loss functions employed. The efficacy of these modifications will be evaluated through the results obtained from the study.

    3.4. Evaluation metrics

    In the field of machine learning, evaluation metrics play a crucial role in objectively measuring the performance of models or algorithms in solving specific problems. These metrics help to evaluate the accuracy, efficiency, and robustness of the model or algorithm. In the context of medical image segmentation, various evaluation metrics are used to assess different aspects of the study. For example, Dice Similarity Coefficient (DSC), Jaccard Index (JI), Sensitivity (Sens), and Specificity (Spec) are some of the commonly used metrics to evaluate medical image segmentation studies. These metrics are defined using parameters such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), as expressed in Eqs. (1)–(4)17:

    DSC=2xTP(TP+FP)+(TP+FN),(1)
    JI=TPTP+FP+FN,(2)
    Sensitivity=TP(TP+FN),(3)
    Specificity=TN(TN+FP).(4)

    In the later stages of the study, an additional evaluation metric, Hausdorff Distance (HD), was included to compare the performance of the proposed model with the state-of-the-art models and different loss functions. HD is a metric that measures the maximum distance between any point on the predicted mask and the closest point on the ground truth mask. It is particularly useful for evaluating the segmentation accuracy of complex structures. The mathematical formulation of HD is presented in Eq. (5), and a visual representation of the HD metric is provided in Fig. 8.18

    HD=hd(X,Y)=maxxXminyYxy2,hd(Y,X)=maxyYminxXxy2,HD(X,Y)=max(hd(X,Y),hd(Y,X)).(5)

    Fig. 8.

    Fig. 8. (Color online) Components of the calculation of the Hausdorff Distance between the green line X and the blue line Y.19

    3.5. Loss functions

    Loss functions are essential in medical image segmentation to measure the difference between the predicted segmentation mask and the ground truth mask. The appropriate selection of the loss function can significantly impact the accuracy and convergence rate of the segmentation model. Several commonly used loss functions in medical image segmentation include binary cross-entropy loss, dice loss, focal loss, Tversky loss, and Tversky focal loss. Binary cross-entropy (BCE) loss is effective for binary partitioning tasks, while dice loss is suitable for unstable datasets with unbalanced foreground and background pixels. Focal loss emphasizes hard-to-classify samples to solve the issue of class imbalance. Tversky loss, a variant of dice loss, includes two additional hyperparameters to control the weight given to false positives and false negatives.20 Finally, Focal Tversky loss integrates the concepts of both Focal and Tversky loss to reduce the contribution of easy pixels and place more emphasis on misclassified hard pixels.21 The choice of loss function depends on the particular medical image segmentation task at hand. The mathematical expressions of the loss functions are given in Eqs. (6)–(10):

    Dice Loss=12xTP(TP+FP)+(TP+FN),(6)
    BCE=(ylog(p)+(1y)log(1p)),(7)
    Tversky Index(TI)=TPTP+αFN+βFP,(8)
    Tversky Loss(TL)=1TI,(9)
    Focal Tversky Loss(FTL)=c(1TIc)γ,(10)

    The proposed model in this study employs the Dice loss function. Subsequently, models with alternative loss functions were trained using the best-performing parameters, and the results were subsequently compared.

    4. Experiments

    The experiments were conducted on a computer system equipped with an Intel€ X€ (R) E-2236 CPU running at a clock speed of 3.40GHz, 32GB of RAM, and an NVIDIA GeForce GTX 1660 TI graphics card. All experiments were performed on the Windows 10 operating system using TensorFlow 1.14 framework version, with the Adam optimizer. The learning rate was reduced by 10% when there was no improvement every 3 epochs, with the learning rate varying from 103 to 105. The study was conducted according to the following flow:

    • The images in the dataset were pre-processed and different datasets were created, including raw and pre-processed datasets.

    • The dataset was divided into three sets, namely, training, validation, and test sets, based on the cross-validation values determined for the training processes. Data augmentation was applied to the training data to prevent overfitting. Specifically, 10% of the data allocated to the training set was reserved for validation.

    • The training was carried out using both the original U-Net model and the proposed model. The training was repeated for different batch sizes to determine the optimal batch size.

    • After training, the trained models were tested on the test images, and their performance was evaluated using the chosen evaluation metrics.

    Figure 9 shows example images of the data augmentation while Fig. 10 shows the workflow of the study. The values of the parameters used to train the models are shown in Table 1.

    Fig. 9.

    Fig. 9. (a) Height shift, (b) Horizontal flip, (c) Rotation, (d Shear, (e) Width shift, and (f) Zoom.

    Fig. 10.

    Fig. 10. Flowchart of the study.

    Table 1. Hyper parameters of models.

    Loss FunctionsBCE Loss, Dice Loss
    Learning Rate1e−4:1e−5
    OptimizerAdam
    Activation FunctionReLU, PReLU
    Epochs1000
    Cross-validation2, 3, 4
    Batch Size1, 2, 3, 4, 5

    5. Results

    This section presents the quantitative results obtained from training the original U-Net and the proposed model, which are categorized under different subheadings based on the type of dataset used, including, T1–T2 weighted datasets with preprocessed datasets, and fusion datasets. The performance metrics evaluated include DSC, JI, sensitivity (Sens), and specificity (Spec), using the original (RAW), DWT, and CLAHE applied datasets with three different cross-validation values (k=2, 3, 4) and five different batch sizes (1, 2, 3, 4, 5). Finally, an overall evaluation of all studies is presented, and the findings are discussed.

    5.1. Quantitative results of T1-weighted images

    This study evaluated the performance of the original U-Net model trained on T1-weighted abdominal images and preprocessed datasets, using varying batch sizes and k-fold values, and the average test results are presented in Table 2. The findings revealed that when the RAW dataset was employed, the performance decreased with increasing batch size across all k-fold values for each metric. Notably, the best results for all metrics were observed when k=3 was selected for the RAW dataset.

    Table 2. Mean quantitative test results of T1-weighted images with U-Net.

    RAWDWTCLAHE
    BatchMetricsk=2k=3k=4k=2k=3k=4k=2k=3k=4
    1DSC0.4560.5830.5040.5170.6000.5240.3980.5000.431
    JI0.3620.4760.4030.4190.5160.4320.3110.3970.335
    Sens0.3790.5450.4150.4610.5160.3850.3060.4700.300
    Spec0.3790.9980.9970.9970.9970.9970.9980.9980.998
    2DSC0.4220.5140.4970.4510.6080.5430.3910.4820.441
    JI0.3290.4120.3990.3610.5090.4480.3060.3880.346
    Sens0.3000.4870.3810.3970.5290.4200.2800.4400.309
    Spec0.9980.9980.9980.9960.9970.9960.9980.9970.998
    3DSC0.3930.5370.4720.4420.6110.4720.3540.5200.446
    JI0.3020.4460.3800.3550.5070.3830.2680.4140.343
    Sens0.2880.5020.3320.4600.5310.3360.2780.4680.311
    Spec0.9900.9980.9980.9910.9980.9980.9810.9980.997
    4DSC0.3970.5130.4700.4750.5660.5080.3520.4690.428
    JI0.3090.4210.3720.3730.4720.4160.2670.3760.331
    Sens0.3180.4780.3520.5670.5040.3670.3190.4430.308
    Spec0.9650.9970.9940.9630.9980.9980.9780.9980.992
    5DSC0.3830.4740.4250.4550.5650.4810.3580.4320.428
    JI0.2890.3780.3330.3640.4640.3870.2680.3310.330
    Sens0.2810.4380.3410.5080.5180.4450.3060.4830.382
    Spec0.9970.9980.9490.9830.9940.9960.9710.9740.961

    Interestingly, for the models trained on the DWT applied data, the best results were obtained with different batch sizes for each k-fold value. Specifically, the optimal results for all metrics were obtained with the combination of k=3 and batch 3 combination. Moreover, the models trained on the CLAHE applied data demonstrated an increase in performance as the k value increased. Additionally, it was observed that the optimal results for this dataset were obtained with the same combination as DWT.

    Overall, the results suggest that among the different preprocessed datasets, the DWT applied dataset demonstrated superior performance in terms of all the evaluation metrics considered.

    Table 3 presents the average test results obtained from the training of the proposed U-Net model on T1-weighted abdominal datasets. The results indicate that a decrease in metrics is observed when a high batch size (4 and 5) is selected for all k-fold values.

    Table 3. Mean quantitative test results of T1-weighted images with modified U-Net.

    RAWDWTCLAHE
    BatchMetricsk=2k=3k=4k=2k=3k=4k=2k=3k=4
    1DSC0.4710.5540.4910.4850.6310.5230.4260.4970.463
    JI0.3730.4540.3980.3930.5330.4300.3320.3930.364
    Sens0.3490.5300.3640.4740.5790.3750.3120.4650.327
    Spec0.9980.9980.9950.9960.9980.9980.9980.9970.998
    2DSC0.4550.5730.4910.4610.6310.5240.4160.5130.531
    JI0.3560.4660.3950.3750.5250.4330.3210.4090.434
    Sens0.3530.5380.3310.4530.5740.3640.3270.4550.366
    Spec0.9970.9980.9980.9960.9970.9980.9980.9980.997
    3DSC0.4290.5330.4680.4820.6120.5370.3940.4950.438
    JI0.3370.4390.3760.3820.5080.4370.3020.3980.340
    Sens0.3390.5120.3350.4860.5520.3570.2820.4580.336
    Spec0.9980.9980.9980.9920.9980.9980.9970.9980.996
    4DSC0.4080.4760.4630.4520.5990.5020.3710.5070.443
    JI0.3170.3850.3690.3640.4920.4110.2820.4080.347
    Sens0.4330.4510.3170.4350.5360.4170.3260.4610.306
    Spec0.9880.9970.9960.9950.9970.9970.9470.9980.998
    5DSC0.3690.4880.4430.4490.5970.5030.3440.4700.414
    JI0.2810.3930.3510.3560.4920.4080.2590.3680.321
    Sens0.4820.4650.3320.4690.5360.3540.3860.4750.348
    Spec0.9130.9930.9680.9820.9950.9980.9410.9660.959

    Notably, the results suggest that the best results are obtained when 3-fold cross-validation is used in datasets other than the CLAHE applied dataset. Specifically, the combination of 3-fold cross-validation and batch 1 yielded the optimal performance for the DWT applied dataset, as presented in Table 3. Moreover, in terms of performance, the DWT applied dataset demonstrated the best performance, followed by the RAW and CLAHE datasets, respectively.

    This part aimed to evaluate the impact of various preprocessing techniques on the performance of segmentation models trained on T1-weighted images. The findings indicate that the DWT applied dataset yielded the best results, followed by the RAW and CLAHE applied datasets, respectively. Thus, it can be inferred that DWT application has a positive effect on T1-weighted images, while CLAHE application has a negative effect.

    Interestingly, the optimal performance for both models was achieved when using 3-fold cross-validation, although with different batch sizes. Furthermore, the proposed segmentation model outperformed the original model. Specifically, when 3-fold cross-validation and batch 1 were selected with the proposed model trained on the DWT dataset, the DSC, JI, sensitivity, and specificity metrics yielded values of 0.631, 0.533, 0.579, and 0.998, respectively, representing the best results.

    5.2. Quantitative results of T2-weighted images

    In this section, the results of the model training conducted on T2-weighted abdominal images are presented. Initially, the original U-Net model was trained on the datasets, and the results are shown in Table 4. The results show that the best results were obtained when batch 1 was selected for all k values in the RAW dataset. Three-fold cross-validation was observed to be the most effective technique.

    Table 4. Mean quantitative test results of T2-weighted images with U-Net.

    RAWDWTCLAHE
    BatchMetricsk=2k=3k=4k=2k=3k=4k=2k=3k=4
    1DSC0.4120.4820.4700.3180.4170.3910.4250.3980.475
    JI0.3200.3920.3750.2450.3320.3100.3310.3060.373
    Sens0.3210.3830.3070.1880.2800.2760.3490.2480.276
    Spec0.9970.9990.9990.9990.9990.9990.9990.9990.998
    2DSC0.3780.4500.4540.3260.3930.3540.3620.4360.468
    JI0.2920.3590.3590.2520.3020.2800.2800.3420.374
    Sens0.3440.3560.2380.1950.2910.2120.3300.2730.265
    Spec0.9970.9990.9980.9990.9980.9990.9990.9990.999
    3DSC0.3000.4430.4400.2840.3840.3670.3290.3400.427
    JI0.2300.3550.3470.2170.2960.2930.2580.2620.338
    Sens0.4130.4330.3780.2180.2670.2320.2860.2140.213
    Spec0.8720.9980.9950.9960.9990.9990.9950.9990.999
    4DSC0.3740.4350.4000.3140.3770.3710.3370.4310.415
    JI0.2840.3430.3170.2360.2960.2910.2600.3340.328
    Sens0.3530.3880.2820.1670.2540.2640.1810.2740.293
    Spec0.9980.9980.9980.9980.9990.9980.9990.9990.998
    5DSC0.3000.3730.3890.2570.3340.3410.3050.3280.409
    JI0.2240.2910.3100.1870.2520.2620.2320.2500.320
    Sens0.3380.5340.3310.2750.2350.3320.2590.2940.246
    Spec0.9780.8950.9980.9610.9980.9860.9790.9810.999

    In the case of the DWT applied dataset, batch 2 for k=2, and batch 1 for the others demonstrated the best performance. On the other hand, for the CLAHE applied dataset with batch 2 for 3-fold cross-validation and batch 1 for the remaining k values provided the optimal results.

    In summary, the results in Table 4 suggest that the optimal performance, as indicated by all the metrics, was achieved by using 3-fold cross-validation and batch 1 with RAW dataset.

    Table 5 shows the performance results attained by training with the proposed model. The highest performance scores were observed when batch 1 was chosen for all cross-validation values in both the RAW and DWT applied datasets. For the CLAHE dataset, batch 4 exhibited the best results for 3-fold cross-validation, while batch 1 yielded optimal performance for other cross-validation values. Especially, the optimal performance across all evaluated datasets was achieved with 3-fold cross-validation and batch 1 in the RAW dataset.

    Table 5. Mean quantitative test results of T2-weighted images with modified U-Net.

    RAWDWTCLAHE
    BatchMetricsk=2k=3k=4k=2k=3k=4k=2k=3k=4
    1DSC0.4300.5030.4700.3760.4340.3990.3990.3950.486
    JI0.3330.4050.3750.2910.3390.3170.3110.3080.391
    Sens0.3600.3660.2800.2910.2930.2680.2760.2430.257
    Spec0.9970.9980.9980.9980.9990.9990.9990.9990.999
    2DSC0.4120.4760.4620.3380.3780.3700.3760.4070.433
    JI0.3190.3780.3700.2630.3000.2970.2900.3150.342
    Sens0.3430.4050.3440.2360.2640.2200.2850.2580.258
    Spec0.9980.9990.9980.9980.9990.9990.9980.9990.998
    3DSC0.3640.4470.4170.3550.3760.3720.3430.3970.449
    JI0.2810.3560.3320.2760.2900.2960.2650.3120.353
    Sens0.3840.3720.3280.1930.2480.2330.2600.2270.284
    Spec0.9930.9990.9980.9980.9980.9990.9990.9990.999
    4DSC0.3830.4470.4300.3050.3710.3820.3710.4270.405
    JI0.2970.3510.3450.2290.2900.3000.2820.3260.325
    Sens0.3560.3650.3570.1700.2520.2730.3260.2760.217
    Spec0.9990.9990.9980.9990.9990.9980.9470.9990.999
    5DSC0.2930.3940.4360.3070.3710.3650.3290.3630.426
    JI0.2180.3010.3490.2330.2880.2880.2530.2780.331
    Sens0.4880.3330.3720.1740.2620.2320.1930.2340.247
    Spec0.8570.9800.9980.9920.9990.9990.9990.9980.998

    The aim of this section was to compare the performance of two segmentation models on T2-weighted images with different datasets, including RAW, CLAHE, and DWT applied datasets. The obtained results demonstrated that the RAW dataset outperformed the other preprocessed datasets for both models, followed by the CLAHE and DWT applied datasets, respectively. These results indicate that the preprocessing techniques did not have a positive effect on the T2-weighted images.

    Furthermore, the optimal combination for both models was 3-fold cross-validation and batch 1. It is noteworthy that the proposed segmentation model exhibited superior performance compared to the original model. Specifically, when using the RAW dataset was used with the proposed model, the best results were achieved with DSC of 0.503, JI of 0.405, sensitivity of 0.366, and specificity of 0.998.

    To summarize, the results of this investigation suggest that the RAW dataset with the proposed segmentation model and the combination of 3-fold cross-validation and batch 1 could provide the most favorable results for segmenting adrenal lesions on T2-weighted images.

    5.3. Quantitative results of fusion images

    This section presents the results of the segmentation models trained on the datasets obtained from the fusion processes conducted on T1 and T2-weighted MR images. The study employed both maximum and mean fusion techniques, and the outcomes are reported in Table 6 for both models.

    Table 6. Mean quantitative test results of fusion images with U-Net.

    U-NetModified U-Net
    Max fusionMean fusionMax fusionMean fusion
    BatchEvaluation metricsk=2k=3k=4k=2k=3k=4k=2k=3k=4k=2k=3k=4
    1DSC0.3730.4730.4290.3170.4840.4470.3620.4260.4440.2940.3480.404
    JI0.2900.3840.3400.2380.3910.3560.2680.3370.3410.2090.2610.305
    Sens0.2650.3730.2390.2850.4350.3230.2760.3630.2510.2340.2060.354
    Spec0.9980.9990.9980.9960.9990.9970.9970.9990.9980.9960.9990.997
    2DSC0.3370.4790.4480.3170.4690.3640.3420.3670.4100.2850.4430.406
    JI0.2650.3710.3550.2400.3780.2870.2560.2720.3140.2060.3390.312
    Sens0.2470.4040.2390.2960.4020.3120.2640.2850.2340.3060.3450.329
    Spec0.9970.9980.9980.9950.9990.9970.9970.9980.9990.9970.9990.997
    3DSC0.3060.4260.3900.3100.4170.4140.2290.4030.3900.2770.4220.390
    JI0.2340.3450.3110.2300.3300.3230.1630.3170.2990.1990.3250.298
    Sens0.3730.3440.2210.3460.3140.3350.2010.3510.2240.3470.3380.329
    Spec0.9530.9990.9980.9560.9990.9970.9610.9990.9970.9810.9990.998
    4DSC0.2850.4300.4240.2970.4500.3680.2640.3920.3800.2740.4270.373
    JI0.2120.3480.3350.2280.3570.2900.1980.3070.2920.2000.3350.279
    Sens0.1770.3440.2180.3330.3710.3260.1980.3190.2100.3260.3790.300
    Spec0.9950.9980.9980.9950.9990.9970.9970.9990.9990.9960.9990.998
    5DSC0.2690.3740.3750.2680.4140.3790.2920.3240.3630.2810.4020.344
    JI0.2000.2940.2930.1990.3210.2930.2080.2470.2820.1990.3080.256
    Sens0.2340.3840.2920.3970.4110.3430.3070.2550.2310.3520.4540.333
    Spec0.9920.9730.9650.9530.9940.9950.9610.9950.9780.9860.9860.998

    Upon evaluating the study results on a model basis, it was observed that batch 1 provided the best results for 2-fold cross-validation in the original model for the maximum fusion dataset, while batch 2 yielded optimal performance for the other datasets. Conversely, for the mean fusion dataset, batch 2 yielded the best results for 2-fold cross-validation, while batch 1 was found to be optimal for the other datasets.

    Additionally, the results indicate that the best performance was achieved when k=3 was selected for both datasets in this model. Specifically, the DSC, JI, sensitivity, and specificity metrics yielded the best results for the mean fusion applied dataset with a DSC value of 0.484, JI value of 0.391, sensitivity value of 0.435, and specificity value of 0.999, when 3-fold cross-validation and batch 1 were utilized.

    Based on the results obtained with the proposed model, it is observed that batch 1 is the most effective choice for all cross-validation values in the maximum fusion dataset, and batch 1 is the optimal choice for the mean fusion dataset with 2-fold cross-validation, whereas batch 2 is preferred for the others. Regarding the datasets, the best performance is achieved when 4-fold cross-validation is applied to the maximum fusion dataset and 3-fold cross-validation is used for the mean fusion dataset. The most successful results are obtained with 4-fold cross-validation and batch 1, according to the trainings performed with the recommended model. The performance metrics are recorded as DSC, JI, sensitivity, and specificity 0.444, 0.341, 0.251, and 0.998, respectively.

    When comparing the training performed on the fusion datasets with both models, it was noted that the original model outperformed the recommended model. However, the performance of the original model was found to be inferior to that of other preprocessing techniques, namely DWT and CLAHE.

    5.4. Evaluation of quantitative results

    The study aimed to investigate the effects of different models, preprocessing techniques, k-fold values, and batch sizes on adrenal lesion segmentation performance. The results showed that the original U-Net model outperformed the proposed model in fusion datasets, while the proposed model performed better in other datasets. Among the datasets used, the DWT-applied T1-weighted images, RAW T2-weighted images, and average fusion images showed better results. The best performance was achieved when 3-fold cross-validation and batch 1 were selected for all datasets, which is a critical criterion for parameter determination.

    Table 7 presents the best results obtained for each dataset, including the addition of the Hausdorff Distance (HD) metric. Remarkably, the pair of cross-validation and batch size that yielded the best results was common to all datasets. When comparing performance across datasets, the T1-weighted dataset outperformed the T2-weighted and mean fusion images, respectively. Furthermore, it was observed that the proposed model outperformed the original model by a significant margin. The best results were achieved on the T1-weighted dataset with DWT applied, with DSC, JI, sensitivity, specificity, and HD metrics yielding values of 0.631, 0.533, 0.579, 0.998, and 2.95mm, respectively, when 3-fold cross-validation and batch 1 were selected in the proposed model.

    Table 7. Best mean quantitative test results of datasets.

    Metrics
    DatasetParametersModelDSCJISensSpecHD (mm)
    T1-(DWT)k:3-B:1*Modified (PReLU) U-Net0.6310.5330.5790.9982.950
    T2-(RAW)k:3-B:1*Modified (PReLU) U-Net0.5030.4050.3660.9982.854
    Mean Fusionk:3-B:1*U-Net0.4840.3910.4350.9992.716

    * k=k-fold value, B= Batch.

    6. Discussion

    This section aims to demonstrate the robustness of the proposed model by comparing it with state-of-the-art models and different loss functions in U-Net. For this purpose, all models were trained on the DWT dataset, which yielded the most successful results in this study, while 3-fold cross-validation and batch 1 were selected. State-of-the-art models including DeepLabV3 Plus,22 SegNet,23 and U-Net++5 were chosen for comparison. Furthermore, BCE, Tversky, and Focal-Tversky loss functions were used as alternative loss functions for the proposed model.

    In Table 8,24 the results of the conducted training using state-of-the-art models and different loss functions are presented. The proposed methods with distinct loss functions, such as BCE loss in model A, Tversky loss in model B, and Focal-Tversky loss in model C, are demonstrated. In addition, the study by Barstugan et al.,14 which also conducted segmentation on the same dataset as this study, is included for comparison purposes. The table provides evaluation metrics along with the number of parameters and training times for the models.

    Table 8. Comparison of mean quantitative test results.

    Metrics
    ModelDSCJISensSpecHD (mm)Training parameters (millions)Training time
    DEEPLAB V3220.5570.4570.4770.9972.42717.8315:16:21
    SEGNET230.5760.4590.5220.9952.0132.9413:13:06
    U-NET++50.5660.4740.5180.9973.0608.19313:58:29
    Barstugan et al.140.6310.5110.6000.998
    Proposed Model*0.6310.5330.5790.9982.95036.53512:45:01
    Model A*0.4160.3270.3200.9953.16736.53512:25:05
    Model B*0.5000.4020.3820.9953.03036.53512:16:35
    Model C*0.4990.4040.3830.9953.16636.53512:17:01

    * In This Study.

    After analysis of the findings in Table 8 based on the metrics used, the following conclusions can be drawn. In terms of the DSC metric, the highest performance is achieved by Barstugan et al. and the proposed model, followed by SegNet and U-Net++. Models B and C have moderate DSC scores, while Model A has the lowest DSC scores. For the JI metric, the proposed model has the highest score, followed by Barstugan et al., and Model A has the lowest JI score. The sensitivity and specificity measures indicate the superior ability of the proposed model to detect the presence or absence of objects in the image. The HD measures the maximum distance between the points of the two sets, with lower HD scores indicating better performance. According to the HD metric, SegNet has the best performance, followed by the proposed model, while Models A, B, and C have relatively high HD scores. The proposed model has the highest number of parameters, while SegNet has the lowest number of parameters, indicating its ability to achieve comparable performance with fewer parameters. Finally, the time metric shows the time required to train each model, with the proposed model having the highest training time, followed by Model A, and SegNet having the lowest training time.

    The comparison of quantitative results demonstrates the superiority of the proposed model over state-of-the-art models, different loss functions, and related studies. This alone confirms the robustness of the proposed model. To further develop the analysis and improve the reliability of the comparison, a qualitative comparison is also included in this section. For this purpose, the image outputs of the trained models and the proposed model on the test images are presented, and the DSC of each predicted image is calculated for comparison to remove subjectivity and ensure objective evaluation. The qualitative comparison results, presented in Table 9, show that the proposed model outperforms all other models both visually and mathematically for all images.

    The findings obtained as a result of the conducted studies have been compared both with existing studies in the literature and with different state-of-the-art models. Tables 8 and 9 provide quantitative and qualitative evaluations, respectively. In both tables, it is evident that the proposed model outperforms other methods.

    Table 9. Comparison of mean qualitative test results.

    Predictions
    Test imageGround truthModified U-Net (Dice Loss)Modified U-Net (BCE Loss)Modified U-Net (Tversky Loss)Modified U-Net (Focal-Tversky Loss)DeepLab V3U-Net ++SegNet
    DSC=0.945DSC=0.822DSC=0.940DSC=0.835DSC=0.831DSC=0.776DSC=0.854
    DSC=0.974DSC=0.956DSC=0.958DSC=0.950DSC=0.851DSC=0.973DSC=0.868
    DSC=0.937DSC=0.895DSC=0.876DSC=0.909DSC=0.903DSC=0.833DSC=0.716
    DSC=0.809DSC=0.704DSC=0.723DSC=0.776DSC=0.699DSC=0.533DSC=0.657

    7. Conclusion

    The aim of this study was to adrenal lesion segmentation on T1-T2-weighted MR images of the abdomen by proposing a new model and testing different preprocessing techniques, k-fold cross-validation, and batch sizes to identify the optimal framework for achieving the best performance.

    The preprocessing techniques employed included DWT, CLAHE, and max-mean fusion processes. The study found that DWT preprocessing on T1-weighted images outperformed the raw dataset for T2-weighted images, and the fusion datasets had the least successful results. Among all datasets, the T1-weighted dataset with DWT applied yielded the best results. In terms of k-fold cross-validation, 3-fold cross-validation gave the best results, followed by 4-fold and 2-fold cross-validation, respectively. The study observed a significant decrease in work performance at high batch numbers (4,5), and the best results were generally obtained with batch 1. The proposed model gave by far the best performance, as shown by both quantitative and qualitative comparisons with various state-of-the-art models and loss functions.

    The study also revealed hardware-based limitations that prevented trying higher values for k-fold cross-validation and batch size. Future studies are planned to improve the proposed model and perform different analyses.

    ORCID

    Ahmet Solak  https://orcid.org/0000-0002-5494-1987

    Rahime Ceylan  https://orcid.org/0000-0003-0294-5692

    Mustafa Alper Bozkurt  https://orcid.org/0000-0001-5171-3295

    Hakan Cebeci  https://orcid.org/0000-0002-2017-3166

    Mustafa Koplay  https://orcid.org/0000-0001-7513-4968