Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Enhancing Robustness of Medical Image Segmentation Model with Neural Memory Ordinary Differential Equation

    https://doi.org/10.1142/S0129065723500600Cited by:13 (Source: Crossref)

    Abstract

    Deep neural networks (DNNs) have emerged as a prominent model in medical image segmentation, achieving remarkable advancements in clinical practice. Despite the promising results reported in the literature, the effectiveness of DNNs necessitates substantial quantities of high-quality annotated training data. During experiments, we observe a significant decline in the performance of DNNs on the test set when there exists disruption in the labels of the training dataset, revealing inherent limitations in the robustness of DNNs. In this paper, we find that the neural memory ordinary differential equation (nmODE), a recently proposed model based on ordinary differential equations (ODEs), not only addresses the robustness limitation but also enhances performance when trained by the clean training dataset. However, it is acknowledged that the ODE-based model tends to be less computationally efficient compared to the conventional discrete models due to the multiple function evaluations required by the ODE solver. Recognizing the efficiency limitation of the ODE-based model, we propose a novel approach called the nmODE-based knowledge distillation (nmODE-KD). The proposed method aims to transfer knowledge from the continuous nmODE to a discrete layer, simultaneously enhancing the model’s robustness and efficiency. The core concept of nmODE-KD revolves around enforcing the discrete layer to mimic the continuous nmODE by minimizing the KL divergence between them. Experimental results on 18 organs-at-risk segmentation tasks demonstrate that nmODE-KD exhibits improved robustness compared to ODE-based models while also mitigating the efficiency limitation.

    1. Introduction

    Segmentation plays a crucial role in clinical practice, e.g. the computed tomography (CT)-based organs-at-risk (OARs) and clinical target volume (CTV) segmentation,1,2 the optical coherence tomography (OCT)-based macular edema segmentation,3 and endoscopy image-based colon glands segmentation,4 etc. Precisely segmenting the targets contributes to the quantitative diagnosis and treatment, thus bringing significant impact on the clinical trial. Due to their benefits from their powerful learning capability, deep neural networks (DNNs) have been successfully applied in various tasks.5,6,7,8,9,10,11,12,13,14 For the medical image segmentation task, both the fully convolutional networks (FCNs)15 and the transformer-based network16 are commonly used. Among these methods, perhaps one of the most well-known medical image segmentation models is the U-Net,17 a succinct segmentation architecture that is composed of an encoder, a decoder, and multiple shortcut connections between them. The encoder’s objective is to extract abstract features from the inputs, while the decoder endeavors to recognize the target using the extracted features. Additionally, shortcut connections facilitate information flow within the network. Numerous variants based on the vanilla U-Net architecture have been proposed, showcasing promising segmentation performance across various tasks. Furthermore, the recently introduced no-new-Net (nnU-Net)18 has achieved state-of-the-art results on multiple benchmark tasks, further demonstrating the remarkable capabilities of the U-Net architecture.

    Despite the numerous encouraging results reported in the literature, it is widely acknowledged that the successful application of DNNs heavily relies on large-scale, high-quality dataset. However, the dataset requirements pose challenges for medical segmentation models. Moreover, obtaining accurate segmentation annotations is difficult due to the intricate characteristics of the target and the wide range of expertise among annotators. It is known that many targets (e.g. CTV of the tumor or temporal lobes of the brain) lack clear boundaries, leading to high inter- and intra-annotation variations. Given the uncertainty in the annotations, it is crucial for the segmentation model to exhibit robustness against potential label noise in the training dataset.

    To assess the robustness of the widely used U-Net model, we deliberately disrupt the label of the training dataset. Specifically, we erase 25% of the masks from the slices near the top and bottom, resulting in a total of 50% masked slices depicted in Fig. 1. It is important to note that the label of the test dataset remains intact since our objective is to evaluate the performance of U-Net using the clean and noisy training datasets separately. For the clean training dataset, U-Net achieves the DSC value of 0.9023 on the test dataset. However, when noise is injected into the labels of the training dataset, the DSC value drops to 0.8102. This decline in performance is anticipated due to the powerful learning capacity of DNNs. Previous research has demonstrated that DNNs can memorize the training dataset, even when the labels are shuffled.19 This declination leads us to the first question: How can we enhance the robustness of DNNs when confronted with the noisy training dataset?

    Fig. 1.

    Fig. 1. (Color online) Comparison of coronal view between the whole and masked region of the spinal cord. The red region indicates the slices containing the spinal cord, and the dotted region represents the slices with empty label (i.e. the mask of the spinal cord is erased). The table below the image compares the dice similarity coefficient (DSC) value of the test dataset between the vanilla U-Net and U-Net with nmODE.

    Fortunately, we have discovered a recently proposed ordinary differential equations (ODEs)-based model called neural memory ODE (nmODE),20 demonstrating remarkable robustness against noise. Unlike conventional layer-wise discrete models, the nmODE is a continuous model that incorporates separate learning and memory neurons, resulting in clear dynamic properties. The detailed explanation of the nmODE is given in Sec. 3.1. In our experiments, we regard the nmODE as a special module and insert it into the penultimate layer of the U-Net. The experimental results are presented in the second row of the table shown in Fig. 1. It can be observed that the U-Net with nmODE performs comparably with the vanilla U-Net on the clean training dataset while exhibiting a significant advantage over the vanilla U-Net on the noisy training dataset. The superiority of the U-Net with nmODE can be attributed to the global attractor property inherent in the nmODE, which serves as an efficient memory mechanism within the model. By separating the learning from memory, the nmODE could learn to correct the parameters through the dynamical system, enhancing the robustness of the segmentation model to noise in the training dataset.

    Even the U-Net with nmODE demonstrates impressive performance against perturbations in the training dataset, it has to be acknowledged that the continuous ODE model is less computationally efficient compared to the conventional discrete layer-wise model. This inefficiency arises from the iterative computation process in the ODE solver, which evaluates the ODE multiple times to accomplish the integration. The limitation in efficiency hinders the application of the U-Net with nmODE in scenarios that require timely segmentation results. This leads us to the second question: How can we enhance the segmentation model’s robustness while maintaining its computational efficiency? Our solution to this question is straightforward: let the discrete model mimics the behavior of the continuous model. Suppose the input is denoted as x, and the outputs of the U-Net with nmODE and vanilla U-Net are represented as ˜F(x) and F(x), respectively. During the training phase, our objective is to minimize the distance between ˜F(x) and F(x). During the inference, only F(x) is used for prediction, eliminating the computational cost of the nmODE. Specifically, by utilizing the Kullback–Leibler (KL) divergence to measure the distance between ˜F(x) and F(x), we can obtain the knowledge distillation (KD) framework. We refer to this proposed method as nmODE-KD, an architecture that distills knowledge from nmODE to the discrete model, thereby achieving both robustness and efficiency simultaneously. Overall, the contributions of the paper can be summarized as follows:

    (i)

    This paper empirically demonstrates that the nmODE method not only enhances the accuracy of the segmentation network but also prompts its robustness against label noise in the training dataset.

    (ii)

    A novel architecture called nmODE-KD is proposed, which aims to transfer knowledge from the nmODE to the discrete layer-wise segmentation network.

    (iii)

    Rigorous experiments on 18OARs segmentation tasks in head-neck prove the effectiveness of the proposed nmODE-KD method.

    2. Related Works

    2.1. Neural ordinary differential equations

    The proposition of neural ordinary differential equation (NODE)21 has revolutionized the field of DNNs, changing the architecture of networks from discrete limited layer to continuous unlimited paradigm. Unlike the conventional discrete DNNs that have fixed architecture, the NODE is a specialized model that implicitly maps the input to the output. The NODE has higher non-linearity, clearer dynamics behavior, and stronger fitting capacity when compared with the discrete model. The easy-to-use libraries such as torchdiffeqa implement multiple ODE solvers, making it possible to use NODE for practical applications. After the prevalence of NODE, many variants have been proposed. Dupont et al.22 pointed out that the NODE would preserve the topology of the input space, resulting in constrained approximation capabilities of NODE. The authors propose to augment the NODE with additional dimensions to learn complex mapping. Gholami et al.23 observed that the adjoint method21 is numerically unstable for specific activation functions. The gradient is inaccurate when the time steps are small. They address the observed problems by incorporating checkpointing method while keeping the same computational cost as the NODE.

    Solving the NODE requires evaluating the differential equation multiple times, which is referred as the number of function evaluation (NFE). Reducing the NFE without a significant decline in accuracy is desired since it can enhance the efficiency of NODE for applications. Based on this consideration, Kelly et al.24 introduced a differentiable regularization term by the Kth order of the state with respect to time, leading to simpler trajectories that are easy to solve. Kidger et al.25 replaced the commonly used L2 norm with a seminorm to judge whether the adaptive step in the ODE solver is accepted or rejected. Experiments on multiple tasks show that the proposed improvement reduces 40% NFE. Ghosh et al.26 proposed a regularization method that randomly samples the end time in the integration of the ODE solver. The regularization method contributes to decreased training time and increased accuracy compared with the baseline approaches.

    One of the most significant benefits of NODE lies in its robustness, which is also a crucial inherent limitation for conventional discrete DNNs. Literature reports that the NODE is robust to random perturbations and adversarial attacks. For example, Hanshu et al.27 empirically verify the robustness of NODE on multiple benchmark datasets. The authors further analyze the robustness and propose a time-invariant steady neural ODE (TisODE) model to advance the robustness. Cui et al.28 propose an activation function named half-Swish to enhance the stability of NODE. Experimental results indicate that the proposed activation function outperforms the basic ones on robustness. Rather than the literature that solely concerns efficiency or robustness, this paper strives for a balance between efficiency and robustness to broaden the applications of NODE.

    2.2. Medical image segmentation

    Image segmentation plays a vital role in various fields.29,30,31 Medical image segmentation,32,33,34 the segmentation task with inputted medical images, is a common requirement during clinical practice. As a biologically inspired model,35,36,37,38,39 DNNs have been ubiquitously used in medical image segmentation tasks. The well-known U-Net architecture provides a strong baseline among the numerous types of DNNs. Plenty of U-Net variants deliver promising performance in medical image segmentation tasks. For example, Li et al.40 proposed the hybrid densely connected U-Net (H-DenseUNet) that uses 2D DenseUNet41 to extract the intra-slice feature and 3D counterpart module to build the inter-slice relationship. The H-DenseUNet achieves promising results on the OARs and tumor segmentation tasks. Oktay et al.42 incorporated the attention module43 into the U-Net to enhance the feature extracted by the encoder, obtaining superior performance on multiple OARs segmentation tasks over the U-Net. The nnU-Net18 leverages a self-configuring pipeline to achieve state-of-the-art records on multiple public medical segmentation datasets without manual intervention. Yu et al.44 designed a self-supervised method that leverages the distance between two slices as the pretext task. The pre-trained model is later transferred into the downstream OARs segmentation tasks to increase the accuracy. To balance the efficiency and accuracy between 2D and 3D convolutions in medical image segmentation tasks, MixConvNet45 leverages a mixture of 2D convolutions from different views to replace the 3D convolution. Besides fully supervised learning, semi-supervised learning that uses a large number of unannotated datasets has also been considered in medical image segmentation.46 Despite the performance improvement reported in the literature, few studies have paid attention to the robustness of the segmentation model, which plays an important role during clinical practice. This limitation motivates us to leverage a more advanced approach to enhance the robustness of the segmentation model.

    Currently, some research attempts to use the NODE to strengthen the performance of the segmentation model. For instance, Pinckaers and Litjens47 incorporated the NODE into the U-Net to better exploit the semantic features of the colon glands. Cheng et al.48 proposed a second-order NODE-based model and achieved promising results on six benchmark segmentation datasets. Despite the progress achieved by the NODE in the medical image segmentation tasks, only a few works have paid attention to the segmentation efficiency, which determines the usability of the NODE-based segmentation model in clinical practice. This paper empirically verifies the robustness of the ODE-based model for medical image segmentation tasks. Moreover, the ODE-based model’s efficiency is also considered by leveraging the knowledge distillation paradigm.

    3. Methodology

    In this section, we will begin by providing a brief overview of the NODE for a better understanding of the relationship between nmODE and conventional discrete DNNs. Suppose the output of layer l is denoted as al. Then the forward computation between layer l and l+1 in discrete DNNs can be represented as al+1=f(al;W), where f typically consists of the linear transformation, normalization, and nonlinear activation function. One of the most well-known structures in DNNs is the residual connection, which is the foundation of residual networks (ResNets).49 The residual connection is defined as al+1=al+f(al;W), where the shortcut connection al enables the construction of a network with significantly increased depth. By considering layer l as time t and transforming feature al into representation y(t), we then obtain the equation y(t+1)y(t)=f(y(t);W). Taking the limit as t approaches infinity on the left side, we then derive the NODE that can be formulated as follows :

    (t)=f(y(t);W).(1)

    By providing the initial internal input y(0) and employing the ODE solver to solve the above differential equation, we can subsequently acquire the solution y(T). The NODE can be seamlessly incorporated into the architecture of DNNs as a specialized layer, where the feature al is commonly considered as the initial internal input y(0). The computation principle of the NODE is illustrated in Fig. 2(a). The optimization of the NODE can be accomplished using the backpropagation or the adjoint21 approaches, which are known as discretize-then-optimize and optimize-then-discretize50 methods, respectively. The former method can compute precise gradients and has the advantage of speed, but it consumes 𝒪(T) memory since it needs to store all the intermediate variables for backpropagation. On the other hand, the latter method has a constant memory cost of 𝒪(1), but it is slower compared to the former and introduces numerical discretization errors.

    Fig. 2.

    Fig. 2. Comparison between the NODE and nmODE.

    3.1. Neural memory ODE

    The NODE is more powerful than the conventional discrete model, which can be attributed to the embedded dynamic system. However, certain literature suggests that the NODE may preserve the input space’s topology, leading to functions that the NODE is incapable of representing.22 For example, the functions g(1)=1 and g(1)=1. The reason lies in the fact that the trajectories of ODE cannot cross each other. Nevertheless, this limitation can be addressed by rearranging the order of inputs, i.e. regarding the data as external inputs while keeping the initial internal input fixed.20 The separation of external inputs from internal inputs implies the existence of two distinct types of neurons: learning neurons and memory neurons. Learning only happens in the learning neurons, whereas the memory neurons endeavor to capture the feature’s characteristics through ODE. This motivates the architecture of nmODE, which can be formulated as follows :

    {(t)=f(y(t),t,γ)γ=g(al;W).(2)
    The γ represents the external input, and the initial internal input y(0) can be set arbitrarily (e.g. y(0)=0). The computation principle of nmODE is demonstrated in Fig. 2(b). Given the initial internal input y(0) and external input γ, the nmODE would output the solution y(T). By comparing NODE with nmODE, it is evident that both models can be regarded as specialized layers to be integrated within DNNs. The distinction lies in the learning mechanism. For the NODE, learning is integrated into the ODE, where the output from the previous layer al serves as the initial value y(0). Nevertheless, the nmODE distinguishes the learning process from the ODE by converting the learning into the transformation g(al;W), which generates the external input γ for the ODE. The nmODE framework offers a versatile architecture for implementing nonlinear mapping. In practical implementations, Zhang20 introduced a novel implementation of Eq. (2) shown as follows :
    (t)=y(t)+sin2(y(t)+γ).(3)
    By providing the initial value of y(0) and external input γ, the dynamical system described in Eq. (3) would converge to only one global attractor.20 The existence of a global attractor ensures that the model possesses enhanced memory properties. It is known that the knowledge of neural networks is stored in its learnable parameters, i.e. the connection weights. The model described by Eq. (3) separates the learning (γ) from memory, thus endowing the model the capability to learn to correct the connection weights through the dynamical system, which also improves the resilience to noise in the training dataset.

    3.2. Transferring knowledge from nmODE to segmentation model

    Despite the superior memory properties of the nmODE compared to the conventional discrete model, one potential limitation lies in its computational efficiency. The nmODE is solved using an ODE solver, which iteratively evaluates the ODE using either the first-order (Euler) or high-order (Runge–Kutta) method. It is also observed that the NFE is increasing along with the training progress.21,22 Yet lowering the error tolerance could reduce the number of NFE, but it also leads to decreased accuracy. The limitation of computational efficiency poses challenges for the real-time application of nmODE. It is desirable to simultaneously possess the advantages of nmODE in memory and the strengths of the discrete model in terms of speed.

    Taking the aforementioned problem into consideration, we propose an architecture specifically designed to transfer the knowledge from the nmODE to the discrete model, as depicted in Fig. 3. The learning process is composed of two phases. In the first phase, we train a segmentation model with nmODE, where the nmODE is employed in the penultimate layer designed to process the abstract feature before making predictions. During the implementation, the U-Net17 is used as the segmentation model. The internal input y(0) and external input γ of nmODE depicted in Eq. (3) are set to 0 and the abstract feature from the convolutional layer in U-Net, respectively. The output of nmODE is subsequently passed to a 1×1 convolutional decision layer to obtain the final prediction. This prediction is regarded as the target for the discrete segmentation model to approximate. In the second phase, we proceed to train a discrete model whose architecture is almost the same as that of the first phase, except the nmODE is removed. For the discrete model, the supervision is composed of two parts. First is the distance between the prediction and the label measured by the cross entropy loss, which is designed to utilize the information contained in the label. The second supervision is derived from the prediction of the segmentation model with nmODE, which aims to transfer the knowledge from the nmODE to the discrete model. The second distance is estimated by the KL divergence between the predictions from nmODE and the discrete model. The two distances are designed to learn the information contained in the label and nmODE. In summary, the complete objective function for the discrete model can be formulated as follows :

    =(1λ)logpsyλCc=1ptclogpsc,(4)
    where the two items represent the cross entropy and KL divergence balanced by the hyper-parameter λ. psy denotes the prediction of the discrete model in terms of the target class y. ptc and psc represent the predictions of the nmODE and the discrete model for the class c, respectively. By optimizing the objective shown in Eq. (4), the discrete model attempts to simultaneously approximate the label and prediction from the segmentation model integrated with nmODE.

    Fig. 3.

    Fig. 3. Learning process of the proposed two-phase method.

    In fact, the proposed architecture adheres to the knowledge distillation paradigm, which is specifically designed to transfer knowledge from the teacher to the student. Therefore, the proposed method is named nmODE-KD, representing a model that effectively transfers knowledge from the nmODE to the discrete model. A clear distinction between conventional KD and the proposed nmODE-KD lies in the design of the teacher. In the nmODE-KD approach, the discrete model (student) not only benefits from KD, but also demonstrates robustness against potential noise present in the training dataset. We thoroughly validate the effectiveness of nmODE-KD in Sec. 4.

    4. Experiments

    In this section, we first describe the used medical image segmentation datasets and experimental setup. Then we present the results obtained from the clean training dataset, the noisy training dataset, and the visualization of features in nmODE.

    4.1. Datasets and experimental setup

    The UaNet,1 an open source dataset for head-neck OARs segmentation, is used in the experiments. The UaNet contains various soft and bone tissues that are taken into account during head-neck radiotherapy. These tissues can be found in Table 1. Due to the majority of samples only consisting of a single foreground slice for the optic chiasm and hypophysis, these two OARs are excluded from the experiments. The dataset is split into training, validation, and test according to the ratio of 0.70:0.15:0.15. The quantity of samples in each part is near the number 90:20:20. The OARs that are categorized into left and right types (e.g. left eye and right eye) are consolidated into a single class to eliminate the influence of location.

    Table 1. Comparison of models trained by the clean training dataset.

    ROIDSC of U-NetDSC of convODEDSC of nmODE
    Brachial plexus0.66360.66240.6703
    Brain stem0.91500.90970.9184
    Constrictor naris0.77230.77120.7751
    Ear0.80550.80740.8081
    Eye0.91810.92520.9245
    Larynx0.92050.92060.9231
    Lens0.77170.76830.7745
    Mandible0.86820.86440.8720
    Optic nerve0.71590.71370.7218
    Oral cavity0.86780.87220.8731
    Parotid0.84800.85090.8513
    SMG0.81250.81050.8172
    Spinal cord0.90230.90640.9049
    Sublingual gland0.60090.60440.6148
    Temporal lobe0.88650.87690.8975
    Thyroid0.81690.82500.8226
    TMJ0.84700.84530.8505
    Trachea0.92410.92620.9277
    Average0.82540.82560.8304

    The U-Net17 serves as the baseline to validate the proposed method. It is possible to substitute U-Net with a more advanced network to further enhance performance. However, our primary focus lies in assessing the robustness of the proposed method rather than breaking state-of-the-art records. The Adam51 method is used as the optimizer with a learning rate of 104. The ODE is solved by the discretize-then-optimize approach with the dopri5 in the torchdiffeq package.21λ in Eq. 4 is set to 0.1. The model is trained by 200 epochs. The DSC is used as the metric to evaluate the performance, which is defined as DSC=2|VpVg||Vp|+|Vg|, where Vp and Vg represent the volume of prediction and ground truth, respectively. The DSC is designed to quantify the degree of overlap between the prediction and label, with values ranging from 0 to 1. A value of 0 indicates no overlap between the prediction and the label, while a value of 1 signifies a perfect match between the prediction and the label. The experimental results are reported on the test dataset using the model that achieved the highest DSC on the validation dataset.

    4.2. Experiments on clean training dataset

    We first carry out experiments by using the clean training dataset to evaluate the impact of ODE on the segmentation tasks. The comparison of the vanilla U-Net,17 convODE,47 and nmODE20 is shown in Table 1. The convODE is the first attempt that incorporates the ODE depicted in Fig. 2(a) into the colon gland segmentation model. It is evident that nmODE consistently achieves higher DSC scores than the vanilla U-Net for all OARs. Comparing convODE to nmODE, we observe that nmODE outperforms convODE for the majority of OARs, except for the eye, spinal cord, and thyroid. The performance of nmODE summarized in Table 1 suggests that it can serve as a potentially useful module to enhance segmentation accuracy. However, this improvement comes at the expense of significantly increased computational costs. The training and inference time of U-Net, convODE, and nmODE are summarized in Table 2. The substantially increased inference time for convODE and nmODE hampers the practical deployment of ODE-based models. But after transferring the knowledge from ODE to the discrete layer, the inference time is the same as that of the U-Net student model, which can increase the efficacy significantly. Additionally, the presence of potential noise in the training dataset poses challenges to the segmentation model. In the next subsection, we further conduct experiments by using the noisy training dataset.

    Table 2. The training and inference time (seconds) per batch of each model.

    U-NetconvODEnmODE
    Training0.521.431.58
    Inference0.331.081.27

    4.3. Experiments on noisy training dataset

    The masked training dataset, which erases 25% of the masks from slices near the top and bottom, is utilized to evaluate the performance of the models. The experimental results are presented in Table 3. By comparing the vanilla U-Net, convODE, and nmODE, it becomes evident that the nmODE exhibits a significant advantage among the three methods. The nmODE outperforms the vanilla U-Net on all the OARs, with the most notable improvement observed in the optic nerve, where the DSC increases from 0.4529 to 0.5128. Furthermore, the nmODE demonstrates superiority over the convODE on most OARs, except for the lens. This improvement indicates that the nmODE is more robust than the convODE when dealing with noise in the training dataset. We also compare the ODE-based model with the ODE-KD models, namely convODE-KD and nmODE-KD, as shown in the right part of Table 3. It is observed that the nmODE-KD further enhances the performance compared to the nmODE. For instance, the DSC values of the spinal cord for the U-Net, nmODE, and nmODE-KD are 0.8102, 0.8611, and 0.8665, respectively. Moreover, the nmODE-KD consistently achieves higher DSC than the convODE-KD for all the OARs. The paired t-test between the vanilla U-Net and nmODE, as well as the nmODE and nmODE-KD, are also conducted. Both p-values are very small (<104), indicating that the results are statistically different.

    Table 3. Comparison of models trained by the masked training dataset.

    ROIDSC of U-NetDSC of convODEDSC of nmODEDSC of convODE-KDDSC of nmODE-KD
    Brachial plexus0.59370.57850.59850.58110.6256
    Brain stem0.86930.87340.87830.87700.8813
    Constrictor naris0.68480.68430.70230.69100.7264
    Ear0.71800.70740.73310.75080.7623
    Eye0.89580.88840.90680.88750.9102
    Larynx0.80610.80050.81580.83470.8672
    Lens0.53010.60030.58140.59720.6611
    Mandible0.76520.79230.79700.78510.8054
    Optic nerve0.45290.48930.51280.47460.5325
    Oral cavity0.80480.79930.81610.80060.8212
    Parotid0.80980.80680.81960.81100.8274
    SMG0.75970.74460.78180.77500.8058
    Spinal cord0.81020.82600.86110.84430.8665
    Sublingual gland0.41480.40010.42390.46540.4690
    Temporal lobe0.81910.82490.83500.81380.8594
    Thyroid0.74430.71930.76240.74480.7945
    TMJ0.75820.79350.80940.80200.8112
    Trachea0.82740.83480.83910.84560.8589
    Average0.72580.73130.74860.74340.7714

    4.4. Visualization of the feature in nmODE

    To qualitatively assess the impact of nmODE in the segmentation model, we further compare the external inputs and outputs of nmODE, which is shown in Fig. 4. The external inputs and outputs of nmODE are γ and y(T) shown in Fig. 3, respectively. Taking the brain stem as an example, it is clear to observe that the nmODE helps to better localize the target, where y(T) is closer to the label than γ. A similar phenomenon can also be observed in the eye and spinal cord. The segmentation of the larynx shows an interesting result where the nmODE contributes to rectifying the contour of the target. Moreover, y(T) is more confident than γ as shown in Fig. 4(d). These visualization results indicate that the nmODE can be served as an effective plug-in module to adjust the feature, resulting in a higher DSC score compared to the vanilla discrete model.

    Fig. 4.

    Fig. 4. Visualization of features in nmODE in terms of the brain stem, eye, larynx, and spinal cord. The four images in each part represent the CT slice, external input of nmODE, output of nmODE, and label, respectively.

    4.5. Ablation studies

    Ablation studies are also conducted to validate the generalization of nmODE-KD, including experiments on the varied mask ratios and different model architectures. The dataset in the experiments is the spinal cord. We first inspect the impact of the mask ratio, which is shown in Table 4. It can be found that the DSC of the U-Net is decreasing along with the increment of mask ratio. For example, the DSC of U-Net is 0.8941 when the mask ratio is 5%, and the DSC drops to 0.6332 with the mask ratio of 75%. Nonetheless, the DSC of nmODE is consistently higher than that of the U-Net. Moreover, the gap between the nmODE and U-Net is enlarged along with the increment of mask ratio, indicating the robustness of nmODE against the noise. It can also be observed that the proposed nmODE-KD achieves the highest DSC among the three methods, implying the effectiveness of the knowledge transfer.

    Table 4. Results of the spinal cord with varied mask ratios.

    Mask RatioDSC of U-NetDSC of nmODEDSC of nmODE-KD
    5%0.89410.89890.9009
    15%0.89060.89830.8991
    25%0.87970.88720.8895
    50%0.81020.86110.8665
    75%0.63320.74450.7538

    We further conduct experiments on clean and noisy datasets by applying nmODE to different network architectures, which is summarized in Table 5. Three well-known U-Net variants are considered, including the Attention U-Net,42 U-Net++,52 and UNETR.53 The Attention U-Net42 integrates the attention mechanism into the U-Net, the U-Net++52 contains a series of skip connections, and the UNETR53 combines the U-Net with the transformer.16 By observing the results on the clean dataset shown in the left part of Table 5, it can be found that the nmODE helps to increase the DSC despite the model’s architecture. A similar increment can also be found in the noisy dataset. These experimental results show that the nmODE can be used as a plug-and-play module contributing to the model’s performance in both clean and noisy datasets.

    Table 5. Results of the spinal cord by using models with and without nmODE. The mask ratio of the noisy dataset is 50%.

    Clean datasetNoisy dataset
    Modelwithout nmODEwith nmODEwithout nmODEwith nmODE
    U-Net0.90230.90490.81020.8611
    Attention U-Net0.90250.90510.81100.8182
    U-Net++0.90330.90470.80680.8134
    UNETR0.88220.88720.76630.7844

    5. Conclusion

    This paper empirically verifies the effectiveness of nmODE in CT-based OARs segmentation tasks. The nmODE is a new continuous ODE-based model with only one global attractor in theory. However, its application and efficacy in medical image analysis tasks are unclear, which is the main motivation of this work. For its application, we rigorously demonstrate the nmODE own robustness against the label noise in the training dataset. Besides, the nmODE-KD is proposed to transfer the knowledge from the nmODE to the discrete layer. Thus the model can also benefit from the computational efficiency in the discrete layer. Experimental results show that the nmODE-KD can further improve the segmentation accuracy. By visualizing and comparing the inputs and outputs of nmODE, it is clear that the nmODE contributes to rectifying the feature extracted by the U-Net.

    It is also possible to extend the method into other modalities (e.g. magnetic resonance imaging) and 3D analysis tasks.54,55 Besides the false negative noise introduced in the experiments, verifying its robustness against the potential false positive noise is also desirable. Currently, the prompt-guided general segmentation model is prevalent in computer vision and medical image analysis. In future works, we attempt to leverage the proposed nmODE-KD to develop the general segmentation model, increasing its robustness and applicability in medical image analysis tasks. Moreover, it is worthwhile to integrate the nmODE into more advanced algorithms, such as the neural dynamic classification algorithm,56,57 ensemble learning,58 self-supervised learning,59 etc.

    Acknowledgments

    This work was supported by the National Major Science and Technology Projects of China under Grant 2018AAA0100201, National Natural Science Foundation of China under Grant 62106162, China Postdoctoral Science Foundation under Grant 2021M692269, and Sichuan University Postdoctoral Science Foundation under Grant 2022SCU12080.

    ORCID

    Junjie Hu  https://orcid.org/0000-0002-5750-0511

    Chengrong Yu  https://orcid.org/0009-0004-1238-7414

    Zhang Yi  https://orcid.org/0000-0002-5867-9322

    Haixian Zhang  https://orcid.org/0000-0002-9821-508X

    Notes

    a https://github.com/rtqichen/torchdiffeq.