Radiation therapy is a fundamental cancer treatment in the clinic. However, to satisfy the clinical requirements, radiologists have to iteratively adjust the radiotherapy plan based on experience, causing it extremely subjective and time-consuming to obtain a clinically acceptable plan. To this end, we introduce a transformer-embedded multi-task dose prediction (TransMTDP) network to automatically predict the dose distribution in radiotherapy. Specifically, to achieve more stable and accurate dose predictions, three highly correlated tasks are included in our TransMTDP network, i.e. a main dose prediction task to provide each pixel with a fine-grained dose value, an auxiliary isodose lines prediction task to produce coarse-grained dose ranges, and an auxiliary gradient prediction task to learn subtle gradient information such as radiation patterns and edges in the dose maps. The three correlated tasks are integrated through a shared encoder, following the multi-task learning strategy. To strengthen the connection of the output layers for different tasks, we further use two additional constraints, i.e. isodose consistency loss and gradient consistency loss, to reinforce the match between the dose distribution features generated by the auxiliary tasks and the main task. Additionally, considering many organs in the human body are symmetrical and the dose maps present abundant global features, we embed the transformer into our framework to capture the long-range dependencies of the dose maps. Evaluated on an in-house rectum cancer dataset and a public head and neck cancer dataset, our method gains superior performance compared with the state-of-the-art ones. Code is available at https://github.com/luuuwen/TransMTDP.

Keywords:

1. Introduction

Making radiotherapy plans based on the medical images of cancer patients is a fundamental yet challenging step in cancer treatment. With the advancement of inventive systems,^1,2,3 the effect of treatment planning has been significantly enhanced in recent years. Nevertheless, due to the specific dose distribution requirements for different tissues, i.e. distributing sufficient dose to the planning target volume (PTV) while diminishing the dose deposition on the organs at risk (OARs), making high-quality radiotherapy planning a complex work. In the clinic, the planning is always adjusted by radiologists iteratively, causing it extremely laborious and tedious.⁴ In addition, the quality of treatment planning might be highly variable among radiologists according to their different expertise and experience, resulting in uneven or even sub-optimal quality of plans.⁵ Consequently, developing a productive and stable methodology that can predict the dose distributions for cancer patients automatically is essential to lessen the burden on radiologists by providing them with a better initial point to start the treatment planning, thus facilitating accelerated and precise radiotherapy.

Recently, as deep learning (DL) blooms,^6,7,8,9 several works have focused on automatic dose prediction^{10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25} based on the popular frameworks, i.e. U-net²⁶ and Generative Adversarial Network (GAN).²⁷ Although the current DL-based methods have fulfilled impressive promotions in the formulation of radiotherapy plans to a certain degree, there are still some unignorable limitations. First, the existing DL models always strive to construct the network structure of a single dose prediction task from the perspective of feature utilization, such as feature reuse and multi-scale feature fusion, neglecting the influential characteristics in the dose map, i.e. the isodose lines and gradient information. Actually, these characteristics are highly related to the dose prediction task. We display several instances in Fig. 1. Specifically, (a) is the original CT image while (b) and (c) represent the PTV mask of the rectum cancer patient and the corresponding dose map designed by the radiologists. During the radiotherapy, the radiation beams targeting the PTV penetrate into the human body, some of which inevitably traverse the surrounding healthy tissues and cause damage. Radiologists have to put the prescribed dose on the PTV while diminishing the dose deposition on OARs as much as possible, so the dose level drops quickly from the PTV to its surroundings, resulting in an obvious decline trend as well as gradient properties in the dose map. More exactly, by discretizing the continuous dose distribution values in Fig. 1(c), the dose map is separated into a few adjacent regions and forms an isodose lines map which can provide every pixel with a coarse-grained dose range, as displayed in Fig. 1(d). The regions with different colors mean different dose values distributed in the corresponding ranges and the redder represents a higher dose value. Therefore, the isodose lines map is usually manifested as higher dose distribution around the PTV and lower toward the surrounding areas. Additionally, we intuitively see the gradient properties, such as the radiation patterns and the clear edges of the dose map in Fig. 1(c). To excavate such gradient information in the dose map, we use the traditional Sobel operator to gain the gradient map, as displayed in Fig. 1(e). We believe that these intrinsic and specific characteristics profoundly affect the prediction quality of the final dose maps.

Fig. 1. Instances of a rectum cancer patient. (a) CT image, (b) planning target volume (PTV), (c) dose map, (d) isodose lines map, (e) gradient map. Note that (d) and (e) are obtained from (c) by isodose line calculation and Sobel operation.

Another limitation of the current DL-based methods for dose prediction is that, due to the limited size of the convolution filters, the existing ones derived from the convolutional neural network (CNN) simply consider the local neighbors in the spatial domain and cannot fully extract the global features.^28,29 Actually, many organs in the human body are symmetrical and the dose maps present abundant global features. For instance, the left femur and the right femur, as two typical OARs for rectum cancer, share similar anatomical properties. Such long-range information, however, can hardly be captured by traditional convolutional filters. Therefore, it is crucial to utilize both local and global information to improve the prediction performance.

In this work, inspired by the success of multi-task learning (MTL) and also motivated to conquer the limitations of existing dose prediction methods, we innovatively introduce an end-to-end architecture, namely transformer-embedded multi-task dose prediction (TransMTDP) network, to fulfill the automatic prediction of dose distribution for clinical radiotherapy. Concretely, in contrast with the prior single-task models for dose prediction, we adopt the multi-task learning strategy and incorporate two auxiliary correlated tasks, i.e. an isodose lines prediction task and a gradient map prediction task. For the isodose lines map, it not only reflects the decline trend of the dose distribution but also provides every pixel with a coarse-grained dose range. Integrating the fine-grained dose value produced by the main dose prediction task, we can obtain the multi-granularity information in the dose distribution. In addition, the auxiliary gradient prediction task is designed to extract the gradient properties like radiation patterns or edge information. A shared encoder integrates the three tasks. Over and above, the conventional MTL methods lack the efficiency of correlations among the output layers, as they generally shared the shallow layers to integrate various tasks. To address this issue, we employ two extra constraints, i.e. isodose consistency loss and gradient consistency loss, to enhance the matching among the outputs of three correlated tasks. The main contributions can be listed as follows: (1) We harness the spirit of multi-task learning by introducing two extra auxiliary tasks, i.e. an isodose lines prediction task and a gradient map prediction task, so the main dose prediction task could benefit from the enhanced isodose lines properties and gradient information by mining the correlations among the three tasks. (2) To fully capture the relations of the deep output layers for different tasks, we propose two additional constraints, i.e. isodose consistency loss and gradient consistency loss, to match the features in the dose distribution generated by these two auxiliary tasks with those predicted by the main task, bringing a more explicit dose prediction with the nonexistence of additional parameters. (3) Considering many organs in the human body are symmetrical and the dose maps present abundant global features, we embed the transformer in our architecture. The transformer avails an attention mechanism to measure the relationships among all feature patches, thus capturing contextual features in long-range dependencies. (4) The experiments verify that the proposed outperforms all state-of-the-arts (SOTAs) on an in-house rectum cancer dataset and a public head and neck (H&N) cancer dataset.

Notably, the rudimentary version of our work was published at the 2021 International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2021).²⁵ We extend the conference paper to the current one from these aspects: (1) Introduction: We presented a more detailed research background to convincingly state the motivation of this work; (2) Related work: We reviewed several relevant works; (3) Methodology: We utilized the CT image, PTV and OAR segmentation masks as the original inputs and further embedded transformer into our network to learn the long-range dependencies of dose maps; (4) Experiments and results: We further designed comparison experiments on a public H&N cancer dataset to verify the good generalization of our method. In addition, more detailed experimental results were clearly displayed and described; (5) Discussion: In addition to analyzing and discussing the results of both comparison and ablation experiments, we further pointed out the disadvantages of our work and proclaimed the future research direction.

2. Related Work

2.1. Dose distribution prediction methods

Recently, numerous researches have been conducted for predicting dose distribution in radiotherapy which can be generally divided into two categories: knowledge-based planning (KBP) methods and DL-based methods. Specifically, for the KBP-based methods, a common solution is to perform dosimetric predictions for new patients by taking the treatment plans of historical patients as references. Shiraishi et al.³⁰ estimated the target of the dose–volume histogram (DVH) and optimized it by inverse programming and prior knowledge of achievable DVHs. Besides, some other handcrafted features were chosen for predicting the dose distribution, such as overlapping volume histograms (OVHs), the number of delivery fields and structure shapes.^31,32,33,34 However, these KBP approaches depending on the manual features cannot guarantee the selected hand-crafted features have the greatest impact on accurate dose distribution estimation. Moreover, those features are one- or two-dimensional statistical data,¹⁶ compressed from the 3D dose information, causing the deficiency of spatial information.

Nowadays, DL methods have obtained impressive progress in medical image processing,^{35,36,37,38,39} especially in the radiotherapy dose prediction task.^{10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25} Specifically, encouraged by the idea of U-net,²⁶ a type of CNN, Nguyen et al.¹⁰ introduced a seven-level hierarchy U-net model to make an automatic prediction to the dose map for prostate cancer patients. Then, Wang et al.¹³ proposed progressive refinement U-net (PRUNet) with rank loss to generate more realistic dose distributions of prostate cancer. Nguyen et al.¹⁴ combined the U-net²⁶ and DenseNet⁴⁰ to obtain dosimetric results for H&N cancer patients. In addition, Song et al.¹⁵ used the DeepLab v3+⁴¹ to predict the dose maps of rectum cancer.

Besides, Mahmood et al.¹⁶ applied the traditional 3D GAN to the dose distribution prediction for OPC and outperformed almost U-net models in several clinical criteria. Besides, Kearney et al.¹⁷ utilized a gating mechanism to GAN to emphasize the relevant anatomy features, thus facilitating more realistic predictions of prostate cancer. In addition, to ensure the predicted dose distribution to match the dosimetric specificity of PTV and OARs, Zhan et al.²² introduced an adaptive multi-organ loss into GAN to produce more accurate dose maps.

2.2. Multi-task learning

For the past decade, the MTL methods have drawn wide concern for their promising performance in various vision tasks.^{23,24,25,42,43,44,45} As for medical image processing, Wang et al.⁴² designed a MTL architecture to combine the main segmentation task with an auxiliary regression task where it learns image-level properties to regularize the segmentation subnetwork. Moreover, following the MTL strategy, Zhang et al.⁴³ designed an auxiliary classification task to enhance the performance of automatic gastric tumor segmentation in the main cancer segmentation task. More related to our work, Jiao et al.²⁴ designed a multi-task GAN network to generate accurate dose maps for rectum cancer. This method constructed an additional segmentation task to provide the main dose prediction task with additional anatomy information, in which these two tasks were learned jointly by using a shared encoder and respective decoders. Despite the boost on diverse tasks by MTL, a drawback is that these MTL methods always share the shallow layers to integrate different tasks, but ignore the association among the deep output layers for different tasks. So, we explore two consistency constraints, i.e. isodose consistency loss and gradient consistency loss, to force the dose prediction results of two auxiliary to match well with those produced by the main task.

2.3. Transformer

The transformer is first introduced by Vaswani et al.⁴⁶ as a sequence transduction model based on attention mechanisms and now cut a conspicuous figure in medical image processing.^47,48,49,50 Explicitly, Luo et al.⁵⁰ introduced a GAN network embedded with a transformer encoder to reconstruct the high-quality positron emission tomography from the low-quality one. This method used a modified transformer encoder to capture the global representations in flattened feature sequences generated from the CNN encoder. Meanwhile, Chen et al.⁵⁰ presented a transformer image segmentation framework, named TransUNet, which merits both U-net and transformer to enable precise localization and extract global contexts, gaining superior performance in multi-organ segmentation. In light of this, we argue that embedding transformer into CNN is an executable solution to capture the global information of dose maps.

3. Methodology

In this section, we illustrate the detailed design of our TransMTDP network which jointly implements three relevant tasks. As shown in Fig. 2, three decoders with a shared encoder are respectively designed for three different tasks, including a main dose prediction task to produce the final dose map, an auxiliary isodose lines prediction task to retain features related to isodose lines, and an auxiliary gradient prediction task to capture the gradient properties. In contrast to the single-task network, the two auxiliary tasks in our multi-task network can provide stronger supervision in learning and capture richer feature representations to enhance the prediction performance. The whole network utilizes the CT images, the PTV, and OAR segmentation masks as input, and finally outputs the predicted dose distribution maps, isodose lines maps, and gradient maps, respectively. Notably, all three tasks share the same encoder while the parameters in the three task-specific decoders are different to fit different tasks. In this way, the encoder is compelled to extract the features correlated to the isodose lines and gradient properties in the dose map, causing a lower difference between the output and ground truth. Different from vanilla MTL methods, we further employ two additional constraints, i.e. isodose consistency loss and gradient consistency loss, for feature refinement in the decoder. Besides, transformer modules are embedded in the network for long-range dependencies modeling and global feature extraction. Specifically, a transformer encoder (TransEncoder) is inserted after the CNN to form the entire shared encoder while three transformer decoders (TransDecoder) are respectively embedded into the three decoders to manage different tasks.

3.1. Auxiliary tasks

Isodose Lines Prediction Task: As displayed in Fig. 1(c), the dose distribution presents a decline trend from the PTV to its surroundings. Motivated by the geography concept of iso-height lines, we provide each dose map with an isodose lines map and innovatively design an isodose lines prediction task to learn such features.

Denoting N isodose lines as ${d_{1}, d_{2}, \dots, d_{n}}$ , we separate the whole dose distribution map into $N - 1$ adjacent regions as ${A_{1, 2}, A_{2, 3}, \dots, A_{n - 1, n}}$ . Specifically, if the dose distribution value of one pixel is in the interval $(d_{i}, d_{i + 1})$ , it will be assigned the label $A_{i, i + 1}$ . In this way, the isodose lines maps are easily calculated from the original dose maps. As opposed to the regression task for the dose prediction, the isodose lines prediction task is actually an ( $N - 1$ )-label classification task. For the sake of simplicity, we just utilize the same decoder architecture as the one used in our dose prediction task except for the last convolution layer. It outputs an ( $N - 1$ )-channel classification results and each pixel in channel i represents the probability it belongs to $A_{i, i + 1}$ .

The contributions of this task lie in two points: (1) The isodose lines prediction task constrains the main task to concentrate on the decline trend of the dose value, forcing the dose distribution map to keep similar to the ground truth. (2) The isodose lines prediction task provides each pixel with a coarse-grained dose range. Combining it with the fine-grained dose value obtained by the main dose prediction task, multi-granularity information is gained to get more accurate dose distributions.

Gradient Prediction Task: Besides the isodose-related features, we notice that there are some subtle gradient features shown in Fig. 1(e) which are generally neglected by the current dose prediction methods. Such subtle features reflect the radiation patterns and edge information in the dose map, which are advantageous to improve the prediction quality of dose distribution. Therefore, we integrate a gradient prediction task in the network to learn such subtle gradient properties and output the corresponding gradient maps.

We utilize the widely used Sobel operator to process the manually optimized dose map and generate the corresponding gradient map, due to its simplicity and differentiability. The gained gradient maps are then used as the gradient labels to train the gradient prediction subnetwork with the shared encoder and a gradient prediction decoder. In this manner, the shared encoder is constrained to capture the information related to gradient properties, thus producing a more precise prediction.

3.2. Architecture

Because the resolutions of the dose distribution map, isodose lines map, and gradient map are the same as that of the input CT image, we harness three transformer-embedded U-net-like encoder–decoder subnetworks for these three tasks, in which the encoder is shared but the decoders are task-specific. Following the rationale of U-net,²⁶ we also utilize the skip connections for feature reuse and multi-level feature aggregation between the task-specific decoders and shared encoder.

Shared Encoder: The shared encoder contains two main components, i.e. a CNN encoder and a TransEncoder to learn the task-shared features for the three tasks. Particularly, the CNN encoder contains three down-sampling blocks, and every block is composed of two convolutional layers with a batch normalization (BN) layer, a $3 \times 3$ convolution kernel, and a rectified linear unit (ReLU) activation function followed by a max-pooling layer except for the last one.

The feature maps $X \in ℝ^{C \times W \times H}$ with channel size C, width W, and height H from the CNN encoder are further flattened into feature sequences and added with a position encoding, forming the feature patch sequences $X_{e} = ℝ^{C \times P^{2}}$ as the input of TransEncoder, where $P^{2} = W \times H$ denotes the pixel number in an image patch. As illustrated in Fig. 2, the TransEncoder has a Multi-Head Self-Attention (MHSA) and a Feed Forward Network (FFN), respectively, followed by a layer normalization.⁴⁶ The core component of TransEncoder is MHSA which employs multiple single Self-Attentions (SSAs) to model the relationships among all feature patches. The calculation of SSA is formulated as follows :

S S A (Q, K, V) = SoftMax (Q \cdot K T / \sqrt d) V, <math display="block" altimg="eq-00013.gif"><mi>S</mi><mi>S</mi><mi>A</mi><mo stretchy="false">(</mo><mi>Q</mi><mo>,</mo><mi>K</mi><mo>,</mo><mi>V</mi><mo stretchy="false">)</mo><mo>=</mo><mstyle><mtext mathvariant="normal">SoftMax</mtext></mstyle><mo stretchy="false">(</mo><mi>Q</mi><mo>\cdot</mo><msup><mrow><mi>K</mi></mrow><mrow><mi>T</mi></mrow></msup><mo stretchy="false">/</mo><msqrt><mrow><mi>d</mi></mrow></msqrt><mo stretchy="false">)</mo><mi>V</mi><mo>,</mo></math> (1)

where Q, K, and V denote the weight matrices Keys, Queries, and Value transformed from the input patch and d denotes dimension. In this way, the shared encoder can further learn the long-range feature representations, which is beneficial to the following prediction tasks.

Task-specific Decoders: Consistent with the encoder, each task-specific decoder also contains a TransDecoder and a CNN decoder. The TransDecoder also follows the original design in Ref. 46 and the detailed process is similar to the TransEncoder.

As for the CNN decoder, each up-sampling block contains three $3 \times 3$ convolutional layers, and every two convolutional layers are followed by a BN and ReLU. The outputs of the three decoders are different and task-dependent. With the input image $I^{H \times W \times (2 + c)}$ where c denotes the number of OARs, the main dose prediction and auxiliary gradient prediction tasks, respectively, produce the dose distribution map $D^{H \times W \times 1}$ and the corresponding gradient map $G^{H \times W \times 1}$ , while the isodose lines prediction task generates $N -$ 1 probability maps $P^{H \times W \times (N - 1)}$ . Finally, a $1 \times 1$ convolutional layer is added at the end of each decoder, and the sigmoid functions are also utilized for the decoder of gradient prediction task and dose prediction task to obtain the final outputs with the target size.

3.3. Objective functions

Multi-task Learning Loss Functions: In the proposed TransMTDP network, every subtask has a specially designed loss function. First, we employ an L1 loss as follows to train the main dose prediction network

Lossdose=1W×HW×H∑i=1∥Decdose(F)i−yi∥1,<math display="block" altimg="eq-00021.gif"><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></munderover><mo>∥</mo><msup><mrow><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>−</mo><msup><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msup><msub><mrow><mo>∥</mo></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo></math>(2)

where F denotes the feature maps produced by the shared encoder from the original CT slice and corresponding segmentation masks. y represents the dose map (ground truth) manually optimized by radiologists. i represents a single pixel inside the slice.

${Dec}_{dose} (\cdot)$ denotes the dose decoder. W and

$H$ denote the width and height of the original input, respectively.

Considering the isodose lines prediction subtask is actually a ( $N -$ 1)-label classification problem, we use the following cross entropy (CE) loss to guide model training.

Lossiso=−1W×HW×H∑i=1N−1∑cAi,c⋅log(Deciso(F)i,c),<math display="block" altimg="eq-00025.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">iso</mtext></mstyle></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></munderover><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>c</mi></mrow><mrow><mi>N</mi><mo>−</mo><mn>1</mn></mrow></munderover><msup><mrow><mi>A</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>c</mi></mrow></msup></mtd></mtr><mtr><mtd columnalign="left"><mspace width="4pc"></mspace><mo>⋅</mo><mo>log</mo><mo stretchy="false">(</mo><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">iso</mtext></mstyle></mrow></msub><msup><mrow><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi><mo>,</mo><mi>c</mi></mrow></msup><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr></mtable></math>(3)

where

${Dec}_{iso} (\cdot)$ denotes the isoline decoder to predict the isodose lines.

$c$ refers to the category derived from the ground truth

$y$ by isodose lines calculating as described in Sec. 3.1.

$A^{i, c}$ represents the classification label that indicates whether pixel i pertains to category c. Specifically,

$A^{i, c}$ equals to 1 when pixel i pertains to category c, otherwise equals to 0.

${Dec}_{iso} {(F)}^{i, c}$ denotes the probability that pixel i pertains to category c.

In addition, for the gradient prediction task, we employ L2 loss as follows to constrain the model learning :

Lossgra=1W×HW×H∑i=1∥Decgra(F)i−S(y)i∥2,<math display="block" altimg="eq-00032.gif"><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">gra</mtext></mstyle></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></munderover><msup><mrow><mo>∥</mo><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">gra</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>−</mo><msup><mrow><mi>S</mi><mo stretchy="false">(</mo><mi>y</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><msub><mrow><mo>∥</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo></math>(4)

where

${Dec}_{gra} (\cdot)$ represents the gradient decoder to predict the gradient properties while

$S (\cdot)$ means the Sobel operator.

As can be seen, all the losses are functions with respect to the feature maps outputted by the shared encoder. As a consequence, these three tasks can force the encoder to concentrate on the features correlated to the dose values, isodose lines, and gradients in the dose distribution map, thus enhancing the expressivity of the extracted features.

Consistency Loss Functions: Traditional MTL methods are confined to learning a stronger encoder, yet they rarely pay attention to the learning of the decoders. However, the decoder is also significant for task-specific feature extraction and task accomplishment. In this work, we introduce two consistency loss functions, such as the isodose consistency loss ${Loss}_{IC}$ and gradient consistency loss ${Loss}_{GC}$ , to actualize the dose distribution features from these two auxiliary tasks, which are well-matched with those predicted by the main prediction task.

For isodose lines features, whereas the pixel i pertains to the interval $(d_{i}, d_{i + 1})$ , we use (5) to impel the dose value of every pixel to be located in the correct region.

LossIC=1W×HW×H∑i=1×{∥Decdose(F)i−dim+1∥2,Decdose(F)i>dim+1,0,dim≤Decdose(F)i≤dim+1,∥Decdose(F)i−dim∥2,Decdose(F)i<dim,<math display="block" altimg="eq-00038.gif"><mrow><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mtext>IC</mtext></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></munderover></mrow><mo>×</mo><mspace width=".17em"></mspace><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><mo>∥</mo><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><msup><mrow><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>−</mo><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi><mo>+</mo><mn>1</mn></mrow><mrow><mi>i</mi></mrow></msubsup><msub><mrow><mo>∥</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><msup><mrow><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>&gt;</mo><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi><mo>+</mo><mn>1</mn></mrow><mrow><mi>i</mi></mrow></msubsup><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mn>0</mn><mo>,</mo><mspace width="1em"></mspace><msup><mrow><msub><mrow><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo>≤</mo><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>≤</mo><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi><mo>+</mo><mn>1</mn></mrow><mrow><mi>i</mi></mrow></msubsup><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mo>∥</mo><msup><mrow><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>−</mo><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msubsup><msub><mrow><mo>∥</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo><msup><mrow><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>&lt;</mo><msubsup><mrow><mi>d</mi></mrow><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo>,</mo></mtd></mtr></mtable></mrow></mfenced></math>(5)

For gradient-related properties, we process the predicted dose map with the Sobel operator $S (\cdot)$ and generate a pseudo gradient map. Afterward, to ensure that the output of the main task possesses the gradient properties close to those of the ground truth, we employ an L2 loss as follows to measure the disparity between the pseudo gradient map and the true one gained from the manually optimized dose distribution map.

LossGC=1W×HW×H∑i=1∥S(Decdose(F))i−S(y)i∥2.<math display="block" altimg="eq-00040.gif"><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">GC</mtext></mstyle></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></munderover><mo>∥</mo><mi>S</mi><msup><mrow><mo stretchy="false">(</mo><msub><mrow><mstyle><mtext mathvariant="normal">Dec</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo stretchy="false">(</mo><mi>F</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><mo>−</mo><mi>S</mi><msup><mrow><mo stretchy="false">(</mo><mi>y</mi><mo stretchy="false">)</mo></mrow><mrow><mi>i</mi></mrow></msup><msub><mrow><mo>∥</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>.</mo></math>(6)

Additionally, since the $S {(y)}^{i}$ has been calculated in (4) and these two losses are also easy to calculate, the consistency loss functions can be considered cost-free.

The whole loss function of the training phase is expressed as the weighted sum of the five above-mentioned losses.

Loss total = Loss dose + λ 1 Loss iso + λ 2 Loss gra + λ 3 Loss IC + λ 4 Loss GC, <math display="block" altimg="eq-00042.gif"><mtable columnalign="left" displaystyle="true"><mtr><mtd><mrow><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">total</mtext></mstyle></mrow></msub><mo>=</mo><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">dose</mtext></mstyle></mrow></msub><mo>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mn>1</mn></mrow></msub><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">iso</mtext></mstyle></mrow></msub><mo>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mn>2</mn></mrow></msub><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">gra</mtext></mstyle></mrow></msub></mrow></mtd></mtr><mtr><mtd><mspace width="3.5pc"></mspace><mo>+</mo><mspace width=".17em"></mspace><msub><mrow><mi>λ</mi></mrow><mrow><mn>3</mn></mrow></msub><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mtext>IC</mtext></mrow></msub><mo>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mn>4</mn></mrow></msub><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mtext>GC</mtext></mrow></msub><mo>,</mo></mtd></mtr></mtable></math> (7)

where

$λ$ s are the hyper-parameters to balance these five terms.

4. Experiments and Results

4.1. Datasets and implementation details

Datasets: To validate the effectiveness of our method, we involve an in-house rectum cancer dataset and a public H&N cancer dataset from the OpenKBP-2020 AAPM Grand Challenge. Specifically, the rectum cancer dataset contains 110 rectum cancer patients who are treated with chemo-radiotherapy or radiotherapy at West China Hospital and each subject is composed of a CT image, a PTV segmentation mask, a dose distribution map, and four OAR segmentation masks, i.e. bladder, small intestine (ST), right femoral head (FHR), and left femoral head (FHL). Specifically, the images of 88 patients are picked for model training, and the remaining 22 patients are utilized for model testing. The H&N cancer dataset contains 340 patients with contoured CT images and corresponding dose distribution maps. Following Ref. 26, 270 instances are selected for training, while the remaining 70 instances serve as testing samples.

Training Details: The proposed TransMTDP network is realized by the PyTorch on an 11GB NVIDIA GeForce GTX 2080T. We train the model for 200 epochs with Adam optimizer. The initial learning rates of the shared encoder, dose decoder, isodose decoder, and gradient decoder are set as 1e−5, 1e−5, 4e−4, and 3e−6, respectively. Then, the learning rates progressively decline by 1% for the last 100 epochs. Founded on our experimental studies, we set the number of regions i.e. $N -$ 1, as 7, $λ_{1}$ , $λ_{3}$ , and $λ_{4}$ in (7) all equal to 0.5. $λ_{3}$ is set as 2 for the first 30 epochs for the difficulty of gradient learning and then gradually decreases to 0.5 for the following 15 epochs and stays unchanged for the remaining 155 epochs.

Evaluation Metrics: We measure the results of the proposed network with several evaluation metrics, i.e. conformity index $(CI)$ ⁵¹ and heterogeneity index ( $HI)$ .⁵² Given the target volume $TV$ and prescription isodose volume $PIV$ , $CI$ is defined as (8).

CI=(TV∩PIV)2TV×PIV,<math display="block" altimg="eq-00056.gif"><mstyle><mtext mathvariant="normal">CI</mtext></mstyle><mo>=</mo><mfrac><mrow><msup><mrow><mo stretchy="false">(</mo><mstyle><mtext mathvariant="normal">TV</mtext></mstyle><mo>∩</mo><mstyle><mtext mathvariant="normal">PIV</mtext></mstyle><mo stretchy="false">)</mo></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mstyle><mtext mathvariant="normal">TV</mtext></mstyle><mo>×</mo><mstyle><mtext mathvariant="normal">PIV</mtext></mstyle></mrow></mfrac><mo>,</mo></math>(8)

HI is expressed as follows :

HI=D2−D98D50,<math display="block" altimg="eq-00057.gif"><mstyle><mtext mathvariant="normal">HI</mtext></mstyle><mo>=</mo><mfrac><mrow><msub><mrow><mi>D</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>−</mo><msub><mrow><mi>D</mi></mrow><mrow><mn>9</mn><mn>8</mn></mrow></msub></mrow><mrow><msub><mrow><mi>D</mi></mrow><mrow><mn>5</mn><mn>0</mn></mrow></msub></mrow></mfrac><mo>,</mo></math>(9)

where

$D_{m}$ is the minimal absorbed dose covering m% percentage volume of PTV. For both CI and HI, the smaller disparity between the predicted value and the ground truth, the more accurate prediction for the dose distribution. Besides, the accuracy of dose distribution on PTV can also be evaluated by

$D_{m}$ , where

$D_{mean}$ and

$D_{max}$ represent the mean and maximum dose absorbed. Also, we calculate the average prediction error (APE),

$| Δ | = \frac{| y - ỹ |}{y}$ , to measure the disparity between the metrics

$ỹ$ and ground truths. Moreover, we make a paired t-test and calculate the p-value of the proposed over the previous ones. If the p-value is less than 0.05, the enhancement of our proposed is significant.

Dose volume histogram (DVH)⁵³ is also a prominent metric of dose prediction performance. If the DVH curves of a predicted plan are closer to that of the ground truth than other prediction methods, it indicates that this method is of better performance.

4.2. Comparison with the state-of-the-art methods

Comparison on the Rectum Cancer Dataset: In order to investigate the superior performance of our network, we take six state-of-the-art (SOTA) methods for comparison, including U-net (2017),¹¹ GAN (2018),¹⁶ DoseNet (2018),¹² DeepLab v3+ (2020),¹⁵ PRUNet (2022),¹³ and MTDP (2021).²⁵ The quantitative results are listed in Table 1. As we can see, compared with the previous SOTAs, the proposed method demonstrates superior performance for all clinical metrics. Although the MTDP performs better than other SOTAs with the second-best results, the results of our model are even 0.45 and 0.76 lower than it in terms of $D_{mean}$ and $D_{max}$ . Additionally, our model notably reduces the prediction error $Δ$ of $HI$ and $D_{max}$ to 1.80% and 1.45%. Notably, the p-values between the proposed and other SOTAs are almost all smaller than 0.05, indicating significant enhancements of our method. Besides, the parameter number in our model is 25.93M which is also feasible in clinical applications.

**Table 1. Quantitative comparison with state-of-the-art methods in terms of $CI$ , $HI$ , $D_{mean}$ , $D_{max}$ , $D_{95}$ and $D_{50}$ on the rectum cancer dataset.**
Methods	$CI$	$HI$	$D_{mean} (G_{y})$	$D_{max} (G_{y})$	$D_{95} (G_{y})$	$D_{50} (G_{y})$
U-net¹¹	0.835 (2.68)*	0.0682 (18.02)*	46.27 (9.41)*	49.32 (10.04)*	44.37 (6.94)*	47.54 (7.27)*
GAN¹⁶	0.841 (1.98)*	0.0714 (14.18)*	50.34 (1.44)	52.76 (3.77)*	46.86 (1.71)	49.21 (4.01)*
DoseNet¹²	0.839 (2.21)*	0.0731 (12.13)*	49.68 (2.74)*	51.34 (6.36)*	45.97 (3.58)*	49.5 (4.33)*
DeepLab v3+¹⁵	0.847 (1.28)*	0.0701 (15.74)*	49.93 (2.25)*	52.89 (3.37)*	46.32 (2.85)*	50.36 (1.77)*
PRUNet¹³	0.844 (3.57)*	0.0695 (9.42)*	49.51 (1.05)*	50.69 (0.55)*	46.29 (1.27)*	49.72 (1.08)*
MTDP²⁵	0.854 (0.47)	0.0813 (2.28)	50.41 (1.31)	53.27 (2.84)	46.98 (1.46)	50.75 (1.01)
Proposed	0.856 (0.23)	0.0817 (1.80)	50.86 (0.43)	54.03 (1.45)	47.12 (1.17)	50.93 (0.66)
Ground truth	0.858	0.0832	51.08	54.83	47.68	51.27

Note: * means our method is significantly better than the compared method with $p < 0.05$ via paired t-test.

To verify whether the proposed can effectively extract the gradient information and isodose lines, visualizations of these two types of attributes are shown in Fig. 3. The outstanding improvement obtained by our method is especially marked by the red boxes and arrows. As observed, the gradients of the SOTAs are inevitably ambiguous while our method can preeminently reserve such information. Furthermore, the isodose lines generated by the proposed method better match the ground truth while the results of other previous SOTAs are over-smoothed and show large variants in detail.

Fig. 3. Visualization of the gradient maps (first row) and isodose lines maps (second row) of the SOTAs and the proposed method on the rectum cancer dataset.

Moreover, we present the DVH curves of different methods (dashed lines) and ground truth (solid lines) in Fig. 4. We can see that, compared to other SOTAs, the DVH curves of MTDP and DeepLab v3+ have less disparity to the ground truth, indicating a more accurate prediction. However, compared to the proposed, they all suffer from nonnegligible discrepancies in ST. On the whole, the DVH curves of the proposed method achieve minimal disparity from the ground truth, affirming the best quality of dose prediction.

Furthermore, Fig. 5 illustrates the visualization results of different methods. The first row shows the ground truth manually planned by doctors, the second row shows the predicted dose maps of different methods, and the last row illustrates the corresponding difference map. Obviously, the difference between the results of the proposed model and the ground truth is minimal, showing the exemplary performance of our method.

Comparison on the H&N Cancer Dataset: To verify the generalization of the TransMTDP, we further conduct the comparison experiments on the H&N cancer dataset.

Table 2 displays the quantitative results. As can be observed, our proposed TransMTDP shows superior performance over other methods in all metrics. Particularly, all other SOTAs achieve relatively poor performance in $HI$ with larger prediction errors while the TransMTDP reaches the best accuracy (0.101) with only 1.94% error. Even though the MTDP performs fairly well and obtains the second-best results, the proposed TransMTDP further boosts its accuracy and especially diminishes the prediction of $D_{mean}$ by 1.09. In short, the proposed exhibits superior performance compared with the previous SOTAs quantitatively.

**Table 2. Quantitative comparison with state-of-the-art methods in terms of CI, HI, $D_{mean}$ , $D_{max}$ on the H&N cancer dataset.**
Methods	$CI$	$HI$	$D_{mean} (G_{y})$	$D_{max} (G_{y})$
U-net¹⁴	0.654 (9.66)*	0.078 (24.27)*	27.20 (18.56)*	62.21 (9.08)*
GAN¹⁹	0.697 (3.72)	0.084 (18.44)	28.19 (15.59)*	65.40 (4.42)*
DoseNet¹⁵	0.693 (4.28)	0.087 (15.53)*	28.27 (15.35)*	66.79 (2.39)*
DeepLab v3+¹⁸	0.703 (2.90)	0.091 (11.65)	31.17 (6.67)	64.33 (5.99)*
PRUNet¹⁶	0.665 (14.47)	0.084 (6.89)*	31.42 (4.61)	64.75 (4.17)
MTDP²⁸	0.712 (1.65)	0.093 (9.70)	32.21 (3.56)	67.17 (1.84)
Proposed	0.719 (0.69)	0.101 (1.94)	33.30 (0.29)	67.43 (1.48)
Ground truth	0.724	0.103	33.40	68.43

Note: * means our method is significantly better than the compared method with $p < 0.05$ via paired t-test.

4.3. Ablation study

To demonstrate the efficiency of each component of the proposed model, we progressively make ablation studies on the rectum cancer dataset. To be specific, the experiment arrangement can be concluded as: (A) U-net as the main dose prediction task (U-DPT), (B) Transformer-based DPT (Trans-DPT), (C) Trans-DPT+gradient prediction task (GPT), (D) Trans-DPT+GPT+gradient consistency loss ( ${Loss}_{GC}$ ), (E) $Trans - DPT + GPT + {Loss}_{GC} + isodose$ lines prediction task (IPT), (F) $Trans - DPT + GPT + {Loss}_{GC} + IPT + isodose$ consistency loss ( ${Loss}_{IC}$ ) (proposed).

Effectiveness of the auxiliary tasks: To demonstrate the effectiveness of the two auxiliary tasks, we can simply observe the prediction performance by utilizing (B) and (C) for the gradient prediction task, and utilizing (D) and (E) for the isodose lines prediction task. In Fig. 6, we can see that the $| Δ |$ s of CI and HI decline by 0.35% and 2.80%, respectively, after integrating the gradient prediction task. As to the isodose lines prediction task, the CI value of (E) enhances 0.005 over (D).

Fig. 6. Quantitative results of ablation experiments in terms of CI and HI on the rectum cancer dataset. * means our method is significantly better than other variations with $p < 0.05$ via paired t-test.

To further intuitively illustrate the accuracy promotion, we also plot the DVH curves in Fig. 7, where the dashed lines of (C) and (E) show much less gap to the solid lines compared to (B) and (D). For qualitative comparisons, we further provide the dose distribution maps generated by different methods followed by their difference maps in Fig. 8. Compared to (B), we can see that the disparity in the difference map of (C) is slightly less. Additionally, compared with (D), the promotion of (E) is remarkable, decreasing the differences on the whole.

Fig. 8. Visualization results of ablation experiments on rectum cancer dataset. From top to bottom: the ground truth, predicted dose distribution map and the difference map.

In summary, the experimental results confirm the efficacy of the two proposed auxiliary tasks.

Effectiveness of consistency loss functions: To investigate the effectiveness of the proposed two consistency constraints, ${Loss}_{GC}$ and ${Loss}_{IC}$ , we can directly compare the results between (C) and (D), (E), and (F). As displayed in Fig. 6, the isodose consistency loss and the gradient consistency loss, respectively, achieve a 0.0023 and 0.0052 improvement on the HI metric. Furthermore, as illustrated in Fig. 7, the solid lines of (D) and (F) in the DVHs are much closer to the dashed lines than (C) and (E). In addition, we can intuitively observe that the pixel-wise differences in Figs. 8(D) and 8(F) are much smaller. It is also notable that the predicted dose distribution map of (F) is extremely similar to the ground truth with almost negligible differences, verifying the accuracy of the proposed method. Note that, these improvements are nearly cost-free since the consistency losses are easy to calculate, without introducing additional parameters to the network.

Contributions of the Transformer: The results of (A) and (B) clarify the contributions of the transformer employed in this work. For quantitative comparison, as demonstrated in Fig. 6, the $| Δ |$ s of CI and HI of (B) drop by 0.82% and 3.62% over (A). As for qualitative comparison, we can find the predicted DVH curves of (B) are the closest to the manually optimized dose plan, as shown in Fig. 7. Moreover, the predicted dose distribution map is also more similar to the ground truth, as shown by a lighter color in the pixel-wise difference map, as shown in Fig. 8. Therefore, the effectiveness of the transformer is clearly demonstrated.

Selection of hyper-parameters: We further make a set of experiments on the rectum cancer dataset to select the optimal hyper-parameters, i.e. $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , and $λ_{4}$ in Eq. (7), and verify the effect of different parameters selection on the final accuracy. Concretely, the default values of $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , and $λ_{4}$ are all set as 1. In the following experiments, we fix three parameters and then study the influence of using different values for the remaining parameter on the prediction accuracy. We set the unfixed hyper-parameter as 0.1, 0.5, 1, 1.5, and 2, respectively, to conduct experiments, and the parameter with the best accuracy is utilized for the selection of the next one until all parameters are explored. Corresponding results are given in Fig. 9. The different values of the hyper-parameters are shown on the X-axis while the Y-axis indicates CI values or HI values. The curves with different colors stand for the results of different hyper-parameters while the light blue curve represents the ground truth. As observed in Fig. 9, when the hyper-parameters $λ_{1}$ , $λ_{3}$ , and $λ_{4}$ are set as 0.5, the model gains relatively high performance. As for $λ_{2}$ , to avoid the negative effects brought by the insufficient quality of the gradient map in the initial training stage, we set it as 2 for the initial 30 epochs with the hope of enhancing the learning ability of gradient information. Then we gradually decrease it to 0.5 in the following 15 epochs and keep it fixed in the next 155 epochs, to enforce better performance of the auxiliary gradient prediction task and more positive guidance to the main dose prediction task.

Fig. 9. The effect of different values of hyper-parameters on CI and HI metrics on the rectum cancer dataset.

5. Discussion

In this work, we design a transformer-embedded multi-task network for dose prediction named TransMTDP as a solution to automatic and accurate radiotherapy. Taking advantage of the multi-task learning, our model obtains superior performance over the single-task models by using three highly correlated tasks, i.e. a main dose prediction task, an isodose lines prediction task, and a gradient prediction task. Particularly, the auxiliary isodose lines prediction task is innovatively proposed to capture the usually neglected information, i.e. isodose lines, which reflects a coarse-grained dose range of each pixel inside the dose map (as displayed in Fig. 1(d)). Moreover, the auxiliary gradient prediction task is introduced to learn the subtle gradient properties, such as the radiation patterns and the edge information in the dose map, as shown in Fig. 1(e). We jointly trained the main dose prediction task and two auxiliary tasks in an end-to-end manner. From the visualization in Fig. 3, it is obvious that the proposed can learn subtle gradient properties more efficiently and preserve more coarse-grained dose information than other SOTAs. Quantitative results in Fig. 6 also exhibit the effectiveness of the two additional auxiliary tasks. As far as we know, this work is the first attempt to utilize the extra isodose lines information and gradient properties to accurately predict the dose distribution map.

Moreover, different from the traditional ways to perform multi-task strategy by sharing the shallow layers to integrate different tasks, we argue that the relationship between the deep output layers of the main task and auxiliary tasks is another key to further enhancing the results. In this paper, we use two additional loss functions, i.e. isodose consistency loss and gradient consistency loss, to strengthen the association of the main task and two auxiliary tasks. Experimental results in Figs. 6 and 7 verify the better performance of the proposed with two additional constraints than the model without the constraints. Besides, considering the dose map has many global features, the transformer modules are embedded in the model to capture such long-range dependencies. Experimental results in Figs. 6 and 7 indicate the effectiveness of the transformer modules.

Despite the progress achieved, our work has the following limitations. First, due to the privacy politics and costly manual annotation, only an in-house rectum cancer dataset and a public H&N cancer dataset are employed in our experiments. Although the proposed model shows superior accuracy over the SOTAs, other cancer datasets may be further involved to enhance the model generalization. Second, the proposed network still needs the PTV and OAR contours of the cancer patients as input which require a lot delineation time. Therefore, how to design a network that only uses CT images as the input will be our future research direction.

6. Conclusion

In this paper, we introduce a transformer-embedded multi-task dose prediction network (TransMTDP) to fulfill the automatic prediction of clinically acceptable dose distribution in radiotherapy. Assuming that isodose lines and gradient information are helpful to improve the prediction quality, we try to enhance the performance of the main dose prediction task with the help of an isodose lines prediction task and a gradient prediction task and integrate the three tasks by a shared encoder following the multi-task learning strategy. Aiming to strengthen the relationship of the outputs of different tasks, we design two additional constraints, i.e. isodose consistency loss and gradient consistency loss, to further enhance the performance. Then, transformer modules are embedded in the proposed to learn the long-range dependencies and global features. Experimental results on an in-house rectum cancer dataset and a public H&N cancer dataset verify the superiority of the TransMTDP compared to other SOTAs. In the future, we will make more efforts to further enhance its efficiency and model generalization.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC 62071314), Sichuan Science and Technology Program 2023YFG0263, 2023NSFSC0497, 22YYJCYJ0086, and Opening Foundation of Agile and Intelligent Computing Key Laboratory of Sichuan Province.

References

1. S. Tribius and C. Bergelt , Intensity-modulated radiotherapy versus conventional and 3D conformal radiotherapy in patients with head and neck cancer: Is there a worthwhile quality of life gain, Cancer Treat. Rev. 37(7) (2011) 511–519. Crossref, Medline, Web of Science, Google Scholar
2. I. J. Das, V. Moskvin and P. A. Johnstone , Analysis of treatment planning time among systems and planners for intensity-modulated radiation therapy, J. Am. Coll. Radiol. 6(7) (2009) 514–517. Crossref, Medline, Google Scholar
3. W. M. Mendenhall, R. J. Amdur and J. R. Palta , Intensity-modulated radiotherapy in the standard management of head and neck cancer: Promises and pitfalls, J. Clin. Oncol. 24(17) (2006) 2618–2623. Crossref, Medline, Web of Science, Google Scholar
4. Y. Murakami, M. Nakano, M. Yoshida, H. Hirashima, F. Nakamura, J. Fukunaga, M. Hirose, Y. Yoshioka, M. Oguchi and H. Hirata , Possibility of chest wall dose reduction using volumetric-modulated arc therapy (VMAT) in radiation-induced rib fracture cases: Comparison with stereotactic body radiation therapy (SBRT), J. Radiat. Res. 59(3) (2018) 327–332. Crossref, Medline, Web of Science, Google Scholar
5. B. E. Nelms, G. Robinson, J. Markham, K. Velasco, S. Boyd, S. Narayan, J. Wheeler and M. L. Sobczak , Variation in external beam treatment plan quality: An inter-institutional study of planners and planning systems, Pract. Radiat. Oncol. 2(4) (2012) 296–305. Crossref, Medline, Google Scholar
6. Y. Xue, H. Zhu and F. Neri , A self-adaptive multi-objective feature selection approach for classification problems, Integr. Comput. Aided Eng. 29(1) (2022) 3–21. Crossref, Web of Science, Google Scholar
7. G. Zhang, X. Zhang, H. Rong, P. Paul, M. Zhu, F. Neri and Y. S. Ong , A layered spiking neural system for classification problems, Int. J. Neural Syst. 32(8) (2022) 2250023. Link, Web of Science, Google Scholar
8. H. S. Nogay and H. Adeli , Machine learning (ML) for the diagnosis of autism spectrum disorder (ASD) using brain imaging, Rev. Neurosci. 31(8) (2020) 825–841. Crossref, Web of Science, Google Scholar
9. M. H. Rafiei, L. V. Gauthier, H. Adeli and D. Takabi , Self-Supervised learning for electroencephalography, IEEE Trans. Neural Netw. Learn. Syst. 1 (2022) 1–15. Google Scholar
10. D. Nguyen, T. Long, X. Jia, W. Lu, X. Gu, Z. Iqbal and S. Jiang, Dose prediction with U-net: A feasibility study for predicting dose distributions from contours using deep learning on prostate IMRT patients, preprint, 17 (2017) arXiv:1709.09233. Google Scholar
11. D. Nguyen, T. Long, X. Jia, W. Lu, X. Gu, Z. Iqbal and S. Jiang , A feasibility study for predicting optimal radiation therapy dose distributions of prostate cancer patients from patient anatomy using deep learning, Sci. Rep. 9(1) (2019) 1–10. Medline, Web of Science, Google Scholar
12. V. Kearney, J. W. Chan, S. Haaf, M. Descovich and T. D. Solberg , DoseNet: A volumetric dose prediction algorithm using 3D fully-convolutional neural networks, Phys. Med. Biol. 63(23) (2018) 235022. Crossref, Medline, Web of Science, Google Scholar
13. J. Wang, J. Hu, Y. Song, Q. Wang, X. Zhang, S. Bai and Z. Yi , VMAT dose prediction in radiotherapy by using progressive refinement UNet, Neurocomputing 488 (2022) 528–539. Crossref, Web of Science, Google Scholar
14. D. Nguyen, X. Jia, D. Sher, M. H. Lin, Z. Iqbal, H. Liu and S. Jiang , 3D radiotherapy dose prediction on head and neck cancer patients with a hierarchically densely connected U-net deep learning architecture, Phys. Med. Biol. 64(6) (2019) 065020. Crossref, Medline, Web of Science, Google Scholar
15. Y. Song, J. Hu, Y. Liu, H. Hu, Y. Huang, S. Bai and Z. Yi , Dose prediction using a deep neural network for accelerated planning of rectal cancer radiotherapy, Radiot. Oncol. 149 (2020) 111–116. Crossref, Medline, Web of Science, Google Scholar
16. R. Mahmood, A. Babier, A. McNiven, A. Diamant and T. C. Chan , Automated treatment planning in radiation therapy using generative adversarial networks, in Machine Learning for Healthcare Conf. (PMLR, 2018), pp. 484–499. Google Scholar
17. V. Kearney, J. W. Chan, T. Wang, A. Perry, M. Descovich, O. Morin, S. S. Yom and T. D. Solberg , DoseGAN: A generative adversarial network for synthetic dose prediction using attention-gated discrimination and generation, Sci. Rep. 10(1) (2020) 1–8. Crossref, Medline, Web of Science, Google Scholar
18. Y. Murakami, T. Magome, K. Matsumoto, T. Sato, Y. Yoshioka and M. Oguchi , Fully automated dose prediction using generative adversarial networks in prostate cancer patients, PLoS One 15(5) (2020) e0232697. Crossref, Medline, Web of Science, Google Scholar
19. A. Babier, R. Mahmood, A. L. McNiven, A. Diamant and T. C. Chan , Knowledge-based automated planning with three-dimensional generative adversarial networks, Med. Phys. 47(2) (2020) 297–306. Crossref, Medline, Web of Science, Google Scholar
20. L. Wen, J. Xiao, J. Zeng, C. Zu, X. Wu, J. Zhou, X. Peng and Y. Wang , Multi-level progressive transfer learning for cervical cancer dose prediction, Pattern Recognit. 141 (2023) 109606. Crossref, Web of Science, Google Scholar
21. J. Hu, Y. Song, Q. Wang, S. Bai and Z. Yi , Incorporating historical sub-optimal deep neural networks for dose prediction in radiotherapy, Med. Image Anal. 67 (2021) 101886. Crossref, Medline, Web of Science, Google Scholar
22. B. Zhan, J. Xiao, C. Cao, X. Peng, C. Zu, J. Zhou and Y. Wang , Multi-constraint generative adversarial network for dose prediction in radiotherapy, Med. Image Anal. 77 (2022) 102339. Crossref, Medline, Web of Science, Google Scholar
23. H. Li, X. Peng, J. Zeng, J. Xiao, D. Nie, C. Zu, X. Wu, J. Zhou and Y. Wang , Explainable attention guided adversarial deep network for 3D radiotherapy dose distribution prediction, Knowl. Based Syst. 241 (2022) 108324. Crossref, Web of Science, Google Scholar
24. Z. Jiao, X. Peng, J. Xiao, X. Wu, J. Zhou and Y. Wang , Mask-free radiotherapy dose prediction via multi-task learning, in 2022 IEEE 19th Int. Symp. Biomedical Imaging (ISBI) ( IEEE, 2021), pp. 1–5. Google Scholar
25. S. Tan, P. Tang, X. Peng, J. Xiao, C. Zu, X. Wu, J. Zhou and Y. Wang , Incorporating isodose lines and gradient information via multi-task learning for dose prediction in radiotherapy, in Int. Conf. Medical Image Computing and Computer-Assisted Intervention ( Springer, 2021), pp. 753–763. Crossref, Google Scholar
26. O. Ronneberger, P. Fischer and T. Brox , U-net: Convolutional networks for biomedical image segmentation, in Int. Conf. Medical Image Computing and Computer-assisted Intervention ( Springer, 2015), pp. 234–241. Crossref, Google Scholar
27. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio , Generative adversarial nets, Adv. Neural Inf. Process. Syst. (2014) 27. Google Scholar
28. X. Wang, R. Girshick, A. Gupta and K. He , Non-local neural networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (2018) pp. 7794–7803. Crossref, Google Scholar
29. H. Zhang, I. Goodfellow, D. Metaxas and A. Odena , Self-attention generative adversarial networks, in Int. Conf. Mach. Learning (2019), pp. 7354–7363. Google Scholar
30. S. Shiraishi, J. Tan, L. A. Olsen and K. L. Moore , Knowledge-based prediction of plan quality metrics in intracranial stereotactic radiosurgery, Med. Phys. 42(2) (2015) 908–917. Crossref, Medline, Web of Science, Google Scholar
31. B. Wu, F. Ricchetti, G. Sanguineti, M. Kazhdan, P. Simari, R. Jacques, R. Jacques, R. Taylor and T. McNutt , Data-driven approach to generating achievable dose–volume histogram objectives in intensity-modulated radiotherapy planning, Int. J. Radiat. Oncol. Biol. Phys. 79(4) (2011) 1241–1247. Crossref, Medline, Web of Science, Google Scholar
32. A. Tran, K. Woods, D. Nguyen, V. Y. Yu, T. Niu, M. Cao, … and K. Sheng , Predicting liver SBRT eligibility and plan quality for VMAT and 4 $π$ plans, Radiat. Oncol. 12(1) (2017) 1–9. Crossref, Medline, Web of Science, Google Scholar
33. L. Yuan, Y. Ge, W. R. Lee, F. F. Yin, J. P. Kirkpatrick and Q. J. Wu , Quantitative analysis of the factors which affect the interpatient organ-at-risk dose sparing variation in IMRT plans, Med. Phys. 39(11) (2012) 6868–6878. Crossref, Medline, Web of Science, Google Scholar
34. J. Lian, L. Yuan, Y. Ge, B. S. Chera, D. P. Yoo, S. Chang, F. Yin and Q. J. Wu , Modeling the dosimetry of organ-at-risk in head and neck IMRT planning: An intertechnique and interinstitutional study, Med. Phys. 40(12) (2013) 121704. Crossref, Medline, Web of Science, Google Scholar
35. Y. Luo, L. Zhou, B. Zhan, Y. Fei, J. Zhou, Y. Wang and D. Shen , Adaptive rectification based adversarial network with spectrum constraint for high-quality PET image synthesis, Med. Image Anal. 77 (2022) 102335. Crossref, Medline, Web of Science, Google Scholar
36. L. Hu, J. Li, X. Peng, J. Xiao, B. Zhan, C. Zu, X. Wu, J. Zhou and Y. Wang , Semi-supervised NPC segmentation with uncertainty and attention guided consistency, Knowl. Based Syst. 239 (2022) 108021. Crossref, Web of Science, Google Scholar
37. P. Tang, P. Yang, D. Nie, X. Wu, J. Zhou and Y. Wang , Unified medical image segmentation by learning from uncertainty in an end-to-end manner, Knowl. Based Syst. 241 (2022) 108215. Crossref, Web of Science, Google Scholar
38. Y. Shi, C. Zu, M. Hong, L. Zhou, L. Wang, X. Wu, J. Zhou, D. Zhang and Y. Wang , ASMFS: Adaptive-similarity-based multi-modality feature selection for classification of Alzheimer’s disease, Pattern Recognit. 126 (2022) 108566. Crossref, Web of Science, Google Scholar
39. Y. Sun, H. Yang, J. Zhou and Y. Wang , ISSMF: Integrated semantic and spatial information of multi-level features for automatic segmentation in prenatal ultrasound images, Artif. Intell. Med. 125 (2022) 102254. Crossref, Medline, Web of Science, Google Scholar
40. G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger , Densely connected convolutional networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition ( IEEE, 2017), pp. 4700–4708. Crossref, Google Scholar
41. L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff and H. Adam , Encoder–decoder with atrous separable convolution for semantic image segmentation, in Proc. European Conf. Computer Vision ( Springer, 2018), pp. 801–818. Google Scholar
42. K. Wang, B. Zhan, Y. Luo, J. Zhou, X. Wu and Y. Wang , Multi-task curriculum learning for semi-supervised medical image segmentation, in 2021 IEEE 18th Int. Symp. Biomedical Imaging (ISBI) ( IEEE, 2021), pp. 925–928. Crossref, Google Scholar
43. Y. Zhang, H. Li, J. Du, J. Qin, T. Wang, Y. Chen, B. Liu, W. Gao, G. Ma and B. Lei , 3D multi-attention guided multi-task learning network for automatic gastric tumor segmentation and lymph node classification, IEEE Trans. Med. Imaging 40(6) (2021) 1618–1631. Crossref, Medline, Web of Science, Google Scholar
44. A. Olamat, P. Ozel and S. Atasever , Deep learning methods for multi-channel EEG-based emotion recognition, Int. J. Neural Syst. 32(5) (2022) 2250021. Link, Web of Science, Google Scholar
45. H. S. Nogay and H. Adeli , Machine learning (ML) for the diagnosis of autism spectrum disorder (ASD) using brain imaging, Rev. Neurosci. 31(8) (2020) 825–841. Crossref, Web of Science, Google Scholar
46. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin , Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017) 5999–6009. Google Scholar
47. A. Bhattacharya, T. Baweja and S. P. K. Karri , Epileptic seizure prediction using deep transformer model, Int. J. Neural Syst. 32(02) (2022) 2150058. Link, Web of Science, Google Scholar
48. Y. Luo, Y. Wang, C. Zu, B. Zhan, X. Wu, J. Zhou, D. Shen and L. Zhou , 3D transformer-GAN for high-quality PET reconstruction, Int. Conf. Medical Image Computing and Computer-Assisted Intervention ( Springer, 2021), pp. 276–285. Crossref, Google Scholar
49. W. Y. Peh, P. Thangavel, Y. Yao, J. Thomas, Y. L. Tan and J. Dauwels , Six-center assessment of CNN-transformer with belief matching loss for patient-independent seizure detection in EEG, Int. J. Neural Syst. 33(3) (2023) 2350012. Link, Web of Science, Google Scholar
50. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille and Y. Zhou, Transunet: Transformers make strong encoders for medical image segmentation, preprint, arXiv (2021) 2102.04306. Google Scholar
51. A. Van’t Riet, A. C. Mak, M. A. Moerland, L. H. Elders and W. Van Der Zee , A conformation number to quantify the degree of conformality in brachytherapy and external beam irradiation: Application to the prostate, Int. J. Radiat. Oncol. Biol. Phys. 37(3) (1997) 731–736. Crossref, Medline, Web of Science, Google Scholar
52. A. Helal and A. Omar , Homogeneity index: Effective tool for evaluation of 3DCRT, Pan Arab. J. Oncol. 8(2) (2015) 20–23. Google Scholar
53. M. V. Graham, J. A. Purdy, B. Emami, W. Harms, W. Bosch, M. A. Lockett and C. A. Perez , Clinical dose–volume histogram analysis for pneumonitis after 3D treatment for non-small cell lung cancer (NSCLC), Int. J. Radiat. Oncol. Biol. Phys. 45(2) (1999) 323–329. Crossref, Medline, Web of Science, Google Scholar

Vol. 33, No. 08

Metrics

Downloaded 1,687 times

History

Accepted 1 June 2023

Published: 7 July 2023

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND) License which permits use, distribution and reproduction, provided that the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Keywords

PDF download