Streamlined photoacoustic image processing with foundation models: A training-free solution
Abstract
Foundation models (FMs) have rapidly evolved and have achieved significant accomplishments in computer vision tasks. Specifically, the prompt mechanism conveniently allows users to integrate image prior information into the model, making it possible to apply models without any training. Therefore, we proposed a workflow based on foundation models and zero training to solve the tasks of photoacoustic (PA) image processing. We employed the Segment Anything Model (SAM) by setting simple prompts and integrating the model’s outputs with prior knowledge of the imaged objects to accomplish various tasks, including: (1) removing the skin signal in three-dimensional PA image rendering; (2) dual speed-of-sound reconstruction, and (3) segmentation of finger blood vessels. Through these demonstrations, we have concluded that FMs can be directly applied in PA imaging without the requirement for network design and training. This potentially allows for a hands-on, convenient approach to achieving efficient and accurate segmentation of PA images. This paper serves as a comprehensive tutorial, facilitating the mastery of the technique through the provision of code and sample datasets.
1. Introduction
Foundation models (FMs) have flourished with their parameters increasing to hundreds of billions or even trillions.1,2 Due to the substantial development of parameters, data, and computational power, FMs have demonstrated extraordinary capabilities in numerous tasks. Notably, the natural language processing FMs represented by ChatGPT have shown astonishing abilities in language understanding, generation, inference, and various code-related tasks.3 They have been widely applied in various fields such as office software, chatbots, translation, text generation, and even assisting medical diagnosis.4 The development of FMs in computer vision is following closely.5 Vision Transformer (ViT) model applied the Transformer structure into image recognition tasks, significantly increasing the parameter size of vision models.6 Contrastive Language-Image Pretraining (CLIP) trained vision models using text as prompt, achieving zero-shot classification.7 Beyond text prompt, some research efforts are also dedicated to using visual prompt. Recently, Meta’s Segment Anything Model (SAM) has demonstrated robust generalization capabilities in segmenting natural images,8 by effectively processing both images and visual prompts (such as boxes, dots, or masks). Moreover, this FM does not require high computational power and can be deployed on ordinary consumer-grade GPUs, offering good prospects for practical applications. Overall, visual FMs have three characteristics: (1) strong generalization ability, allowing a single model to complete tasks in various scenarios; (2) the ability to introduce image prior information through “prompts”, simplifying or even avoiding the cumbersome training process; and (3) the models do not require high computational power and can be conveniently integrated into imaging hardware.
Prominently, the introduction of prompts enables the incorporation of image prior information into deep learning (DL) models, marking a departure from the traditional approach of designing, improving, and training networks for specific tasks. This paradigm shift reduces the technical hurdles to accessing DL models significantly. In this new paradigm, formulating precise and efficient prompts becomes a research-worthy challenge, recognized as prompt engineering.9
Image segmentation is a common image processing task in photoacoustic imaging (PAI),10 with applications spanning vascular segmentation,11,12,13,14,15 tissue boundary delineation,16,17,18 outer contour segmentation of imaged objects,19,20,21 and the identification of surgical instruments.22 In the above-mentioned applications, commonly used segmentation methods include manual segmentation, graphics methods, and DL methods. Among them, DL has emerged as the mainstream approach, yet, it still exhibits certain limitations, with two major ones identified below:
(1) | The implementation of traditional DL models involves designing networks, constructing datasets, training, and fine-tuning the network, which require considerable amounts of time and effort.23 Some researchers may resort to manual image segmentation due to a lack of necessary resources for building datasets. | ||||
(2) | DL networks previously developed for PA image processing are highly specialized and lack the necessary generalizability to be widely implemented across various imaging scenarios. |
In response to these challenges, we report a method, abbreviated as SAMPA (SAM-assisted PA image processing), for zero-training PA image processing based on the SAM FM. Prior knowledge of the imaged object can be conveniently integrated into the model through prompts and utilized in downstream processing of the segmentation results. The outstanding generalizability of SAMPA is validated through three demonstrations, wherein the imaging systems and objects are deliberately selected to be highly diverse: (1) Demonstration 1: Removing the skin signal in three-dimensional (3D) PA image rendering. In 3D human hand imaging, SAMPA is used to delineate human tissue boundary and remove signals from the skin, thereby effectively exposing deeper vascular features. (2) Demonstration 2: Dual speed-of-sound (SoS) reconstruction. In two-dimensional (2D) mouse imaging, SAMPA identifies the boundary between the animal and the coupling medium to facilitate dual SoS reconstruction. (3) Demonstration 3: Human finger’s blood vessel segmentation. In the segmentation task, SAMPA robustly identifies major blood vessels by refining SAM’s output through the incorporation of prior information into a simple algorithm.
In all three tasks, we do not prepare datasets or perform any model training. Instead, we directly deploy the SAM model and combine it with prior information to achieve good results, demonstrating the exceptional simplicity and generalizability of SAMPA. This paves the way for the application of DL in PA image processing and establishes a new standard against which different DL methods can be compared. The objective of this paper is to offer a tutorial, with publicly available code and exemplifying data files, for swiftly implementing SAMPA.
2. Materials and Methods
In this section, we introduce the basic workflow of SAMPA, aiming to give readers a fundamental understanding of the approach. In this paper, we mainly elaborate on emphasizing the simplicity of the method. Detailed module and tool definitions as well as implementation details are provided in the GitHub repository (https://github.com/Adi-Deng/photoacoustic-SAM). Even readers with no background in DL can quickly replicate and expand upon this work.
2.1. Algorithm workflow introduction
The method workflow is illustrated in Fig. 1 and consists of two main steps: (1) utilize the prompt to introduce prior information to SAM for image segmentation; and (2) process the PA image based on SAM’s result and the information from the image.

Fig. 1. Schematic diagram of SAMPA. The SAM model is outlined with a black dashed box. This method utilizes the segmentation results from the SAM model to achieve more accurate reconstruction or processing of PA data, thereby obtaining better imaging results.
In the first step, the primary task is to segment the image using the SAM model based on the image and prompt information. The prompt information is conveyed to the FM through marked points on the image and the category of the area where the marked points are located. For different tasks, we can set various prompt points to convey different prior information. For example, in the task of mouse PA image segmentation, we mark multiple points around the edges of the image, whereas for human tissue boundary segmentation, we only need to mark a single point at the top of the image. This will be elaborated on in detail later. Additionally, we can use feature engineering approaches to modify the input images to better align with the preferences of SAM. Adjusting the style of PA images is a task that researchers of PAI are relatively skilled at. The SAM model then outputs binary (or multi-valued) boundary information. In detail, in Demonstrations 1 and 2, image segmentation involves the division between the imaging object and the coupling medium, thus SAMPA outputs a binary image. In Demonstration 3, the output for blood vessel segmentation is a multi-valued mask which highlights different regions with distinct colors corresponding to their respective categories. Additionally, in Demonstration 2, we explore how simple pre-processing of input images can enhance SAM’s segmentation performance under the under-sampling conditions, which exemplifies the feature engineering approach mentioned earlier. Here, we can see that prior information can be incorporated into the model using prompts, and the model’s generalization can be improved through feature engineering methods. This approach, as opposed to traditional fine-tuning, significantly simplifies the application of the model, which provides a new perspective for us in applying deep learning techniques.
The second step involves “custom processing” which aims to enhance image quality or improve segmentation accuracy based on SAM’s output. In this step, specific prior information relevant to the imaging task is incorporated again, and various processing methods can be customized. In Demonstration 1, the veins on the back of the human hand are predominantly located close to the surface. Therefore, we generate a mask to filter out the skin’s signal and isolate deeper image features, which are primarily contaminated by artifacts. The mask is delineated from the upper boundary of the imaged object to a depth of 1cm below this boundary. Within the masked region, the value is set to 1, while outside this region, it is set to 0. This process effectively suppresses skin signals and reflection artifacts. In Demonstration 2, the reconstruction process requires determining the time of flight (ToF) of the PA signal. For dual-SoS reconstruction, an analytical expression of the body’s outline is necessary. Based on SAM’s segmentation result, it is straightforward to determine the best elliptical fit to the body’s profile. In Demonstration 3, SAM is initially used to automatically identify blood vessels (without directly specified prompts). However, the segmentation results also include nonvessel features. In the second step, we refine the segmentation results by developing a program to calculate the area of each segmented region and the average signal intensity within it. That is to say, in our algorithm workflow, for segmentation tasks where SAM’s performance is suboptimal, we can re-combine SAM’s results with image’s prior information to achieve the final goal. Based on the above three demonstrations, our workflow emphasizes the flexible and multi-layered integration of FM with prior information of PA images to effectively accomplish PA image processing tasks.
In the sample code, depending on the type of the output mask (binary or multi-valued), two sets of code are developed. Demonstrations 1 and 2 share one set of code, while Demonstration 3 uses the other set.
2.2. Algorithm workflow deployment
The computer used in this study consists of a 13th Gen Intel(R) Core(TM) i7-13700K CPU, a GIGABYTE GeForce GTX 1660 Super 6G graphics card, and 32G of Kingston DDR4 2666 RAM. Running the lightweight version of the SAM model on Windows takes approximately 0.07s to perform binary segmentation on an image with dimensions of 500×500500×500 pixels and the used checkpoint is “sam_vit_l” for all demonstrations. All image segmentation experiments are conducted using the aforementioned hardware and software. Through the use of hardware and software, we can see that the workflow is very user-friendly. The deployment of the model, along with a basic explanation of each component, is detailed in the “How to start” section of our GitHub repository. The process for using the method is described in the “Readme” section, with some code and procedural explanations referencing the official SAM repository.
2.3. Algorithm workflow test
To validate the effectiveness of SAMPA, we conduct three different types of tests: (1) testing prompt functionality and feature engineering effectiveness. For Demonstrations 1 and 2, we determine the appropriate prompt method, and verify the segmentation capability of SAM after integrating prior information. Meanwhile, we test the impact of feature engineering. For Demonstration 3, we test the initial results of automatic segmentation of blood vessels. (2) Testing workflow generalization performance. Based on the prompt methods determined in (1), we test the segmentation capability of SAM on different data. The data used include those collected by the imaging systems developed by our group, and PA datasets publicly available online. (3) Verifying the overall workflow functionality. The absence of original data files from online resources led us to demonstrate the entire workflow using only the data generated in our lab.
The following subsections will briefly describe the imaging experiments conducted with our equipment and the publicly available data used.
2.3.1. Imaging experiments
We collect the PA images of the hand and forearm of a healthy volunteer by a clinical PAI platform (CPIIP, TsingPAI Co., Ltd.). The ultrasound probe (256 elements, 5-MHz center frequency, and 60% receive bandwidth) has a 180∘ angular coverage providing 2D cross-sectional images. An optical parametric oscillator (OPO) provides excitation pulses at 850nm. The scanning step is 0.1mm. A 3D image is reconstructed by splicing 2D cross-sectional images. Small animal imaging is performed by a custom-made ring-array PA computed tomography (PACT) system with 256 transducer elements, 5-MHz center frequency, and 70% receive bandwidth. A rotation of the array by 0.7∘ results in an equivalent acquisition of 512 channels. The excitation wavelength is 850nm. Human finger imaging is performed using the aforementioned ring-array PACT system without rotating the array or CPIIP. Additionally, for ring-array system, only half of the ring array data are utilized for image reconstruction, resulting in a 128-channel half-ring acquisition. We intentionally employ this limited-angle acquisition to induce limited-view artifacts, thereby increasing the complexity of the segmentation task. The excitation wavelength is 800nm. In all experiments, we maintain a laser repetition rate of 10Hz and ensure that the per-pulse energy density remains below 15mJ/cm2 to comply with the limits set by the American National Standards Institute (ANSI). The images are reconstructed by a standard delay and sum (DAS) algorithm. The hand, finger, and the mouse are submerged in distilled water to facilitate ultrasound coupling.
For Demonstrations 1–3, we collect data from three volunteers, three mice, and another set of three volunteers, respectively. The animal study has been reviewed and approved by the National Institutes of Health Guidelines on the Care and Use of Laboratory Animal of Beijing Vital River Laboratory Animal Technology Co., Ltd. The human experiments have been approved by the Ethics Committee of Tsinghua University (Project No. 20220121).
2.3.2. Public data collection from published works
To validate the universality of our workflow across different imaging systems, we directly download the reconstructed data from supplementary materials in published articles to test the segmentation performance of SAM within our workflow. This includes images of mice,20,23 human body,24,25 and human finger blood vessels.26,27 The selected imaging systems are representative and demonstrate the potential wide applicability of the workflow.
3. Results
In this section, we will first elaborate on setting the prompts. Then, we will demonstrate the effectiveness of SAMPA.
3.1. Prompt functionality and feature engineering effectiveness testing
In Demonstration 1, given that the coupling medium is located above the hand, we marked the position of the coupling medium for all images acquired at different scanning positions. The position of the hand relative to the surrounding medium during the scanning procedure was consistent. Therefore, we marked the same prompt point for each scanning position.
In the mouse imaging experiment, to challenge the segmentation task, we deliberately selected a 2D layer where signals from the internal organs were significantly stronger than those from the skin, making it difficult to distinguish the skin boundary.
Figure 2 shows the segmentation results of the human hand and the mouse body under different prompts. It can be observed that satisfactory segmentation can be achieved with only a few prompt points, without any fine-tuning. In the hand segmentation task shown in Figs. 2(a) and 2(e), an accurate segmentation result is obtained with only one prompt point, indicating good generalization capability. The segmentation quality for the mouse body is consistently good. As shown in Figs. 2(f) and 2(g), when the number of prompts is two or fewer, a small region at the top portion of the body is incorrectly segmented, as indicated by the red arrows. With four prompts, the aforementioned segmentation error is corrected, as shown in Fig. 2(h). This has led to the conclusion that moderately increasing the number of prompts can effectively improve segmentation accuracy.

Fig. 2. Results of human hand and mouse body segmentation under different prompts (represented by green stars). Panels (a) and (e) show the original PA image of the hand and the segmentation result, respectively. In this demonstration, a single prompt yields good results. Panels (b) and (f) display the original mouse image and the segmentation result, respectively. A single prompt is used. Panels (c), (f) and (d), (h) illustrate the updated results, when two and four prompts are used, respectively. The red arrows point to areas with incorrect segmentation.
The dimensions of the hand image are 500×500500×500 pixels, and the runtime with one prompt point is 0.069s. The dimensions of the mouse image are 500×500500×500 pixels, and the runtimes with one, two, and four prompt points are 0.072, 0.071, and 0.070s, respectively. The overall runtime of the model does not change significantly with the variation of prompts. The above findings indicate that SAM can perform segmentation of traditional PA images within 0.1s. This capability renders it suitable for deployment within conventional imaging apparatus or seamless integration into the PA image processing software.
We also tested the segmentation performance in relatively complex scenarios of Demonstrations 1 and 2, including: (1) human arm images with limited-view artifacts [Figs. 3(a) and 3(f)] and (2) mouse cross-sectional images reconstructed with under-sampled data, simulating a cost-effective ring array with only 64 channels. To mitigate streak artifacts, a simple method is employed to expand the data into 256 channels before reconstruction, as illustrated in Figs. 3(c) and 3(h). Specifically, the first channel of the original data was duplicated into the first, second, third, and fourth channels of the new data; subsequently, the second channel was duplicated into the third and fourth channels; and so forth. Figures 3(d) and 3(i) depict the image reconstructed from fully sampled data (512 channels) and its segmentation result.

Fig. 3. Image segmentation results in relatively complex tasks. Panels (a) and (f) show the segmentation results under limited-view artifacts. Panels (b) and (g) are the segmentation results under sparse sampling (64 channels). Panels (c) and (h) are the segmentation results of PA images reconstructed by interpolated data (256 channels). Compared with the fully sampled images shown in panels (d) and (i) (highlighted with yellow boxes), it is evident that panel (h) is segmented with much improved accuracy after interpolation. Panels (e) and (j) display the multi-valued vessel segmentation results, revealing some apparent misclassifications (labeled by red hexagrams) that necessitate further refinement of the segmentation results.
In Fig. 3(a), strong artifacts are produced due to the limited acceptance angle of the transducer. However, these artifacts have minimal influence on the accuracy of the segmentation, as evidenced by Fig. 3(f). Similarly, Fig. 3(b) exhibits poor image quality due to under-sampling (64 channels), resulting in a vague boundary of the animal. Thus, we can see from Fig. 3(g) that SAM resulted in incorrect segmentation. However, by using a very simple nearest-neighbor interpolation method to expand the sinogram from 64 channels to 256 channels and then performing reconstruction [as shown in Fig. 3(h)], the segmentation result is comparable to that of the fully sampled image, as demonstrated in Fig. 3(i). The main reason is that interpolation effectively suppresses the under-sampling artifacts, which are not commonly seen in the training sets (i.e., natural images), thereby making the input images more suitable for SAM. These findings suggest that by directly using SAM with appropriate prompts, or with the assistance of simple feature engineering, the designed workflow can robustly achieve accurate segmentation even under limited-view and under-sampling conditions. Figure 3(d) displays the PA image of the human finger, while the blood vessel segmentation result is shown in Fig. 3(h). After preliminary automatic segmentation using SAM, all vessels were successfully identified. However, several image features that are clearly artifacts were incorrectly recognized as blood vessels, as indicated by the red hexagrams. Specifically, the features that are misclassified as blood vessels are mainly imaging artifacts. The similarity between blood vessels and artifacts causes SAM to treat artifacts as real features during segmentation. Utilizing more advanced system hardware and reconstruction algorithms can reduce artifacts, thereby improving SAM’s segmentation accuracy. Moreover, in situations where artifacts have lower intensity, reintroducing prior information can enhance segmentation accuracy and we will elaborate on this in Sec. 4.
3.2. Workflow generalization performance testing
To verify the generalization performance of our workflow, we validated SAM’s segmentation effectiveness on both self-collected and publicly available data, based on the prompt established earlier. The results are shown in Fig. 4. In all subpanels of Fig. 4, the upper part of each subfigure displays the original image, while the lower part shows the segmentation results. Figures 4(a)–4(d) depict the contour segmentation results of mice, where panels (a) and (b) correspond to images captured by our system as discussed earlier, and panels (c) and (d) are images obtained using the systems developed by other groups.20,23 Figures 4(e)–4(h) show the boundary segmentation results of human tissues, with panels (e) and (f) representing images collected by our system, and panels (g) and (h) being images published by other groups.24,25 Figures 4(i)–4(k) display the results of human blood vessel segmentation. Figures 4(i) and 4(j) show images from our group, while Figs. 4(k) and 4(l) show the results from another group.26,27 By comparing Figs. 3(e), 3(j) and 4(i)–4(l), we can see that SAM successfully segments all the blood vessels and also misclassifies some artifacts in Figs. 4(i)–4(l). Similarly, these artifacts have lower intensity and can be removed by reintroducing prior information to correct the segmentation errors. Figure 4 demonstrates that, with the right prompts, SAM has achieved excellent segmentation results for images of various objects captured by different systems. This further verifies the generalizability of the workflow.

Fig. 4. SAM’s segmentation results in different scenarios. The upper and lower parts of each subfigure are the raw image and its segmentation result, respectively. The first and second rows are the segmentation results of mice and human outer profile, respectively. The last row shows human vessel segmentation results.
3.3. Overall workflow functionality verification
Figure 5 displays the final reconstruction results of Demonstrations 1–3. Figures 5(a) and 5(e) show the maximum intensity projection (MIP) images of the 3D blood vessel reconstruction of the human hand, before and after the removal of the skin signals. It is evident that the removal of skin signals, assisted by the aforementioned segmentation, better reveals deeper blood vessels, as pinpointed by the white arrows. Figures 5(b) and 5(f) show the single- and dual-SoS reconstructions of the mouse trunk, respectively. Figures 5(c) and 5(g) display the zoomed-in images of the corresponding areas in Figs. 5(b) and 5(f). The places where image quality has effectively improved are indicated by white circles. While deep features remain invariant, superficial features have become more in-focus after dual-SoS reconstruction based on the segmentation result of SAM. Figures 5(d) and 5(h) depict the cross-sectional PA image of a human finger and the result of blood vessel segmentation. As shown in Fig. 5(h), refined segmentation is obtained using a customized program based on the original segmentation result obtained from SAM [Fig. 3(h)]. It is evident that by integrating simple prior knowledge of the PA image, the segmentation became more accurate. Overall, our workflow has achieved satisfactory results across the three demonstrations.

Fig. 5. Imaging results of Demonstrations 1–3. Panels (a) and (e) show the 3D imaging results (maximum intensity projection) before and after the removal of surface signals. White arrows label the vessels that are better exposed. Panels (b) and (f) display the results of single- and dual-SoS reconstructions, respectively. Panels (c) and (g) are the magnified images of panels (b) and (f), showing that both external features (outlined by green dashed boxes) and internal features (outlined by yellow dashed boxes) can be well reconstructed in the dual-SoS images. White circles in panel (g) indicate areas with improved image quality. Panels (d) and (h) are the photoacoustic images of finger vessels and their segmentation results. Scale bars: 10mm.
4. Discussion
This paper reports a workflow that applies FMs for segmenting PA images and performing reconstruction, effectively achieving good results across multiple tasks. By implementing SAMPA in a convenient, training-free manner, we demonstrated the usefulness of the method across three imaging scenarios. In human hand imaging, the segmentation of the skin signal and its subsequent removal effectively revealed internal blood vessels. In animal cross-sectional imaging, auto-segmentation of the body profile facilitated dual-SoS reconstruction, thereby enhancing image quality. In human finger imaging, blood vessel segmentation was achieved, potentially aiding in medical diagnosis. Currently, FMs achieve good results in simple segmentation tasks of two-dimensional PA images. Although the underlying reasons for the effectiveness of FMs on PA images are unclear, empirical evidence indicates their remarkable performance. This surprising result motivated the writing of this paper. A possible explanation is the inherent similarity between PA and natural images.
This paper highlights several advantages of visual FMs: (1) Training-free: The method of integrating prior information through prompts allows for the direct application of DL models without pre-training. (2) Robustness: The model achieves good segmentation results even in the presence of artifacts, greatly enhancing its applicability in complex real-world scenarios. (3) Efficiency: Compared to language FMs, vision FMs have lower computational requirements, greatly facilitating deployment. In traditional DL practices, designing networks and preparing datasets are time-consuming tasks, and the quality of the training set critically determines the model’s performance. In contrast, FMs remove these technical barriers, enabling researchers to implement DL models quickly and conveniently. Some FMs, such as SAM, provide online demos, making the model verification and deployment readily accessible. Freed from the tasks of dataset preparation and network design, the only remaining task for the user is to design appropriate prompts to apply FMs effectively. We envision that in the future, directly implementing FMs, or performing simple fine-tuning on the basis of FMs, is a promising solution to applying DL in PAI.
Certainly, there are areas where current FMs can be improved. First, natural images are predominantly two-dimensional, but medical images, including those generated by PAI, are often three-dimensional. However, there is a lack of FMs specifically designed for three-dimensional images. While it’s possible to decompose three-dimensional images into two-dimensional slices, this approach inherently loses information across slices. Therefore, FMs capable of handling three-dimensional images are highly anticipated.28 Second, the vascular network structure, which is common in PAI, is relatively sparse in natural images. Figure 6 shows a comparison of vessel extraction results obtained using FMs and dedicated deep networks, based on the data from Ref. 29. Current FMs exhibit poor capability in extracting vascular networks in comparison to dedicated networks. We anticipate that FMs tailored for vascular network segmentation or extraction across different medical imaging modalities will be developed in the future. Additionally, current FMs face significant limitations in strongly proprietary scenarios such as image reconstruction. This is another area where dedicated networks outperform FMs.

Fig. 6. The difference in blood vessel identifications between a dedicated network and SAM. (a) The raw photoacoustic image. (b) The vessel identification result of the dedicated network. (c) The result of SAM.
In summary, applying FMs has proven to be a promising solution for implementing DL in PAI. It is worth mentioning that SAM’s online trial offers convenient hands-on practice for quickly mastering the program.
5. Conclusions
We proposed a workflow named SAMPA for processing PA images by FMs. And we validated SAMPA’s effectiveness through three demonstrations. To facilitate readers in replicating our results and validating our findings, we have uploaded all of our codes and provided detailed documentation on https://github.com/Adi-Deng/photoacoustic-SAM.
Acknowledgments
We would like to acknowledge the financial support from Strategic Project of Precision Surgery, Tsinghua University; Initiative Scientific Research Program, Institute for Intelligent Healthcare, Tsinghua University; Tsinghua-Foshan Institute of Advanced Manufacturing; National Natural Science Foundation of China (61735016); Beijing Nova Program (20230484308); Young Elite Scientists Sponsorship Program by CAST (2023QNRC001); Youth Elite Program of Beijing Friendship Hospital (YYQCJH2022-9); and Science and Technology Program of Beijing Tongzhou District (KJ2023CX012).
We thank Xiaojun Wang, Yuwen Chen, Wubing Fu, Naiyue Zhang, Wenjie Guo, and Jianpan Gao at TsingPAI Technology Co., Ltd. for helpful discussions.
Conflict of Interest
Cheng Ma had a financial interest in TsingPAI Technology Co., Ltd., which provided the clinical imaging system (CPIIP) used in this work.
ORCID
Handi Deng https://orcid.org/0000-0002-7296-6432
Yucheng Zhou https://orcid.org/0009-0008-1936-7415
Jiaxuan Xiang https://orcid.org/0009-0004-8596-6177
Liujie Gu https://orcid.org/0000-0002-8836-3150
Yan Luo https://orcid.org/0000-0002-5919-1329
Hai Feng https://orcid.org/0000-0001-9508-0696
Mingyuan Liu https://orcid.org/0000-0002-6449-4885