Research ArticleOpen Access

Streamlined photoacoustic image processing with foundation models: A training-free solution

Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University 30 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Precision Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Intelligent Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

These authors contributed equally to this work.

Search for more papers by this author

Yucheng Zhou

https://orcid.org/0009-0008-1936-7415

School of Biological Science and Medical Engineering, Beihang University, 37 XueYuan Road, Haidian, Beijing 100191, P. R. China

These authors contributed equally to this work.

Search for more papers by this author

Jiaxuan Xiang

https://orcid.org/0009-0004-8596-6177

TsingPAI Technology Co., Ltd., 27 Jiancaicheng Middle Road, Haidian, Beijing 100096, P. R. China

Search for more papers by this author

Liujie Gu

https://orcid.org/0000-0002-8836-3150

Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University 30 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Precision Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Intelligent Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Search for more papers by this author

Yan Luo

https://orcid.org/0000-0002-5919-1329

Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University 30 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Search for more papers by this author

Hai Feng

https://orcid.org/0000-0001-9508-0696

Department of Vascular Surgery, Beijing Friendship Hospital, Capital Medical University, 95 Yongan Road, Haidian, Beijing 100050, P. R. China

Search for more papers by this author

Mingyuan Liu

https://orcid.org/0000-0002-6449-4885

Department of Vascular Surgery, Beijing Friendship Hospital, Capital Medical University, 95 Yongan Road, Haidian, Beijing 100050, P. R. China

E-mail Address: dr.mingyuanliu@pku.edu.cn

Corresponding author.

Search for more papers by this author

, and

Cheng Ma

https://orcid.org/0000-0001-7366-0091

Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University 30 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Precision Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

Institute for Intelligent Healthcare, Tsinghua University, 77 Shuangqing Road, Haidian, Beijing 100084, P. R. China

E-mail Address: cheng_ma@tsinghua.edu.cn

Corresponding author.

Search for more papers by this author

https://doi.org/10.1142/S1793545824500196Cited by:0 (Source: Crossref)

Abstract

Foundation models (FMs) have rapidly evolved and have achieved significant accomplishments in computer vision tasks. Specifically, the prompt mechanism conveniently allows users to integrate image prior information into the model, making it possible to apply models without any training. Therefore, we proposed a workflow based on foundation models and zero training to solve the tasks of photoacoustic (PA) image processing. We employed the Segment Anything Model (SAM) by setting simple prompts and integrating the model’s outputs with prior knowledge of the imaged objects to accomplish various tasks, including: (1) removing the skin signal in three-dimensional PA image rendering; (2) dual speed-of-sound reconstruction, and (3) segmentation of finger blood vessels. Through these demonstrations, we have concluded that FMs can be directly applied in PA imaging without the requirement for network design and training. This potentially allows for a hands-on, convenient approach to achieving efficient and accurate segmentation of PA images. This paper serves as a comprehensive tutorial, facilitating the mastery of the technique through the provision of code and sample datasets.

Keywords:

1. Introduction

Foundation models (FMs) have flourished with their parameters increasing to hundreds of billions or even trillions.^1,2 Due to the substantial development of parameters, data, and computational power, FMs have demonstrated extraordinary capabilities in numerous tasks. Notably, the natural language processing FMs represented by ChatGPT have shown astonishing abilities in language understanding, generation, inference, and various code-related tasks.³ They have been widely applied in various fields such as office software, chatbots, translation, text generation, and even assisting medical diagnosis.⁴ The development of FMs in computer vision is following closely.⁵ Vision Transformer (ViT) model applied the Transformer structure into image recognition tasks, significantly increasing the parameter size of vision models.⁶ Contrastive Language-Image Pretraining (CLIP) trained vision models using text as prompt, achieving zero-shot classification.⁷ Beyond text prompt, some research efforts are also dedicated to using visual prompt. Recently, Meta’s Segment Anything Model (SAM) has demonstrated robust generalization capabilities in segmenting natural images,⁸ by effectively processing both images and visual prompts (such as boxes, dots, or masks). Moreover, this FM does not require high computational power and can be deployed on ordinary consumer-grade GPUs, offering good prospects for practical applications. Overall, visual FMs have three characteristics: (1) strong generalization ability, allowing a single model to complete tasks in various scenarios; (2) the ability to introduce image prior information through “prompts”, simplifying or even avoiding the cumbersome training process; and (3) the models do not require high computational power and can be conveniently integrated into imaging hardware.

Prominently, the introduction of prompts enables the incorporation of image prior information into deep learning (DL) models, marking a departure from the traditional approach of designing, improving, and training networks for specific tasks. This paradigm shift reduces the technical hurdles to accessing DL models significantly. In this new paradigm, formulating precise and efficient prompts becomes a research-worthy challenge, recognized as prompt engineering.⁹

Image segmentation is a common image processing task in photoacoustic imaging (PAI),¹⁰ with applications spanning vascular segmentation,^{11,12,13,14,15} tissue boundary delineation,^16,17,18 outer contour segmentation of imaged objects,^19,20,21 and the identification of surgical instruments.²² In the above-mentioned applications, commonly used segmentation methods include manual segmentation, graphics methods, and DL methods. Among them, DL has emerged as the mainstream approach, yet, it still exhibits certain limitations, with two major ones identified below:

(1)	The implementation of traditional DL models involves designing networks, constructing datasets, training, and fine-tuning the network, which require considerable amounts of time and effort.²³ Some researchers may resort to manual image segmentation due to a lack of necessary resources for building datasets.
(2)	DL networks previously developed for PA image processing are highly specialized and lack the necessary generalizability to be widely implemented across various imaging scenarios.

In response to these challenges, we report a method, abbreviated as SAMPA (SAM-assisted PA image processing), for zero-training PA image processing based on the SAM FM. Prior knowledge of the imaged object can be conveniently integrated into the model through prompts and utilized in downstream processing of the segmentation results. The outstanding generalizability of SAMPA is validated through three demonstrations, wherein the imaging systems and objects are deliberately selected to be highly diverse: (1) Demonstration 1: Removing the skin signal in three-dimensional (3D) PA image rendering. In 3D human hand imaging, SAMPA is used to delineate human tissue boundary and remove signals from the skin, thereby effectively exposing deeper vascular features. (2) Demonstration 2: Dual speed-of-sound (SoS) reconstruction. In two-dimensional (2D) mouse imaging, SAMPA identifies the boundary between the animal and the coupling medium to facilitate dual SoS reconstruction. (3) Demonstration 3: Human finger’s blood vessel segmentation. In the segmentation task, SAMPA robustly identifies major blood vessels by refining SAM’s output through the incorporation of prior information into a simple algorithm.

In all three tasks, we do not prepare datasets or perform any model training. Instead, we directly deploy the SAM model and combine it with prior information to achieve good results, demonstrating the exceptional simplicity and generalizability of SAMPA. This paves the way for the application of DL in PA image processing and establishes a new standard against which different DL methods can be compared. The objective of this paper is to offer a tutorial, with publicly available code and exemplifying data files, for swiftly implementing SAMPA.

2. Materials and Methods

In this section, we introduce the basic workflow of SAMPA, aiming to give readers a fundamental understanding of the approach. In this paper, we mainly elaborate on emphasizing the simplicity of the method. Detailed module and tool definitions as well as implementation details are provided in the GitHub repository (https://github.com/Adi-Deng/photoacoustic-SAM). Even readers with no background in DL can quickly replicate and expand upon this work.

2.1. Algorithm workflow introduction

The method workflow is illustrated in Fig. 1 and consists of two main steps: (1) utilize the prompt to introduce prior information to SAM for image segmentation; and (2) process the PA image based on SAM’s result and the information from the image.

In the first step, the primary task is to segment the image using the SAM model based on the image and prompt information. The prompt information is conveyed to the FM through marked points on the image and the category of the area where the marked points are located. For different tasks, we can set various prompt points to convey different prior information. For example, in the task of mouse PA image segmentation, we mark multiple points around the edges of the image, whereas for human tissue boundary segmentation, we only need to mark a single point at the top of the image. This will be elaborated on in detail later. Additionally, we can use feature engineering approaches to modify the input images to better align with the preferences of SAM. Adjusting the style of PA images is a task that researchers of PAI are relatively skilled at. The SAM model then outputs binary (or multi-valued) boundary information. In detail, in Demonstrations 1 and 2, image segmentation involves the division between the imaging object and the coupling medium, thus SAMPA outputs a binary image. In Demonstration 3, the output for blood vessel segmentation is a multi-valued mask which highlights different regions with distinct colors corresponding to their respective categories. Additionally, in Demonstration 2, we explore how simple pre-processing of input images can enhance SAM’s segmentation performance under the under-sampling conditions, which exemplifies the feature engineering approach mentioned earlier. Here, we can see that prior information can be incorporated into the model using prompts, and the model’s generalization can be improved through feature engineering methods. This approach, as opposed to traditional fine-tuning, significantly simplifies the application of the model, which provides a new perspective for us in applying deep learning techniques.

The second step involves “custom processing” which aims to enhance image quality or improve segmentation accuracy based on SAM’s output. In this step, specific prior information relevant to the imaging task is incorporated again, and various processing methods can be customized. In Demonstration 1, the veins on the back of the human hand are predominantly located close to the surface. Therefore, we generate a mask to filter out the skin’s signal and isolate deeper image features, which are primarily contaminated by artifacts. The mask is delineated from the upper boundary of the imaged object to a depth of 1cm below this boundary. Within the masked region, the value is set to 1, while outside this region, it is set to 0. This process effectively suppresses skin signals and reflection artifacts. In Demonstration 2, the reconstruction process requires determining the time of flight (ToF) of the PA signal. For dual-SoS reconstruction, an analytical expression of the body’s outline is necessary. Based on SAM’s segmentation result, it is straightforward to determine the best elliptical fit to the body’s profile. In Demonstration 3, SAM is initially used to automatically identify blood vessels (without directly specified prompts). However, the segmentation results also include nonvessel features. In the second step, we refine the segmentation results by developing a program to calculate the area of each segmented region and the average signal intensity within it. That is to say, in our algorithm workflow, for segmentation tasks where SAM’s performance is suboptimal, we can re-combine SAM’s results with image’s prior information to achieve the final goal. Based on the above three demonstrations, our workflow emphasizes the flexible and multi-layered integration of FM with prior information of PA images to effectively accomplish PA image processing tasks.

In the sample code, depending on the type of the output mask (binary or multi-valued), two sets of code are developed. Demonstrations 1 and 2 share one set of code, while Demonstration 3 uses the other set.

2.2. Algorithm workflow deployment

The computer used in this study consists of a 13th Gen Intel(R) Core(TM) i7-13700K CPU, a GIGABYTE GeForce GTX 1660 Super 6G graphics card, and 32G of Kingston DDR4 2666 RAM. Running the lightweight version of the SAM model on Windows takes approximately 0.07s to perform binary segmentation on an image with dimensions of $500 \times 500$ $500 \times 500$ pixels and the used checkpoint is “sam_vit_l” for all demonstrations. All image segmentation experiments are conducted using the aforementioned hardware and software. Through the use of hardware and software, we can see that the workflow is very user-friendly. The deployment of the model, along with a basic explanation of each component, is detailed in the “How to start” section of our GitHub repository. The process for using the method is described in the “Readme” section, with some code and procedural explanations referencing the official SAM repository.

2.3. Algorithm workflow test

To validate the effectiveness of SAMPA, we conduct three different types of tests: (1) testing prompt functionality and feature engineering effectiveness. For Demonstrations 1 and 2, we determine the appropriate prompt method, and verify the segmentation capability of SAM after integrating prior information. Meanwhile, we test the impact of feature engineering. For Demonstration 3, we test the initial results of automatic segmentation of blood vessels. (2) Testing workflow generalization performance. Based on the prompt methods determined in (1), we test the segmentation capability of SAM on different data. The data used include those collected by the imaging systems developed by our group, and PA datasets publicly available online. (3) Verifying the overall workflow functionality. The absence of original data files from online resources led us to demonstrate the entire workflow using only the data generated in our lab.

The following subsections will briefly describe the imaging experiments conducted with our equipment and the publicly available data used.

2.3.1. Imaging experiments

We collect the PA images of the hand and forearm of a healthy volunteer by a clinical PAI platform (CPIIP, TsingPAI Co., Ltd.). The ultrasound probe (256 elements, 5-MHz center frequency, and 60% receive bandwidth) has a 180^∘ angular coverage providing 2D cross-sectional images. An optical parametric oscillator (OPO) provides excitation pulses at 850nm. The scanning step is 0.1mm. A 3D image is reconstructed by splicing 2D cross-sectional images. Small animal imaging is performed by a custom-made ring-array PA computed tomography (PACT) system with 256 transducer elements, 5-MHz center frequency, and 70% receive bandwidth. A rotation of the array by 0.7^∘ results in an equivalent acquisition of 512 channels. The excitation wavelength is 850nm. Human finger imaging is performed using the aforementioned ring-array PACT system without rotating the array or CPIIP. Additionally, for ring-array system, only half of the ring array data are utilized for image reconstruction, resulting in a 128-channel half-ring acquisition. We intentionally employ this limited-angle acquisition to induce limited-view artifacts, thereby increasing the complexity of the segmentation task. The excitation wavelength is 800nm. In all experiments, we maintain a laser repetition rate of 10Hz and ensure that the per-pulse energy density remains below 15mJ/cm² to comply with the limits set by the American National Standards Institute (ANSI). The images are reconstructed by a standard delay and sum (DAS) algorithm. The hand, finger, and the mouse are submerged in distilled water to facilitate ultrasound coupling.

For Demonstrations 1–3, we collect data from three volunteers, three mice, and another set of three volunteers, respectively. The animal study has been reviewed and approved by the National Institutes of Health Guidelines on the Care and Use of Laboratory Animal of Beijing Vital River Laboratory Animal Technology Co., Ltd. The human experiments have been approved by the Ethics Committee of Tsinghua University (Project No. 20220121).

2.3.2. Public data collection from published works

To validate the universality of our workflow across different imaging systems, we directly download the reconstructed data from supplementary materials in published articles to test the segmentation performance of SAM within our workflow. This includes images of mice,^20,23 human body,^24,25 and human finger blood vessels.^26,27 The selected imaging systems are representative and demonstrate the potential wide applicability of the workflow.

3. Results

In this section, we will first elaborate on setting the prompts. Then, we will demonstrate the effectiveness of SAMPA.

3.1. Prompt functionality and feature engineering effectiveness testing

In Demonstration 1, given that the coupling medium is located above the hand, we marked the position of the coupling medium for all images acquired at different scanning positions. The position of the hand relative to the surrounding medium during the scanning procedure was consistent. Therefore, we marked the same prompt point for each scanning position.

In the mouse imaging experiment, to challenge the segmentation task, we deliberately selected a 2D layer where signals from the internal organs were significantly stronger than those from the skin, making it difficult to distinguish the skin boundary.

Figure 2 shows the segmentation results of the human hand and the mouse body under different prompts. It can be observed that satisfactory segmentation can be achieved with only a few prompt points, without any fine-tuning. In the hand segmentation task shown in Figs. 2(a) and 2(e), an accurate segmentation result is obtained with only one prompt point, indicating good generalization capability. The segmentation quality for the mouse body is consistently good. As shown in Figs. 2(f) and 2(g), when the number of prompts is two or fewer, a small region at the top portion of the body is incorrectly segmented, as indicated by the red arrows. With four prompts, the aforementioned segmentation error is corrected, as shown in Fig. 2(h). This has led to the conclusion that moderately increasing the number of prompts can effectively improve segmentation accuracy.

Fig. 2. Results of human hand and mouse body segmentation under different prompts (represented by green stars). Panels (a) and (e) show the original PA image of the hand and the segmentation result, respectively. In this demonstration, a single prompt yields good results. Panels (b) and (f) display the original mouse image and the segmentation result, respectively. A single prompt is used. Panels (c), (f) and (d), (h) illustrate the updated results, when two and four prompts are used, respectively. The red arrows point to areas with incorrect segmentation.

The dimensions of the hand image are $500 \times 500$ $500 \times 500$ pixels, and the runtime with one prompt point is 0.069s. The dimensions of the mouse image are $500 \times 500$ $500 \times 500$ pixels, and the runtimes with one, two, and four prompt points are 0.072, 0.071, and 0.070s, respectively. The overall runtime of the model does not change significantly with the variation of prompts. The above findings indicate that SAM can perform segmentation of traditional PA images within 0.1s. This capability renders it suitable for deployment within conventional imaging apparatus or seamless integration into the PA image processing software.

We also tested the segmentation performance in relatively complex scenarios of Demonstrations 1 and 2, including: (1) human arm images with limited-view artifacts [Figs. 3(a) and 3(f)] and (2) mouse cross-sectional images reconstructed with under-sampled data, simulating a cost-effective ring array with only 64 channels. To mitigate streak artifacts, a simple method is employed to expand the data into 256 channels before reconstruction, as illustrated in Figs. 3(c) and 3(h). Specifically, the first channel of the original data was duplicated into the first, second, third, and fourth channels of the new data; subsequently, the second channel was duplicated into the third and fourth channels; and so forth. Figures 3(d) and 3(i) depict the image reconstructed from fully sampled data (512 channels) and its segmentation result.

In Fig. 3(a), strong artifacts are produced due to the limited acceptance angle of the transducer. However, these artifacts have minimal influence on the accuracy of the segmentation, as evidenced by Fig. 3(f). Similarly, Fig. 3(b) exhibits poor image quality due to under-sampling (64 channels), resulting in a vague boundary of the animal. Thus, we can see from Fig. 3(g) that SAM resulted in incorrect segmentation. However, by using a very simple nearest-neighbor interpolation method to expand the sinogram from 64 channels to 256 channels and then performing reconstruction [as shown in Fig. 3(h)], the segmentation result is comparable to that of the fully sampled image, as demonstrated in Fig. 3(i). The main reason is that interpolation effectively suppresses the under-sampling artifacts, which are not commonly seen in the training sets (i.e., natural images), thereby making the input images more suitable for SAM. These findings suggest that by directly using SAM with appropriate prompts, or with the assistance of simple feature engineering, the designed workflow can robustly achieve accurate segmentation even under limited-view and under-sampling conditions. Figure 3(d) displays the PA image of the human finger, while the blood vessel segmentation result is shown in Fig. 3(h). After preliminary automatic segmentation using SAM, all vessels were successfully identified. However, several image features that are clearly artifacts were incorrectly recognized as blood vessels, as indicated by the red hexagrams. Specifically, the features that are misclassified as blood vessels are mainly imaging artifacts. The similarity between blood vessels and artifacts causes SAM to treat artifacts as real features during segmentation. Utilizing more advanced system hardware and reconstruction algorithms can reduce artifacts, thereby improving SAM’s segmentation accuracy. Moreover, in situations where artifacts have lower intensity, reintroducing prior information can enhance segmentation accuracy and we will elaborate on this in Sec. 4.

3.2. Workflow generalization performance testing

To verify the generalization performance of our workflow, we validated SAM’s segmentation effectiveness on both self-collected and publicly available data, based on the prompt established earlier. The results are shown in Fig. 4. In all subpanels of Fig. 4, the upper part of each subfigure displays the original image, while the lower part shows the segmentation results. Figures 4(a)–4(d) depict the contour segmentation results of mice, where panels (a) and (b) correspond to images captured by our system as discussed earlier, and panels (c) and (d) are images obtained using the systems developed by other groups.^20,23 Figures 4(e)–4(h) show the boundary segmentation results of human tissues, with panels (e) and (f) representing images collected by our system, and panels (g) and (h) being images published by other groups.^24,25 Figures 4(i)–4(k) display the results of human blood vessel segmentation. Figures 4(i) and 4(j) show images from our group, while Figs. 4(k) and 4(l) show the results from another group.^26,27 By comparing Figs. 3(e), 3(j) and 4(i)–4(l), we can see that SAM successfully segments all the blood vessels and also misclassifies some artifacts in Figs. 4(i)–4(l). Similarly, these artifacts have lower intensity and can be removed by reintroducing prior information to correct the segmentation errors. Figure 4 demonstrates that, with the right prompts, SAM has achieved excellent segmentation results for images of various objects captured by different systems. This further verifies the generalizability of the workflow.

Fig. 4. SAM’s segmentation results in different scenarios. The upper and lower parts of each subfigure are the raw image and its segmentation result, respectively. The first and second rows are the segmentation results of mice and human outer profile, respectively. The last row shows human vessel segmentation results.

3.3. Overall workflow functionality verification

Figure 5 displays the final reconstruction results of Demonstrations 1–3. Figures 5(a) and 5(e) show the maximum intensity projection (MIP) images of the 3D blood vessel reconstruction of the human hand, before and after the removal of the skin signals. It is evident that the removal of skin signals, assisted by the aforementioned segmentation, better reveals deeper blood vessels, as pinpointed by the white arrows. Figures 5(b) and 5(f) show the single- and dual-SoS reconstructions of the mouse trunk, respectively. Figures 5(c) and 5(g) display the zoomed-in images of the corresponding areas in Figs. 5(b) and 5(f). The places where image quality has effectively improved are indicated by white circles. While deep features remain invariant, superficial features have become more in-focus after dual-SoS reconstruction based on the segmentation result of SAM. Figures 5(d) and 5(h) depict the cross-sectional PA image of a human finger and the result of blood vessel segmentation. As shown in Fig. 5(h), refined segmentation is obtained using a customized program based on the original segmentation result obtained from SAM [Fig. 3(h)]. It is evident that by integrating simple prior knowledge of the PA image, the segmentation became more accurate. Overall, our workflow has achieved satisfactory results across the three demonstrations.

Fig. 5. Imaging results of Demonstrations 1–3. Panels (a) and (e) show the 3D imaging results (maximum intensity projection) before and after the removal of surface signals. White arrows label the vessels that are better exposed. Panels (b) and (f) display the results of single- and dual-SoS reconstructions, respectively. Panels (c) and (g) are the magnified images of panels (b) and (f), showing that both external features (outlined by green dashed boxes) and internal features (outlined by yellow dashed boxes) can be well reconstructed in the dual-SoS images. White circles in panel (g) indicate areas with improved image quality. Panels (d) and (h) are the photoacoustic images of finger vessels and their segmentation results. Scale bars: 10mm.

4. Discussion

This paper reports a workflow that applies FMs for segmenting PA images and performing reconstruction, effectively achieving good results across multiple tasks. By implementing SAMPA in a convenient, training-free manner, we demonstrated the usefulness of the method across three imaging scenarios. In human hand imaging, the segmentation of the skin signal and its subsequent removal effectively revealed internal blood vessels. In animal cross-sectional imaging, auto-segmentation of the body profile facilitated dual-SoS reconstruction, thereby enhancing image quality. In human finger imaging, blood vessel segmentation was achieved, potentially aiding in medical diagnosis. Currently, FMs achieve good results in simple segmentation tasks of two-dimensional PA images. Although the underlying reasons for the effectiveness of FMs on PA images are unclear, empirical evidence indicates their remarkable performance. This surprising result motivated the writing of this paper. A possible explanation is the inherent similarity between PA and natural images.

This paper highlights several advantages of visual FMs: (1) Training-free: The method of integrating prior information through prompts allows for the direct application of DL models without pre-training. (2) Robustness: The model achieves good segmentation results even in the presence of artifacts, greatly enhancing its applicability in complex real-world scenarios. (3) Efficiency: Compared to language FMs, vision FMs have lower computational requirements, greatly facilitating deployment. In traditional DL practices, designing networks and preparing datasets are time-consuming tasks, and the quality of the training set critically determines the model’s performance. In contrast, FMs remove these technical barriers, enabling researchers to implement DL models quickly and conveniently. Some FMs, such as SAM, provide online demos, making the model verification and deployment readily accessible. Freed from the tasks of dataset preparation and network design, the only remaining task for the user is to design appropriate prompts to apply FMs effectively. We envision that in the future, directly implementing FMs, or performing simple fine-tuning on the basis of FMs, is a promising solution to applying DL in PAI.

Certainly, there are areas where current FMs can be improved. First, natural images are predominantly two-dimensional, but medical images, including those generated by PAI, are often three-dimensional. However, there is a lack of FMs specifically designed for three-dimensional images. While it’s possible to decompose three-dimensional images into two-dimensional slices, this approach inherently loses information across slices. Therefore, FMs capable of handling three-dimensional images are highly anticipated.²⁸ Second, the vascular network structure, which is common in PAI, is relatively sparse in natural images. Figure 6 shows a comparison of vessel extraction results obtained using FMs and dedicated deep networks, based on the data from Ref. 29. Current FMs exhibit poor capability in extracting vascular networks in comparison to dedicated networks. We anticipate that FMs tailored for vascular network segmentation or extraction across different medical imaging modalities will be developed in the future. Additionally, current FMs face significant limitations in strongly proprietary scenarios such as image reconstruction. This is another area where dedicated networks outperform FMs.

Fig. 6. The difference in blood vessel identifications between a dedicated network and SAM. (a) The raw photoacoustic image. (b) The vessel identification result of the dedicated network. (c) The result of SAM.

In summary, applying FMs has proven to be a promising solution for implementing DL in PAI. It is worth mentioning that SAM’s online trial offers convenient hands-on practice for quickly mastering the program.

5. Conclusions

We proposed a workflow named SAMPA for processing PA images by FMs. And we validated SAMPA’s effectiveness through three demonstrations. To facilitate readers in replicating our results and validating our findings, we have uploaded all of our codes and provided detailed documentation on https://github.com/Adi-Deng/photoacoustic-SAM.

Acknowledgments

We would like to acknowledge the financial support from Strategic Project of Precision Surgery, Tsinghua University; Initiative Scientific Research Program, Institute for Intelligent Healthcare, Tsinghua University; Tsinghua-Foshan Institute of Advanced Manufacturing; National Natural Science Foundation of China (61735016); Beijing Nova Program (20230484308); Young Elite Scientists Sponsorship Program by CAST (2023QNRC001); Youth Elite Program of Beijing Friendship Hospital (YYQCJH2022-9); and Science and Technology Program of Beijing Tongzhou District (KJ2023CX012).

We thank Xiaojun Wang, Yuwen Chen, Wubing Fu, Naiyue Zhang, Wenjie Guo, and Jianpan Gao at TsingPAI Technology Co., Ltd. for helpful discussions.

Conflict of Interest

Cheng Ma had a financial interest in TsingPAI Technology Co., Ltd., which provided the clinical imaging system (CPIIP) used in this work.

ORCID

Handi Deng https://orcid.org/0000-0002-7296-6432

Yucheng Zhou https://orcid.org/0009-0008-1936-7415

Jiaxuan Xiang https://orcid.org/0009-0004-8596-6177

Liujie Gu https://orcid.org/0000-0002-8836-3150

Yan Luo https://orcid.org/0000-0002-5919-1329

Hai Feng https://orcid.org/0000-0001-9508-0696

Mingyuan Liu https://orcid.org/0000-0002-6449-4885

Cheng Ma https://orcid.org/0000-0001-7366-0091

References

1. R. Bommasani et al., “On the opportunities and risks of foundation models,” preprint, arXiv:2108.07258 [cs.LG] (2021). Google Scholar
2. C. Zhou et al., “A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT,” preprint, arXiv:2302.09419 [cs.AI] (2023). Google Scholar
3. T. Wu et al., “A brief overview of ChatGPT: The history, status quo and potential future development,” IEEE/CAA J. Autom. Sin. 10(5), 1122–1136 (2023). Crossref, Web of Science, Google Scholar
4. S. S. Biswas, “Role of Chat GPT in public health,” Ann. Biomed. Eng. 51(5), 868–869 (2023). Crossref, Web of Science, Google Scholar
5. M. Awais et al., “Foundational models defining a new era in vision: A survey and outlook,” preprint, arXiv:2307.13721 [cs.CV] (2023). Google Scholar
6. A. Dosovitskiy et al., “An image is worth $16 \times 16$ $16 \times 16$ words: Transformers for image recognition at scale,” preprint, arXiv:2010.11929 [cs.CV] (2020). Google Scholar
7. A. Radford et al., Learning transferable visual models from natural language supervision, Proc. 38th Int. Conf. Machine Learning, Vol. 139, pp. 1–16, MLResearch Press (2021). Google Scholar
8. A. Kirillov et al., Segment anything, Proc. 2023 IEEE/CVF Int. Conf. Computer Vision, IEEE Press, Piscataway (2023). Crossref, Google Scholar
9. J. Wang et al., Review of large vision models and visual prompt engineering,” Meta-Radiol. 1(3), 100047 (2023). Crossref, Google Scholar
10. T. D. Le, S.-Y. Kwon, C. Lee, “Segmentation and quantitative analysis of photoacoustic imaging: A review,” Photonics 9(3), 176 (2022). Crossref, Web of Science, Google Scholar
11. K. Tang et al., “Advanced image post-processing methods for photoacoustic tomography: A review,” Photonics 10(7), 707 (2023). Crossref, Web of Science, Google Scholar
12. A. Y. Yuan et al., “Hybrid deep learning network for vascular segmentation in photoacoustic imaging,” Biomed. Opt. Express 11(11), 6445–6457 (2020). Crossref, Web of Science, Google Scholar
13. T. Vaiyapuri et al., “Design of metaheuristic optimization-based vascular segmentation techniques for photoacoustic images,” Contrast Media Mol. Imaging 2022, 4736113 (2022). Crossref, Web of Science, Google Scholar
14. C. D. Ly et al., “Full-view in vivo skin and blood vessels profile segmentation in photoacoustic imaging based on deep learning,” Photoacoustics 25, 100310 (2022). Crossref, Web of Science, Google Scholar
15. Y. Gao et al., “Deep learning-based photoacoustic imaging of vascular network through thick porous media,” IEEE Trans. Med. Imaging 41(8), 2191–2204 (2022). Crossref, Web of Science, Google Scholar
16. M. Sun et al., “Full three-dimensional segmentation and quantification of tumor vessels for photoacoustic images,” Photoacoustics 20, 100212 (2020). Crossref, Web of Science, Google Scholar
17. J. Zhang et al., “Photoacoustic image classification and segmentation of breast cancer: A feasibility study,” IEEE Access 7, 5457–5466 (2018). Crossref, Web of Science, Google Scholar
18. K. Jnawali et al., “Automatic cancer tissue detection using multispectral photoacoustic imaging,” Int. J. Comput. Assist. Radiol. Surg. 15, 309–320 (2020). Crossref, Web of Science, Google Scholar
19. S. Mandal, X. L. Deán-Ben and D. Razansky, “Visual quality enhancement in optoacoustic tomography using active contour segmentation priors,” IEEE Trans. Med. Imaging 35(10), 2209–2217 (2016). Crossref, Web of Science, Google Scholar
20. L. Li et al., “Single-impulse panoramic photoacoustic computed tomography of small-animal whole-body dynamics at high spatiotemporal resolution,” Nat. Biomed. Eng. 1(5), 0071 (2017). Crossref, Web of Science, Google Scholar
21. C. Huang et al., “Dual-scan photoacoustic tomography for the imaging of vascular structure on foot,” IEEE Trans. Ultrason. Ferroelectr. Freq. Control 70(12), 1703–1713 (2023). Crossref, Web of Science, Google Scholar
22. X. Lin et al., “Handheld interventional ultrasound/photoacoustic puncture needle navigation based on deep learning segmentation,” Biomed. Opt. Express 14(11), 5979–5993 (2023). Crossref, Web of Science, Google Scholar
23. N. Davoudi, X. L. Deán-Ben and D. Razansky, “Deep learning optoacoustic tomography with sparse data,” Nat. Mach. Intell. 1(10), 453–460 (2019). Crossref, Web of Science, Google Scholar
24. C. Dehner et al., “A deep neural network for real-time optoacoustic image reconstruction with adjustable speed of sound,” Nat. Mach. Intell. 5(10), 1130–1141 2023. Crossref, Web of Science, Google Scholar
25. S. Na et al., “Massively parallel functional photoacoustic computed tomography of the human brain,” Nat. Biomed. Eng. 6(5), 584–592 (2022). Crossref, Web of Science, Google Scholar
26. P. van Es et al., “Coregistered photoacoustic and ultrasound tomography of healthy and inflamed human interphalangeal joints,” Proc. SPIE 9539, 95390C (2015). Crossref, Google Scholar
27. P. van Es et al., “Initial results of finger imaging using photoacoustic computed tomography,” J. Biomed. Opt. 19(6), 060501 (2014). Crossref, Web of Science, Google Scholar
28. H. Wang et al., “SAM-Med3D,” preprint, arxiv:2310.15161 [cs.CV] (2023). Google Scholar
29. W. Zheng et al., “Deep learning enhanced volumetric photoacoustic imaging of vasculature in human,” Adv. Sci. 10(29), 2301277 (2023). Crossref, Web of Science, Google Scholar

Vol. 18, No. 01

Metrics

Downloaded 587 times

History

Received 15 April 2024

Accepted 9 July 2024

Published: 27 August 2024

Information

This is an Open Access article. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC-BY) License. Further distribution of this work is permitted, provided the original work is properly cited.

Keywords

PDF download