Automated Quality Assessment of Medical Images in Echocardiography Using Neural Networks with Adaptive Ranking and Structure-Aware Learning
Abstract
The quality of medical images is crucial for accurately diagnosing and treating various diseases. However, current automated methods for assessing image quality are based on neural networks, which often focus solely on pixel distortion and overlook the significance of complex structures within the images. This study introduces a novel neural network model designed explicitly for automated image quality assessment that addresses pixel and semantic distortion. The model introduces an adaptive ranking mechanism enhanced with contrast sensitivity weighting to refine the detection of minor variances in similar images for pixel distortion assessment. More significantly, the model integrates a structure-aware learning module employing graph neural networks. This module is adept at deciphering the intricate relationships between an image’s semantic structure and quality. When evaluated on two ultrasound imaging datasets, the proposed method outshines existing leading models in performance. Additionally, it boasts seamless integration into clinical workflows, enabling real-time image quality assessment, crucial for precise disease diagnosis and treatment.
1. Introduction
Deep learning has significantly improved image processing, especially in computer vision tasks such as image recognition and classification. With its efficient feature extraction and pattern recognition capabilities, deep learning technology has found widespread applications in medical image processing. By utilizing deep learning algorithms, medical image processing can achieve precise lesion detection, organ segmentation, and diagnostic support, thereby greatly enhancing the accuracy and efficiency of medical diagnosis. However, the quality of medical images directly affects the performance of these algorithms, and evaluating medical image quality is a crucial step in ensuring diagnostic reliability.
High-quality medical images are indispensable for precise diagnoses and effective treatments across various diseases.1 To ensure the reliable performance of medical images in clinical applications, automated evaluation methods have been proposed for various image types, including those from echocardiography (Echo),2 skin ultrasound,3 craniomaxillofacial,4 and pelvic lymph node detection.5
Prevailing methodologies often focus on assessing image quality through pixel distortion, sidelining the evaluation of an image’s semantic structure visibility. An obvious example is the quality assessment of Echo. Echo is an extensively utilized and cost-effective diagnostic tool for detecting cardiac diseases.6 Among the various views in Echo, the apical four-chamber (A4C) view holds paramount importance in diagnosing fetal heart disease7 and congenital heart disease.8 However, obtaining an accurate A4C view for diagnostic purposes is complex, as shown in the Fig. 1. In this illustration, (a) presents a high-resolution image with abundant pixels, but the cardiac structure is not fully visible, whereas (b) is a lower-resolution image compromising on pixel count but offering a more comprehensive view of the cardiac structure. Sonographers typically prefer (B) for diagnosis, highlighting the importance of considering semantic structure visibility alongside pixel-level assessment to enhance medical image quality.

Fig. 1. A typical A4C echocardiogram. (a) A high-resolution image, yet the structure is not fully visible; (b) a low-resolution quality but provides a complete view of the structure.
Obtaining the best diagnostic views requires extensive clinical experience and a deep understanding of anatomy, presenting a significant challenge, especially for novice doctors.9 In current clinical practice, evaluating medical imaging quality by novices is often subjective and time-consuming, necessitating the support of experienced physicians.10 Consequently, there is an urgent need for an objective and automated image quality assessment (IQA) method. Such a method should swiftly identify standard views, facilitate accurate diagnoses, and support sonographers and doctors in their clinical tasks, particularly in cardiac ultrasound examinations.
In pixel distortion evaluation methodologies, the task is commonly approached as either a regression or classification problem, with convolutional neural networks (CNNs) being the prevailing underlying architecture.11,12 In regression-based methods, the objective is to map image quality to a specific numerical score. For example, Zhang et al.13 employed transfer learning to assess ultrasound image quality by training a model on optical image quality data. Lin et al.14 concentrated on standard planar acquisition for fetal head ultrasound, while Czajkowska et al.3 explored quality assessment methods for high-frequency ultrasound datasets. Conversely, in classification-based approaches, images are classified into different quality grades based on clarity, and a classification model is trained for quality grade categorization. For instance, Abdi et al.15 and Chen et al.16 utilized CNN to classify cardiac ultrasound images into different quality levels. Additionally, Dong et al.17 and Zhang et al.18 developed a quality scoring model specifically for the A4C view by integrating multiple networks and employing data augmentation techniques such as image gain and scaling. This classification approach establishes a systematic framework for categorizing images, facilitating the identification of varying levels of image quality based on predefined criteria such as clarity and resolution. However, accurately assessing the quality differences between similar images poses a challenge in pixel evaluation methods, as they often share similar visual features.19 This similarity hinders effectively capturing the unique attributes that determine image quality.
Despite advancements in image evaluation methods focusing on pixel distortion, these approaches often neglect the critical assessment of basic anatomical structures’ clarity, which is vital for accurate diagnoses. Our study introduces a novel neural network model to evaluate pixel and semantic distortions comprehensively. Pixel distortion measurement concentrates on the clarity and sharpness of images, while semantic distortion assessment aims to determine the accuracy with which basic anatomical structures are captured. To evaluate pixel distortion, the model employs an adaptive ranking mechanism. This mechanism learns to assign quality scores to similar images based on contrast sensitivity weighting factors, offering a nuanced understanding of image clarity. Furthermore, the model incorporates a structure-aware learning approach to assess semantic distortion. Assessing semantic distortion involves leveraging a graph neural network (GNN) to capture the intricate relationship between image structures and image quality, ensuring a more holistic and accurate evaluation.
In summary, this paper has developed a comprehensive IQA model for ultrasound by combining multiple deep neural networks and IQA methods. Additionally, we have created an intelligent system based on this model. We have made our implementation publicly available to support reproducibility and further research. The code can be accessed at https://github.com/gaden168/MIQA_A4C.
Our main contributions are summarized as follows:
• | Developed a method that integrates a dynamic ranking module with contrast sensitivity weighting to enhance the learning of distinguishable features in similar images, leading to a more accurate assessment of pixel distortions. | ||||
• | Proposed using GNNs for structure-aware learning, demonstrating their effectiveness in understanding the relationship between image quality and semantic structures. This facilitates a precise assessment of semantic distortions in images. | ||||
• | Introduced a multi-task learning framework that concurrently measures pixel and semantic distortions for IQA. Validation on a real-world cardiac ultrasound dataset and two public datasets demonstrated competitive performance compared to state-of-the-art methodologies. | ||||
• | Utilized the proposed method to design and implement an intelligent automatic evaluation system for cardiac ultrasound. This novel approach shows excellent potential for future integration into clinical practice. |
The structure of this paper is outlined as follows. Section 2 presents a review of related works, Secs. 3 and 4 outline the proposed method and experimental results in detail. The system design and implementation are discussed in Sec. 5. Finally, the conclusions will be summarized in Sec. 6.
2. Related Work
Over the past decade, deep learning has rapidly become one of the core technologies in artificial intelligence, demonstrating exceptional performance across various applications. Particularly in medical image processing, its efficient feature extraction and pattern recognition capabilities have significantly enhanced the accuracy and efficiency of medical diagnostics. As an essential aspect of ensuring diagnostic reliability, medical IQA has received extensive attention and research in recent years.
In the field of medical IQA, numerous methods have been proposed over the past decades to ensure the reliable performance of images in clinical tasks. These IQA methods can be broadly categorized as traditional and deep learning methods.
Traditional methods for IQA: Traditional IQA methods encompass techniques that rely on metrics like peak signal-to-noise ratio (PSNR)20 and mean square error (MSE).21 These techniques have been widely utilized in the initial phases for the automated assessment of the quality of medical images.22 While traditional methods yield reproducible IQA measurements, they necessitate manual feature engineering, which can be time-consuming and resource-intensive.23
Deep learning methods for IQA: With the advancements in deep learning techniques, CNNs have emerged as the main methods in IQA tasks.24 These approaches treat quality prediction as a regression or classification problem and aim to automatically learn features for quantifying IQA from labeled data, eliminating the need for manual feature design. For instance, Kang et al.25 pioneered this approach by integrating CNNs into the IQA task. Since then, several algorithms such as Yang et al.26 introduced SGDNet, an end-to-end saliency-guided deep neural network that enhances assessment accuracy without reference images. Chen et al.27 proposed hyperIQA, improving the evaluation of light field sub-aperture images. Cheon et al.28 developed a perceptual IQA method using transformers, leveraging their feature extraction and attention mechanisms for robust IQAs.
Improving the quality of medical images is crucial in maintaining the accuracy of clinical diagnoses.29 Recently, deep learning has also been applied to assess the quality of medical images. For instance, Abdi et al.15 developed a CNN-based A4C view quality assessment model to classify ultrasound images into different quality levels. Zhang et al.18 specifically designed a quality scoring model for A4C views, treating the quality assessment task as a regression problem. They compared their approach with conventional Lasso regression30 and Elastic Net31 methods, achieving a lower absolute distance error (ABE). Hossain et al.32 proposed an automated fetal ultrasound IQA scheme using CNNs to aid ultrasound image quality control in clinical obstetric examinations. Additionally, endeavors have been made to employ optical IQA techniques for evaluating the quality of ultrasound images, as demonstrated by methods such as Saeed et al.23 who used meta-reinforcement learning for adaptive quality assessment tailored to machine learning tasks. Huh et al.33 introduced tunable quality control for three-dimensional (3D) ultrasound using switchable CycleGAN, enabling dynamic adjustments. Saeed et al.34 focused on enhancing IQA through task-amenable data selection. Golestaneh et al.11 developed a no-reference IQA method with transformers, relative ranking, and self-consistency.
Semantic structure analysis: Furthermore, in medical IQA, diagnosis must consider not only image clarity but also whether the semantic structure in the image is fully visible. Liu et al.35 delved into a multi-task-based pulmonary nodule analysis approach to detect anatomical structures within the image and obtain a standardized slice for diagnostic purposes. They employed a method called MF R-CNN to detect structures in ultrasound images, considering each anatomical structure as an individual target for detection and disregarding the correlation between anatomical structures. However, it is worth noting that multiple anatomical structures often appear simultaneously in a single ultrasound image. Consequently, accurately capturing the correlation between these different structures becomes crucial.
To evaluate the quality of medical images comprehensively, it is imperative to consider the image’s clarity and whether the basic anatomical structures are visible. This holistic assessment approach ensures a more comprehensive understanding of image quality in the context of medical image analysis.
3. Methodology
This paper presents a novel comprehensive IQA method focusing on pixel and semantic distortion measurement. The approach incorporates an adaptive ranking mechanism that utilizes contrast-sensitive weighting factors to assign quality scores to similar images, ensuring accurate pixel distortion measurement. Additionally, the method integrates a structure-aware learning model with a GNN to capture the relationship between image structure and quality, facilitating precise semantic distortion measurement. Furthermore, this study introduces a multi-task framework to handle both tasks concurrently. Each of these components is detailed extensively in the following three sections.
3.1. Multi-task learning
The proposed methodology adopts a multi-task learning framework to comprehensively evaluate image quality by simultaneously considering pixel and semantic distortion measurements. As depicted in Fig. 2, the framework consists of two branches. The first branch employs a shallow regression model that maps image features to quality scores. It incorporates an adaptive ranking mechanism, utilizing a contrast sensitivity weighting factor to assign quality scores to similar images. This mechanism enhances the exploitation of heterogeneity among similar images, thereby improving the accuracy of pixel distortion estimation. The second branch utilizes structure-aware learning with a GNN to perform semantic structure classification of images. This branch’s graph convolutional networks (GCNs) model captures the intricate relationship between image structures and image quality, enabling precise semantic distortion evaluation. The final IQA score is determined by aggregating the quality scores from both branches.

Fig. 2. Schematic diagram of the proposed network. The proposed network is a multi-task learning framework with three main modules: feature extraction, pixel distortion measurement, and semantic distortion measurement. The feature extraction module includes a residual network backbone and n conv-blocks. The pixel distortion measurement branch learns a shallow regression model and introduces an adaptive ranking learning to assign quality scores to similar images according to a contrast sensitivity weighting factor. The semantic distortion measurement branch through the structure-aware learning model incorporates a GNN to capture the relationship between heart structure and image quality. The proposed method combines these two approaches to comprehensively evaluate cardiac ultrasound A4C image quality.
This paper uses SpixSpix and SsemSsem to denote the quality scores of the pixel and semantic distortion measurement branches, respectively. The final image quality score StotalStotal is obtained by computing the arithmetic mean of the quality scores of the two branches of the multi-task framework
The feature extraction module in our model uses ResNet5036 as the backbone. ResNet50 is an efficient and commonly used network for feature extraction in computer vision.37 In this paper, we removed the last global pooling and fully connected layers and added conv-blocks to reduce the feature dimension. The exact number of conv-blocks to be added is uncertain, so we experiment with two different architectures: B1 and B2. B1 represents the model with one additional conv-block, while B2 represents the model with two additional conv-blocks.
3.2. Pixel distortion measurement
Accurately assessing the quality of similar images presents a challenge due to their shared visual characteristics, hindering the capture of unique properties that define individual image quality. Learning to rank emerges as a promising approach to address this, enabling quality inference through partial order relations.35 However, traditional ranking-based methods often struggle to effectively prioritize similar images.
In response, this paper proposes an adaptive rank method that specifically targets similar images, aiming to enhance their contrast sensitivity and improve the accuracy of evaluating their quality. Our method begins by taking a two-dimensional (2D) image as input and extracting features using a dedicated feature extraction layer. These features traverse two fully connected layers, utilizing MSE loss to train a shallow regression model mapping image features to a specific quality score.
Simultaneously, the method incorporates an adaptive rank (a-rank) loss, depicted in the light blue section of Fig. 2. This loss function is designed to capture the heterogeneity between similar images, enhancing contrast sensitivity and improving quality prediction performance. The IQA based on pixel distortion measurement is achieved by combining MSE loss and a-rank loss into a joint loss. To simplify the model and avoid the need for additional hyperparameters, both losses are weighted equally
Adaptive ranking learning: Learning image quality ranking can capture and exploit the heterogeneity of similar images, thereby improving the accuracy of shallow regression models. In particular, when provided with a dataset consisting of N training samples {xi,ti}N−1i=0, where xi represents the ith image and ti denotes the corresponding ground truth quality score.
We employ the following loss function to learn the rank relationship between images :
The incorporation of learning to rank undoubtedly enhances model performance. However, to further guide the model’s attention toward similar images and enhance their contrast sensitivity, we propose the implementation of adaptive ranking learning. This approach introduces a contrast-sensitive weighting factor based on rank learning, enabling the model to prioritize and emphasize the distinguishing characteristics of similar images. The contrast-sensitive weighting factors are defined as follows :
3.3. Semantic distortion measurement
In clinical settings, the precision of diagnoses relies on the perceptual clarity of medical images and the unambiguous visibility of semantic structures inherent to the visual data. Unfortunately, contemporary medical IQA methodologies often neglect the critical evaluation of semantic distortion within images. To address this issue, this paper introduces an innovative structure-aware learning model seamlessly incorporating a GNN. This advanced model facilitates exploring the relationships between image structures and quality, enabling a more accurate measurement of semantic distortion. More importantly, the ensemble of GCN models can capture the correlation between different structures in a single image, thereby improving the model’s ability to examine image structures and enabling a comprehensive assessment of image quality.
Specifically, the mutual relationships between structures within the image are modeled through graphics. Nodes in the graph are represented using structural labels from the A4C images obtained through Echo. The edges between nodes are constructed based on a label co-occurrence matrix. The features of each node (label) in the graph are represented using word embeddings corresponding to the structural labels. Word embeddings are generated using GloVe38 with a vector dimension of 300. GCNs are employed to propagate information across multiple labels. GCNs learn classifiers that capture the interdependencies of each image label. These classifiers are then fused with a shallow multi-label classifier to predict semantic structural labels. Finally, the structural classification results are transformed into quality scores, providing a comprehensive assessment of image quality based on semantic distortion.
Structure-aware learning: In this branch, we tackle a multi-label classification task. The loss function is formulated as follows :
The quality score for the semantic distortion measurement branch is defined as follows :
4. Experiments
4.1. Dataset
Echo dataset: In this study, we utilized Echo data collected from 91 patients by cardiologists at the Dazhou Central Hospital in Sichuan Province, China. The data were acquired using a specially developed video recording tool. The heart was imaged from at least seven standard views, including parasternal long and short axes, apical two-, three-, and four-chamber, subcostal, and suprasternal. This paper focuses on the A4C view and collected 1170 images of this view. To align our research with clinical practice, we have enlisted the expertise of four professional ultrasound doctors to annotate each image meticulously. Their annotations include a quality score and information about the structural characteristics depicted in the images. Furthermore, an additional group of four experts carefully reviewed the results of these annotations. Each image was assigned nine categories of structural information (left and right atria, left and right ventricles, tricuspid and mitral valve, interventricular and interatrial septum, and left ventricular posterior wall). Each image receives a score between 0 and 4.5, with a higher score indicating better image quality and more apparent structural features. The data used in the research process are anonymized, de-identified, and stored securely using encryption methods.
4.2. Experiment settings
We implement our method on two NVIDIA GTX 1080Ti GPUs using PyTorch as the backend. The pre-processed size 256×256 images are fed into our proposed network. The Adam stochastic optimization algorithm39 is employed, starting with an initial learning rate of 1∗10−3 and a decay factor of 0.1 applied every 10 epochs. We set the mini-batch size to 32 and the epoch number to 100. The default value of the parameter η is set to 0.5. The parameter γ is set to 0.5.
For the overall performance of the proposed method, we reported three commonly used evaluation metrics for performance comparison: Pearson Linear Correlation Coefficient (PLCC), Spearman Rank-Order Correlation Coefficient (SRCC), and ABE. PLCC quantifies the linear correlation between predicted outcomes and ground truths, computed in the following manner :
4.3. Ablation experiment
Within this section, we conduct ablative studies to meticulously examine the impact of crucial components in our proposed model.
Analysis of the pixel distortion measurement: To evaluate the performance of our method’s pixel distortion measurement branch, we conducted comparative experiments between the proposed method and two classic regression models, namely, the Lasso regression model.35 and Elastic Net35 Similarly, we also compared it with two state-of-the-art methods: CNN-based ultrasound IQA (CNN_MIQA)13 and transformer-based medical image processing methods (MedViT)43 to assess the effectiveness of our method. For each model, we calculated PLCC, SRCC, and ABE on the test set of the Echo dataset.
Table 1 shows that our method achieves the highest PLCC, SRCC, and lowest ABE. Compared to the two classic regression methods, LASSO and Elastic Net, our approach reduces the ABE by 0.274 and 0.257, respectively. Additionally, compared to the state-of-the-art CNN_MIQA method and the MedViT method, our approach reduces the ABE by 0.124 and 0.148, respectively. Our findings underscore the efficacy of our method in the domain of pixel distortion assessment. We further conducted two ablation studies to compare the effectiveness of using MSE and a-rank losses individually and in combination. The results indicate that the joint loss, which combines MSE and a-rank, achieves superior results compared to using either loss alone. This demonstrates that introducing an adaptive ranking mechanism in the quality regression task can enhance the model’s ability to predict quality. However, it should be noted that the predictive performance of using only a-rank loss is lower than that of using only MSE loss. This further validates the notion that combining both losses can yield better results. These findings strongly support the feasibility and accuracy of our proposed model.
Methods | SRCC↑ | PLCC↑ | ABE↓ |
---|---|---|---|
LASSO30 | 0.403 | 0.452 | 0.663 |
EN31 | 0.393 | 0.384 | 0.655 |
CNN_MIQA13 | 0.432 | 0.563 | 0.522 |
MedViT43 | 0.412 | 0.534 | 0.546 |
Ours (ℒMSE) | 0.466 | 0.563 | 0.438 |
Ours (ℒa-rank) | 0.404 | 0.506 | 0.445 |
Ours (ℒMSE+a-rank) | 0.778 | 0.824 | 0.398 |
Analysis of the semantic distortion measurement: The essence of semantic distortion measurement is a multi-label classification task aimed at detecting different cardiac underlying structures in ultrasound images. Therefore, our method is compared with several widely used baseline methods, VGG-19,44 ResNet50,36 and Vit43 commonly employed in computer vision tasks. Furthermore, we evaluate our method against two modern multi-label classification approaches: CNN–RNN40 and Multi-Evidence.41 Furthermore, we analyzed the performance comparison between our proposed B1 and B2 architectures and the baseline methods. This analysis provides insights into the superiority of our architectures over the baselines. We also performed an ablation analysis to validate the effectiveness of introducing GCNs for capturing the correlation between heart structures and image quality in semantic distortion assessment. We excluded the GCN module from the semantic distortion measurement branch to observe its effect on performance.
According to Table 2, our proposed method outperforms existing methods in the OF1. Specifically, compared to two representative CNN backbones (VGG-19 and ResNet50) and transformer backbones (MedVit), the model shows an increase in average OF1 of 1.1%, 0.9% and 0.6%, respectively. Similarly, compared to state-of-the-art multi-label classification methods, the model shows an increase in average OF1 of 1.31% and 4.4%, respectively. Furthermore, adding one or two conv-blocks can help further improve the model’s overall performance. The optimal average OF1 achieved by the model is 93.9%. Additionally, we report the average precision (AP) for each category of our proposed framework and several other baselines, as shown in Fig. 3.

Fig. 3. The AP (in %) for each category is compared between our proposed framework and the several other baselines. “LV” and “RV” denote left and right ventricles, “LA” and “RA” denote left and right atria, “TV” and “MV” denote tricuspid and mitral valve, “IVS” and “IAS” denote interventricular and interatrial septum, and “LVPW” denotes left ventricular posterior wall.
Methods | OP↑ | OR↑ | OF1↑ |
---|---|---|---|
VGG-1944 | 0.935 | 0.922 | 0.928 |
ResNet5036 | 0.932 | 0.928 | 0.930 |
MedViT43 | 0.930 | 0.936 | 0.933 |
CNN–RNN40 | 0.882 | 0.867 | 0.808 |
Multi-Evidence41 | 0.889 | 0.892 | 0.895 |
Ours (with B1) | 0.938 | 0.940 | 0.939 |
Ours (with B2) | 0.941 | 0.925 | 0.933 |
Ours (ℒcls and without GCN) | 0.928 | 0.937 | 0.932 |
Ours (ℒcls+GCN) | 0.938 | 0.940 | 0.939 |
Moreover, the quantitative analysis presented in Table 2 demonstrates that our method achieves a 0.7% improvement in the average OF1 compared to the model without the GCN module. This result highlights the significant contribution of the GCN module in enabling the model to learn the correlation between anatomical structures within the image. By leveraging this learned correlation, our model performs better in structure classification tasks.
Analysis of the multi-task framework: The ultimate quality score for the A4C image is established through a multi-task framework, encompassing the assessment of two facets: pixel distortion and semantic distortion. To analyze the individual contributions of these branches to the overall quality assessment performance, we conducted ablation studies on the Echo dataset. For clarity and convenience, we refer to the pixel distortion measurement branch as “PIX” and the semantic distortion measurement branch as “SEM”.
As shown in Table 3, combining both branches clearly leads to the best performance of the model. Notably, the experimental results reveal that introducing the semantic distortion branch enhances the overall quality assessment of A4C images compared to relying solely on pixel distortion. Specifically, there is an improvement of 0.02 in SRCC, 0.111 in PLCC, and a reduction of 0.034 in ABE. These findings provide strong evidence that considering the semantic structure features of the image is beneficial in enhancing the overall performance of A4C IQA.
Methods | SRCC↑ | PLCC↑ | ABE↓ |
---|---|---|---|
PIX | 0.778 | 0.824 | 0.398 |
SEM | 0.793 | 0.783 | 0.403 |
PIX+SEM | 0.798 | 0.935 | 0.364 |
As shown in Table 4, we conducted a detailed analysis of the weight sensitivity between the two branches in the multi-task framework proposed in this paper. The results indicate that the model performs optimally when the γ value is set to 0.5. This finding suggests that both branches exhibit high robustness and stability within the range of weight adjustments, indicating that the model is not highly sensitive to changes in weight.
(γ) | (1−γ) | SRCC↑ | PLCC↑ | ABE↓ |
---|---|---|---|---|
0.1 | 0.9 | 0.787 | 0.802 | 0.455 |
0.2 | 0.8 | 0.789 | 0.852 | 0.425 |
0.3 | 0.7 | 0.792 | 0.893 | 0.416 |
0.4 | 0.6 | 0.799 | 0.906 | 0.396 |
0.5 | 0.5 | 0.798 | 0.935 | 0.364 |
4.4. Comparisons with state-of-the-art methods
Qualitative analysis: Figure 4 visually represents the qualitative analysis results, comparing our proposed model with three established state-of-the-art models for ultrasound IQA. Encouragingly, our method achieves the best results, especially for images of similar quality (second column), clearly outperforming other models. This result demonstrates the effectiveness of our introduction of adaptive perceptual learning in assessing the accurate prediction of the quality score of similar imagery images. In addition, from the third column, it can be seen that the prediction performance of the low-resolution image quality score is better than other models. This result highlights the importance of identifying image structures for predicting ultrasound image quality, which is consistent with the evaluation criteria used by clinicians. However, this aspect has received little attention in current research.

Fig. 4. (Color online) Quality assessment results of our proposed method on A4C images compared to state-of-the-art methods. The best results are highlighted in red, and the second-best results are highlighted in blue. The labels indicate the actual quality scores.
Quantitative analysis: Table 5 shows the quantitative analysis results of our proposed model and the existing state-of-the-art ultrasound IQA models. The outcomes indicate that our proposed model attained the highest SRCC and PLCC and the lowest ABE. These findings provide evidence of its effectiveness in assessing the quality of ultrasound images.
Methods | SRCC↑ | PLCC↑ | ABE↓ |
---|---|---|---|
MF R-CNN14 | 0.770 | 0.867 | 0.386 |
MUIQA16 | 0.786 | 0.878 | 0.367 |
ARVBNet17 | 0.749 | 0.780 | 0.379 |
Ours | 0.798 | 0.935 | 0.364 |
Table 6 presents the average inference time and the number of parameters for various models using the Echo dataset. The MUIQA model demonstrates a slight advantage over other methods in parameters and inference time. This efficiency can be attributed to its design as a single-task framework, which directly predicts quality scores. Our benchmark indicates that the proposed method has a marginally lower inference time and fewer parameters compared to two multi-task models, MF R-CNN and ARVBNet. Overall, despite the complexity level of the proposed model being comparable to several models, it surpasses them in quality assessment performance.
Methods | Backbone | Time (s) | Params (M) |
---|---|---|---|
MF R-CNN | ResNet101 | 0.647 | 54.09 |
MUIQA | ResNet18 | 0.451 | 30.85 |
ARVBNet | VGG | 0.632 | 43.87 |
Ours | ResNet50 | 0.547 | 35.58 |
4.5. Generalization performance
To further investigate the scalability of our adaptive ranking learning in other ultrasound IQA tasks, we conducted further experiments on the HFUS dataset3 and the UltraSound dataset.45 It is important to note that the HFUS and UltraSound datasets do not contain information on image structure. The parameter γ in Eq. (2) is set to 1.
HFUS dataset: The HFUS dataset collects 17,425 high-frequency ultrasound images of facial skin taken from 44 patients. Three physicians reviewed each image and determined its quality based on the presence of artifacts, noise, or if it was taken while the ultrasound probe was not in contact with the patient’s skin. If all three physicians agreed that an image was of high quality, it has been labeled as such. Conversely, if all three physicians have decided that an image was of low quality, it has been labeled as such. Images that the physicians have not unanimously agreed upon have been labeled as “blurred”. To ensure a fair comparison, images marked as “high quality” have been assigned a score of 3, “blurred” images have been assigned a score of 2, and “low quality” images have been assigned a score of 1. Based on Ref. 3, we compare separately with the three methods that achieve the best results on the HFUS dataset, respectively.
The numerical outcomes presented in Table 7 demonstrate that our approach has achieved performance at the forefront compared to other studies. Specifically, the F1 value is improved by 8.6% compared with the HFUS method. These results demonstrate that our adaptive ranking learning method is highly scalable and can be applied to other ultrasound quality assessment datasets. It is worth noting that the Recall of the HFUS method is slightly higher than that of our method. This is because the HFUS dataset contains significantly more images labeled as “blurry” than the other two categories, leading to some imbalance in the data. Our method achieves a higher F1 score by slightly sacrificing Recall.
Methods | Accuracy↑ | Precision↑ | Recall↑ | F1↑ |
---|---|---|---|---|
DenseNet46 | 0.796 | 0.733 | 0.909 | 0.811 |
VGG1647 | 0.804 | 0.738 | 0.919 | 0.818 |
HFUS3 | 0.828 | 0.762 | 0.936 | 0.840 |
Ours | 0.906 | 0.921 | 0.921 | 0.926 |
UltraSound dataset: The UltraSound dataset is an ultrasound quality assessment dataset. According to the quality of the ultrasound images, the datasets are classified into four categories (normal, noisy, blurry, and distorted). Each type contains 650 ultrasound images, for a total of 2600 images in the dataset. For a fair comparison, we mark the quality scores of the four normal, noise, blur, and distortion categories as 3, 2, 1, and 0, respectively. We compare the quality assessment performance with two state-of-the-art methods on the UltraSound dataset: Quantitative Feature Extraction Machine (QFEM) and VGG-19.
The quantitative results in Table 8 show a clear improvement of the proposed method compared to other methods on the UltraSound dataset. It further demonstrates the scalability of the adaptive ranking learning method to other ultrasound quality assessment data.
Methods | Accuracy↑ | Precision↑ | Recall↑ | F1↑ |
---|---|---|---|---|
QFEM32 | 0.816 | 0.819 | 0.816 | 0.817 |
VGG-1948 | 0.962 | 0.963 | 0.962 | 0.964 |
Ours | 0.997 | 0.996 | 0.995 | 0.995 |
5. An Intelligent Ultrasonic Image Quality Evaluation System
This paper presents an intelligent cardiac ultrasound IQA system based on the proposed method to assist sonographers in real-time echocardiogram acquisition. The system’s overall architecture is shown in Fig. 5. The system utilizes a deep neural network quality assessment model to guide sonographers in obtaining views quickly for diagnosis. The system comprises a server and a client, with the server handling data management and model calculations. The client interface is illustrated in Fig. 6. Upon system activation, the sonographer can examine the patient using the ultrasound machine as usual without any additional steps. The system automatically captures and uploads the ultrasound image data to the server. Subsequently, the server employs the trained deep neural network model to assess image quality and returns the assessment results to the client. The client promptly displays the evaluation results through a quality meter, enabling the doctor to locate the standard view and save the highest quality image as necessary. The system comprises the following specific modules.

Fig. 5. Functional architecture of intelligent ultrasonic image quality evaluation system.

Fig. 6. System main interface.
Quality meter: The quality meter is the core function. Sonographers can make decisions or optimize the quality based on the predicted quality calculated and fed back by the model in real time.
Auto-capture: The auto-capture feature triggers an auto-capture clip when the image quality is predicted to be diagnostic. It simulates how a sonographer knows when an image is good enough for a diagnosis and documents it.
Save best clip: This feature allows the sonographer to retrospectively record the highest quality clips obtained to date for selection by the sonographer when recording a diagnostic report.
System management: The system management mainly completes managing the system interface and logs.
6. Conclusion
This paper proposes a novel neural network model for automated quality assessment of the A4C view. Our proposed approach employs a multi-task learning framework to comprehensively evaluate the quality of cardiac ultrasound A4C images from two aspects: pixel-based and semantic distortion measurement. We use an adaptive ranking mechanism that learns to assign quality scores to similar images based on contrast sensitivity weighting factors to evaluate the pixel distortion. We also use a structure-aware learning model incorporating a GNN to capture the relationship between the heart structures and the image quality to assess the semantic distortion. Extensive experiments on real-world cardiac ultrasound data and benchmark datasets demonstrate the efficacy of the proposed approach, achieving competitive performance beyond the state-of-the-art. Moreover, we have designed and implemented an intelligent cardiac ultrasound image quality automatic assessment system based on our proposed method. The system is expected to be applied clinically in the future. Moving forward, we plan to extend it to other tasks in 3D engineering applications,49,50 and to apply our model to IQA experiments in other domains, such as the LIVE IQA database.
ORCID
Gadeng Luosang https://orcid.org/0009-0009-1873-3812
Zhihua Wang https://orcid.org/0000-0003-0355-903X
Jian Liu https://orcid.org/0000-0001-5148-5069
Fanxin Zeng https://orcid.org/0000-0002-7337-4463
Zhang Yi https://orcid.org/0000-0002-5867-9322
Jianyong Wang https://orcid.org/0000-0003-1689-2384