Performance Evaluation of Deep, Shallow and Ensemble Machine Learning Methods for the Automated Classification of Alzheimer’s Disease
Abstract
Artificial intelligence (AI)-based approaches are crucial in computer-aided diagnosis (CAD) for various medical applications. Their ability to quickly and accurately learn from complex data is remarkable. Deep learning (DL) models have shown promising results in accurately classifying Alzheimer’s disease (AD) and its related cognitive states, Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI), along with the healthy conditions known as Cognitively Normal (CN). This offers valuable insights into disease progression and diagnosis. However, certain traditional machine learning (ML) classifiers perform equally well or even better than DL models, requiring less training data. This is particularly valuable in CAD in situations with limited labeled datasets. In this paper, we propose an ensemble classifier based on ML models for magnetic resonance imaging (MRI) data, which achieved an impressive accuracy of 96.52%. This represents a 3–5% improvement over the best individual classifier. We evaluated popular ML classifiers for AD classification under both data-scarce and data-rich conditions using the Alzheimer’s Disease Neuroimaging Initiative and Open Access Series of Imaging Studies datasets. By comparing the results to state-of-the-art CNN-centric DL algorithms, we gain insights into the strengths and weaknesses of each approach. This work will help users to select the most suitable algorithm for AD classification based on data availability.
1. Introduction
Alzheimer’s disease (AD) is an irreversible neurodegenerative condition that permanently alters one’s quality of life.1 AD is characterized by the progressive degeneration of protein components in brain cells, known as plaques and tangles.2 A considerable reduction in cognitive abilities will result from this disruption in the communication between protein components, which can have devastating effects on an individual’s personal and social life.1,3 Patients diagnosed with Mild Cognitive Impairment (MCI) are in a stage of transition from a Cognitively Normal (CN) condition to a dementia state, also known as a major neurocognitive disorder.1 This transition has a 10% conversion rate to AD.1 There are an estimated 55 million persons with AD worldwide, and many more instances go unreported due to a general lack of knowledge about the condition.1 According to the data, AD is the seventh greatest cause of death around the world.1
People affected by this condition may experience a wide variety of discomforts, including problems with their short-term and long-term memories, behavioral disturbances and a variety of other physical concerns such as impaired eyesight and limited mobility.4 The lack of awareness of AD among members of the general population is the primary barrier to the early diagnosis of this condition. As a result, increasing cognitive decline and related behavioral changes are frequently thought of as the phenomena connected with the natural aging process or suspected to be other psychiatric problems.5 Furthermore, patients’ suffering is compounded by geographical isolation, lack of qualified medical personnel, limited access to specialists and inadequate diagnostic resources.1 These manifest into exacerbated suffering, to the point that a person’s independence in their day-to-day and social life is compromised. Therefore, it is crucial to identify AD early to lessen the burden on the patient and care-taking family members.
The AD continuum has several levels or stages. Mild Cognitive Impairment is an intermediate stage between the cognitive decline due to normal aging and the more pronounced decline of dementia.6 These problems are noticeable to other people and show up on tests, but they do not interfere with the daily life activities. The stable MCI (sMCI) is known as the condition when the impairment does not worsen over time, and progressive MCI (pMCI) is when the deterioration of the cognitive faculties is noticeable over time.7,8 It is also categorized as Early MCI (EMCI) and Late MCI (LMCI). EMCI refers to cognitive changes that are not yet significantly impacting daily life, while LMCI refers to noticeable cognitive difficulties that affect daily activities.9,10 The published classification works on AD usually aim at detecting the onset of the disease or assessing the stage of cognitive impairments.11,12,13
Diagnosis of AD is based primarily on observing patient symptoms. It can take years to observe an apparent presence of the disease. However, due to developments in diagnostic research, various techniques [e.g. magnetic resonance imaging (MRI), positron emission tomography (PET), computed tomography (CT), blood tests, etc.] have been discovered to help in early AD prediction.14,15,16 Employing artificial intelligence (AI) methods on these imaging techniques can improve clinical decision-making and patient treatment quality. Machine learning (ML) and deep learning (DL) are subsets of AI that involve training algorithms to learn from data, as shown in Fig. 1. ML algorithms are designed to identify patterns in data and make predictions based on those patterns. These algorithms can be supervised, unsupervised or semi-supervised. In contrast, DL algorithms are based on neural networks capable of learning and representing complex patterns in data through multiple layers of processing. DL algorithms have achieved state-of-the-art (SOTA) performance on many challenging AI tasks.16 Recently, researchers have shown much propensity to use DL models (with or without ensemble methods),16,17 but this may not be the best choice because a simple ML classification could get the same or even better results.18 Classifiers based on ML have been widely used in the healthcare industry and have proven helpful in identifying AD cases.11,19,20

Fig. 1. Typical machine learning and deep learning pipeline in AD classification.
Recent research in ML and DL for AD classification spans several aspects, such as the usage of multiple modalities such as MRI, PET and CT scans for more comprehensive diagnosis.21,22,23 Several studies have incorporated clinical information such as demographic data, cognitive scores and genetics in DL models using Convolutional Neural Network (CNN)-based classifiers for more accurate AD classification.24,25 For instance, in a study, Kundaram and Pathak26 designed a three-way (NC, MCI and AD) CNN-based classifier trained with 9540 images extracted from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and reported a classification accuracy of 98.57%. In another work, Basaia et al.27 used a 3D CNN consisting of 12 convolutional layers trained with 3D T1-weighted images from the ADNI dataset and also from an independent dataset (Milan dataset) and reported an accuracy of 98.2%. Also, many studies have predominantly used the Transfer learning (TL) approach for AD classification during the past few years. TL is an approach of using a pretrained CNN model, for a task usually with a huge dataset, to perform a similar task on a smaller dataset. The study of Jain et al.28 used VGG-16 as a feature extractor applied to the ADNI data of 150 subjects consisting of T1-weighted sMRI, demographic features and MMSE scores. The last fully connected layers of the VGG-16 model were replaced by two dense layers and a dropout layer. The authors reported classification accuracies of 95.73% for AC versus MCI versus AD, 99.14% for AD versus CN and 99.30% for AD versus MCI. In another study, Wu et al.29 compared the classification performance of AlexNet and GoogLeNet using MRI images of individuals from normal control, stable MCI and converted MCI (cMCI) subjects. In their experiments, AlexNet outperformed GoogLeNet in all classifications. In another study,30 popular TL architectures (VGG-16 and Inception) were employed for AD classification using the Open Access Series of Imaging Studies (OASIS) dataset. The authors achieved a competitive accuracy despite using a smaller dataset through optimal training of the network.
Traditional ML classifiers such as Logistic Regression (LR), Random Forest (RF), K-Means (KM) and Support Vector Machine (SVM) have also been used successfully for AD classification.11 The study of Khedher et al.31 used independent component analysis (ICA) as a feature extraction on segmented MRI images extracted from the ADNI dataset and passed them to the SVM classifier to perform binary classifications. Their model achieved the accuracies of 89% for CN versus AD, 79% for CN versus MCI and 85% for MCI versus AD. Several features extracted from MRI images were tested by Acharya et al.32 using K-Nearest Neighbors (KNN). The Shearlet Transform feature extraction technique produced the best performance with an average accuracy of 94.54% and the precision, sensitivity and specificity of 88.33%, 96.30% and 93.64%, respectively. Many researchers have applied RF for AD prediction and prognosis; it is characterized by its robustness to overfitting and outliers and its ability to handle nonlinear data.33 Alickovic and Subasi34 used a histogram to represent useful features extracted from 267 ADNI brain images. Several classifiers [i.e. SVM, RF, LR, Multilayer Perceptron (MLP), KNN and Decision Tree (DT)] were applied to the produced histogram; the authors reported that RF achieved the best classification accuracy of 85%.
Likewise, ample studies have been proposed in the literature that employ ML and DL algorithms for AD classification. However, in most studies, either the data used is publicly unavailable or they used a different subset of data, for instance, from the ADNI dataset, which hinders benchmarking and obscures identification of the true efficiency of these classifiers. In addition to proposing an ML-based ensemble classifier for AD classification, our study also conducts an objective evaluation of the efficacy of existing ML and DL classifiers for AD classification using the MRI data. In this investigation, our focus on DL classifiers predominantly refers to the utilization of CNN-centric models.
The study also helps to ascertain these classifiers’ basic limitations and capabilities, which are useful for enhancing the understanding of their performance in real-world applications and guiding further refinements in ML models for MRI data modality.
This paper is built on our previously published study on the four-way classification of AD using the OASIS dataset.35 This update explores testing performance utilizing the popular ADNI dataset in addition to the OASIS dataset. A series of contributions have been included in this version, as follows:
(1) | Investigation of the use of SOTA ML and DL models for the classification of AD and providing substantiating analytical results. | ||||
(2) | An ensemble classification approach has been proposed for AD classification using the best-performing models. | ||||
(3) | Data and source code used in this work are made publicly available for use of the research community. |
The rest of the paper is organized as follows: Sections 2 and 3, respectively, present the proposed approach along with experimental setups. Experimental results and analysis are presented in Sec. 4. Current challenges and future avenues are presented in Sec. 5. Conclusions are drawn in Sec. 6.
2. Proposed Methodology
This section presents the proposed pipeline to classify AD using traditional ML classifiers and an ensemble classification approach. We empirically chose the VGG-16 pretrained architecture for feature extraction. Our empirical decision process involved comprehensive experiments with various pretrained feature extractors trained on the ImageNet dataset, employing five-fold cross-validation and averaging accuracy as the performance indicator. The results consistently demonstrated the superior performance of the VGG-16 architecture over other feature extractors. As a result, we chose the VGG-16 pretrained architecture for feature extraction in our study.36,37 By using pretrained DL models like VGG-16, we leverage its ability to automatically learn hierarchical and discriminative features from images. This facilitates a more sophisticated representation of the data compared to using raw pixel values, which can be limited in capturing complex patterns and structures.
Visual representation of the overall pipeline for ML-based classification of AD is shown in Fig. 2, where features from MRI are extracted using the VGG-16 architecture. The top layer of the model, which consists of fully connected layers, is removed for feature extraction. Input image, which has the dimensions of 176×176×3 (the size used in the OASIS dataset), will be converted into a block of features with dimensions 5×5×512 when it passes through the VGG-16 feature extraction network. This feature block will be reshaped into a single-dimensional vector of dimension 12,800 prior to applying any ML classifier.

Fig. 2. Block diagram of the proposed pipeline outlining the different stages of the classification method.
2.1. Proposed ensemble classifier
The proposed approach of ensemble classifiers employs distinct ML algorithms. Prior to using the ML classifiers, the 3D MRI images will be fed to the VGG-16 network to extract a 1D vector of features. This feature vector will then pass through individual ML classifiers. A voting process is applied to the output of individual ML classifiers resulting in the final classification label. The overall process involved in the proposed ensemble classification is depicted in Fig. 3.

Fig. 3. The ensemble of machine learning classifiers.
In this work, we study the performance of ensemble classification using two basic techniques: hard-voting and soft-voting. The hard-voting ensemble predicts the final label by taking the mode of the predicted labels obtained by individual classifiers. On the other hand, the soft-voting predicts the final label by summing the predicted probabilities by individual classifiers and taking the class label with the largest sum of probability values. Let K be the number of individual classifiers and C be the number of classes (four in our case).
2.1.1. Hard-voting or max voting approach
Let ej=(ej1,ej2,…,ejK) be a vector consisting of classification labels from each of the K classifiers for the jth test sample. Here, ejk represents the label obtained by kth classifier. The final classification label lj is determined as follows :
2.1.2. Soft-voting or probability-based approach
Let the probabilities of individual classifiers assigned to C classes for a jth sample be
When the probability output for C classes by K classifiers have been calculated for the jth sample, the final prediction label lj is determined by probability-based fusion method as follows :
3. Experimental Setup
3.1. The MRI data
The data used for the experiments in this study is collected from the ADNI (https://adni.loni.usc.edu/)38 and OASIS (https://www.oasis-brains.org)39 datasets. The ADNI is designed to develop clinical, imaging, genetic and biochemical biomarkers to track and early detect AD. We downloaded T1-weighted MRI images with Magnetization Prepared RApid Gradient Echo (MPRAGE) from subjects of either gender aged between 50 years and 65 years. The MPRAGE technique is used in MRI scanners to highlight anatomical brain tissue image quality and contrast between gray and white matters.40 The dataset includes 1056 MRI images from the axial plane in four categories (AD: 223, EMCI: 475, LMCI: 262 and CN: 96). Train and test samples were split into 90:10 proportions, with 950 training and 106 test samples. This dataset used in our study represents a data-scarce situation. All datasets are taken from ADNI-1, ADNI-2 and ADNI-GO cohorts and downloaded in NIfTI format. The dimensions of the ADNI MRI used in our experimentation were 218×192.
The OASIS dataset was divided into four categories based on the Clinical Dementia Rating (CDR) score. A CDR score of 0 denotes the absence of dementia, 0.5 is very mild dementia, 1 is mild dementia and 2 is moderate AD. There were 3200, 2240, 896 and 64 images in total in the CDR-0, CDR-0.5, CDR-1.0 and CDR-2.0 samples, respectively. Each sample has the dimensions of 176×176. The SMOTE algorithm was applied to generate synthetic samples in the minority classes, resulting in a dataset with comparable samples in each class. Specifically, the numbers of samples in CDR-0, CDR-0.5, CDR-1.0 and CDR-2.0 classes were 2704, 2674, 2708 and 2666, respectively. The dataset was further split into a 75:25 train–test split.
Testing algorithms on two different datasets, one balanced (OASIS) and another imbalanced (ADNI), have several advantages. In many real-world scenarios, the data is often imbalanced. Testing the algorithm on an imbalanced dataset can provide a more comprehensive understanding of where the algorithm is performing poorly on a specific class. Testing on a balanced dataset will give an idea of overall accuracy. Overall, testing on both balanced and imbalanced datasets can provide a more comprehensive evaluation of the model’s effectiveness and accuracy, which can ultimately lead to determining the best-performing model for real-world scenarios.
3.2. Preprocessing of MRI data
The MRI images were preprocessed using the FMRIB Software Library toolset (FSL).41 The FSL consists of various analytical tools for MRI data.
The MRI images have undergone four preprocessing steps: (i) reorientation, (ii) registration, (iii) skull-stripping and (iv) histogram equalization. The images were reoriented to the brain atlas (MNI space) standard space in the first step. The standard orientation is a left-to-right orientation where the anterior side of the brain faces upward, and the superior side of the brain faces forward. Reorientation of MRI images ensures consistency across all images, accuracy and reliability of the processing images and allows better interpretation and visualization of the images. In the second step, the reoriented MRI images were normalized to the MNI standard space by registering to a standard brain template or the MNI template. The registration process is to align the images to a common coordinate system. This process allows accurate comparisons of multiple scans of the same subject. After reorientation and registration, the MRI images resulted in uniform voxel dimensions (90×108×90). Skull-stripping is the next phase of preprocessing. In this phase, also known as the brain extraction or whole-brain segmentation, nonbrain tissues such as the skull, eyeballs and skin are removed from the MRI images of the brain. Skull-stripping is mainly done to improve the analytical accuracy and interpretation of images. The last step is to increase the contrast and visibility of features in MRI images using the histogram equalization technique. A sample output of these steps are shown in Fig. 2.
3.3. Metrics
Accuracy, sensitivity, specificity, area under the ROC curve (AUC) and standard deviation performance metrics were utilized in the evaluation of ML and DL models.35 Standard deviation serves as a measure of the variability or spread of performance metrics across different runs or folds of the datasets. A lower standard deviation indicates more stable and consistent performance, while a higher standard deviation suggests greater variability. All reported values are averages obtained with the one-versus-all strategy.
3.4. Implementation
In order to implement the ML models, the Python library scikit-learn42 was utilized. The Keras package from TensorFlow was used for DL model implementation. The overall performance evaluation was done on a standalone local machine equipped with a Ubuntu Linux system with a 3.2-GHz CPU and 32-GB RAM. This was further enhanced with an NVIDIA GeForce RTX 3060 GPU.
4. Experimental Results
This section demonstrates the performance of 10 conventional ML techniques to assess the top-performing models before creating an ensemble classifier. The five-fold cross-validation technique was used with the grid search approach to determine the optimal hyperparameter values for each ML method. Thus, the reported results of each algorithm ensure the optimal hyperparameter value.
4.1. Experiment 1: Results based on ML classifiers
The initial set of experiments was performed using the VGG-16 feature extractor on imbalanced ADNI and balanced OASIS datasets and the results are, respectively, tabulated in Tables 1 and 2. From the given scores (Tables 1 and 2), we can observe that some models have performed much better than others. For example, SVM, GBoost, XGBoost and KNN have all achieved high accuracy scores above 0.98. On the other hand, K-Means clustering, AdaBoost and DT have achieved a very low accuracy.
Method | Accuracy | Specificity | Sensitivity | FNR | FPR | AUC | Std. dev. |
---|---|---|---|---|---|---|---|
AdaBoost | 0.7714 | 0.9088 | 0.7618 | 0.2381 | 0.0912 | 0.7700 | 0.0219 |
GBoost | 0.9714 | 0.9893 | 0.9798 | 0.0202 | 0.0106 | 0.9934 | 0.0134 |
XGBoost | 0.9809 | 0.9925 | 0.9803 | 0.0196 | 0.0074 | 0.9954 | 0.0223 |
RF | 0.8917 | 0.965 | 0.8965 | 0.1034 | 0.0346 | 0.9814 | 0.0132 |
DT | 0.7142 | 0.8872 | 0.6869 | 0.3130 | 0.1127 | 0.7870 | 0.0257 |
SVM | 0.9904 | 0.9967 | 0.9895 | 0.0104 | 0.0032 | 0.9987 | 0.0004 |
KNN | 0.9809 | 0.9931 | 0.9803 | 0.0196 | 0.0068 | 0.9878 | 0.0039 |
K-Means | 0.4840 | 0.7791 | 0.3306 | 0.6693 | 0.2208 | 0.0134 | |
LR | 0.9809 | 0.9925 | 0.9803 | 0.0196 | 0.0074 | 0.9974 | 0.0039 |
Naive Bayes (NB) | 0.7809 | 0.9183 | 0.7728 | 0.2271 | 0.0816 | 0.8456 | 0.0337 |
Method | Accuracy | Specificity | Sensitivity | FNR | FPR | AUC | Std. dev. |
---|---|---|---|---|---|---|---|
AdaBoost | 0.6648 | 0.8883 | 0.6650 | 0.3349 | 0.1116 | 0.8146 | 0.0137 |
GBoost | 0.8593 | 0.9531 | 0.8599 | 0.1400 | 0.0468 | 0.9683 | 0.0266 |
XGBoost | 0.8902 | 0.9633 | 0.8905 | 0.1094 | 0.0366 | 0.9904 | 0.0198 |
RF | 0.8917 | 0.9653 | 0.8919 | 0.1080 | 0.0361 | 0.8917 | 0.0132 |
DT | 0.6898 | 0.8965 | 0.6904 | 0.3095 | 0.1034 | 0.7935 | 0.0903 |
SVM | 0.9132 | 0.9107 | 0.9135 | 0.0864 | 0.0893 | 0.9913 | 0.0042 |
KNN | 0.9371 | 0.9790 | 0.9377 | 0.0622 | 0.0209 | 0.9919 | 0.0028 |
K-Means | 0.3 | 0.7664 | 0.2991 | 0.7008 | 0.2335 | 0.0158 | |
LR | 0.907 | 0.969 | 0.9074 | 0.0925 | 0.0309 | 0.9830 | 0.0057 |
NB | 0.6765 | 0.8919 | 0.6745 | 0.325 | 0.108 | 0.7982 | 0.0065 |
Some critical observations from this independent testing are as follows:
(1) | SVM is well suited for handling high-dimensional data when the number of observations is less compared to the number of features. SVM can employ kernel functions to transform nonlinear data into a higher-dimensional space, where a linear boundary can separate classes in the transformed domain.43 SVM also generalizes well to new and unseen data. | ||||
(2) | Based on the accuracy scores provided in Table 1, KNN has the same accuracy score (0.9809) as GBoost, XGBoost and LR, which is higher than most other models. On the OASIS dataset, KNN outperforms all other models (0.9371). KNN assumes no data distribution and thus it can adapt to complex interactions between input features and target variables. KNN also captures local patterns by allocating a new data point to the class of most of its K nearest neighbors in the training set.44 | ||||
(3) | K-Means clustering performed poorly on ADNI and OASIS datasets. The reason could be that the outliers can affect the calculation of cluster centroids and lead to incorrect grouping of data points. Furthermore, K-Means clustering is only optimal if the data is linearly separable (unlike AD classification). | ||||
(4) | RF performed poorly (0.8917) on the ADNI dataset. This dataset with limited samples may not have enough variance to establish meaningful splits.45 In addition, class imbalance also biases prediction toward the majority class. | ||||
(5) | Table 1 shows that LR and KNN have the same accuracy score. KNN has a slightly higher specificity score than LR, suggesting that it can better identify negative cases.44 | ||||
(6) | Logistic Regression performed well with ADNI compared to OASIS. The model may have favored the majority class due to the imbalanced ADNI dataset, resulting in high accuracy. On the larger balanced OASIS dataset, the model may have performed better on all classes but with slightly lower accuracy. | ||||
(7) | There could be several reasons for the no improvement in accuracy for the RF algorithm between the smaller ADNI and larger OASIS datasets. The larger dataset might contain noisy or irrelevant data that could offset the benefits of having more data. If the data quality is poor, the Random Forest model may not be able to extract any useful information, leading to no improvement in accuracy. Further, the data distribution in the larger dataset is similar to the smaller dataset. In that case, the RF model will behave similarly irrespective of the dataset in terms of overall accuracy. The larger dataset may need to be more diverse, containing new or different examples, to help the model generalize better and improve its accuracy. | ||||
(8) | On both datasets, the XGBoost consistently obtained higher accuracy. This could be because the XGBoost has built-in regularization techniques such as L1- and L2-regularizations, which can help prevent overfitting and improve the generalization performance.46 These regularization techniques can be essential on smaller datasets where overfitting is a common problem. Also, XGBoost is designed to be scalable and can handle large datasets efficiently.46 | ||||
(9) | We used the DT model as AdaBoost’s base classifier with the same hyperparameter values as used in the standalone Decision Tree classifier. This approach can lead to a more accurate classifier than using a single decision tree because the AdaBoost model can learn by assigning greater weights to samples that are difficult to classify correctly. In turn, it helps the subsequent decision trees focus on these samples and improve the overall accuracy. | ||||
(10) | The Naive Bayes (NB) algorithm was the next worst-performing algorithm after the K-Means classifier. The NB algorithm assumes that all features are independent of each other, but in reality, this may not be true.47 Also, the NB classifier assumes that the decision boundaries are linear, which is not the case with AD classification. | ||||
(11) | KNN and SVM algorithms exhibited low standard deviations on both ADNI and OASIS datasets, showing consistent and predictable performance, important for medical data research. |
The performance measures of the five best-performing algorithms separately using VGG-16 (top) and direct raw pixel (bottom) data are shown in Fig. 4. By comparing the performance of classifiers trained on raw pixels versus VGG-16 features, we can determine the model’s suitability for real-time deployment that involves factors such as speed, scalability, robustness and overall performance. This experiment was conducted on the OASIS dataset only as their performance on ADNI is nearly 100% accurate. The raw image used in OASIS measures 176×176 pixels. When using direct pixel values as features, each training sample needs to be vectorized and stored in a large matrix of size M×30976, where M is the number of training samples. In contrast, using the pretrained VGG-16 encoder can transform the images into (M×12800)-dimensional feature space.

Fig. 4. VGG16 vs Raw Pixels: Different metrics for four top-performing algorithms on the OASIS dataset.
It is clearly evident that the best-performing KNN algorithm (0.9371) in the transformed VGG-16 space suffered badly when direct raw pixel values were used (0.8757). This could be due to multiple reasons. As the feature space’s dimensionality increases, the data points’ sparsity also increases. This can make it harder for the KNN algorithm to accurately classify the data points, as the distance between any two points in the high-dimensional space becomes more and more similar. It is also possible that the higher-dimensional feature space contains redundant or irrelevant features, which may decrease the KNN algorithm’s accuracy. On the other hand, features extracted from the VGG-16 network are pretrained on a large dataset and have already learnt to recognize high-level features such as edges, shapes and textures. Using these features, which are nearly 60% lesser in dimension, as input to a classifier can reduce the computational burden and improve classification accuracy.
Performance trends of other algorithms (e.g. XGBoost, SVM, LR and RF) are comparable in both domains across all metrics. It can be observed that although the accuracy of XGBoost is less, the specificity and AUC are higher. It is important to note that in multiclass classification problems, specificity is calculated for each class separately. Therefore, a high specificity value for one class does not necessarily mean the model performs well for the other classes. Furthermore, other metrics such as accuracy, specificity and sensitivity of SVM, KNN, LR, and XGBoost fluctuate; the AUC is comparable for these algorithms. This is because AUC measures the classifier’s overall performance across all classes calibrated with optimal hyperparameter values. The AUC considers the ranking of the positive and negative instances across all classes and thresholds. On the other hand, metrics such as accuracy, specificity and sensitivity may vary widely across classes due to differences in the class distribution or classification threshold.
Considering the high-dimensional feature of using raw pixels and the associated computational burden, the VGG-16 feature extractor is the best choice. Hence, the rest of our experiments involving ML classifiers will utilize the VGG-16 feature space.
4.2. Experiment 2: Results based on ML-based ensemble classifiers
For this experiment, we considered four top-performing algorithms from the previous experiment: KNN, SVM, LR and XGBoost. The results of both probability fusion and max-voting-based ensembling were tested using the OASIS dataset as these top-performing methods individually attained the highest accuracy on the ADNI dataset. We tried different combinations of two, three and four classifiers in this experiment, and the results are tabulated in Table 3.
No. of models | Ensemble | Type | Accuracy | Specificity | Sensitivity | FNR | FPR |
---|---|---|---|---|---|---|---|
2 | KNN, LR | Hard | 0.9121 | 0.9706 | 0.9118 | 0.0881 | 0.0293 |
Soft | 0.9589 | 0.9863 | 0.9593 | 0.0406 | 0.0136 | ||
KNN, XGBoost | Hard | 0.9046 | 0.9681 | 0.9043 | 0.0956 | 0.0318 | |
Soft | 0.9574 | 0.9858 | 0.9578 | 0.0421 | 0.0141 | ||
KNN, SVM | Hard | 0.9089 | 0.9695 | 0.9087 | 0.0912 | 0.0304 | |
Soft | 0.9652 | 0.9884 | 0.9655 | 0.0344 | 0.0115 | ||
LR, XGBoost | Hard | 0.8933 | 0.9643 | 0.8927 | 0.1072 | 0.0356 | |
Soft | 0.9179 | 0.9726 | 0.9184 | 0.0815 | 0.0273 | ||
LR, SVM | Hard | 0.9054 | 0.9684 | 0.9053 | 0.0946 | 0.0315 | |
Soft | 0.9125 | 0.9708 | 0.9128 | 0.0871 | 0.0291 | ||
XGBoost, SVM | Hard | 0.8933 | 0.9643 | 0.8927 | 0.1072 | 0.0356 | |
Soft | 0.9281 | 0.9760 | 0.9284 | 0.0715 | 0.0239 | ||
3 | KNN, SVM, LR | Hard | 0.9250 | 0.9750 | 0.9254 | 0.0745 | 0.0249 |
Soft | 0.9468 | 0.9822 | 0.9472 | 0.0527 | 0.0177 | ||
KNN, SVM, XGBoost | Hard | 0.9367 | 0.9789 | 0.9371 | 0.0628 | 0.0210 | |
Soft | 0.9613 | 0.9870 | 0.9615 | 0.0384 | 0.0129 | ||
KNN, XGBoost, LR | Hard | 0.9332 | 0.9777 | 0.9335 | 0.0664 | 0.0222 | |
Soft | 0.9554 | 0.9851 | 0.9557 | 0.0442 | 0.0148 | ||
SVM, XGBoost, LR | Hard | 0.9183 | 0.9727 | 0.9187 | 0.0812 | 0.0272 | |
Soft | 0.9226 | 0.9742 | 0.9230 | 0.0769 | 0.0257 | ||
4 | KNN, SVM, XGBoost, LR | Hard | 0.9285 | 0.9761 | 0.9284 | 0.0715 | 0.0238 |
Soft | 0.9464 | 0.9821 | 0.9468 | 0.0531 | 0.0178 |
The ensemble of KNN and SVM resulted in an accuracy of 0.9652, a nearly 3% increase over the best-performing KNN model (0.9371). This improvement could be because KNN and SVM are fundamentally different algorithms with different strengths and weaknesses. KNN is a nonparametric algorithm that makes predictions based on the closest neighbors in the training data, while the SVM is a parametric algorithm that finds the best hyperplane to separate the classes. When the predictions of KNN and SVM are combined, their strengths complement each other, and their weaknesses are mitigated. For example, KNN may better identify local patterns in the data, while SVM may better handle high-dimensional data or data with complex decision boundaries. The ensemble of KNN and SVM can correct errors made by each model as these algorithms are diverse. Diversity is a critical factor in the success of ensembles because it allows the models to capture different aspects of the data and reduce the risk of overfitting.
It can also be noted from Table 3 that increasing the number of classifiers did not increase the overall accuracy. Ensemble methods can help improve the performance of ML models, but there are limits to how much improvement can be gained by adding more classifiers to the ensemble. Ensemble methods work best when the individual models are diverse and provide complementary information about the data. If the additional classifiers are similar to the existing models or provide conflicting information, they may not add much value to the ensemble. In general, we can conclude that adding more classifiers to an ensemble can be beneficial up to a certain extent. Beyond that point, the marginal benefits of adding more classifiers may outweigh the costs of increased model complexity and possible overfitting.
It can also be observed that the probability-based ensembling is consistently performing better than the max-voting-based ensemble. In a probability-based ensemble, the final prediction is based on the confidence levels assigned by each classifier. On the other hand, in a max-voting-based ensemble, the final prediction is based on the mode of the individual classifier predictions.
The soft-voting can be helpful in the cases where some classifiers are more confident in their predictions than others. The probability-based ensemble can improve accuracy by weighting the predictions according to their confidence levels. Further, ensembles can also be more robust in the case of uncertainty in the data such as ambiguity or difficulty in the classification. In that case, a soft-voting ensemble can consider this uncertainty and make a more informed prediction based on the probabilities assigned by the individual classifiers. The rest of our experiments report the accuracy of ensembles using the probability-based (soft-voting) approach.
The five top-performing algorithms’ confusion matrices are shown in Fig. 5 along with the ensemble confusion matrix. The ensemble approach showed robustness in correctly classifying the MCI (CDR-1.0) and AD (CDR-2.0) cases, with the accuracies of 0.92 and 0.94, respectively, which is the distinctive capability of the ensemble classifier compared to individual classifiers. The SVM achieved the accuracies of 0.84 and 0.83 in distinguishing MCI and AD cases, while the KNN classifier obtained the accuracies of 0.88 and 0.86 in the correct classification of MCI and AD class samples. This ability of the ensemble classifier to distinguish between MCI and AD cases is vital, as early detection of MCI can provide an opportunity for early intervention to prevent or delay the onset of AD.

Fig. 5. Confusion matrices: XGBoost (0.8902), KNN (0.9371), SVM (0.9132), LR (0.907), RF (0.8917) and soft ensemble classifier (0.9652).
4.3. Experiment 3: Results based on DL classifiers
This study also investigates the performance of several pretrained neural network models from the ImageNet challenge, namely ResNet50, ResNet101, ResNet152, InceptionV3, InceptionResNetV2 and EfficientNetB0, on the task of classifying MRI datasets obtained from OASIS and ADNI. The input samples were converted to 3D space for architectural compatibility. We removed the top layer and retained the ImageNet weights during our experiments with pretrained models. The fine-tuning layer can be expressed as follows: DO(0.5)–Flatten–BN–2048N–BN–DO(0.5)–1024N–BN–DO(0.5)–4N, where DO(x) represents a dropout layer with a probability of x, BN indicates Batch Normalization and fN implies a fully connected layer with f neurons. The final output layer comprises four softmax-activated neurons for four AD classes.
In this experiment, we evaluate both pretrained models and CNNs trained from scratch and compare their performance to provide insights into the effectiveness of these approaches for our task. The architecture of non-pretrained CNN can be represented as: 16C2–16C2–MP2–32C2–32C2–32C2–MP2– 64C2–16C1–Flatten–4N. The learning rate of 0.0001 and Adam activation function were utilized with a batch size of 128. The number of epochs was controlled by the Keras EarlyStopping callback. The results obtained for ADNI and OASIS datasets are shown in Tables 4 and 5, respectively.
Test | Train | Val | ||||||
---|---|---|---|---|---|---|---|---|
Method | acc | acc | acc | Specificity | Sensitivity | FNR | FPR | AUC |
ResNet50 | 0.9890 | 0.9950 | 0.9989 | 0.9926 | 0.9803 | 0.0197 | 0.0074 | 0.9864 |
ResNet101 | 0.9620 | 0.9910 | 0.9911 | 0.9863 | 0.9695 | 0.0305 | 0.0137 | 0.9922 |
ResNet152 | 0.9809 | 0.9920 | 0.9924 | 0.9926 | 0.9803 | 0.0197 | 0.0074 | 0.9899 |
InceptionV3 | 0.9050 | 0.9990 | 0.9850 | 0.9682 | 0.9215 | 0.0785 | 0.0318 | 0.9912 |
InceptionResNetV2 | 0.9620 | 0.9980 | 0.9960 | 0.9841 | 0.9398 | 0.0602 | 0.0159 | 0.9917 |
EfficientNetB0 | 0.4380 | 0.6830 | 0.6907 | 0.7500 | 0.2500 | 0.7500 | 0.2500 | 0.5468 |
Non-pretrained CNN | 0.9809 | 1 | 0.9905 | 0.9925 | 0.9791 | 0.0208 | 0.0074 | 0.9999 |
Test | Train | Val | ||||||
---|---|---|---|---|---|---|---|---|
Method | acc | acc | acc | Specificity | Sensitivity | FNR | FPR | AUC |
ResNet50 | 0.8310 | 0.9690 | 0.9775 | 0.9433 | 0.8299 | 0.1701 | 0.0567 | 0.9641 |
ResNet101 | 0.8730 | 0.9830 | 0.9805 | 0.9575 | 0.8729 | 0.1271 | 0.0425 | 0.9734 |
ResNet152 | 0.8380 | 0.9720 | 0.9718 | 0.9461 | 0.8386 | 0.1664 | 0.0539 | 0.9628 |
InceptionV3 | 0.9140 | 0.9980 | 0.9899 | 0.9713 | 0.9139 | 0.0861 | 0.0287 | 0.9871 |
InceptionResNetV2 | 0.9000 | 0.9920 | 0.9869 | 0.9666 | 0.9003 | 0.0997 | 0.0334 | 0.9832 |
EfficientNetB0 | 0.2590 | 0.5060 | 0.5077 | 0.7500 | 0.2500 | 0.7500 | 0.2500 | 0.4826 |
Non-pretrained CNN | 0.9429 | 1 | 0.9429 | 0.9810 | 0.9434 | 0.0565 | 0.0189 | 0.9817 |
The observations based on this experiment are discussed below:
(1) | The Transfer Learning models achieved convergence at an earlier stage in ADNI dataset as compared to OASIS dataset. However, the models required 200 epochs to complete the training process on OASIS dataset. Early halting may have several reasons. The OASIS dataset trained 8064 samples (75% of 10,752 total samples), whereas the ADNI dataset trained 950 samples (90% of 1056 total samples). There could be several reasons for early convergence: The validation loss may stop falling and rise as the number of epochs grows, indicating overfitting for the smaller ADNI dataset. In addition, the ADNI dataset may have more noise in the training data, making it harder for the model to learn patterns and generalize to new data, causing increased validation loss and early termination. | ||||
(2) | The EfficientNetB0 model performed poorly compared to the ResNet models on the MRI dataset. The EfficientNetB0’s complex architecture with many parameters may have caused overfitting, especially with a limited dataset size. In contrast, the simpler architecture of ResNet models facilitated better generalization for the MRI dataset. The pretrained weights used in transfer learning also played a role, with ResNet models better capturing relevant features than EfficientNetB0. | ||||
(3) | InceptionV3 demonstrates superior accuracy on the OASIS dataset but not on the ADNI dataset. This discrepancy can be attributed to several factors. InceptionV3 is a deep and complex model with numerous parameters. When trained on the smaller ADNI dataset, there is a higher risk of overfitting, wherein the model memorizes the training set instead of generalizing it to new data. | ||||
(4) | The difference between the testing and training accuracies or the generalization gap is less in the ADNI dataset than in the OASIS dataset. In a larger dataset, there are more diverse samples that can make the model difficult to generalize to new samples. Overall, the size of the dataset is a pivotal factor that can impact the generalization gap. | ||||
(5) | ResNet101 performed better on the larger OASIS dataset but worse on the smaller ADNI dataset compared to other ResNet models used in this study. Deeper and more complex models like ResNet101 are typically more effective with larger datasets, while smaller datasets may suffer from overfitting due to limited features. In such cases, a ResNet50 model with fewer layers might prevent overfitting and capture key properties more effectively. | ||||
(6) | InceptionResNetV2 performs well on OASIS but not on ADNI. Like InceptionV3, InceptionResNetV2 has several parameters and can overfit on smaller datasets. On the ADNI dataset, data imbalance and lack of diversity might impair accuracy. | ||||
(7) | The CNN from scratch model performs better with the larger OASIS and smaller ADNI datasets. In pretrained models, the learnt features from other datasets may not be relevant to the current dataset, leading to poor performance. |
4.4. Experiment 4: Results based on ensemble DL classifiers
This subsection reports the performance of ensemble DL classifiers using four CNN architectures, including Transfer Learning and CNN trained from scratch (see Table 6 for more details). Combining transfer learning architecture with CNN trained from scratch can improve the ensemble classifiers’ accuracy. However, adding more DL classifiers did not significantly improve accuracy and may not be worth the computational cost. The highest accuracy achieved with four DL classifiers was similar to an ML ensemble with only two classifiers (see Table 3). It is important to highlight that our results resonate with the widely recognized fact that DL algorithms tend to require large datasets for optimal performance (refer Table 5).30,48 Notably, our findings remain consistent even when implementing transfer learning using popular pretrained models.49
No. of Models | Models | Accuracy | Specificity | Sensitivity | FNR | FPR |
---|---|---|---|---|---|---|
2 | CNN, InceptionV3 | 0.9535 | 0.9845 | 0.9539 | 0.0154 | 0.0460 |
CNN, InceptionResNetV2 | 0.9476 | 0.9825 | 0.9481 | 0.0174 | 0.0518 | |
CNN, ResNet101 | 0.9460 | 0.9820 | 0.9464 | 0.0179 | 0.0535 | |
InceptionV3, InceptionResNetV2 | 0.9363 | 0.9787 | 0.9363 | 0.0217 | 0.0636 | |
InceptionV3, ResNet101 | 0.9281 | 0.9760 | 0.9280 | 0.0239 | 0.0719 | |
InceptionResNetV2, ResNet101 | 0.9234 | 0.9744 | 0.9234 | 0.0255 | 0.0765 | |
3 | CNN, InceptionV3, InceptionResNetV2 | 0.9644 | 0.9881 | 0.9646 | 0.0118 | 0.0353 |
CNN, InceptionV3, ResNet101 | 0.9574 | 0.9857 | 0.9574 | 0.0142 | 0.0425 | |
CNN, InceptionResNetV2, ResNet101 | 0.9539 | 0.9846 | 0.9540 | 0.0153 | 0.0459 | |
InceptionV3, InceptionResNetV2, ResNet101 | 0.9492 | 0.9830 | 0.9491 | 0.0169 | 0.0500 | |
4 | CNN, InceptionV3, InceptionResNetV2, ResNet101 | 0.9656 | 0.9657 | 0.9657 | 0.0114 | 0.0342 |
In conclusion, our findings indicate that a small ensemble of diverse ML classifiers can achieve superior accuracy compared to a larger ensemble of DL classifiers (refer Tables 3 and 6). Based on these experiments, we can safely conclude that ML classifiers, whether used individually or as ensembles, are the preferred choice for data-scarce scenarios, which are commonly encountered in the medical field. The final choice between ML and DL can therefore depend on the specific details of the problem and the available resources.
5. Current Challenges and Future Avenues
While our study has made a significant contribution to the existing knowledge base on AD classification, there are still several limitations that must be addressed. In this section, we will examine these limitations and discuss potential avenues for future research.
(1) | MRI Preprocessing: Validating MRI preprocessing is crucial for ML and DL tasks as it directly impacts model training and testing. Preprocessing enhances data quality by improving image resolution and contrast and reducing noise. Neglecting validation of preprocessed images can result in inaccurate or irrelevant features, leading to decreased model performance. Comparing original and preprocessed images helps identify inconsistencies allowing the assessment of preprocessing’s impact on model accuracy. Hence, ensuring dependable preprocessing can result in improved model outcomes. | ||||
(2) | MRI images with axial plane: Relying solely on MRI images from the axial plane for AD prediction can negatively impact ML and DL training and testing in several ways. First, it may exclude relevant information from other planes, leading to inaccurate analysis. Since AD can affect various brain regions, considering images only from the axial plane limits the comprehensive understanding of the disease. By considering MRI images from multiple planes, researchers can achieve more accurate and generalizable results in AD analysis. | ||||
(3) | Used only neuroimaging data: Using solely neuroimaging data for ML and DL processing of AD can potentially impact the models’ performance. The complexity of the disease extends beyond the brain, affecting various aspects of a person’s lifestyle and overall health. Ignoring these factors can lead to partial or incorrect diagnoses or predictions. Hence, it is important to incorporate additional biomarkers that provide a more comprehensive understanding of the patient, such as clinical and demographic data. The models can capture complex interactions between different factors by combining multiple data sources, improving prediction accuracy. | ||||
(4) | Single data modality: Neurodegenerative diseases like AD can affect the brain in complex and varied ways and using different data modalities can capture different aspects of the disease. However, using only a single data modality might limit the ability of the model to capture the full extent of the impact of the disease on the brain. For instance, using multiple data modalities like structural MRI, functional MRI and molecular changes associated with AD can provide a more complete and accurate performance in the processes of ML and DL. | ||||
(5) | Black box models: The use of black box models in Alzheimer’s disease ML and DL processing poses challenges in result interpretation and understanding the underlying reasons.50,51 Explainable AI (XAI) facilitates understanding and visualization of the decision-making process of black box models, enhancing interpretability and bias detection. | ||||
(6) | Different subsets of ADNI: Varied demographics, imaging protocols and image qualities across subsets contribute to dataset heterogeneity and can affect model performance. Additionally, model instability arises as the model learns from different subsets, reducing reliability and interpretability. Addressing these issues involves carefully selecting subsets representing the overall population with consistent demographics. | ||||
(7) | VGG-16 feature space: There can be several implications on training and testing by using only VGG-16 feature space for ML processing of Alzheimer’s disease. Furthermore, relying only on VGG-16 architecture can lead to the loss of other important features relevant to the diagnosis of AD resulting in decreased performance. In future, one can study the impact of different Transfer Learning-based ConvNet feature extractors for the AD classification task. | ||||
(8) | Model overfitting: Large hyperparameter space exists in DL models. For example, CNNs contain hyperparameters for the number of convolutional layers, the size and number of filters, the size of the pooling layer, the learning rate, the regularization parameters and many other variables. Fine-tuning DL models and using optimized hyperparameters is a challenging task but it will probably yield the best results. |
6. Conclusion
In recent years, there has been a surge of interest in deep learning methods, and they are often portrayed as the solution to all complex classification problems. Yet, our study aimed to provide insights into machine learning performance over deep learning methods in the Alzheimer’s disease classification task, where data is often limited and complex. Specifically, by proposing an ensemble classifier based on ML that showed an improvement of 3–7% accuracy over individual classifier methods, we were able to demonstrate the potential of ML approaches in improving performance with complex AD data. Furthermore, our study focused on comparing the performance of ML and DL methods by implementing 10 widely used ML algorithms and comparing their performance against DL models.
In summary, our results demonstrated that a small ensemble of diverse ML classifiers outperforms a larger ensemble of deep learning classifiers in terms of accuracy. These experiments led us to the confident conclusion that ML classifiers, whether employed singly or in ensembles, represent the preferred option for situations characterized by limited data, a frequent occurrence in medical contexts. The ultimate decision between ML and DL should depend on the specific complexities of the problem at hand and the available resources.
While improvements in classification accuracy seem encouraging from an academic perspective, the practical significance of clinical and epidemiological implications should also be considered. In a clinical context, even slight enhancements in AD diagnosis accuracy hold the potential to usher in earlier interventions and personalized treatment strategies, which can significantly improve the quality of life for patients and their families. From an epidemiological perspective, our research contributes to the ongoing discourse surrounding AD diagnosis, offering insights that can inform population-level studies and the broader understanding of AD progression. These findings underscore the value of our work beyond the realm of ML, reinforcing its relevance and potential to impact clinical practice and public health.
Acknowledgments
This work is funded by the Ministry of Higher Education, Research and Innovation (MoHERI) of the Sultanate of Oman under the Block Funding Program (Grant No. MoHERI/BFP/UoTAS/01/2021) and by the UKRI through the Horizon Europe Guarantee Scheme (project number: 10078953) for the European Commission funded PHASE IV AI project (Grant Agreement No. 101095384) under the Horizon Europe Programme. Vimbi Viswan is supported by the Internal Research Grant (No.IRG/2024/Call-7/8) of UTAS. The source code of this work can be found at: https://github.com/snoushath/AII2022.git.
ORCID
Noushath Shaffi https://orcid.org/0000-0001-9243-8402
Karthikeyan Subramanian https://orcid.org/0000-0002-5086-1170
Viswan Vimbi https://orcid.org/0009-0005-4065-4492
Faizal Hajamohideen https://orcid.org/0000-0003-4402-8294
Abdelhamid Abdesselam https://orcid.org/0000-0003-2950-2875
Mufti Mahmud https://orcid.org/0000-0002-2037-8348