Open Access

Performance Evaluation of Deep, Shallow and Ensemble Machine Learning Methods for the Automated Classification of Alzheimer’s Disease

College of Computing and Information Sciences, University of Technology and Applied Sciences, P.O. Box: 135, Suhar 311, Sultanate of Oman, Oman

E-mail Address: noushath.mys@gmail.com

Search for more papers by this author

Karthikeyan Subramanian

https://orcid.org/0000-0002-5086-1170

College of Computing and Information Sciences, University of Technology and Applied Sciences, P.O. Box: 135, Suhar 311, Sultanate of Oman, Oman

Search for more papers by this author

Viswan Vimbi

https://orcid.org/0009-0005-4065-4492

College of Computing and Information Sciences, University of Technology and Applied Sciences, P.O. Box: 135, Suhar 311, Sultanate of Oman, Oman

Search for more papers by this author

Faizal Hajamohideen

https://orcid.org/0000-0003-4402-8294

College of Computing and Information Sciences, University of Technology and Applied Sciences, P.O. Box: 135, Suhar 311, Sultanate of Oman, Oman

Search for more papers by this author

Abdelhamid Abdesselam

https://orcid.org/0000-0003-2950-2875

Department of Computer Science, College of Science, Sultan Qaboos University, P.O. Box: 36, Al-Khod 123, Sultanate of Oman, Oman

E-mail Address: ahamid@squ.edu.om

Search for more papers by this author

, and

Mufti Mahmud

https://orcid.org/0000-0002-2037-8348

Department of Computer Science, Medical Technologies Innovation Facility and Centre for Computer Science and Informatics (CIRC), Nottingham Trent University, Nottingham NG11 8NS, UK

E-mail Address: mufti.mahmud@ntu.ac.uk

E-mail Address: muftimahmud@gmail.com

Corresponding author.

Search for more papers by this author

https://doi.org/10.1142/S0129065724500291Cited by:10 (Source: Crossref)

Abstract

Artificial intelligence (AI)-based approaches are crucial in computer-aided diagnosis (CAD) for various medical applications. Their ability to quickly and accurately learn from complex data is remarkable. Deep learning (DL) models have shown promising results in accurately classifying Alzheimer’s disease (AD) and its related cognitive states, Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI), along with the healthy conditions known as Cognitively Normal (CN). This offers valuable insights into disease progression and diagnosis. However, certain traditional machine learning (ML) classifiers perform equally well or even better than DL models, requiring less training data. This is particularly valuable in CAD in situations with limited labeled datasets. In this paper, we propose an ensemble classifier based on ML models for magnetic resonance imaging (MRI) data, which achieved an impressive accuracy of 96.52%. This represents a 3–5% improvement over the best individual classifier. We evaluated popular ML classifiers for AD classification under both data-scarce and data-rich conditions using the Alzheimer’s Disease Neuroimaging Initiative and Open Access Series of Imaging Studies datasets. By comparing the results to state-of-the-art CNN-centric DL algorithms, we gain insights into the strengths and weaknesses of each approach. This work will help users to select the most suitable algorithm for AD classification based on data availability.

Keywords:

1. Introduction

Alzheimer’s disease (AD) is an irreversible neurodegenerative condition that permanently alters one’s quality of life.¹ AD is characterized by the progressive degeneration of protein components in brain cells, known as plaques and tangles.² A considerable reduction in cognitive abilities will result from this disruption in the communication between protein components, which can have devastating effects on an individual’s personal and social life.^1,3 Patients diagnosed with Mild Cognitive Impairment (MCI) are in a stage of transition from a Cognitively Normal (CN) condition to a dementia state, also known as a major neurocognitive disorder.¹ This transition has a 10% conversion rate to AD.¹ There are an estimated 55 million persons with AD worldwide, and many more instances go unreported due to a general lack of knowledge about the condition.¹ According to the data, AD is the seventh greatest cause of death around the world.¹

People affected by this condition may experience a wide variety of discomforts, including problems with their short-term and long-term memories, behavioral disturbances and a variety of other physical concerns such as impaired eyesight and limited mobility.⁴ The lack of awareness of AD among members of the general population is the primary barrier to the early diagnosis of this condition. As a result, increasing cognitive decline and related behavioral changes are frequently thought of as the phenomena connected with the natural aging process or suspected to be other psychiatric problems.⁵ Furthermore, patients’ suffering is compounded by geographical isolation, lack of qualified medical personnel, limited access to specialists and inadequate diagnostic resources.¹ These manifest into exacerbated suffering, to the point that a person’s independence in their day-to-day and social life is compromised. Therefore, it is crucial to identify AD early to lessen the burden on the patient and care-taking family members.

The AD continuum has several levels or stages. Mild Cognitive Impairment is an intermediate stage between the cognitive decline due to normal aging and the more pronounced decline of dementia.⁶ These problems are noticeable to other people and show up on tests, but they do not interfere with the daily life activities. The stable MCI (sMCI) is known as the condition when the impairment does not worsen over time, and progressive MCI (pMCI) is when the deterioration of the cognitive faculties is noticeable over time.^7,8 It is also categorized as Early MCI (EMCI) and Late MCI (LMCI). EMCI refers to cognitive changes that are not yet significantly impacting daily life, while LMCI refers to noticeable cognitive difficulties that affect daily activities.^9,10 The published classification works on AD usually aim at detecting the onset of the disease or assessing the stage of cognitive impairments.^11,12,13

Diagnosis of AD is based primarily on observing patient symptoms. It can take years to observe an apparent presence of the disease. However, due to developments in diagnostic research, various techniques [e.g. magnetic resonance imaging (MRI), positron emission tomography (PET), computed tomography (CT), blood tests, etc.] have been discovered to help in early AD prediction.^14,15,16 Employing artificial intelligence (AI) methods on these imaging techniques can improve clinical decision-making and patient treatment quality. Machine learning (ML) and deep learning (DL) are subsets of AI that involve training algorithms to learn from data, as shown in Fig. 1. ML algorithms are designed to identify patterns in data and make predictions based on those patterns. These algorithms can be supervised, unsupervised or semi-supervised. In contrast, DL algorithms are based on neural networks capable of learning and representing complex patterns in data through multiple layers of processing. DL algorithms have achieved state-of-the-art (SOTA) performance on many challenging AI tasks.¹⁶ Recently, researchers have shown much propensity to use DL models (with or without ensemble methods),^16,17 but this may not be the best choice because a simple ML classification could get the same or even better results.¹⁸ Classifiers based on ML have been widely used in the healthcare industry and have proven helpful in identifying AD cases.^11,19,20

Fig. 1. Typical machine learning and deep learning pipeline in AD classification.

Recent research in ML and DL for AD classification spans several aspects, such as the usage of multiple modalities such as MRI, PET and CT scans for more comprehensive diagnosis.^21,22,23 Several studies have incorporated clinical information such as demographic data, cognitive scores and genetics in DL models using Convolutional Neural Network (CNN)-based classifiers for more accurate AD classification.^24,25 For instance, in a study, Kundaram and Pathak²⁶ designed a three-way (NC, MCI and AD) CNN-based classifier trained with 9540 images extracted from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and reported a classification accuracy of 98.57%. In another work, Basaia et al.²⁷ used a 3D CNN consisting of 12 convolutional layers trained with 3D $T_{1}$ -weighted images from the ADNI dataset and also from an independent dataset (Milan dataset) and reported an accuracy of 98.2%. Also, many studies have predominantly used the Transfer learning (TL) approach for AD classification during the past few years. TL is an approach of using a pretrained CNN model, for a task usually with a huge dataset, to perform a similar task on a smaller dataset. The study of Jain et al.²⁸ used VGG-16 as a feature extractor applied to the ADNI data of 150 subjects consisting of $T_{1}$ -weighted sMRI, demographic features and MMSE scores. The last fully connected layers of the VGG-16 model were replaced by two dense layers and a dropout layer. The authors reported classification accuracies of 95.73% for AC versus MCI versus AD, 99.14% for AD versus CN and 99.30% for AD versus MCI. In another study, Wu et al.²⁹ compared the classification performance of AlexNet and GoogLeNet using MRI images of individuals from normal control, stable MCI and converted MCI (cMCI) subjects. In their experiments, AlexNet outperformed GoogLeNet in all classifications. In another study,³⁰ popular TL architectures (VGG-16 and Inception) were employed for AD classification using the Open Access Series of Imaging Studies (OASIS) dataset. The authors achieved a competitive accuracy despite using a smaller dataset through optimal training of the network.

Traditional ML classifiers such as Logistic Regression (LR), Random Forest (RF), K-Means (KM) and Support Vector Machine (SVM) have also been used successfully for AD classification.¹¹ The study of Khedher et al.³¹ used independent component analysis (ICA) as a feature extraction on segmented MRI images extracted from the ADNI dataset and passed them to the SVM classifier to perform binary classifications. Their model achieved the accuracies of 89% for CN versus AD, 79% for CN versus MCI and 85% for MCI versus AD. Several features extracted from MRI images were tested by Acharya et al.³² using K-Nearest Neighbors (KNN). The Shearlet Transform feature extraction technique produced the best performance with an average accuracy of 94.54% and the precision, sensitivity and specificity of 88.33%, 96.30% and 93.64%, respectively. Many researchers have applied RF for AD prediction and prognosis; it is characterized by its robustness to overfitting and outliers and its ability to handle nonlinear data.³³ Alickovic and Subasi³⁴ used a histogram to represent useful features extracted from 267 ADNI brain images. Several classifiers [i.e. SVM, RF, LR, Multilayer Perceptron (MLP), KNN and Decision Tree (DT)] were applied to the produced histogram; the authors reported that RF achieved the best classification accuracy of 85%.

Likewise, ample studies have been proposed in the literature that employ ML and DL algorithms for AD classification. However, in most studies, either the data used is publicly unavailable or they used a different subset of data, for instance, from the ADNI dataset, which hinders benchmarking and obscures identification of the true efficiency of these classifiers. In addition to proposing an ML-based ensemble classifier for AD classification, our study also conducts an objective evaluation of the efficacy of existing ML and DL classifiers for AD classification using the MRI data. In this investigation, our focus on DL classifiers predominantly refers to the utilization of CNN-centric models.

The study also helps to ascertain these classifiers’ basic limitations and capabilities, which are useful for enhancing the understanding of their performance in real-world applications and guiding further refinements in ML models for MRI data modality.

This paper is built on our previously published study on the four-way classification of AD using the OASIS dataset.³⁵ This update explores testing performance utilizing the popular ADNI dataset in addition to the OASIS dataset. A series of contributions have been included in this version, as follows:

(1)	Investigation of the use of SOTA ML and DL models for the classification of AD and providing substantiating analytical results.
(2)	An ensemble classification approach has been proposed for AD classification using the best-performing models.
(3)	Data and source code used in this work are made publicly available for use of the research community.

The rest of the paper is organized as follows: Sections 2 and 3, respectively, present the proposed approach along with experimental setups. Experimental results and analysis are presented in Sec. 4. Current challenges and future avenues are presented in Sec. 5. Conclusions are drawn in Sec. 6.

2. Proposed Methodology

This section presents the proposed pipeline to classify AD using traditional ML classifiers and an ensemble classification approach. We empirically chose the VGG-16 pretrained architecture for feature extraction. Our empirical decision process involved comprehensive experiments with various pretrained feature extractors trained on the ImageNet dataset, employing five-fold cross-validation and averaging accuracy as the performance indicator. The results consistently demonstrated the superior performance of the VGG-16 architecture over other feature extractors. As a result, we chose the VGG-16 pretrained architecture for feature extraction in our study.^36,37 By using pretrained DL models like VGG-16, we leverage its ability to automatically learn hierarchical and discriminative features from images. This facilitates a more sophisticated representation of the data compared to using raw pixel values, which can be limited in capturing complex patterns and structures.

Visual representation of the overall pipeline for ML-based classification of AD is shown in Fig. 2, where features from MRI are extracted using the VGG-16 architecture. The top layer of the model, which consists of fully connected layers, is removed for feature extraction. Input image, which has the dimensions of $176 \times 176 \times 3$ (the size used in the OASIS dataset), will be converted into a block of features with dimensions $5 \times 5 \times 512$ when it passes through the VGG-16 feature extraction network. This feature block will be reshaped into a single-dimensional vector of dimension 12,800 prior to applying any ML classifier.

2.1. Proposed ensemble classifier

The proposed approach of ensemble classifiers employs distinct ML algorithms. Prior to using the ML classifiers, the 3D MRI images will be fed to the VGG-16 network to extract a 1D vector of features. This feature vector will then pass through individual ML classifiers. A voting process is applied to the output of individual ML classifiers resulting in the final classification label. The overall process involved in the proposed ensemble classification is depicted in Fig. 3.

Fig. 3. The ensemble of machine learning classifiers.

In this work, we study the performance of ensemble classification using two basic techniques: hard-voting and soft-voting. The hard-voting ensemble predicts the final label by taking the mode of the predicted labels obtained by individual classifiers. On the other hand, the soft-voting predicts the final label by summing the predicted probabilities by individual classifiers and taking the class label with the largest sum of probability values. Let K be the number of individual classifiers and C be the number of classes (four in our case).

2.1.1. Hard-voting or max voting approach

Let $e^{j} = (e_{1}^{j}, e_{2}^{j}, \dots, e_{K}^{j})$ be a vector consisting of classification labels from each of the K classifiers for the jth test sample. Here, $e_{k}^{j}$ represents the label obtained by kth classifier. The final classification label $l^{j}$ is determined as follows :

l j = mode (e j 1, e j 2, \dots, e j K) . <math display="block" altimg="eq-00008.gif"><msup><mrow><mi>l</mi></mrow><mrow><mi>j</mi></mrow></msup><mo>=</mo><mstyle><mtext mathvariant="normal">mode</mtext></mstyle><mo stretchy="false">(</mo><msubsup><mrow><mi>e</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>j</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>e</mi></mrow><mrow><mn>2</mn></mrow><mrow><mi>j</mi></mrow></msubsup><mo>,</mo><mo>\dots</mo><mo>,</mo><msubsup><mrow><mi>e</mi></mrow><mrow><mi>K</mi></mrow><mrow><mi>j</mi></mrow></msubsup><mo stretchy="false">)</mo><mo>.</mo></math> (1)

2.1.2. Soft-voting or probability-based approach

Let the probabilities of individual classifiers assigned to C classes for a jth sample be

Ω k j = (β k j 1, β k j 2, \dots, β k j C) . <math display="block" altimg="eq-00009.gif"><msup><mrow><mi mathvariant="normal">Ω</mi></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msup><mo>=</mo><mo stretchy="false">(</mo><msubsup><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup><mo>,</mo><mo>\dots</mo><mo>,</mo><msubsup><mrow><mi>β</mi></mrow><mrow><mi>C</mi></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup><mo stretchy="false">)</mo><mo>.</mo></math> (2)

Here,

$β_{i}^{k j}$ indicates the probability of class i (where

$i = 1, 2, \dots, C$ ) by the classifier k (where

$k = 1, 2, \dots, K$ ) for the jth sample.

When the probability output for C classes by K classifiers have been calculated for the jth sample, the final prediction label $l^{j}$ is determined by probability-based fusion method as follows :

l j = argmax l \in C (K \sum k = 1 β k j 1, K \sum k = 1 β k j 2, \dots, K \sum k = 1 β k j C) . <math display="block" altimg="eq-00014.gif"><msup><mrow><mi>l</mi></mrow><mrow><mi>j</mi></mrow></msup><mo>=</mo><msub><mrow><mstyle><mtext mathvariant="normal">argmax</mtext></mstyle></mrow><mrow><mi>l</mi><mo>\in</mo><mi>C</mi></mrow></msub><mfenced separators="" open="(" close=")"><mrow><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></munderover><msubsup><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup><mo>,</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></munderover><msubsup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup><mo>,</mo><mo>\dots</mo><mo>,</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></munderover><msubsup><mrow><mi>β</mi></mrow><mrow><mi>C</mi></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msubsup></mrow></mfenced><mo>.</mo></math> (3)

3. Experimental Setup

3.1. The MRI data

The data used for the experiments in this study is collected from the ADNI (https://adni.loni.usc.edu/)³⁸ and OASIS (https://www.oasis-brains.org)³⁹ datasets. The ADNI is designed to develop clinical, imaging, genetic and biochemical biomarkers to track and early detect AD. We downloaded $T_{1}$ -weighted MRI images with Magnetization Prepared RApid Gradient Echo (MPRAGE) from subjects of either gender aged between 50 years and 65 years. The MPRAGE technique is used in MRI scanners to highlight anatomical brain tissue image quality and contrast between gray and white matters.⁴⁰ The dataset includes 1056 MRI images from the axial plane in four categories (AD: 223, EMCI: 475, LMCI: 262 and CN: 96). Train and test samples were split into 90:10 proportions, with 950 training and 106 test samples. This dataset used in our study represents a data-scarce situation. All datasets are taken from ADNI-1, ADNI-2 and ADNI-GO cohorts and downloaded in NIfTI format. The dimensions of the ADNI MRI used in our experimentation were $218 \times 192$ .

The OASIS dataset was divided into four categories based on the Clinical Dementia Rating (CDR) score. A CDR score of 0 denotes the absence of dementia, 0.5 is very mild dementia, 1 is mild dementia and 2 is moderate AD. There were 3200, 2240, 896 and 64 images in total in the CDR-0, CDR-0.5, CDR-1.0 and CDR-2.0 samples, respectively. Each sample has the dimensions of $176 \times 176$ . The SMOTE algorithm was applied to generate synthetic samples in the minority classes, resulting in a dataset with comparable samples in each class. Specifically, the numbers of samples in CDR-0, CDR-0.5, CDR-1.0 and CDR-2.0 classes were 2704, 2674, 2708 and 2666, respectively. The dataset was further split into a 75:25 train–test split.

Testing algorithms on two different datasets, one balanced (OASIS) and another imbalanced (ADNI), have several advantages. In many real-world scenarios, the data is often imbalanced. Testing the algorithm on an imbalanced dataset can provide a more comprehensive understanding of where the algorithm is performing poorly on a specific class. Testing on a balanced dataset will give an idea of overall accuracy. Overall, testing on both balanced and imbalanced datasets can provide a more comprehensive evaluation of the model’s effectiveness and accuracy, which can ultimately lead to determining the best-performing model for real-world scenarios.

3.2. Preprocessing of MRI data

The MRI images were preprocessed using the FMRIB Software Library toolset (FSL).⁴¹ The FSL consists of various analytical tools for MRI data.

The MRI images have undergone four preprocessing steps: (i) reorientation, (ii) registration, (iii) skull-stripping and (iv) histogram equalization. The images were reoriented to the brain atlas (MNI space) standard space in the first step. The standard orientation is a left-to-right orientation where the anterior side of the brain faces upward, and the superior side of the brain faces forward. Reorientation of MRI images ensures consistency across all images, accuracy and reliability of the processing images and allows better interpretation and visualization of the images. In the second step, the reoriented MRI images were normalized to the MNI standard space by registering to a standard brain template or the MNI template. The registration process is to align the images to a common coordinate system. This process allows accurate comparisons of multiple scans of the same subject. After reorientation and registration, the MRI images resulted in uniform voxel dimensions ( $90 \times 108 \times 90$ ). Skull-stripping is the next phase of preprocessing. In this phase, also known as the brain extraction or whole-brain segmentation, nonbrain tissues such as the skull, eyeballs and skin are removed from the MRI images of the brain. Skull-stripping is mainly done to improve the analytical accuracy and interpretation of images. The last step is to increase the contrast and visibility of features in MRI images using the histogram equalization technique. A sample output of these steps are shown in Fig. 2.

3.3. Metrics

Accuracy, sensitivity, specificity, area under the ROC curve (AUC) and standard deviation performance metrics were utilized in the evaluation of ML and DL models.³⁵ Standard deviation serves as a measure of the variability or spread of performance metrics across different runs or folds of the datasets. A lower standard deviation indicates more stable and consistent performance, while a higher standard deviation suggests greater variability. All reported values are averages obtained with the one-versus-all strategy.

3.4. Implementation

In order to implement the ML models, the Python library scikit-learn⁴² was utilized. The Keras package from TensorFlow was used for DL model implementation. The overall performance evaluation was done on a standalone local machine equipped with a Ubuntu Linux system with a 3.2-GHz CPU and 32-GB RAM. This was further enhanced with an NVIDIA GeForce RTX 3060 GPU.

4. Experimental Results

This section demonstrates the performance of 10 conventional ML techniques to assess the top-performing models before creating an ensemble classifier. The five-fold cross-validation technique was used with the grid search approach to determine the optimal hyperparameter values for each ML method. Thus, the reported results of each algorithm ensure the optimal hyperparameter value.

4.1. Experiment 1: Results based on ML classifiers

The initial set of experiments was performed using the VGG-16 feature extractor on imbalanced ADNI and balanced OASIS datasets and the results are, respectively, tabulated in Tables 1 and 2. From the given scores (Tables 1 and 2), we can observe that some models have performed much better than others. For example, SVM, GBoost, XGBoost and KNN have all achieved high accuracy scores above 0.98. On the other hand, K-Means clustering, AdaBoost and DT have achieved a very low accuracy.

**Table 1. Results of ML algorithms on the ADNI (imbalanced) dataset.**
Method	Accuracy	Specificity	Sensitivity	FNR	FPR	AUC	Std. dev.
AdaBoost	0.7714	0.9088	0.7618	0.2381	0.0912	0.7700	0.0219
GBoost	0.9714	0.9893	0.9798	0.0202	0.0106	0.9934	0.0134
XGBoost	0.9809	0.9925	0.9803	0.0196	0.0074	0.9954	0.0223
RF	0.8917	0.965	0.8965	0.1034	0.0346	0.9814	0.0132
DT	0.7142	0.8872	0.6869	0.3130	0.1127	0.7870	0.0257
SVM	0.9904	0.9967	0.9895	0.0104	0.0032	0.9987	0.0004
KNN	0.9809	0.9931	0.9803	0.0196	0.0068	0.9878	0.0039
K-Means	0.4840	0.7791	0.3306	0.6693	0.2208		0.0134
LR	0.9809	0.9925	0.9803	0.0196	0.0074	0.9974	0.0039
Naive Bayes (NB)	0.7809	0.9183	0.7728	0.2271	0.0816	0.8456	0.0337

**Table 2. Results of ML algorithms on the OASIS (balanced) dataset.**
Method	Accuracy	Specificity	Sensitivity	FNR	FPR	AUC	Std. dev.
AdaBoost	0.6648	0.8883	0.6650	0.3349	0.1116	0.8146	0.0137
GBoost	0.8593	0.9531	0.8599	0.1400	0.0468	0.9683	0.0266
XGBoost	0.8902	0.9633	0.8905	0.1094	0.0366	0.9904	0.0198
RF	0.8917	0.9653	0.8919	0.1080	0.0361	0.8917	0.0132
DT	0.6898	0.8965	0.6904	0.3095	0.1034	0.7935	0.0903
SVM	0.9132	0.9107	0.9135	0.0864	0.0893	0.9913	0.0042
KNN	0.9371	0.9790	0.9377	0.0622	0.0209	0.9919	0.0028
K-Means	0.3	0.7664	0.2991	0.7008	0.2335		0.0158
LR	0.907	0.969	0.9074	0.0925	0.0309	0.9830	0.0057
NB	0.6765	0.8919	0.6745	0.325	0.108	0.7982	0.0065

Some critical observations from this independent testing are as follows:

(1)	SVM is well suited for handling high-dimensional data when the number of observations is less compared to the number of features. SVM can employ kernel functions to transform nonlinear data into a higher-dimensional space, where a linear boundary can separate classes in the transformed domain.⁴³ SVM also generalizes well to new and unseen data.
(2)	Based on the accuracy scores provided in Table 1, KNN has the same accuracy score (0.9809) as GBoost, XGBoost and LR, which is higher than most other models. On the OASIS dataset, KNN outperforms all other models (0.9371). KNN assumes no data distribution and thus it can adapt to complex interactions between input features and target variables. KNN also captures local patterns by allocating a new data point to the class of most of its K nearest neighbors in the training set.⁴⁴
(3)	K-Means clustering performed poorly on ADNI and OASIS datasets. The reason could be that the outliers can affect the calculation of cluster centroids and lead to incorrect grouping of data points. Furthermore, K-Means clustering is only optimal if the data is linearly separable (unlike AD classification).
(4)	RF performed poorly (0.8917) on the ADNI dataset. This dataset with limited samples may not have enough variance to establish meaningful splits.⁴⁵ In addition, class imbalance also biases prediction toward the majority class.
(5)	Table 1 shows that LR and KNN have the same accuracy score. KNN has a slightly higher specificity score than LR, suggesting that it can better identify negative cases.⁴⁴
(6)	Logistic Regression performed well with ADNI compared to OASIS. The model may have favored the majority class due to the imbalanced ADNI dataset, resulting in high accuracy. On the larger balanced OASIS dataset, the model may have performed better on all classes but with slightly lower accuracy.
(7)	There could be several reasons for the no improvement in accuracy for the RF algorithm between the smaller ADNI and larger OASIS datasets. The larger dataset might contain noisy or irrelevant data that could offset the benefits of having more data. If the data quality is poor, the Random Forest model may not be able to extract any useful information, leading to no improvement in accuracy. Further, the data distribution in the larger dataset is similar to the smaller dataset. In that case, the RF model will behave similarly irrespective of the dataset in terms of overall accuracy. The larger dataset may need to be more diverse, containing new or different examples, to help the model generalize better and improve its accuracy.
(8)	On both datasets, the XGBoost consistently obtained higher accuracy. This could be because the XGBoost has built-in regularization techniques such as $L_{1}$ - and $L_{2}$ -regularizations, which can help prevent overfitting and improve the generalization performance.⁴⁶ These regularization techniques can be essential on smaller datasets where overfitting is a common problem. Also, XGBoost is designed to be scalable and can handle large datasets efficiently.⁴⁶
(9)	We used the DT model as AdaBoost’s base classifier with the same hyperparameter values as used in the standalone Decision Tree classifier. This approach can lead to a more accurate classifier than using a single decision tree because the AdaBoost model can learn by assigning greater weights to samples that are difficult to classify correctly. In turn, it helps the subsequent decision trees focus on these samples and improve the overall accuracy.
(10)	The Naive Bayes (NB) algorithm was the next worst-performing algorithm after the K-Means classifier. The NB algorithm assumes that all features are independent of each other, but in reality, this may not be true.⁴⁷ Also, the NB classifier assumes that the decision boundaries are linear, which is not the case with AD classification.
(11)	KNN and SVM algorithms exhibited low standard deviations on both ADNI and OASIS datasets, showing consistent and predictable performance, important for medical data research.

The performance measures of the five best-performing algorithms separately using VGG-16 (top) and direct raw pixel (bottom) data are shown in Fig. 4. By comparing the performance of classifiers trained on raw pixels versus VGG-16 features, we can determine the model’s suitability for real-time deployment that involves factors such as speed, scalability, robustness and overall performance. This experiment was conducted on the OASIS dataset only as their performance on ADNI is nearly 100% accurate. The raw image used in OASIS measures $176 \times 176$ pixels. When using direct pixel values as features, each training sample needs to be vectorized and stored in a large matrix of size $M \times 30976$ , where M is the number of training samples. In contrast, using the pretrained VGG-16 encoder can transform the images into ( $M \times 12800$ )-dimensional feature space.

It is clearly evident that the best-performing KNN algorithm (0.9371) in the transformed VGG-16 space suffered badly when direct raw pixel values were used (0.8757). This could be due to multiple reasons. As the feature space’s dimensionality increases, the data points’ sparsity also increases. This can make it harder for the KNN algorithm to accurately classify the data points, as the distance between any two points in the high-dimensional space becomes more and more similar. It is also possible that the higher-dimensional feature space contains redundant or irrelevant features, which may decrease the KNN algorithm’s accuracy. On the other hand, features extracted from the VGG-16 network are pretrained on a large dataset and have already learnt to recognize high-level features such as edges, shapes and textures. Using these features, which are nearly 60% lesser in dimension, as input to a classifier can reduce the computational burden and improve classification accuracy.

Performance trends of other algorithms (e.g. XGBoost, SVM, LR and RF) are comparable in both domains across all metrics. It can be observed that although the accuracy of XGBoost is less, the specificity and AUC are higher. It is important to note that in multiclass classification problems, specificity is calculated for each class separately. Therefore, a high specificity value for one class does not necessarily mean the model performs well for the other classes. Furthermore, other metrics such as accuracy, specificity and sensitivity of SVM, KNN, LR, and XGBoost fluctuate; the AUC is comparable for these algorithms. This is because AUC measures the classifier’s overall performance across all classes calibrated with optimal hyperparameter values. The AUC considers the ranking of the positive and negative instances across all classes and thresholds. On the other hand, metrics such as accuracy, specificity and sensitivity may vary widely across classes due to differences in the class distribution or classification threshold.

Considering the high-dimensional feature of using raw pixels and the associated computational burden, the VGG-16 feature extractor is the best choice. Hence, the rest of our experiments involving ML classifiers will utilize the VGG-16 feature space.

4.2. Experiment 2: Results based on ML-based ensemble classifiers

For this experiment, we considered four top-performing algorithms from the previous experiment: KNN, SVM, LR and XGBoost. The results of both probability fusion and max-voting-based ensembling were tested using the OASIS dataset as these top-performing methods individually attained the highest accuracy on the ADNI dataset. We tried different combinations of two, three and four classifiers in this experiment, and the results are tabulated in Table 3.

**Table 3. Various performance metrics of ensemble classifiers based on ML algorithms.**
No. of models	Ensemble	Type	Accuracy	Specificity	Sensitivity	FNR	FPR
2	KNN, LR	Hard	0.9121	0.9706	0.9118	0.0881	0.0293
		Soft	0.9589	0.9863	0.9593	0.0406	0.0136
	KNN, XGBoost	Hard	0.9046	0.9681	0.9043	0.0956	0.0318
		Soft	0.9574	0.9858	0.9578	0.0421	0.0141
	KNN, SVM	Hard	0.9089	0.9695	0.9087	0.0912	0.0304
		Soft	0.9652	0.9884	0.9655	0.0344	0.0115
	LR, XGBoost	Hard	0.8933	0.9643	0.8927	0.1072	0.0356
		Soft	0.9179	0.9726	0.9184	0.0815	0.0273
	LR, SVM	Hard	0.9054	0.9684	0.9053	0.0946	0.0315
		Soft	0.9125	0.9708	0.9128	0.0871	0.0291
	XGBoost, SVM	Hard	0.8933	0.9643	0.8927	0.1072	0.0356
		Soft	0.9281	0.9760	0.9284	0.0715	0.0239
3	KNN, SVM, LR	Hard	0.9250	0.9750	0.9254	0.0745	0.0249
		Soft	0.9468	0.9822	0.9472	0.0527	0.0177
	KNN, SVM, XGBoost	Hard	0.9367	0.9789	0.9371	0.0628	0.0210
		Soft	0.9613	0.9870	0.9615	0.0384	0.0129
	KNN, XGBoost, LR	Hard	0.9332	0.9777	0.9335	0.0664	0.0222
		Soft	0.9554	0.9851	0.9557	0.0442	0.0148
	SVM, XGBoost, LR	Hard	0.9183	0.9727	0.9187	0.0812	0.0272
		Soft	0.9226	0.9742	0.9230	0.0769	0.0257
4	KNN, SVM, XGBoost, LR	Hard	0.9285	0.9761	0.9284	0.0715	0.0238
		Soft	0.9464	0.9821	0.9468	0.0531	0.0178

The ensemble of KNN and SVM resulted in an accuracy of 0.9652, a nearly 3% increase over the best-performing KNN model (0.9371). This improvement could be because KNN and SVM are fundamentally different algorithms with different strengths and weaknesses. KNN is a nonparametric algorithm that makes predictions based on the closest neighbors in the training data, while the SVM is a parametric algorithm that finds the best hyperplane to separate the classes. When the predictions of KNN and SVM are combined, their strengths complement each other, and their weaknesses are mitigated. For example, KNN may better identify local patterns in the data, while SVM may better handle high-dimensional data or data with complex decision boundaries. The ensemble of KNN and SVM can correct errors made by each model as these algorithms are diverse. Diversity is a critical factor in the success of ensembles because it allows the models to capture different aspects of the data and reduce the risk of overfitting.

It can also be noted from Table 3 that increasing the number of classifiers did not increase the overall accuracy. Ensemble methods can help improve the performance of ML models, but there are limits to how much improvement can be gained by adding more classifiers to the ensemble. Ensemble methods work best when the individual models are diverse and provide complementary information about the data. If the additional classifiers are similar to the existing models or provide conflicting information, they may not add much value to the ensemble. In general, we can conclude that adding more classifiers to an ensemble can be beneficial up to a certain extent. Beyond that point, the marginal benefits of adding more classifiers may outweigh the costs of increased model complexity and possible overfitting.

It can also be observed that the probability-based ensembling is consistently performing better than the max-voting-based ensemble. In a probability-based ensemble, the final prediction is based on the confidence levels assigned by each classifier. On the other hand, in a max-voting-based ensemble, the final prediction is based on the mode of the individual classifier predictions.

The soft-voting can be helpful in the cases where some classifiers are more confident in their predictions than others. The probability-based ensemble can improve accuracy by weighting the predictions according to their confidence levels. Further, ensembles can also be more robust in the case of uncertainty in the data such as ambiguity or difficulty in the classification. In that case, a soft-voting ensemble can consider this uncertainty and make a more informed prediction based on the probabilities assigned by the individual classifiers. The rest of our experiments report the accuracy of ensembles using the probability-based (soft-voting) approach.

The five top-performing algorithms’ confusion matrices are shown in Fig. 5 along with the ensemble confusion matrix. The ensemble approach showed robustness in correctly classifying the MCI (CDR-1.0) and AD (CDR-2.0) cases, with the accuracies of 0.92 and 0.94, respectively, which is the distinctive capability of the ensemble classifier compared to individual classifiers. The SVM achieved the accuracies of 0.84 and 0.83 in distinguishing MCI and AD cases, while the KNN classifier obtained the accuracies of 0.88 and 0.86 in the correct classification of MCI and AD class samples. This ability of the ensemble classifier to distinguish between MCI and AD cases is vital, as early detection of MCI can provide an opportunity for early intervention to prevent or delay the onset of AD.

4.3. Experiment 3: Results based on DL classifiers

This study also investigates the performance of several pretrained neural network models from the ImageNet challenge, namely ResNet50, ResNet101, ResNet152, InceptionV3, InceptionResNetV2 and EfficientNetB0, on the task of classifying MRI datasets obtained from OASIS and ADNI. The input samples were converted to 3D space for architectural compatibility. We removed the top layer and retained the ImageNet weights during our experiments with pretrained models. The fine-tuning layer can be expressed as follows: $DO (0.5)$ –Flatten–BN–2048N–BN–DO(0.5)–1024N–BN–DO(0.5)–4N, where $DO (x)$ represents a dropout layer with a probability of x, $BN$ indicates Batch Normalization and $f N$ implies a fully connected layer with f neurons. The final output layer comprises four softmax-activated neurons for four AD classes.

In this experiment, we evaluate both pretrained models and CNNs trained from scratch and compare their performance to provide insights into the effectiveness of these approaches for our task. The architecture of non-pretrained CNN can be represented as: $16 C 2$ – $16 C 2$ –MP2– $32 C 2$ – $32 C 2$ – $32 C 2$ –MP2– $64 C 2$ – $16 C 1$ –Flatten–4N. The learning rate of 0.0001 and Adam activation function were utilized with a batch size of 128. The number of epochs was controlled by the Keras EarlyStopping callback. The results obtained for ADNI and OASIS datasets are shown in Tables 4 and 5, respectively.

**Table 4. Results of DL algorithms on the ADNI dataset.**
	Test	Train	Val
Method	acc	acc	acc	Specificity	Sensitivity	FNR	FPR	AUC
ResNet50	0.9890	0.9950	0.9989	0.9926	0.9803	0.0197	0.0074	0.9864
ResNet101	0.9620	0.9910	0.9911	0.9863	0.9695	0.0305	0.0137	0.9922
ResNet152	0.9809	0.9920	0.9924	0.9926	0.9803	0.0197	0.0074	0.9899
InceptionV3	0.9050	0.9990	0.9850	0.9682	0.9215	0.0785	0.0318	0.9912
InceptionResNetV2	0.9620	0.9980	0.9960	0.9841	0.9398	0.0602	0.0159	0.9917
EfficientNetB0	0.4380	0.6830	0.6907	0.7500	0.2500	0.7500	0.2500	0.5468
Non-pretrained CNN	0.9809	1	0.9905	0.9925	0.9791	0.0208	0.0074	0.9999

**Table 5. Results of DL algorithms on the OASIS dataset.**
	Test	Train	Val
Method	acc	acc	acc	Specificity	Sensitivity	FNR	FPR	AUC
ResNet50	0.8310	0.9690	0.9775	0.9433	0.8299	0.1701	0.0567	0.9641
ResNet101	0.8730	0.9830	0.9805	0.9575	0.8729	0.1271	0.0425	0.9734
ResNet152	0.8380	0.9720	0.9718	0.9461	0.8386	0.1664	0.0539	0.9628
InceptionV3	0.9140	0.9980	0.9899	0.9713	0.9139	0.0861	0.0287	0.9871
InceptionResNetV2	0.9000	0.9920	0.9869	0.9666	0.9003	0.0997	0.0334	0.9832
EfficientNetB0	0.2590	0.5060	0.5077	0.7500	0.2500	0.7500	0.2500	0.4826
Non-pretrained CNN	0.9429	1	0.9429	0.9810	0.9434	0.0565	0.0189	0.9817

The observations based on this experiment are discussed below:

(1)	The Transfer Learning models achieved convergence at an earlier stage in ADNI dataset as compared to OASIS dataset. However, the models required 200 epochs to complete the training process on OASIS dataset. Early halting may have several reasons. The OASIS dataset trained 8064 samples (75% of 10,752 total samples), whereas the ADNI dataset trained 950 samples (90% of 1056 total samples). There could be several reasons for early convergence: The validation loss may stop falling and rise as the number of epochs grows, indicating overfitting for the smaller ADNI dataset. In addition, the ADNI dataset may have more noise in the training data, making it harder for the model to learn patterns and generalize to new data, causing increased validation loss and early termination.
(2)	The EfficientNetB0 model performed poorly compared to the ResNet models on the MRI dataset. The EfficientNetB0’s complex architecture with many parameters may have caused overfitting, especially with a limited dataset size. In contrast, the simpler architecture of ResNet models facilitated better generalization for the MRI dataset. The pretrained weights used in transfer learning also played a role, with ResNet models better capturing relevant features than EfficientNetB0.
(3)	InceptionV3 demonstrates superior accuracy on the OASIS dataset but not on the ADNI dataset. This discrepancy can be attributed to several factors. InceptionV3 is a deep and complex model with numerous parameters. When trained on the smaller ADNI dataset, there is a higher risk of overfitting, wherein the model memorizes the training set instead of generalizing it to new data.
(4)	The difference between the testing and training accuracies or the generalization gap is less in the ADNI dataset than in the OASIS dataset. In a larger dataset, there are more diverse samples that can make the model difficult to generalize to new samples. Overall, the size of the dataset is a pivotal factor that can impact the generalization gap.
(5)	ResNet101 performed better on the larger OASIS dataset but worse on the smaller ADNI dataset compared to other ResNet models used in this study. Deeper and more complex models like ResNet101 are typically more effective with larger datasets, while smaller datasets may suffer from overfitting due to limited features. In such cases, a ResNet50 model with fewer layers might prevent overfitting and capture key properties more effectively.
(6)	InceptionResNetV2 performs well on OASIS but not on ADNI. Like InceptionV3, InceptionResNetV2 has several parameters and can overfit on smaller datasets. On the ADNI dataset, data imbalance and lack of diversity might impair accuracy.
(7)	The CNN from scratch model performs better with the larger OASIS and smaller ADNI datasets. In pretrained models, the learnt features from other datasets may not be relevant to the current dataset, leading to poor performance.

4.4. Experiment 4: Results based on ensemble DL classifiers

This subsection reports the performance of ensemble DL classifiers using four CNN architectures, including Transfer Learning and CNN trained from scratch (see Table 6 for more details). Combining transfer learning architecture with CNN trained from scratch can improve the ensemble classifiers’ accuracy. However, adding more DL classifiers did not significantly improve accuracy and may not be worth the computational cost. The highest accuracy achieved with four DL classifiers was similar to an ML ensemble with only two classifiers (see Table 3). It is important to highlight that our results resonate with the widely recognized fact that DL algorithms tend to require large datasets for optimal performance (refer Table 5).^30,48 Notably, our findings remain consistent even when implementing transfer learning using popular pretrained models.⁴⁹

**Table 6. Results of ensemble classifiers based on DL algorithms.**
No. of Models	Models	Accuracy	Specificity	Sensitivity	FNR	FPR
2	CNN, InceptionV3	0.9535	0.9845	0.9539	0.0154	0.0460
	CNN, InceptionResNetV2	0.9476	0.9825	0.9481	0.0174	0.0518
	CNN, ResNet101	0.9460	0.9820	0.9464	0.0179	0.0535
	InceptionV3, InceptionResNetV2	0.9363	0.9787	0.9363	0.0217	0.0636
	InceptionV3, ResNet101	0.9281	0.9760	0.9280	0.0239	0.0719
	InceptionResNetV2, ResNet101	0.9234	0.9744	0.9234	0.0255	0.0765
3	CNN, InceptionV3, InceptionResNetV2	0.9644	0.9881	0.9646	0.0118	0.0353
	CNN, InceptionV3, ResNet101	0.9574	0.9857	0.9574	0.0142	0.0425
	CNN, InceptionResNetV2, ResNet101	0.9539	0.9846	0.9540	0.0153	0.0459
	InceptionV3, InceptionResNetV2, ResNet101	0.9492	0.9830	0.9491	0.0169	0.0500
4	CNN, InceptionV3, InceptionResNetV2, ResNet101	0.9656	0.9657	0.9657	0.0114	0.0342

In conclusion, our findings indicate that a small ensemble of diverse ML classifiers can achieve superior accuracy compared to a larger ensemble of DL classifiers (refer Tables 3 and 6). Based on these experiments, we can safely conclude that ML classifiers, whether used individually or as ensembles, are the preferred choice for data-scarce scenarios, which are commonly encountered in the medical field. The final choice between ML and DL can therefore depend on the specific details of the problem and the available resources.

5. Current Challenges and Future Avenues

While our study has made a significant contribution to the existing knowledge base on AD classification, there are still several limitations that must be addressed. In this section, we will examine these limitations and discuss potential avenues for future research.

(1)	MRI Preprocessing: Validating MRI preprocessing is crucial for ML and DL tasks as it directly impacts model training and testing. Preprocessing enhances data quality by improving image resolution and contrast and reducing noise. Neglecting validation of preprocessed images can result in inaccurate or irrelevant features, leading to decreased model performance. Comparing original and preprocessed images helps identify inconsistencies allowing the assessment of preprocessing’s impact on model accuracy. Hence, ensuring dependable preprocessing can result in improved model outcomes.
(2)	MRI images with axial plane: Relying solely on MRI images from the axial plane for AD prediction can negatively impact ML and DL training and testing in several ways. First, it may exclude relevant information from other planes, leading to inaccurate analysis. Since AD can affect various brain regions, considering images only from the axial plane limits the comprehensive understanding of the disease. By considering MRI images from multiple planes, researchers can achieve more accurate and generalizable results in AD analysis.
(3)	Used only neuroimaging data: Using solely neuroimaging data for ML and DL processing of AD can potentially impact the models’ performance. The complexity of the disease extends beyond the brain, affecting various aspects of a person’s lifestyle and overall health. Ignoring these factors can lead to partial or incorrect diagnoses or predictions. Hence, it is important to incorporate additional biomarkers that provide a more comprehensive understanding of the patient, such as clinical and demographic data. The models can capture complex interactions between different factors by combining multiple data sources, improving prediction accuracy.
(4)	Single data modality: Neurodegenerative diseases like AD can affect the brain in complex and varied ways and using different data modalities can capture different aspects of the disease. However, using only a single data modality might limit the ability of the model to capture the full extent of the impact of the disease on the brain. For instance, using multiple data modalities like structural MRI, functional MRI and molecular changes associated with AD can provide a more complete and accurate performance in the processes of ML and DL.
(5)	Black box models: The use of black box models in Alzheimer’s disease ML and DL processing poses challenges in result interpretation and understanding the underlying reasons.^50,51 Explainable AI (XAI) facilitates understanding and visualization of the decision-making process of black box models, enhancing interpretability and bias detection.
(6)	Different subsets of ADNI: Varied demographics, imaging protocols and image qualities across subsets contribute to dataset heterogeneity and can affect model performance. Additionally, model instability arises as the model learns from different subsets, reducing reliability and interpretability. Addressing these issues involves carefully selecting subsets representing the overall population with consistent demographics.
(7)	VGG-16 feature space: There can be several implications on training and testing by using only VGG-16 feature space for ML processing of Alzheimer’s disease. Furthermore, relying only on VGG-16 architecture can lead to the loss of other important features relevant to the diagnosis of AD resulting in decreased performance. In future, one can study the impact of different Transfer Learning-based ConvNet feature extractors for the AD classification task.
(8)	Model overfitting: Large hyperparameter space exists in DL models. For example, CNNs contain hyperparameters for the number of convolutional layers, the size and number of filters, the size of the pooling layer, the learning rate, the regularization parameters and many other variables. Fine-tuning DL models and using optimized hyperparameters is a challenging task but it will probably yield the best results.

6. Conclusion

In recent years, there has been a surge of interest in deep learning methods, and they are often portrayed as the solution to all complex classification problems. Yet, our study aimed to provide insights into machine learning performance over deep learning methods in the Alzheimer’s disease classification task, where data is often limited and complex. Specifically, by proposing an ensemble classifier based on ML that showed an improvement of 3–7% accuracy over individual classifier methods, we were able to demonstrate the potential of ML approaches in improving performance with complex AD data. Furthermore, our study focused on comparing the performance of ML and DL methods by implementing 10 widely used ML algorithms and comparing their performance against DL models.

In summary, our results demonstrated that a small ensemble of diverse ML classifiers outperforms a larger ensemble of deep learning classifiers in terms of accuracy. These experiments led us to the confident conclusion that ML classifiers, whether employed singly or in ensembles, represent the preferred option for situations characterized by limited data, a frequent occurrence in medical contexts. The ultimate decision between ML and DL should depend on the specific complexities of the problem at hand and the available resources.

While improvements in classification accuracy seem encouraging from an academic perspective, the practical significance of clinical and epidemiological implications should also be considered. In a clinical context, even slight enhancements in AD diagnosis accuracy hold the potential to usher in earlier interventions and personalized treatment strategies, which can significantly improve the quality of life for patients and their families. From an epidemiological perspective, our research contributes to the ongoing discourse surrounding AD diagnosis, offering insights that can inform population-level studies and the broader understanding of AD progression. These findings underscore the value of our work beyond the realm of ML, reinforcing its relevance and potential to impact clinical practice and public health.

Acknowledgments

This work is funded by the Ministry of Higher Education, Research and Innovation (MoHERI) of the Sultanate of Oman under the Block Funding Program (Grant No. MoHERI/BFP/UoTAS/01/2021) and by the UKRI through the Horizon Europe Guarantee Scheme (project number: 10078953) for the European Commission funded PHASE IV AI project (Grant Agreement No. 101095384) under the Horizon Europe Programme. Vimbi Viswan is supported by the Internal Research Grant (No.IRG/2024/Call-7/8) of UTAS. The source code of this work can be found at: https://github.com/snoushath/AII2022.git.

ORCID

Noushath Shaffi https://orcid.org/0000-0001-9243-8402

Karthikeyan Subramanian https://orcid.org/0000-0002-5086-1170

Viswan Vimbi https://orcid.org/0009-0005-4065-4492

Faizal Hajamohideen https://orcid.org/0000-0003-4402-8294

Abdelhamid Abdesselam https://orcid.org/0000-0003-2950-2875

Mufti Mahmud https://orcid.org/0000-0002-2037-8348

References

1. S. Gauthier et al., World Alzheimer Report 2022: Life after diagnosis: Navigating treatment, care and support, Technical Report, Alzheimer’s Disease International, London, UK (2022). Google Scholar
2. A. Leparulo, M. Mahmud, E. Scremin, T. Pozzan, S. Vassanelli and C. Fasolato, Dampened slow oscillation connectivity anticipates amyloid deposition in the PS2APP mouse model of Alzheimer’s disease, Cells 9(1) (2019) 54. Crossref, Medline, Web of Science, Google Scholar
3. L. Rizzi, I. Rosset and M. Roriz-Cruz, Global epidemiology of dementia: Alzheimer’s and vascular types, BioMed Res. Int. 2014 (2014) 908915. Crossref, Medline, Web of Science, Google Scholar
4. M. Fabietti, M. Mahmud, A. Lotfi, A. Leparulo, R. Fontana, S. Vassanelli and C. Fasolato, Early detection of Alzheimer’s disease from cortical and hippocampal local field potentials using an ensembled machine learning model, IEEE Trans. Neural Syst. Rehabil. Eng. 31 (2023) 2839–2848. Crossref, Medline, Web of Science, Google Scholar
5. S. Gauthier et al., World Alzheimer Report 2021: Journey through the diagnosis of dementia, Technical Report, Alzheimer’s Disease International, London, UK (2021). Google Scholar
6. N. Mammone et al., Permutation Jaccard distance-based hierarchical clustering to estimate EEG network density modifications in MCI subjects, IEEE Trans. Neural Netw. Learn. Syst. 29(10) (2018) 5122–5135. Crossref, Web of Science, Google Scholar
7. J. Ruiz et al., 3D DenseNet ensemble in 4-way classification of Alzheimer’s disease, in BI 2020: Brain Informatics, Lecture Notes in Computer Science, Vol. 12241 (Springer, 2020), pp. 85–96. Crossref, Google Scholar
8. J. P. Amezquita-Sanchez, A. Adeli and H. Adeli, A new methodology for automated diagnosis of mild cognitive impairment (MCI) using magnetoencephalography (MEG), Behav. Brain Res. 305 (2016) 174–180. Crossref, Medline, Web of Science, Google Scholar
9. J. P. Amezquita-Sanchez, N. Mammone, F. C. Morabito and H. Adeli, A new dispersion entropy and fuzzy logic system methodology for automated classification of dementia stages using electroencephalograms, Clin. Neurol. Neurosurg. 201 (2021) 106446. Crossref, Medline, Web of Science, Google Scholar
10. V. N. Vahia, Diagnostic and statistical manual of mental disorders 5: A quick glance, Indian J. Psychiatry 55(3) (2013) 220–223. Crossref, Medline, Web of Science, Google Scholar
11. M. Tanveer et al., Machine learning techniques for the diagnosis of Alzheimer’s disease: A review, ACM Trans. Multimed. Comput. Commun. Appl. 16(1s) (2020) 30:1–30:35. Crossref, Web of Science, Google Scholar
12. Z. Zhao et al., Machine learning and deep learning in AD diagnosis using neuroimaging: A review, Front. Comput. Neurosci. 17 (2023) 1038636. Crossref, Medline, Web of Science, Google Scholar
13. J. P. Amezquita-Sanchez et al., A novel methodology for automated differential diagnosis of mild cognitive impairment and the Alzheimer’s disease using EEG signals, J. Neurosci. Methods 322 (2019) 88–95. Crossref, Medline, Web of Science, Google Scholar
14. M. Mahmud, M. S. Kaiser, A. Hussain and S. Vassanelli, Applications of deep learning and reinforcement learning to biological data, IEEE Trans. Neural Netw. Learn. Syst. 29(6) (2018) 2063–2079. Crossref, Medline, Web of Science, Google Scholar
15. M. B. T. Noor et al., Application of deep learning in detecting neurological disorders from MRI images: a survey on the detection of Alzheimer’s disease, Parkinson’s disease and schizophrenia, Brain Inform. 7(1) (2020) 11:1–11:21. Crossref, Google Scholar
16. M. Mahmud, M. S. Kaiser, T. M. McGinnity and A. Hussain, Deep learning in mining biological data, Cogn. Comput. 13(1) (2021) 1–33. Crossref, Medline, Web of Science, Google Scholar
17. N. Shaffi, V. Viswan, M. Mahmud, F. Hajamohideen and K. Subramanian, Multi-planar MRI-based classification of Alzheimer’s disease using tree-based machine learning algorithms, in Proc. 2023 IEEE Int. Conf. Web Intelligence and Intelligent Agent Technology (WI-IAT) (IEEE, 2023), pp. 496–502. Google Scholar
18. A. Loddo, S. Buttau and C. Di Ruberto, Deep learning based pipelines for Alzheimer’s disease diagnosis: a comparative study and a novel deep-ensemble method, Comput. Biol. Med. 141 (2022) 105032. Crossref, Medline, Web of Science, Google Scholar
19. K. R. Bhatele and S. S. Bhadauria, Brain structural disorders detection and classification approaches: A review, Artif. Intell. Rev. 53(5) (2020) 3349–3401. Crossref, Web of Science, Google Scholar
20. H. Adeli, S. Ghosh-Dastidar and N. Dadmehr, Alzheimer’s disease and models of computation: Imaging, classification, and neural models, J. Alzheimer’s Dis. 7(3) (2005) 187–199. Crossref, Medline, Web of Science, Google Scholar
21. H.-I. Suk, S.-W. Lee, D. Shen and Alzheimers, Disease Neuroimaging Initiative, Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis, NeuroImage 101 (2014) 569–582. Crossref, Medline, Web of Science, Google Scholar
22. D. Zhang et al., Multimodal classification of Alzheimer’s disease and mild cognitive impairment, NeuroImage 55(3) (2011) 856–867. Crossref, Medline, Web of Science, Google Scholar
23. Z. Sankari and H. Adeli, Probabilistic neural networks for diagnosis of Alzheimer’s disease using conventional and wavelet coherence, J. Neurosci. Methods 197(1) (2011) 165–170. Crossref, Medline, Web of Science, Google Scholar
24. S. Jahan, K. Abu Taher, M. S. Kaiser, M. Mahmud, M. S. Rahman, A. S. Hosen and I.-H. Ra, Explainable AI-based Alzheimer’s prediction and management using multimodal data, PLOS ONE 18(11) (2023) e0294253. Crossref, Medline, Web of Science, Google Scholar
25. F. Hajamohideen et al., Four-way classification of Alzheimer’s disease using deep Siamese convolutional neural network with triplet-loss function, Brain Inform. 10(1) (2023) 5:1–5:13. Crossref, Google Scholar
26. S. S. Kundaram and K. C. Pathak, Deep learning-based Alzheimer disease detection, in Proceedings of the Fourth International Conference on Microelectronics, Computing and Communication Systems, Lecture Notes in Electrical Engineering, Vol. 673 (Springer, 2021), pp. 587–597. Crossref, Google Scholar
27. S. Basaia et al., Automated classification of Alzheimer’s disease and mild cognitive impairment using a single MRI and deep neural networks, NeuroImage, Clin. 21 (2019) 101645. Crossref, Medline, Web of Science, Google Scholar
28. R. Jain et al., Convolutional neural network based Alzheimer’s disease classification from magnetic resonance brain images, Cogn. Syst. Res. 57 (2019) 147–159. Crossref, Web of Science, Google Scholar
29. C. Wu et al., Discrimination and conversion prediction of mild cognitive impairment using convolutional neural networks, Quant. Imaging Med. Surg. 8(10) (2018) 992. Crossref, Medline, Web of Science, Google Scholar
30. M. Hon and N. M. Khan, Towards Alzheimer’s disease classification through transfer learning, in Proc. 2017 IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM) (IEEE, 2017), pp. 1166–1169. Google Scholar
31. L. Khedher et al., Independent component analysis-support vector machine-based computer-aided diagnosis system for Alzheimer’s with visual support, Int. J. Neural Syst. 27(03) (2017) 1650050. Link, Web of Science, Google Scholar
32. U. R. Acharya et al., Automated detection of Alzheimer’s disease using brain MRI images — A study with various feature extraction techniques, J. Med. Syst. 43 (2019) 302. Crossref, Medline, Web of Science, Google Scholar
33. A. Sarica, A. Cerasa and A. Quattrone, Random forest algorithm for the classification of neuroimaging data in Alzheimer’s disease: a systematic review, Front. Aging Neurosci. 9 (2017) 329. Crossref, Medline, Web of Science, Google Scholar
34. E. Alickovic and A. Subasi, Automatic detection of AD based on histogram and random forest, in CMBEBIH 2019, IFMBE Proceedings, Vol. 73 (Springer, 2020), pp. 91–96. Crossref, Google Scholar
35. N. Shaffi et al., Ensemble classifiers for a 4-way classification of Alzheimer’s disease, in AII 2022: Applied Intelligence and Informatics, Communications in Computer and Information Science, Vol. 1724 (Springer, 2023), pp. 219–230. Google Scholar
36. M. Shorfuzzaman and M. S. Hossain, MetaCOVID: A Siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients, Pattern Recognit. 113 (2021) 107700. Crossref, Medline, Web of Science, Google Scholar
37. S. Sharma et al., A deep learning based convolutional neural network model with VGG16 feature extractor for the detection of Alzheimer disease using MRI scans, Meas. Sens. 24 (2022) 100506. Crossref, Google Scholar
38. C. R. Jack, Jr. et al., The Alzheimer’s Disease Neuroimaging Initiative (ADNI), J. Magn. Reson. Imaging 27(4) (2008) 685–691. Crossref, Medline, Web of Science, Google Scholar
39. P. J. LaMontagne et al., OASIS-3: Longitudinal, neuroimaging, clinical, and cognitive dataset for normal ageing and Alzheimer’s disease, Alzheimer’s Dement. 14 (2018) 1097–1097. Google Scholar
40. J. Mugler and J. Brookeman, Three-dimensional magnetization-prepared rapid gradient-echo imaging, Magn. Reson. Med. 15(1) (1990) 152–157. Crossref, Medline, Web of Science, Google Scholar
41. Analysis Group, FMRIB Software Library toolset (FSL) (2023), https://fsl.fmrib.ox.ac.uk/fsl/. Google Scholar
42. F. Pedregosa et al., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res. 12(85) (2011) 2825–2830. Google Scholar
43. M. Bansal et al., A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning, Decis. Anal. J. 3 (2022) 100071. Crossref, Google Scholar
44. E. A. Patrick and F. P. FischerIII, A generalized k-nearest neighbor rule, Inf. Control 16(2) (1970) 128–152. Crossref, Google Scholar
45. K. Fawagreh, M. M. Gaber and E. Elyan, Random forests: from early developments to recent advancements, Syst. Sci. Control Eng. 2(1) (2014) 602–609. Crossref, Google Scholar
46. T. Chen and C. Guestrin, XGBoost: A scalable tree boosting system, in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (ACM, 2016), pp. 785–794. Crossref, Google Scholar
47. D. Berrar, Bayes’ theorem and naive Bayes classifier, in Encyclopedia of Bioinformatics and Computational Biology, Reference Module in Life Sciences, Vol. 1 (Academic Press, 2018), pp. 403–412. Google Scholar
48. D. Agarwal et al., Transfer learning for Alzheimer’s disease through neuroimaging biomarkers: a systematic review, Sensors 21(21) (2021) 7259. Crossref, Web of Science, Google Scholar
49. S. Kornblith, J. Shlens and Q. V. Le, Do better ImageNet models transfer better?, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019), pp. 2661–2671. Crossref, Google Scholar
50. V. Viswan et al., Explainable artificial intelligence in Alzheimer’s disease classification: A systematic review, Cogn. Comput. 16 (2024) 1–44. Crossref, Web of Science, Google Scholar
51. F. C. Morabito, C. Ieracitano and N. Mammone, An explainable Artificial Intelligence approach to study MCI to AD conversion via HD-EEG processing, Clin. EEG Neurosci. 54(1) (2023) 51–60. Crossref, Medline, Web of Science, Google Scholar

Vol. 34, No. 07

Metrics

Downloaded 797 times

History

Received 14 June 2023

Accepted 9 March 2024

Published: 5 April 2024

Information

Open access since July 2024. This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND) License, which permits use, distribution and reproduction, provided that the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Keywords

PDF download