The marine predator algorithm (MPA) is the latest metaheuristic algorithm proposed in 2020, which has an outstanding merit-seeking capability, but still has the disadvantage of slow convergence and is prone to a local optimum. To tackle the above problems, this paper proposed the flexible adaptive MPA. Based on the MPA, a flexible adaptive model is proposed and applied to each of the three stages of population iteration. By introducing nine benchmark test functions and changing their dimensions, the experimental results show that the flexible adaptive MPA has faster convergence speed, more accurate convergence ability, and excellent robustness. Finally, the flexible adaptive MPA is applied to feature selection experiments. The experimental results of 10 commonly used UCI high-dimensional datasets and three wind turbine (WT) fault datasets show that the flexible adaptive MPA can effectively extract the key features of high-dimensional datasets, reduce the data dimensionality, and improve the effectiveness of the machine algorithm for WT fault diagnosis (FD).
The Internet of Medical Things (IoMT) refers to interconnected medical systems and devices that gather and transfer healthcare information for several medical applications. Smart healthcare leverages IoMT technology to improve patient diagnosis, monitoring, and treatment, providing efficient and personalized healthcare services. Privacy-preserving Federated Learning (PPFL) is a privacy-enhancing method that allows collaborative method training through distributed data sources while ensuring privacy protection and keeping the data decentralized. In the field of smart healthcare, PPFL enables healthcare professionals to train machine learning algorithms jointly on their corresponding datasets without sharing sensitive data, thereby maintaining confidentiality. Within this framework, anomaly detection includes detecting unusual events or patterns in healthcare data like unexpected changes or irregular vital signs in patient behaviors that can represent security breaches or potential health issues in the IoMT system. Smart healthcare systems could enhance patient care while protecting data confidentiality and individual privacy by incorporating PPFL with anomaly detection techniques. Therefore, this study develops a Privacy-preserving Federated Learning with Blockchain-based Smart Healthcare System (PPFL-BCSHS) technique in the IoMT environment. The purpose of the PPFL-BCSHS technique is to secure the IoMT devices via the detection of abnormal activities and FL concepts. Besides, BC technology can be applied for the secure transmission of medical data among the IoMT devices. The PPFL-BCSHS technique employs the FL for training the model for the identification of abnormal patterns. For anomaly detection, the PPFL-BCSHS technique follows three major processes, namely Mountain Gazelle Optimization (MGO)-based feature selection, Bidirectional Gated Recurrent Unit (BiGRU), and Sandcat Swarm Optimization (SCSO)-based hyperparameter tuning. A series of simulations were implemented to examine the performance of the PPFL-BCSHS method. The empirical analysis highlighted that the PPFL-BCSHS method obtains improved security over other approaches under various measures.
In the presence of premature atrial contraction (PAC), premature ventricular contraction (PVC) or other ectopic beats, RR intervals (RRIs) may be disturbed, which results in other types of heart disease being misdiagnosed as atrial fibrillation (AF). In this study, a low-complexity AF detection method based on short ECG is proposed, which includes RRIs modification and feature selection. The extracted RRIs are used to determine whether the potential RRI interference exists and to modify it. Next, based on the modified RRIs, the features are evaluated and selected by the methods of correlation criterion, Fisher criterion, and minimum redundancy maximum relevance criterion. Finally, filtered features are classified by the artificial neural network (ANN). The algorithm is validated in a test set including 2332 AF, 313 normal (NOR), 239 atrioventricular block (IAVB), 81 left bundle branch block (LBBB), 624 right bundle branch block (RBBB), 426 PAC and 564 PVC. Compared with the previous detection method of AF based on the RRIs, the proposed method achieved an overall sensitivity of 94.04% and an overall specificity of 86.74%. The specificity of the test set containing only AF and NOR is up to 99.04%. Meanwhile, the overall false-positive rate (FPR) of PAC and PVC can be reduced by 9.19%. While ensuring accuracy, this method effectively reduces the probability of misdiagnosis of PVC and PAC as AF. It is an automatic detection method of AF suitable for inter-patient clinical short-term ECG.
This study presents a methodology for detecting Parkinson’s disease using a neuro-fuzzy system (NFS) with feature selection. From all the 22 features, the five most accurate minimized features were selected using neural networks with weighted fuzzy membership functions (NEWFMs), which supported the nonoverlapping region method (NORM). NORM eliminates the worst features and can select the minimized features constituting each interpretable fuzzy membership. As an input to the NEWFMs, all 22 features indicated a performance sensitivity, specificity and accuracy of 87.43%, 96.43% and 88.72%, respectively. In addition, at least five features of the NEWFMs showed performance sensitivity, specificity and accuracy of 95.24%, 85.42% and 92.82%, respectively.
Cancer is a complex disease that cannot be diagnosed reliably using only single gene expression analysis. Using gene-set analysis on high throughput gene expression profiling controlled by various environmental factors is a commonly adopted technique used by the cancer research community. This work develops a comprehensive gene expression analysis tool (gene-set activity toolbox: (GAT)) that is implemented with data retriever, traditional data pre-processing, several gene-set analysis methods, network visualization and data mining tools. The gene-set analysis methods are used to identify subsets of phenotype-relevant genes that will be used to build a classification model. To evaluate GAT performance, we performed a cross-dataset validation study on three common cancers namely colorectal, breast and lung cancers. The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance. GAT can be accessed from http://gat.sit.kmutt.ac.th where GAT’s java library for gene-set analysis, simple classification and a database with three cancer benchmark datasets can be downloaded.
Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.
Software Fault Prediction (SFP) is the most persuasive research area of software engineering. Software Fault Prediction which is carried out within the same software project is known as With-In Fault Prediction. However, local data repositories are not enough to build the model of With-in software Fault prediction. The idea of cross-project fault prediction (CPFP) has been suggested in recent years, which aims to construct a prediction model on one project, and use that model to predict the other project. However, CPFP requires that both the training and testing datasets use the same set of metrics. As a consequence, traditional CPFP approaches are challenging to implement through projects with diverse metric sets. The specific case of CPFP is Heterogeneous Fault Prediction (HFP), which allows the program to predict faults among projects with diverse metrics. The proposed framework aims to achieve an HFP model by implementing Feature Selection on both the source and target datasets to build an efficient prediction model using supervised machine learning techniques. Our approach is applied on two open-source projects, Linux and MySQL, and prediction is evaluated based on Area Under Curve (AUC) performance measure. The key results of the proposed approach are as follows: It significantly gives better results of prediction performance for heterogeneous projects as compared with cross projects. Also, it demonstrates that feature selection with feature mapping has a significant effect on HFP models. Non-parametric statistical analyses, such as the Friedman and Nemenyi Post-hoc Tests, are applied, demonstrating that Logistic Regression performed significantly better than other supervised learning algorithms in HFP models.
Feature selection has become a powerful dimensional reduction strategy and an effective tool in handling high-dimensional data. Feature selection aims to reduce the dimension of the feature space, to speed up and reduce the cost of the learning model and that by selecting the most relevant feature subset to data mining and machine learning tasks. The selection of optimal feature subset is an optimization problem that proved to be NP-hard. Metaheuristics are traditionally used to deal with NP-hard problems since they are well known for solving complex and real-world problems in reasonable period of time. Genetic algorithm (GA) is one of the most popular metaheuristics algorithms, which proved to be effective for an accurate feature selection task. However, in the last few decades, data have become progressively larger in both numbers of instances and features. This paradigm is being popularly termed as Big Data. With the tremendous growth of dataset sizes, most current feature selection algorithms and exceptionally GA become unscalable. To improve the scalability of a feature selection algorithm on big data, the distributed computing strategy is always adopted such as MapReduce model and Hadoop system. In this paper, we first present a review for the most recent works which handle the use of Parallel Genetic algorithm in large datasets. Then, we will propose a new Parallel Genetic algorithm based on the Coarse-grained parallelization model (island model). The parallelization of the process and the distribution of the partitioning of data will be performed using Hadoop system with an Amazon cluster. The performance and the scalability of the proposed method were theoretically and empirically compared to the existing feature selection methods when handling large-scale datasets and results confirm the effectiveness of our proposed method.
Many efforts have already been carried out for education mining in the past. Many techniques and models are already developed to predict and identify students’ performance, learning behavior, status, and education level. However, there is no exact solution for the student’s result prediction since it is affected by the level of the student, the field of study, the location of the data collection, different sizes and nature of data, etc. Different research shows that there can be up to 10% difference in the accuracy of results with and without a feature selection process. Thus, the proposed model designs a better model for student result prediction using feature selection and deep learning techniques. The proposed dissertation task compares and analyzes Correlation-based Feature Selection (CFS), Chi-Square (χ2), Genetic Algorithm (GA), Information Gain (IG), Maximum Relevance Minimum Redundancy (mRMR), ReliefF, and Recursive Feature Elimination (RFE) feature selection techniques with the Classification and Regression Tree (CART), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) machine learning algorithms. In the proposed model, a feature selection process CFS and prediction using CNN is recommended. The recommended model (CFS–CNN) is tested with a primary dataset collected from bachelor-level students. The recommended model provides improved performance compared to old techniques. The major contribution of the proposed dissertation is to design a better model for the prediction of students’ results using demographic data and past examination results.
The accurate prediction of a cancer patient’s risk of progression or death can guide clinicians in the selection of treatment and help patients in planning personal affairs. Predictive models based on patient-level data represent a tool for determining risk. Ideally, predictive models will use multiple sources of data (e.g., clinical, demographic, molecular, etc.). However, there are many challenges associated with data integration, such as overfitting and redundant features. In this paper we aim to address those challenges through the development of a novel feature selection and feature reduction framework that can handle correlated data. Our method begins by computing a survival distance score for gene expression, which in combination with a score for clinical independence, results in the selection of highly predictive genes that are non-redundant with clinical features. The survival distance score is a measure of variation of gene expression over time, weighted by the variance of the gene expression over all patients. Selected genes, in combination with clinical data, are used to build a predictive model for survival. We benchmark our approach against commonly used methods, namely lasso- as well as ridge-penalized Cox proportional hazards models, using three publicly available cancer data sets: kidney cancer (521 samples), lung cancer (454 samples) and bladder cancer (335 samples). Across all data sets, our approach built on the training set outperformed the clinical data alone in the test set in terms of predictive power with a c.Index of 0.773 vs 0.755 for kidney cancer, 0.695 vs 0.664 for lung cancer and 0.648 vs 0.636 for bladder cancer. Further, we were able to show increased predictive performance of our method compared to lasso-penalized models fit to both gene expression and clinical data, which had a c.Index of 0.767, 0.677, and 0.645, as well as increased or comparable predictive power compared to ridge models, which had a c.Index of 0.773, 0.668 and 0.650 for the kidney, lung, and bladder cancer data sets, respectively. Therefore, our score for clinical independence improves prognostic performance as compared to modeling approaches that do not consider combining non-redundant data. Future work will concentrate on optimizing the survival distance score in order to achieve improved results for all types of cancer.
Autism Spectrum Disorder (ASD) is a complex neuropsychiatric condition with a highly heterogeneous phenotype. Following the work of Duda et al., which uses a reduced feature set from the Social Responsiveness Scale, Second Edition (SRS) to distinguish ASD from ADHD, we performed item-level question selection on answers to the SRS to determine whether ASD can be distinguished from non-ASD using a similarly small subset of questions. To explore feature redundancies between the SRS questions, we performed filter, wrapper, and embedded feature selection analyses. To explore the linearity of the SRS-related ASD phenotype, we then compressed the 65-question SRS into low-dimension representations using PCA, t-SNE, and a denoising autoencoder. We measured the performance of a multilayer perceptron (MLP) classifier with the top-ranking questions as input. Classification using only the top-rated question resulted in an AUC of over 92% for SRS-derived diagnoses and an AUC of over 83% for dataset-specific diagnoses. High redundancy of features have implications towards replacing the social behaviors that are targeted in behavioral diagnostics and interventions, where digital quantification of certain features may be obfuscated due to privacy concerns. We similarly evaluated the performance of an MLP classifier trained on the low-dimension representations of the SRS, finding that the denoising autoencoder achieved slightly higher performance than the PCA and t-SNE representations.
Genome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Other limiting factors include the dependency between SNPs, due to linkage disequilibrium (LD), and the need to account for population structure, that is to say, confounding due to genetic ancestry.
We propose an efficient approach for the multivariate analysis of multi-population GWAS data based on a multitask group Lasso formulation. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population/task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale.
To our knowledge, this is the first framework for GWAS on diverse populations combining feature selection at the LD-groups level, a multitask approach to address population structure, stability selection, and safe screening rules. We show that our approach outperforms state-of-the-art methods on both a simulated and a real-world cancer datasets.
DNA methylation has emerged as promising epigenetic markers for disease diagnosis. Both the differential mean (DM) and differential variability (DV) in methylation have been shown to contribute to transcriptional aberration and disease pathogenesis. The presence of confounding factors in large scale EWAS may affect the methylation values and hamper accurate marker discovery. In this paper, we propose a exible framework called methylDMV which allows for confounding factors adjustment and enables simultaneous characterization and identification of CpGs exhibiting DM only, DV only and both DM and DV. The proposed framework also allows for prioritization and selection of candidate features to be included in the prediction algorithm. We illustrate the utility of methylDMV in several TCGA datasets. An R package methylDMV implementing our proposed method is available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#methylDMV.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.