Recently, Android applications have been playing a vital part in the everyday life as several services are offered via mobile applications. Due of its market dominance, Android is more at danger from malicious software, and this threat is growing. The exponential growth of malicious Android apps has made it essential to develop cutting-edge methods for identifying them. Despite the prevalence of a number of security-based approaches in the research, feature selection (FS) methods for Android malware detection methods still have to be developed. In this research, researchers provide a method for distinguishing malicious Android apps from legitimate ones by using a intelligent hyperparameter tuned deep learning based malware detection (IHPT-DLMD). Extraction of features and preliminary data processing are the main functions of the IHPT-DLMD method. The proposed IHPT-DLMD technique initially aims to determine the considerable permissions and API calls using the binary coyote optimization algorithm (BCOA)-based FS technique, which aids to remove the unnecessary features. Besides, bidirectional long short-term memory (Bi-LSTM) model is employed for the detection and classification of Android malware. Finally, the glowworm swarm optimization (GSO) algorithm is applied to optimize the hyperparameters of the BiLSTM model to produce effectual outcomes for Android application classification. This IHPT-DLMD method is checked for quality using a benchmark dataset and evaluated in several ways. The test data demonstrated overall higher performance of the IHPT-DLMD methodology in comparison to the most contemporary methods that are currently in use.
The marine predator algorithm (MPA) is the latest metaheuristic algorithm proposed in 2020, which has an outstanding merit-seeking capability, but still has the disadvantage of slow convergence and is prone to a local optimum. To tackle the above problems, this paper proposed the flexible adaptive MPA. Based on the MPA, a flexible adaptive model is proposed and applied to each of the three stages of population iteration. By introducing nine benchmark test functions and changing their dimensions, the experimental results show that the flexible adaptive MPA has faster convergence speed, more accurate convergence ability, and excellent robustness. Finally, the flexible adaptive MPA is applied to feature selection experiments. The experimental results of 10 commonly used UCI high-dimensional datasets and three wind turbine (WT) fault datasets show that the flexible adaptive MPA can effectively extract the key features of high-dimensional datasets, reduce the data dimensionality, and improve the effectiveness of the machine algorithm for WT fault diagnosis (FD).
The Internet of Medical Things (IoMT) refers to interconnected medical systems and devices that gather and transfer healthcare information for several medical applications. Smart healthcare leverages IoMT technology to improve patient diagnosis, monitoring, and treatment, providing efficient and personalized healthcare services. Privacy-preserving Federated Learning (PPFL) is a privacy-enhancing method that allows collaborative method training through distributed data sources while ensuring privacy protection and keeping the data decentralized. In the field of smart healthcare, PPFL enables healthcare professionals to train machine learning algorithms jointly on their corresponding datasets without sharing sensitive data, thereby maintaining confidentiality. Within this framework, anomaly detection includes detecting unusual events or patterns in healthcare data like unexpected changes or irregular vital signs in patient behaviors that can represent security breaches or potential health issues in the IoMT system. Smart healthcare systems could enhance patient care while protecting data confidentiality and individual privacy by incorporating PPFL with anomaly detection techniques. Therefore, this study develops a Privacy-preserving Federated Learning with Blockchain-based Smart Healthcare System (PPFL-BCSHS) technique in the IoMT environment. The purpose of the PPFL-BCSHS technique is to secure the IoMT devices via the detection of abnormal activities and FL concepts. Besides, BC technology can be applied for the secure transmission of medical data among the IoMT devices. The PPFL-BCSHS technique employs the FL for training the model for the identification of abnormal patterns. For anomaly detection, the PPFL-BCSHS technique follows three major processes, namely Mountain Gazelle Optimization (MGO)-based feature selection, Bidirectional Gated Recurrent Unit (BiGRU), and Sandcat Swarm Optimization (SCSO)-based hyperparameter tuning. A series of simulations were implemented to examine the performance of the PPFL-BCSHS method. The empirical analysis highlighted that the PPFL-BCSHS method obtains improved security over other approaches under various measures.
Multimodal Sentiment Analysis (MSA) is a growing area of emotional computing that involves analyzing data from three different modalities. Gathering data from Multimodal Sentiment analysis in Car Reviews (MuSe-CaR) is challenging due to data imbalance across modalities. To address this, an effective data augmentation approach is proposed by combining dynamic synthetic minority oversampling with a multimodal elicitation conditional generative adversarial network for emotion recognition using audio, text, and visual data. The balanced data is then fed into a granular elastic-net regression with a hybrid feature selection method based on dandelion fick’s law optimization to analyze sentiments. The selected features are input into a multilabel wavelet convolutional neural network to classify emotion states accurately. The proposed approach, implemented in python, outperforms existing methods in terms of trustworthiness (0.695), arousal (0.723), and valence (0.6245) on the car review dataset. Additionally, the feature selection method achieves high accuracy (99.65%), recall (99.45%), and precision (99.66%). This demonstrates the effectiveness of the proposed MSA approach, even with three modalities of data.
In the presence of premature atrial contraction (PAC), premature ventricular contraction (PVC) or other ectopic beats, RR intervals (RRIs) may be disturbed, which results in other types of heart disease being misdiagnosed as atrial fibrillation (AF). In this study, a low-complexity AF detection method based on short ECG is proposed, which includes RRIs modification and feature selection. The extracted RRIs are used to determine whether the potential RRI interference exists and to modify it. Next, based on the modified RRIs, the features are evaluated and selected by the methods of correlation criterion, Fisher criterion, and minimum redundancy maximum relevance criterion. Finally, filtered features are classified by the artificial neural network (ANN). The algorithm is validated in a test set including 2332 AF, 313 normal (NOR), 239 atrioventricular block (IAVB), 81 left bundle branch block (LBBB), 624 right bundle branch block (RBBB), 426 PAC and 564 PVC. Compared with the previous detection method of AF based on the RRIs, the proposed method achieved an overall sensitivity of 94.04% and an overall specificity of 86.74%. The specificity of the test set containing only AF and NOR is up to 99.04%. Meanwhile, the overall false-positive rate (FPR) of PAC and PVC can be reduced by 9.19%. While ensuring accuracy, this method effectively reduces the probability of misdiagnosis of PVC and PAC as AF. It is an automatic detection method of AF suitable for inter-patient clinical short-term ECG.
This study presents a methodology for detecting Parkinson’s disease using a neuro-fuzzy system (NFS) with feature selection. From all the 22 features, the five most accurate minimized features were selected using neural networks with weighted fuzzy membership functions (NEWFMs), which supported the nonoverlapping region method (NORM). NORM eliminates the worst features and can select the minimized features constituting each interpretable fuzzy membership. As an input to the NEWFMs, all 22 features indicated a performance sensitivity, specificity and accuracy of 87.43%, 96.43% and 88.72%, respectively. In addition, at least five features of the NEWFMs showed performance sensitivity, specificity and accuracy of 95.24%, 85.42% and 92.82%, respectively.
Partially linear additive models (PLAMs) have attracted much attention in the statistical machine learning community due to their interpretability and flexibility in data-driven prediction and inference. Since the performance of PLAMs is closely related to the structure information of linear and nonlinear components, several approaches have been proposed for regression estimation and data-driven structure discovery. However, the existing automatic discovery strategy is limited to the mean regression framework and is usually sensitive to non-Gaussian noises, e.g. skewed noise and heavy-tailed noise. To further improve the robustness of PLAMs, this paper proposes a Robust Partially Linear Trend Filtering (RPLTF) for regression estimation and structure discovery by integrating the mode-induced error metric and the trend filtering-based nonlinear approximation into regularized PLAMs. Besides the computing algorithm of RPLTF, we establish its upper bound on generalization error in theory. Empirical examples are provided to validate the effectiveness of the proposed method.
Cancer is a complex disease that cannot be diagnosed reliably using only single gene expression analysis. Using gene-set analysis on high throughput gene expression profiling controlled by various environmental factors is a commonly adopted technique used by the cancer research community. This work develops a comprehensive gene expression analysis tool (gene-set activity toolbox: (GAT)) that is implemented with data retriever, traditional data pre-processing, several gene-set analysis methods, network visualization and data mining tools. The gene-set analysis methods are used to identify subsets of phenotype-relevant genes that will be used to build a classification model. To evaluate GAT performance, we performed a cross-dataset validation study on three common cancers namely colorectal, breast and lung cancers. The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance. GAT can be accessed from http://gat.sit.kmutt.ac.th where GAT’s java library for gene-set analysis, simple classification and a database with three cancer benchmark datasets can be downloaded.
Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.
Identifying valuable features from complex omics data is of great significance for disease diagnosis study. This paper proposes a new feature selection algorithm based on sample network (FS-SN) to mine important information from omics data. The sample network is constructed according to the sample neighbor relationship at the molecular (feature) expression level, and the distinguishing ability of the feature is evaluated based on the topology of the sample network. The sample network established on a feature with a strong discriminating ability tends to have many edges between the same group samples and few edges between the different group samples. At the same time, FS-SN removes redundant features according to the gravitational interaction between features. To show the validation of FS-SN, it was compared on ten public datasets with ERGS, mRMR, ReliefF, ATSD-DN, and INDEED which are efficient in omics data analysis. Experimental results show that FS-SN performed better than the compared methods in accuracy, sensitivity and specificity in most cases. Hence, FS-SN making use of the topology of the sample network is effective for analyzing omics data, it can identify key features that reflect the occurrence and development of diseases, and reveal the underlying biological mechanism.
Effectively reducing the dimensionality of big data and retaining its key information has been a research challenge. As an important step in data pre-processing, feature selection plays a critical role in reducing data size and increasing the overall value of the data. Many previous studies have focused on single-label feature selection, however, with the increasing variety of data types, the need for feature selection on multi-label data types has also arisen. Unlike single-labeled data, multi-labeled data with more combinations of classifications place higher demands on the capabilities of feature selection algorithms. In this paper, we propose a filter-based Multi-Objective Equilibrium Optimizer algorithm (MOEO-Smp) to solve the feature selection problem for both single-label and multi-label data. MOEO-Smp rates the optimization results of solutions and features based on four pairs of optimization principles, and builds three equilibrium pools to guide exploration and exploitation based on the total scores of solutions and features and the ranking of objective fitness values, respectively. Seven UCI single-label datasets and two Mulan multi-label datasets and one COVID-19 multi-label dataset are used to test the feature selection capability of MOEO-Smp, and the feature selection results are compared with 10 other state-of-the-art algorithms and evaluated using three and seven different metrics, respectively. Feature selection experiments and comparisons with the results in other literatures show that MOEO-Smp not only has the highest classification accuracy and excellent dimensionality reduction on single-labeled data, but also performs better on multi-label data in terms of Hamming loss, accuracy, dimensionality reduction, and so on.
Current healthcare applications commonly incorporate the Internet of Things (IoT) and cloud computing ideas. IoT devices provide massive amounts of patient data in the healthcare industry. These data stored in the cloud are analyzed using mobile devices’ built-in storage and processing power. The Internet of Medical Healthcare Things (IoMHT) integrates health monitoring components including sensors and medical equipment to remotely monitor patient records in order to provide more intelligent and sophisticated healthcare services. In this research, we analyze one of the deadliest illnesses with a high fatality rate worldwide, the chronic kidney disease (CKD), to provide the finest healthcare services possible to users of e-health and m-health applications by presenting the IoTC services based on healthcare delivery system for the prediction and observation of CKD with its severity level. The suggested architecture gathers patient data from linked IoT devices and saves it in the cloud alongside real-time data, pertinent medical records that are collected from the UCI Machine Learning Repository, and relevant medical documents. We further use a Deep Neural Network (DNN) classifier to predict CKD and its severity. To boost the effectiveness of the DNN classifier, a Particle Swarm Optimization (PSO)-based feature selection technique is also applied. We compare the performance of the proposed model using different classification measures utilizing different classifiers. A Quick Flower Pollination Algorithm (QFPA)- (DNN)-based IoT and cloud-based CKD diagnosis model, is presented in this paper. The CKD diagnosis steps in the QFPA- DNN model involve data gathering, preparation, feature selection and classification stages.
Software Fault Prediction (SFP) is the most persuasive research area of software engineering. Software Fault Prediction which is carried out within the same software project is known as With-In Fault Prediction. However, local data repositories are not enough to build the model of With-in software Fault prediction. The idea of cross-project fault prediction (CPFP) has been suggested in recent years, which aims to construct a prediction model on one project, and use that model to predict the other project. However, CPFP requires that both the training and testing datasets use the same set of metrics. As a consequence, traditional CPFP approaches are challenging to implement through projects with diverse metric sets. The specific case of CPFP is Heterogeneous Fault Prediction (HFP), which allows the program to predict faults among projects with diverse metrics. The proposed framework aims to achieve an HFP model by implementing Feature Selection on both the source and target datasets to build an efficient prediction model using supervised machine learning techniques. Our approach is applied on two open-source projects, Linux and MySQL, and prediction is evaluated based on Area Under Curve (AUC) performance measure. The key results of the proposed approach are as follows: It significantly gives better results of prediction performance for heterogeneous projects as compared with cross projects. Also, it demonstrates that feature selection with feature mapping has a significant effect on HFP models. Non-parametric statistical analyses, such as the Friedman and Nemenyi Post-hoc Tests, are applied, demonstrating that Logistic Regression performed significantly better than other supervised learning algorithms in HFP models.
Feature selection has become a powerful dimensional reduction strategy and an effective tool in handling high-dimensional data. Feature selection aims to reduce the dimension of the feature space, to speed up and reduce the cost of the learning model and that by selecting the most relevant feature subset to data mining and machine learning tasks. The selection of optimal feature subset is an optimization problem that proved to be NP-hard. Metaheuristics are traditionally used to deal with NP-hard problems since they are well known for solving complex and real-world problems in reasonable period of time. Genetic algorithm (GA) is one of the most popular metaheuristics algorithms, which proved to be effective for an accurate feature selection task. However, in the last few decades, data have become progressively larger in both numbers of instances and features. This paradigm is being popularly termed as Big Data. With the tremendous growth of dataset sizes, most current feature selection algorithms and exceptionally GA become unscalable. To improve the scalability of a feature selection algorithm on big data, the distributed computing strategy is always adopted such as MapReduce model and Hadoop system. In this paper, we first present a review for the most recent works which handle the use of Parallel Genetic algorithm in large datasets. Then, we will propose a new Parallel Genetic algorithm based on the Coarse-grained parallelization model (island model). The parallelization of the process and the distribution of the partitioning of data will be performed using Hadoop system with an Amazon cluster. The performance and the scalability of the proposed method were theoretically and empirically compared to the existing feature selection methods when handling large-scale datasets and results confirm the effectiveness of our proposed method.
Papillary thyroid carcinoma (PTC) is typically an indolent cancer, yet a minority of cases develop lymph node metastasis. Due to the unclear mechanisms of lymph node metastasis, a considerable number of patients undergo unnecessary surgeries. Currently, the identification of key genetic biomarkers in high-dimensional data presents a significant challenge, thereby limiting research progress in this area. Here, we proposed a hybrid filter-wrapper feature selection strategy for core factor detection and developed MethyAE, a metastasis prediction model based on DNA methylation, utilizing an end-to-end learning auto-encoder. 46 methylated CpG sites were successfully identified as crucial biomarkers for lymph node metastasis. Leveraging 447 PTC samples from the Cancer Genome Atlas (221 with metastasis, 226 without), the MethyAE model achieves 88.9% accuracy and a recall rate of 88.6% in predicting lymph node metastasis, outperforming commonly used machine learning methods like logistic regression and random forest. Furthermore, the MethyAE model exhibits favorable performance in DNA methylation data from colon cancer, bladder cancer, and breast cancer. To the best of our knowledge, this is the first attempt to predict PTC lymph node metastasis through DNA methylation, offering pivotal decision-making criteria for avoiding unnecessary surgeries and selecting appropriate treatment plans for a substantial cohort of PTC patients.
Many efforts have already been carried out for education mining in the past. Many techniques and models are already developed to predict and identify students’ performance, learning behavior, status, and education level. However, there is no exact solution for the student’s result prediction since it is affected by the level of the student, the field of study, the location of the data collection, different sizes and nature of data, etc. Different research shows that there can be up to 10% difference in the accuracy of results with and without a feature selection process. Thus, the proposed model designs a better model for student result prediction using feature selection and deep learning techniques. The proposed dissertation task compares and analyzes Correlation-based Feature Selection (CFS), Chi-Square (χ2χ2), Genetic Algorithm (GA), Information Gain (IG), Maximum Relevance Minimum Redundancy (mRMR), ReliefF, and Recursive Feature Elimination (RFE) feature selection techniques with the Classification and Regression Tree (CART), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) machine learning algorithms. In the proposed model, a feature selection process CFS and prediction using CNN is recommended. The recommended model (CFS–CNN) is tested with a primary dataset collected from bachelor-level students. The recommended model provides improved performance compared to old techniques. The major contribution of the proposed dissertation is to design a better model for the prediction of students’ results using demographic data and past examination results.
The accurate prediction of a cancer patient’s risk of progression or death can guide clinicians in the selection of treatment and help patients in planning personal affairs. Predictive models based on patient-level data represent a tool for determining risk. Ideally, predictive models will use multiple sources of data (e.g., clinical, demographic, molecular, etc.). However, there are many challenges associated with data integration, such as overfitting and redundant features. In this paper we aim to address those challenges through the development of a novel feature selection and feature reduction framework that can handle correlated data. Our method begins by computing a survival distance score for gene expression, which in combination with a score for clinical independence, results in the selection of highly predictive genes that are non-redundant with clinical features. The survival distance score is a measure of variation of gene expression over time, weighted by the variance of the gene expression over all patients. Selected genes, in combination with clinical data, are used to build a predictive model for survival. We benchmark our approach against commonly used methods, namely lasso- as well as ridge-penalized Cox proportional hazards models, using three publicly available cancer data sets: kidney cancer (521 samples), lung cancer (454 samples) and bladder cancer (335 samples). Across all data sets, our approach built on the training set outperformed the clinical data alone in the test set in terms of predictive power with a c.Index of 0.773 vs 0.755 for kidney cancer, 0.695 vs 0.664 for lung cancer and 0.648 vs 0.636 for bladder cancer. Further, we were able to show increased predictive performance of our method compared to lasso-penalized models fit to both gene expression and clinical data, which had a c.Index of 0.767, 0.677, and 0.645, as well as increased or comparable predictive power compared to ridge models, which had a c.Index of 0.773, 0.668 and 0.650 for the kidney, lung, and bladder cancer data sets, respectively. Therefore, our score for clinical independence improves prognostic performance as compared to modeling approaches that do not consider combining non-redundant data. Future work will concentrate on optimizing the survival distance score in order to achieve improved results for all types of cancer.
Autism Spectrum Disorder (ASD) is a complex neuropsychiatric condition with a highly heterogeneous phenotype. Following the work of Duda et al., which uses a reduced feature set from the Social Responsiveness Scale, Second Edition (SRS) to distinguish ASD from ADHD, we performed item-level question selection on answers to the SRS to determine whether ASD can be distinguished from non-ASD using a similarly small subset of questions. To explore feature redundancies between the SRS questions, we performed filter, wrapper, and embedded feature selection analyses. To explore the linearity of the SRS-related ASD phenotype, we then compressed the 65-question SRS into low-dimension representations using PCA, t-SNE, and a denoising autoencoder. We measured the performance of a multilayer perceptron (MLP) classifier with the top-ranking questions as input. Classification using only the top-rated question resulted in an AUC of over 92% for SRS-derived diagnoses and an AUC of over 83% for dataset-specific diagnoses. High redundancy of features have implications towards replacing the social behaviors that are targeted in behavioral diagnostics and interventions, where digital quantification of certain features may be obfuscated due to privacy concerns. We similarly evaluated the performance of an MLP classifier trained on the low-dimension representations of the SRS, finding that the denoising autoencoder achieved slightly higher performance than the PCA and t-SNE representations.
Genome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Other limiting factors include the dependency between SNPs, due to linkage disequilibrium (LD), and the need to account for population structure, that is to say, confounding due to genetic ancestry.
We propose an efficient approach for the multivariate analysis of multi-population GWAS data based on a multitask group Lasso formulation. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population/task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale.
To our knowledge, this is the first framework for GWAS on diverse populations combining feature selection at the LD-groups level, a multitask approach to address population structure, stability selection, and safe screening rules. We show that our approach outperforms state-of-the-art methods on both a simulated and a real-world cancer datasets.
DNA methylation has emerged as promising epigenetic markers for disease diagnosis. Both the differential mean (DM) and differential variability (DV) in methylation have been shown to contribute to transcriptional aberration and disease pathogenesis. The presence of confounding factors in large scale EWAS may affect the methylation values and hamper accurate marker discovery. In this paper, we propose a exible framework called methylDMV which allows for confounding factors adjustment and enables simultaneous characterization and identification of CpGs exhibiting DM only, DV only and both DM and DV. The proposed framework also allows for prioritization and selection of candidate features to be included in the prediction algorithm. We illustrate the utility of methylDMV in several TCGA datasets. An R package methylDMV implementing our proposed method is available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#methylDMV.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.