Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Feature saliency estimation and feature selection are important tasks in machine learning applications. Filters, such as distance measures are commonly used as an efficient means of estimating the saliency of individual features. However, feature rankings derived from different distance measures are frequently inconsistent. This can present reliability issues when the rankings are used for feature selection. Two novel consensus approaches to creating a more robust ranking are presented in this paper. Our experimental results show that the consensus approaches can improve reliability over a range of feature parameterizations and various seabed texture classification tasks in sidescan sonar mosaic imagery.
The goal of feature selection is to search the optimal feature subset with respect to the evaluation function. Exhaustively searching all possible feature subsets requires high computational cost. The alternative suboptimal methods are more efficient and practical but they cannot promise globally optimal results. We propose a new feature selection algorithm based on distance discriminant and distribution overlapping (HFSDD) for continuous features, which overcomes the drawbacks of the exhaustive search approaches and those of the suboptimal methods. The proposed method is able to find the optimal feature subset without exhaustive search or Branch and Bound algorithm. The most difficult problem for optimal feature selection, the search problem, is converted into a feature ranking problem following rigorous theoretical proof such that the computational complexity can be greatly reduced. Since the distribution of overlapping degrees between every two classes can provide useful information for feature selection, HFSDD also takes them into account by using a new approach to estimate the overlapping degrees. In this sense, HFSDD is a distance discriminant and distribution overlapping based solution. HFSDD was compared with ReliefF and mrmrMID on ten data sets. The experimental results show that HFSDD outperforms the other methods.
Feature ranking is widely employed to deal with high dimensionality in text classification. The main advantage of feature ranking methods is their low cost and simple algorithms. However, they suffer from some drawbacks which cause low performance compared to wrapper approach feature selection methods. In this paper, three major drawbacks of feature ranking methods are discussed. First, we show that feature ranking methods are highly problem dependent. For designing an effective feature ranking method and appropriate ranking threshold, we need background knowledge including the data set characteristics as well as the classifier to be used. Second, the feature ranking methods are univariate functions, while the nature of text classification is multivariate. It means that in these methods, correlation between terms is ignored. Finally, they fail in multiple class problems with unbalanced class distribution because they pay more attention to the simpler and larger classes. In this paper, these drawbacks, especially the last two issues, are experimentally investigated using a set of extensive numerical experiments with several data sets and feature scoring measures.
Feature ranking is a fundamental preprocess for feature selection, before performing any data mining task. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we have developed an efficient feature ranking algorithm for selecting the more relevant features prior to derivation of classification predictors. Regardless the ranking criteria which rely on the training error of a predictor based on a feature, our approach is distance-based, employing only the statistical distribution of classes in each feature. It uses a scoring function as ranking criterion to evaluate the correlation measure between each feature and the classes. This function comprises three measures for each class: the statistical between-class distance, the interclass overlapping measure, and an estimate of class impurity. In order to compute the statistical parameters, used in these measures, a normalized form of histogram, obtained for each class, is employed as its a priori probability density. Since the proposed algorithm examines each feature individually, it provides a fast and cost-effective method for feature ranking. We have tested the effectiveness of our approach on some benchmark data sets with high dimensions. For this purpose, some top-ranked features are selected and are used in some rule-based classifiers as the target data mining task. Comparing with some popular feature ranking methods, the experimental results show that our approach has better performance as it can identify the more relevant features eventuate to lower classification error.
Dimensionality reduction is a necessary task in data mining when working with high dimensional data. A type of dimensionality reduction is feature selection. Feature selection based on feature ranking has received much attention by researchers. The major reasons are its scalability, ease of use, and fast computation. Feature ranking methods can be divided into different categories and may use different measures for ranking features. Recently, ensemble methods have entered in the field of ranking and achieved more accuracy among others. Accordingly, in this paper a Heterogeneous ensemble based algorithm for feature ranking is proposed. The base ranking methods in this ensemble structure are chosen from different categories like information theoretic, distance based, and statistical methods. The results of the base ranking methods are then fused into a final feature subset by means of genetic algorithm. The diversity of the base methods improves the quality of initial population of the genetic algorithm and thus reducing the convergence time of the genetic algorithm. In most of ranking methods, it's the user's task to determine the threshold for choosing the appropriate subset of features. It is a problem, which may cause the user to try many different values to select a good one. In the proposed algorithm, the difficulty of determining a proper threshold by the user is decreased. The performance of the algorithm is evaluated on four different text datasets and the experimental results show that the proposed method outperforms all other five feature ranking methods used for comparison. One advantage of the proposed method is that it is independent to the classification method used for classification.
Software quality prediction models are useful tools for creating high quality software products. The general process is that practitioners use software metrics and defect data along with various data mining techniques to build classification models for identifying potentially faulty program modules, thereby enabling effective project resource allocation. The predictive accuracy of these classification models is often affected by the quality of input data. Two main problems which can affect the quality of input data are high dimensionality (too many independent attributes in a dataset) and class imbalance (many more members of one class than the other class in a binary classification problem). To resolve both of these problems, we present an iterative feature selection approach which repeatedly applies data sampling (to overcome class imbalance) followed by feature selection (to overcome high dimensionality), and finally combines the ranked feature lists from the separate iterations of sampling. After feature selection, models are built either using a plain learner or by using a boosting algorithm which incorporates sampling. In order to assess the impact of various balancing, filter, and learning techniques in the feature selection and model-building process on software quality prediction, we employ two sampling techniques, random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and two ensemble boosting approaches, RUSBoost and SMOTEBoost (in which RUS and SMOTE, respectively, are integrated into a boosting technique), as well as six feature ranking techniques. We apply the proposed techniques to several groups of datasets from two real-world software systems and use two learners to build classification models. The experimental results demonstrate that RUS results in better prediction than SMOTE, and also that boosting is more effective in improving classification performance than not using boosting. In addition, some feature ranking techniques, like chi-squared and information gain, exhibit better and more stable classification behavior than other rankers.
In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.
Atrial Fibrillation (A-Fib), Atrial Flutter (AFL) and Ventricular Fibrillation (V-Fib) are fatal cardiac abnormalities commonly affecting people in advanced age and have indication of life-threatening condition. To detect these abnormal rhythms, Electrocardiogram (ECG) signal is most commonly visualized as a significant clinical tool. Concealed non-linearities in the ECG signal can be clearly unraveled using Recurrence Quantification Analysis (RQA) technique. In this paper, RQA features are applied for classifying four classes of ECG beats namely Normal Sinus Rhythm (NSR), A-Fib, AFL and V-Fib using ensemble classifiers. The clinically significant (p<0.05) features are ranked and fed independently to three classifiers viz. Decision Tree (DT), Random Forest (RAF) and Rotation Forest (ROF) ensemble methods to select the best classifier. The training and testing of the feature set is accomplished using 10-fold cross-validation strategy. The RQA coefficients using ROF provided an overall accuracy of 98.37% against 96.29% and 94.14% for the RAF and DT, respectively. The results achieved evidently ratify the superiority of ROF ensemble classifier in the diagnosis of A-Fib, AFL and V-Fib. Precision of four classes is measured using class-specific accuracy (%) and reliability of the performance is assessed using Cohen’s kappa statistic (κ). The developed approach can be used in therapeutic devices and help the physicians in automatic monitoring of fatal tachycardia rhythms.
The molecular big data are highly correlated, and numerous genes are not related. The various classification methods performance mainly rely on the selection of significant genes. Sparse regularized regression (SRR) models using the least absolute shrinkage and selection operator (lasso) and adaptive lasso (alasso) are popularly used for gene selection and classification. Nevertheless, it becomes challenging when the genes are highly correlated. Here, we propose a modified adaptive lasso with weights using the ranking-based feature selection (RFS) methods capable of dealing with the highly correlated gene expression data. Firstly, an RFS methods such as Fisher’s score (FS), Chi-square (CS), and information gain (IG) are employed to ignore the unimportant genes and the top significant genes are chosen through sure independence screening (SIS) criteria. The scores of the ranked genes are normalized and assigned as proposed weights to the alasso method to obtain the most significant genes that were proven to be biologically related to the cancer type and helped in attaining higher classification performance. With the synthetic data and real application of microarray data, we demonstrated that the proposed alasso method with RFS methods is a better approach than the other known methods such as alasso with filtering such as ridge and marginal maximum likelihood estimation (MMLE), lasso and alasso without filtering. The metrics of accuracy, area under the receiver operating characteristics curve (AUROC), and geometric mean (GM-mean) are used for evaluating the performance of the models.
World Health Organization (W.H.O) has coined the word “Infodemic” to refer to the dissemination of fake news during this pandemic, which is considered to be as harmful as the virus itself. Verifying the information available on the internet is a prerequisite to ensuring the ecosystem is maintained which is the driving force behind this work. The primary goal of this study is to address the problem of time-consuming automatically voluminous fake news detection of certain data and consider the uncertainty of data from causal relations using a rich feature set. This research produces significant feature reduction for reduced time and improved accuracy by filtering out significant features using recursive feature selection (RFE). The retained features by the RFE algorithm are also compared to a standard statistical measure of Pearson’s correlation to ensure no information loss while reducing features. The suggested methodology has also defined appropriate class output assurance levels and accurate prediction ambiguity for the fake identification jobs. Comparative analysis for existing methods for feature selection is performed. The result of experimentation testifies the improvement of a 6% increase in precision and a 97% reduction in execution time.
Feature selection is very important for the success of any automated pattern recognition system. Removal of redundant features improves the efficiency of a classifier as well as cut down the cost of feature extraction. Recently artificial neural networks (ANN) are becoming popular for solving pattern classification problems. But the choice of proper architecture and model from various alternatives is still an open problem for research. In this work a new architecture, a modified version of multilayer feed forward neural network, proposed earlier for pattern classification has been used as a tool for feature selection. A proper algorithm for feature subset selection has been proposed in which the features are evaluated according to their contribution to classification rate of the net for the unknown samples. Simulation of the proposed algorithm has been done with two types of data sets and the results seem to be promising.