Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution.
Identifying the significant, or dominant, features is important to reveal the cause-and-effect relations in many pattern recognition applications, such as medical diagnosis, gene analysis, cyber security, finance and insurance fraud detection, etc. Samples that are sparsely populated and binary-valued in highly imbalanced datasets pose a challenge to the identification of these features. This paper explores an approach based on the confusion matrix measurement of the feature values with respect to their potential classification outcomes. The approach is able to compute the Discriminative Significances of the features and rank the features unbiasedly with respect to the imbalance ratios of the datasets. Experiment results on real-world and experimental datasets show that the approach made consistent evaluations of the features and identified the most significant ones accordingly on the sparse and binary-valued samples of the class-imbalanced datasets.
This paper presents a quantitative analysis of the nonlinearities of the positive predictive value (PPV) and its effect in evaluating two-class pattern classification models with imbalanced datasets. The analysis is made through an expression of the PPV as a function of two other classification ratios that are invariant to the data imbalance —the true positive rate (TPR) and false positive rate (FPR), and σ — the imbalance ratio (IR) of the dataset such that PPV =σTPR/(σTPR+FPR). The curvatures of PPV in the three-dimensional TPR–FPR–σ space are studied using the Hessian matrix, from which a saddle-shaped 3D surface in the space is revealed. This paper explores the nonlinear behaviors of PPV around the critical points, identified at FPR =σTPR on the saddle surface, along with its scaling and sensitivity issues as performance measurements in model evaluation. The effect of the nonlinearities of PPV for the F1 and MCC metrics on imbalanced datasets is also studied. It is warned through the results of this study that the evaluations of classification models could be misleading if without an awareness and understanding of the nonlinearities associated with the PPV and its relevant metrics on imbalanced datasets.
Machine learning is widely applied to gene expression profiles based molecular tumor classification, but sample imbalance problem is often overlooked. This paper proposed a subclass-weighted neighborhood classifier to address the imbalanced sample set problem and a novel neighborhood rough set model to select informative genes for classification performance improvement. Experiments on three publicly available tumor datasets demonstrated that the proposed method is obviously effective on imbalanced dataset with obscure boundary between two subtypes and informative gene selection and it can achieve higher cross-validation accuracy with much fewer tumor-related genes.
Assisted Reproductive Technology (ART) is a set of medical procedures primarily used to address infertility. Success Rate of ART is very low because it is affected by large number of variables. Machine Learning Techniques are now applied to predict ART outcome and to find strategies to improve success rate. For this, determining the best performing classifier for ART is very important. Previously, some classifiers are applied to ART with static data. But, in reality, the datasets are dynamic in nature and require dynamic setup which can be achieved with the help of Incremental Classifiers. Due to low success rate, the ART dataset contains less number of records for positive results that make the dataset imbalanced. This research work first finds the best evaluation metric for classification on imbalanced dataset and then balances the dataset using three different balancing techniques such as undersampling, oversampling and Synthetic Minority Oversampling Technique (SMOTE) and applies five different Incremental Classifiers, namely Stochastic Gradient Descent (SGD), Stochastic Primal Estimated sub-GrAdient SOlver for Support vector machine (SPegasos), Naïve Bayes Updatable, Instance Based (IBk), Averaged One Dependence Estimators (A1DE) Updatable and finds the best balancing technique and suitable classifier for ART outcome prediction. The result shows that for an imbalanced dataset Receiver Operating Characteristics (ROC) Area may be taken as a metric instead of the accuracy. It is found that SMOTE is best method for balancing the ART dataset and IB1 classifier performs well for the balanced data with the high prediction rate of 92.3 for ROC. Finally, various Feature Selection methods are applied to the top three best performing classifiers and suitable feature selection method for each classifier is identified.
Today’s datasets are usually very large with many features and making analysis on such datasets is really a tedious task. Especially when performing classification, selecting attributes that are salient for the process is a brainstorming task. It is more difficult when there are many class labels for the target class attribute and hence many researchers have introduced methods to select features for performing classification on multi-class attributes. The process becomes more tedious when the attribute values are imbalanced for which researchers have contributed many methods. But, there is no sufficient research to handle extreme imbalance and feature selection together and hence this paper aims to bridge this gap. Here Particle Swarm Optimization (PSO), an efficient evolutionary algorithm is used to handle imbalanced dataset and feature selection process is also enhanced with the required functionalities. First, Multi-objective Particle Swarm Optimization is used to transform the imbalanced datasets into balanced one and then another version of Multi-objective Particle Swarm Optimization is used to select the significant features. The proposed methodology is applied on eight multi-class extremely imbalanced datasets and the experimental results are found to be better than other existing methods in terms of classification accuracy, G mean, F measure. The results validated by using Friedman test also confirm that the proposed methodology effectively balances the dataset with less number of features than other methods.
O-glycosylation (Oglyc) plays an important role in various biological processes. The key to understanding the mechanisms of Oglyc is identifying the corresponding glycosylation sites. Two critical steps, feature selection and classifier design, greatly affect the accuracy of computational methods for predicting Oglyc sites. Based on an efficient feature selection algorithm and a classifier capable of handling imbalanced datasets, a new computational method, ChiMIC-based balanced decision table O-glycosylation (CBDT-Oglyc), is proposed. ChiMIC-based balanced decision table for O-glycosylation (CBDT-Oglyc), is proposed to predict Oglyc sites in proteins. Sequence characterization is performed by combining amino acid composition (AAC), undirected composition of k-spaced amino acid pairs (undirected-CKSAAP) and pseudo-position-specific scoring matrix (PsePSSM). Chi-MIC-share algorithm is used for feature selection, which simplifies the model and improves predictive accuracy. For imbalanced classification, a backtracking method based on local chi-square test is designed, and then cost-sensitive learning is incorporated to construct a novel classifier named ChiMIC-based balanced decision table (CBDT). Based on a 1:49 (positives:negatives) training set, the CBDT classifier achieves significantly better prediction performance than traditional classifiers. Moreover, the independent test results on separate human and mouse glycoproteins show that CBDT-Oglyc outperforms previous methods in global accuracy. CBDT-Oglyc shows great promise in predicting Oglyc sites and is expected to facilitate further experimental studies on protein glycosylation.