Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution.
Earthquake prediction based on extreme imbalanced precursor data is a challenging task for standard algorithms. Since even if an area is in an earthquake-prone zone, the proportion of days with earthquakes per year is still a minority. The general method is to generate more artificial data for the minority class that is the earthquake occurrence data. But the most popular oversampling methods generate synthetic samples along line segments that join minority class instances, which is not suitable for earthquake precursor data. In this paper, we propose a Safe Zone Synthetic Minority Oversampling Technique (SZ-SMOTE) oversampling method as an enhancement of the SMOTE data generation mechanism. SZ-SMOTE generates synthetic samples with a concentration mechanism in the hyper-sphere area around each selected minority instances. The performance of SZ-SMOTE is compared against no oversampling, SMOTE and its popular modifications adaptive synthetic sampling (ADASYN) and borderline SMOTE (B-SMOTE) on six different classifiers. The experiment results show that the quality of earthquake prediction using SZ-SMOTE as oversampling algorithm significantly outperforms that of using the other oversampling algorithms.
This paper is intended to verify that cost-sensitive learning is a competitive approach for learning fuzzy rules in certain imbalanced classification problems. It will be shown that there exist cost matrices whose use in combination with a suitable classifier allows for improving the results of some popular data-level techniques. The well known FURIA algorithm is extended to take advantage of this definition. A numerical study is carried out to compare the proposed cost-sensitive FURIA to other state-of-the-art classification algorithms, based on fuzzy rules and on other classical machine learning methods, on 64 different imbalanced datasets.
The imbalanced data classification is a challenging issue in many domains including medical intelligent diagnosis and fraudulent transaction analysis. The performance of the conventional classifier degrades due to the imbalanced class distribution of the training data set. Recently, machine learning and deep learning techniques are used for imbalanced data classification. Data preprocessing approaches are also suitable for handling class imbalance problem. Data augmentation is one of the preprocessing techniques used to handle skewed class distribution. Synthetic Minority Oversampling Technique (SMOTE) is a promising class balancing approach and it generates noise during the process of creation of synthetic samples. In this paper, AutoEncoder is used as a noise reduction technique and it reduces the noise generated by SMOTE. Further, Deep one-dimensional Convolutional Neural Network is used for classification. The performance of the proposed method is evaluated and compared with existing approaches using different metrics such as Precision, Recall, Accuracy, Area Under the Curve and Geometric Mean. Ten data sets with imbalance ratio ranging from 1.17 to 577.87 and data set size ranging from 303 to 284807 instances are used in the experiments. The different imbalanced data sets used are Heart-Disease, Mammography, Pima Indian diabetes, Adult, Oil-Spill, Phoneme, Creditcard, BankNoteAuthentication, Balance scale weight & distance database and Yeast data sets. The proposed method shows an accuracy of 96.1%, 96.5%, 87.7%, 87.3%, 95%, 92.4%, 98.4%, 86.1%, 94% and 95.9% respectively. The results suggest that this method outperforms other deep learning methods and machine learning methods with respect to G-mean and other performance metrics.
Arrhythmia classification is useful during heart disease diagnosis. Although well-established for intra-patient diagnoses, inter-patient arrhythmia classification remains difficult. Most previous work has focused on the intra-patient condition and has not followed the Association for the Advancement of Medical Instrumentation (AAMI) standards. Here, we propose a novel system for arrhythmia classification based on multi-lead electrocardiogram (ECG) signals. The core of the design is that we fuse two types of deep learning features with some common traditional features and select discriminating features using a binary particle swarm optimization algorithm (BPSO). Then, the feature vector is classified using a weighted support vector machine (SVM) classifier. For a better generalization of the model and to draw fair comparisons, we carried out inter-patient experiments and followed the AAMI standards. We found that, when using common metrics aimed at multi-classification either macro- or micro-averaging, our system outperforms most other state-of-the-art methods.