Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE

    Software defect prediction suffers from the class-imbalance. Solving the class-imbalance is more important for improving the prediction performance. SMOTE is a useful over-sampling method which solves the class-imbalance. In this paper, we study about some problems that faced in software defect prediction using SMOTE algorithm. We perform experiments for investigating how they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and compare the performance of classifiers. We use paired t-test to test the statistical significance of results. Also, we introduce the effectiveness and ineffectiveness of over-sampling, and evaluation criteria for evaluating if an over-sampling is effective or not. We use those concepts to evaluate the results in accordance with the evaluation criteria for the effectiveness of over-sampling.

    The results show that they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and show that the over-sampling by SMOTE is effective in several classifiers.

  • articleNo Access

    Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study

    Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.

  • articleNo Access

    Classification of Imbalanced Data Using SMOTE and AutoEncoder Based Deep Convolutional Neural Network

    The imbalanced data classification is a challenging issue in many domains including medical intelligent diagnosis and fraudulent transaction analysis. The performance of the conventional classifier degrades due to the imbalanced class distribution of the training data set. Recently, machine learning and deep learning techniques are used for imbalanced data classification. Data preprocessing approaches are also suitable for handling class imbalance problem. Data augmentation is one of the preprocessing techniques used to handle skewed class distribution. Synthetic Minority Oversampling Technique (SMOTE) is a promising class balancing approach and it generates noise during the process of creation of synthetic samples. In this paper, AutoEncoder is used as a noise reduction technique and it reduces the noise generated by SMOTE. Further, Deep one-dimensional Convolutional Neural Network is used for classification. The performance of the proposed method is evaluated and compared with existing approaches using different metrics such as Precision, Recall, Accuracy, Area Under the Curve and Geometric Mean. Ten data sets with imbalance ratio ranging from 1.17 to 577.87 and data set size ranging from 303 to 284807 instances are used in the experiments. The different imbalanced data sets used are Heart-Disease, Mammography, Pima Indian diabetes, Adult, Oil-Spill, Phoneme, Creditcard, BankNoteAuthentication, Balance scale weight & distance database and Yeast data sets. The proposed method shows an accuracy of 96.1%, 96.5%, 87.7%, 87.3%, 95%, 92.4%, 98.4%, 86.1%, 94% and 95.9% respectively. The results suggest that this method outperforms other deep learning methods and machine learning methods with respect to G-mean and other performance metrics.

  • articleNo Access

    A NOVEL TWO-LEAD ARRHYTHMIA CLASSIFICATION SYSTEM BASED ON CNN AND LSTM

    Arrhythmia classification is useful during heart disease diagnosis. Although well-established for intra-patient diagnoses, inter-patient arrhythmia classification remains difficult. Most previous work has focused on the intra-patient condition and has not followed the Association for the Advancement of Medical Instrumentation (AAMI) standards. Here, we propose a novel system for arrhythmia classification based on multi-lead electrocardiogram (ECG) signals. The core of the design is that we fuse two types of deep learning features with some common traditional features and select discriminating features using a binary particle swarm optimization algorithm (BPSO). Then, the feature vector is classified using a weighted support vector machine (SVM) classifier. For a better generalization of the model and to draw fair comparisons, we carried out inter-patient experiments and followed the AAMI standards. We found that, when using common metrics aimed at multi-classification either macro- or micro-averaging, our system outperforms most other state-of-the-art methods.

  • articleNo Access

    SMOTE and Feature Selection for More Effective Bug Severity Prediction

    “Severity” is one of the essential features of software bug reports, which is a crucial factor for developers to decide which bug should be fixed immediately and which bug could be delayed to a next release. Severity assignment is a manual process and its accuracy depends on the experience of the assignee. Prior research proposed several models to automate this process. These models are based on textual preprocessing of historical bug reports and classification techniques. Although bug repositories suffer from severity class imbalance, none of the prior studies investigated the impact of implementing a class rebalancing technique on the accuracy of their models. In this paper, we propose a framework for predicting fine-grained severity levels which utilizes an over-sampling technique “SMOTE”, to balance the severity classes, and a feature selection scheme, to reduce the data scale and select the most informative features for training a K-nearest neighbor (KNN) classifier. The KNN classifier utilizes a distance-weighted voting scheme to predict the proper severity level of a newly reported bug. We investigated the effectiveness of our proposed approach on two large bug repositories, namely Eclipse and Mozilla, and the experimental results showed that our approach outperforms cutting-edge studies in predicting the minority severity classes.

  • articleNo Access

    A Safe Zone SMOTE Oversampling Algorithm Used in Earthquake Prediction Based on Extreme Imbalanced Precursor Data

    Earthquake prediction based on extreme imbalanced precursor data is a challenging task for standard algorithms. Since even if an area is in an earthquake-prone zone, the proportion of days with earthquakes per year is still a minority. The general method is to generate more artificial data for the minority class that is the earthquake occurrence data. But the most popular oversampling methods generate synthetic samples along line segments that join minority class instances, which is not suitable for earthquake precursor data. In this paper, we propose a Safe Zone Synthetic Minority Oversampling Technique (SZ-SMOTE) oversampling method as an enhancement of the SMOTE data generation mechanism. SZ-SMOTE generates synthetic samples with a concentration mechanism in the hyper-sphere area around each selected minority instances. The performance of SZ-SMOTE is compared against no oversampling, SMOTE and its popular modifications adaptive synthetic sampling (ADASYN) and borderline SMOTE (B-SMOTE) on six different classifiers. The experiment results show that the quality of earthquake prediction using SZ-SMOTE as oversampling algorithm significantly outperforms that of using the other oversampling algorithms.

  • articleNo Access

    Enhanced Prediction for Observed Peptide Count in Protein Mass Spectrometry Data by Optimally Balancing the Training Dataset

    Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution.

  • articleNo Access

    Cost-Sensitive Learning of Fuzzy Rules for Imbalanced Classification Problems Using FURIA

    This paper is intended to verify that cost-sensitive learning is a competitive approach for learning fuzzy rules in certain imbalanced classification problems. It will be shown that there exist cost matrices whose use in combination with a suitable classifier allows for improving the results of some popular data-level techniques. The well known FURIA algorithm is extended to take advantage of this definition. A numerical study is carried out to compare the proposed cost-sensitive FURIA to other state-of-the-art classification algorithms, based on fuzzy rules and on other classical machine learning methods, on 64 different imbalanced datasets.

  • articleNo Access

    Performance Analysis of Two-Stage Iterative Ensemble Method over Random Oversampling Methods on Multiclass Imbalanced Datasets

    Data imbalance occurring among multiclass datasets is very common in real-world applications. Existing studies reveal that various attempts were made in the past to overcome this multiclass imbalance problem, which is a severe issue related to the typical supervised machine learning methods such as classification and regression. But, still there exists a need to handle the imbalance problem efficiently as the datasets include both safe and unsafe minority samples. Most of the widely used oversampling techniques like SMOTE and its variants face challenges in replicating or generating the new data instances for balancing them across multiple classes, particularly when the imbalance is high and the number of rare samples count is too minimal thus leading the classifier to misclassify the data instances. To lessen this problem, we proposed a new data balancing method namely a two-stage iterative ensemble method to tackle the imbalance in multiclass environment. The proposed approach focuses on the rare minority sample’s influence on learning from imbalanced datasets and the main idea of the proposed approach is to balance the data without any change in class distribution before it gets trained by the learner such that it improves the learner’s learning process. Also, the proposed approach is compared against two widely used oversampling techniques and the results reveals that the proposed approach shows a much significant improvement in the learning process among the multiclass imbalanced data.