Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    Enhanced Prediction for Observed Peptide Count in Protein Mass Spectrometry Data by Optimally Balancing the Training Dataset

    Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution.

  • articleNo Access

    A Safe Zone SMOTE Oversampling Algorithm Used in Earthquake Prediction Based on Extreme Imbalanced Precursor Data

    Earthquake prediction based on extreme imbalanced precursor data is a challenging task for standard algorithms. Since even if an area is in an earthquake-prone zone, the proportion of days with earthquakes per year is still a minority. The general method is to generate more artificial data for the minority class that is the earthquake occurrence data. But the most popular oversampling methods generate synthetic samples along line segments that join minority class instances, which is not suitable for earthquake precursor data. In this paper, we propose a Safe Zone Synthetic Minority Oversampling Technique (SZ-SMOTE) oversampling method as an enhancement of the SMOTE data generation mechanism. SZ-SMOTE generates synthetic samples with a concentration mechanism in the hyper-sphere area around each selected minority instances. The performance of SZ-SMOTE is compared against no oversampling, SMOTE and its popular modifications adaptive synthetic sampling (ADASYN) and borderline SMOTE (B-SMOTE) on six different classifiers. The experiment results show that the quality of earthquake prediction using SZ-SMOTE as oversampling algorithm significantly outperforms that of using the other oversampling algorithms.

  • articleNo Access

    An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE

    Software defect prediction suffers from the class-imbalance. Solving the class-imbalance is more important for improving the prediction performance. SMOTE is a useful over-sampling method which solves the class-imbalance. In this paper, we study about some problems that faced in software defect prediction using SMOTE algorithm. We perform experiments for investigating how they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and compare the performance of classifiers. We use paired t-test to test the statistical significance of results. Also, we introduce the effectiveness and ineffectiveness of over-sampling, and evaluation criteria for evaluating if an over-sampling is effective or not. We use those concepts to evaluate the results in accordance with the evaluation criteria for the effectiveness of over-sampling.

    The results show that they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and show that the over-sampling by SMOTE is effective in several classifiers.

  • articleNo Access

    SMOTE and Feature Selection for More Effective Bug Severity Prediction

    “Severity” is one of the essential features of software bug reports, which is a crucial factor for developers to decide which bug should be fixed immediately and which bug could be delayed to a next release. Severity assignment is a manual process and its accuracy depends on the experience of the assignee. Prior research proposed several models to automate this process. These models are based on textual preprocessing of historical bug reports and classification techniques. Although bug repositories suffer from severity class imbalance, none of the prior studies investigated the impact of implementing a class rebalancing technique on the accuracy of their models. In this paper, we propose a framework for predicting fine-grained severity levels which utilizes an over-sampling technique “SMOTE”, to balance the severity classes, and a feature selection scheme, to reduce the data scale and select the most informative features for training a K-nearest neighbor (KNN) classifier. The KNN classifier utilizes a distance-weighted voting scheme to predict the proper severity level of a newly reported bug. We investigated the effectiveness of our proposed approach on two large bug repositories, namely Eclipse and Mozilla, and the experimental results showed that our approach outperforms cutting-edge studies in predicting the minority severity classes.

  • articleOpen Access

    FAULT DETECTION OF WIND TURBINE PITCH CONNECTION BOLTS BASED ON TSDAS-SMOTE WITH XGBOOST

    Fractals01 Jan 2023

    For the problem of class-imbalance in the operation monitoring data of wind turbine (WT) pitch connecting bolts, an improved Borderline-SMOTE oversampling method based on “two-step decision” with adaptive selection of synthetic instances (TSDAS-SMOTE) is proposed. Then, TSDAS-SMOTE is combined with XGBoost to construct a WT pitch connection bolt fault detection model. TSDAS-SMOTE generates new samples by “two-step decision making” to avoid the problem of class–class boundary blurring that Borderline-SMOTE tends to cause when oversampling. First, the nearest neighbor sample characteristics are perceived by the fault class samples in the first decision step. If the characteristics of this fault class sample are different from the characteristics of all its nearest neighbor samples, the fault class sample is identified as interference and filtered. Second, the faulty class samples in the boundary zone are extracted as synthetic instances to generate new samples adaptively. Finally, the normal class samples in the boundary zone are used to perceive the unqualified new generated samples in the boundary zone based on the minimum Euclidean distance characteristics, and these unqualified samples are eliminated. For the second step of decision making, since the first step decision removes some of the newly generated samples, the remaining fault class samples without interference samples and boundary zone samples are used as synthetic instances to continue adaptively generating new samples. Thus, a balanced data set with clear class–class boundary zone is obtained, which is then used to train a WT pitch connection bolt fault detection model based on the XGBoost algorithm. The experimental results show that compared with six popular oversampling methods such as Borderline-SMOTE, Cluster-SMOTE, K-means-SMOTE, etc., the fault detection model constructed by the proposed oversampling method is better than the compared fault detection models in terms of missed alarm rate (MAR) and false alarm rate (FAR). Therefore, it can well achieve the fault detection of large WT pitch connection bolts.

  • articleNo Access

    Cost-Sensitive Learning of Fuzzy Rules for Imbalanced Classification Problems Using FURIA

    This paper is intended to verify that cost-sensitive learning is a competitive approach for learning fuzzy rules in certain imbalanced classification problems. It will be shown that there exist cost matrices whose use in combination with a suitable classifier allows for improving the results of some popular data-level techniques. The well known FURIA algorithm is extended to take advantage of this definition. A numerical study is carried out to compare the proposed cost-sensitive FURIA to other state-of-the-art classification algorithms, based on fuzzy rules and on other classical machine learning methods, on 64 different imbalanced datasets.

  • articleNo Access

    Classification of Imbalanced Data Using SMOTE and AutoEncoder Based Deep Convolutional Neural Network

    The imbalanced data classification is a challenging issue in many domains including medical intelligent diagnosis and fraudulent transaction analysis. The performance of the conventional classifier degrades due to the imbalanced class distribution of the training data set. Recently, machine learning and deep learning techniques are used for imbalanced data classification. Data preprocessing approaches are also suitable for handling class imbalance problem. Data augmentation is one of the preprocessing techniques used to handle skewed class distribution. Synthetic Minority Oversampling Technique (SMOTE) is a promising class balancing approach and it generates noise during the process of creation of synthetic samples. In this paper, AutoEncoder is used as a noise reduction technique and it reduces the noise generated by SMOTE. Further, Deep one-dimensional Convolutional Neural Network is used for classification. The performance of the proposed method is evaluated and compared with existing approaches using different metrics such as Precision, Recall, Accuracy, Area Under the Curve and Geometric Mean. Ten data sets with imbalance ratio ranging from 1.17 to 577.87 and data set size ranging from 303 to 284807 instances are used in the experiments. The different imbalanced data sets used are Heart-Disease, Mammography, Pima Indian diabetes, Adult, Oil-Spill, Phoneme, Creditcard, BankNoteAuthentication, Balance scale weight & distance database and Yeast data sets. The proposed method shows an accuracy of 96.1%, 96.5%, 87.7%, 87.3%, 95%, 92.4%, 98.4%, 86.1%, 94% and 95.9% respectively. The results suggest that this method outperforms other deep learning methods and machine learning methods with respect to G-mean and other performance metrics.

  • articleNo Access

    Performance Analysis of Two-Stage Iterative Ensemble Method over Random Oversampling Methods on Multiclass Imbalanced Datasets

    Data imbalance occurring among multiclass datasets is very common in real-world applications. Existing studies reveal that various attempts were made in the past to overcome this multiclass imbalance problem, which is a severe issue related to the typical supervised machine learning methods such as classification and regression. But, still there exists a need to handle the imbalance problem efficiently as the datasets include both safe and unsafe minority samples. Most of the widely used oversampling techniques like SMOTE and its variants face challenges in replicating or generating the new data instances for balancing them across multiple classes, particularly when the imbalance is high and the number of rare samples count is too minimal thus leading the classifier to misclassify the data instances. To lessen this problem, we proposed a new data balancing method namely a two-stage iterative ensemble method to tackle the imbalance in multiclass environment. The proposed approach focuses on the rare minority sample’s influence on learning from imbalanced datasets and the main idea of the proposed approach is to balance the data without any change in class distribution before it gets trained by the learner such that it improves the learner’s learning process. Also, the proposed approach is compared against two widely used oversampling techniques and the results reveals that the proposed approach shows a much significant improvement in the learning process among the multiclass imbalanced data.

  • articleNo Access

    A NOVEL TWO-LEAD ARRHYTHMIA CLASSIFICATION SYSTEM BASED ON CNN AND LSTM

    Arrhythmia classification is useful during heart disease diagnosis. Although well-established for intra-patient diagnoses, inter-patient arrhythmia classification remains difficult. Most previous work has focused on the intra-patient condition and has not followed the Association for the Advancement of Medical Instrumentation (AAMI) standards. Here, we propose a novel system for arrhythmia classification based on multi-lead electrocardiogram (ECG) signals. The core of the design is that we fuse two types of deep learning features with some common traditional features and select discriminating features using a binary particle swarm optimization algorithm (BPSO). Then, the feature vector is classified using a weighted support vector machine (SVM) classifier. For a better generalization of the model and to draw fair comparisons, we carried out inter-patient experiments and followed the AAMI standards. We found that, when using common metrics aimed at multi-classification either macro- or micro-averaging, our system outperforms most other state-of-the-art methods.

  • articleNo Access

    Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study

    Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.

  • articleNo Access

    SMOTE-Based Homogeneous Prediction for Aging-Related Bugs in Cloud-Oriented Software

    Software aging is the process caused by Aging-Related Bugs (ARBs) which leads to the depletion of resources and degradation of performance in the long run. ARBs are difficult to find and replicate in future studies as they are less in number, thus prediction of ARB is necessary to save cost and time in the testing phase. ARBs are present in low proportion as compared to non-ARBs known as the class Imbalance problem resulting in insufficient training dataset for prediction models. In this study, Synthetic Minority Oversampling Technique (SMOTE) is applied along with homogeneous cross-project ARB prediction to reduce the effect of imbalance problem in software. SMOTE is oversampling of the minority instances synthetically to balance the dataset and improve the capability of defect prediction models. Homogeneous cross-project prediction is implemented where the datasets are different but the distribution of metric sets of both training and testing datasets is similar. The experiment is conducted on five cloud-oriented software such as Cassandra, Hive, Storm, Hadoop HDFS and Hadoop Mapreduce. The novelty of this study is the combination of SMOTE and homogeneous cross-project defect prediction for ARBs in cloud-oriented software. The comparative analysis is also conducted to understand the difference between SMOTE and non-SMOTE results with the help of machine learning classifiers. The result conveys that SMOTE is an efficient method to address class imbalance problem in ARB prediction.

  • chapterNo Access

    Study of Traffic Incident Detection with Machine Learning Methods

    Traffic incidents occur frequently, which lead to inefficient road operations and and bring serious harm to society and individuals. An effective detection model is built for the traffic incident detection, including Synthetic Minority Oversampling Technique (SMOTE) as an oversampling technique, Tomek links as a data cleaning technique and Random Forest as a classifier. The experimental results indicate that the proposed method improves the overall performance than other methods in the automatic traffic incident detection for unbalanced datasets significantly.