Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    FEEDFORWARD NEURAL NETWORK MODELS FOR HANDLING CLASS OVERLAP AND CLASS IMBALANCE

    This paper proposes a framework for training feedforward neural network models capable of handling class overlap and imbalance by minimizing an error function that compensates for such imperfections of the training set. A special case of the proposed error function can be used for training variance-controlled neural networks (VCNNs), which are developed to handle class overlap by minimizing an error function involving the class-specific variance (CSV) computed at their outputs. Another special case of the proposed error function can be used for training class-balancing neural networks (CBNNs), which are developed to handle class imbalance by relying on class-specific correction (CSC). VCNNs and CBNNs are compared with conventional feedforward neural networks (FFNNs), quantum neural networks (QNNs), and resampling techniques. The properties of VCNNs and CBNNs are illustrated by experiments on artificial data. Various experiments involving real-world data reveal the advantages offered by VCNNs and CBNNs in the presence of class overlap and class imbalance.

  • articleNo Access

    COST-SENSITIVE NEURAL NETWORK CLASSIFIERS FOR POSTCODE RECOGNITION

    Most traditional postcode recognition systems implicitly assumed that the distribution of the 10 numerals (0–9) is balanced. However it is far from a reasonable setting because the distribution of 0–9 in postcodes of a country or a city is generally imbalanced. Some numerals appear in more postcodes, while some others do not. In this paper, we study cost-sensitive neural network classifiers to address the class imbalance problem in postcode recognition. Four methods, namely: cost-sampling, cost-convergence, rate-adapting and threshold-moving are considered in training neural networks. Cost-sampling adjusts the distribution of the training data such that the costs of classes are conveyed explicitly by the appearances of their instances. Cost-convergence and rate-adapting are carried out in training phase by modifying the architecture of training algorithms of the neural network. Threshold-moving tries to increase the probability estimations of expensive classes to avoid the samples with higher costs to be misclassified. 10,702 postcode images are experimented using five cost matrices based on the distribution of numerals in postcodes. The results suggest that cost-sensitive learning is indeed effective on class imbalanced postcode analysis and recognition. It also reveals that cost-sampling on a proper cost matrix outperforms others in this application.

  • articleNo Access

    A Safe Zone SMOTE Oversampling Algorithm Used in Earthquake Prediction Based on Extreme Imbalanced Precursor Data

    Earthquake prediction based on extreme imbalanced precursor data is a challenging task for standard algorithms. Since even if an area is in an earthquake-prone zone, the proportion of days with earthquakes per year is still a minority. The general method is to generate more artificial data for the minority class that is the earthquake occurrence data. But the most popular oversampling methods generate synthetic samples along line segments that join minority class instances, which is not suitable for earthquake precursor data. In this paper, we propose a Safe Zone Synthetic Minority Oversampling Technique (SZ-SMOTE) oversampling method as an enhancement of the SMOTE data generation mechanism. SZ-SMOTE generates synthetic samples with a concentration mechanism in the hyper-sphere area around each selected minority instances. The performance of SZ-SMOTE is compared against no oversampling, SMOTE and its popular modifications adaptive synthetic sampling (ADASYN) and borderline SMOTE (B-SMOTE) on six different classifiers. The experiment results show that the quality of earthquake prediction using SZ-SMOTE as oversampling algorithm significantly outperforms that of using the other oversampling algorithms.

  • articleNo Access

    Multi-Content Merging Network Based on Focal Loss and Convolutional Block Attention in Hyperspectral Image Classification

    Simultaneous extraction of spectral and spatial features and their fusion is currently a popular solution in hyperspectral image (HSI) classification. It has achieved satisfactory results in some research. Because the scales of objects are often different in HSI, it is necessary to extract multi-scale features. However, this aspect was not taken into account in many spectral-spatial feature fusion methods. This causes the model to be unable to get sufficient features on scales with a large difference range. The model (MCMN: Multi-Content Merging Network) proposed in this paper designs a multi-branch fusion structure to extract multi-scale spatial features by using multiple dilated convolution kernels. Considering the interference of the surrounding heterogeneous objects, the useful information from different directions is also fused together to realize the merging of multiple regional features. MCMN introduces a convolution block attention mechanism, which fully extracts attention features in both spatial and spectral directions, so that the network can focus on more useful parts, which can effectively improve the performance of the model. In addition, since the number of objects in each class is often discrepant, it will have some impact on the training process. We apply the focal loss function to eliminate the negative factor. The experimental results of MCMN on three data sets have a breakthrough compared with the other comparison models, which highlights the role of MCMN structure.

  • articleNo Access

    FAULT DETECTION FOR THE CLASS IMBALANCE PROBLEM IN SEMICONDUCTOR MANUFACTURING PROCESSES

    In the semiconductor manufacturing process, fault detection which aims at constructing a decision tool to maintain high process yields is a major step of the process control. Unfortunately, the class imbalance in the modern semiconductor industry makes feature selection for fault detection quite challenging. However, the characteristic has usually been ignored in the open literatures. This paper analyzes the challenge and indicates some of the reasons are due to the dataset shift, the small samples and the class overlapping caused by the class imbalance. To cope with the problems, a new feature selection approach is proposed, which combines the global and local resampling and named ensemble manifold sensitive margin fisher analysis (EMSMFA). Our approach consists of three key components: (1) At the global level, the bagging-based ensemble model is used to overcome the overfitting caused by the data shift; (2) At the local level, the manifold-based oversampling named the weighted synthetic minority oversampling technique (WSMOTE) is proposed to solve the small samples problem in the minority class; (3) The sensitive margin fisher analysis (SMFA) is used to solve the challenge caused by the class overlapping. The proposed fault detection method is demonstrated through its application to the semiconductor wafer fabrication process. The experimental results confirm the EMSMFA improves the performance of fault detection.

  • articleNo Access

    Discriminative Feature Selection Based on Imbalance SVDD for Fault Detection of Semiconductor Manufacturing Processes

    Feature selection has become a key step of fault detection. Unfortunately, the class imbalance in the modern semiconductor industry makes feature selection quite challenging. This paper analyzes the challenges and indicates the limitations of the traditional supervised and unsupervised feature selection methods. To cope with the limitations, a new feature selection method named imbalanced support vector data description-radius-recursive feature selection (ISVDD-radius-RFE) is proposed. When selecting features, the ISVDD-radius-RFE has three advantages: (1) ISVDD-radius-RFE is designed to find the most representative feature by finding the real shape of normal samples. (2) ISVDD-radius-RFE can represent the real shape of normal samples more correctly by introducing the discriminant information from fault samples. (3) ISVDD-radius-RFE is optimized for fault detection where the imbalance data is common. The kernel ISVDD-radius-RFE is also described in this paper. The proposed method is demonstrated through its application in the banana set and SECOM dataset. The experimental results confirm ISVDD-radius-RFE and kernel ISVDD-radius-RFE improve the performance of fault detection.

  • articleNo Access

    An Improved TCN Considering Data Augmentation in Enabling Load Classification

    In modern power system, along with the developments of the data collecting technologies, the intensive and high-dimensional load data collection can be achieved. Therefore, to deeply reveal the patterns and behaviors hidden in the load dataset using load classification is of great significance for improving the service quality and the user experience of the power system. However, inevitable issues, for example the data missing and class imbalance are frequently reported in the present load dataset, which deteriorates the performance of the classification algorithms. Also, due to the special features, for example the time series, periodicity, and fluctuation of the load data, the traditional data classification algorithms also encounter performance defects. Therefore, this paper presents a data augmentation based enhanced temporal convolutional network (TCN) algorithm in enabling load classification. In the data augmentation phase, first an LRTC-TSVD algorithm is presented to implement the missing data completion. Second, a WGAN based class balancing approach is further presented to solve the class imbalance issue. Then, in the enhanced TCN phase, a WeightNorm, exponential linear unit (ELU) activation function, residual connection, and bidirectional feature fusion techniques based improved TCN (ITCN) algorithm is presented to carry out the accurate load data classification. Combining the data augmentation and the enhanced TCN phases, the ITCN algorithm is finally conducted. Based on the benchmark load datasets, the performances of the presented ITCN are evaluated. The experimental results report that the presented data augmentations can improve the quality of the dataset, moreover the classification algorithm is able to achieve the satisfied classification accuracy.

  • articleNo Access

    AN EMPIRICAL EVALUATION OF REPETITIVE UNDERSAMPLING TECHNIQUES

    Class imbalance is a fundamental problem in data mining and knowledge discovery which is encountered in a wide array of application domains. Random undersampling has been widely used to alleviate the harmful effects of imbalance, however, this technique often leads to a substantial amount of information loss. Repetitive undersampling techniques, which generate an ensemble of models, each trained on a different, undersampled subset of the training data, have been proposed to allieviate this difficulty. This work reviews three repetitive undersampling methods currently used to handle imbalance and presents a detailed and comprehensive empirical study using four different learners, four performance metrics and 15 datasets from various application domains. To our knowledge, this work is the most thorough study of repetitive undersampling techniques.

  • articleNo Access

    Aggregating Data Sampling with Feature Subset Selection to Address Skewed Software Defect Data

    Defect prediction is an important process activity frequently used for improving the quality and reliability of software products. Defect prediction results provide a list of fault-prone modules which are necessary in helping project managers better utilize valuable project resources. In the software quality modeling process, high dimensionality and class imbalance are the two potential problems that may exist in data repositories. In this study, we investigate three data preprocessing approaches, in which feature selection is combined with data sampling, to overcome these problems in the context of software quality estimation. These three approaches are: Approach 1 — sampling performed prior to feature selection, but retaining the unsampled data instances; Approach 2 — sampling performed prior to feature selection, retaining the sampled data instances; and Approach 3 — sampling performed after feature selection. A comparative investigation is presented for evaluating the three approaches. In the experiments, we employed three sampling methods (random undersampling, random oversampling, and synthetic minority oversampling), each combined with a filter-based feature subset selection technique called correlation-based feature selection. We built the defect prediction models using five common classification algorithms. The case study was based on software metrics and defect data collected from multiple releases of a real-world software system. The results demonstrated that the type of sampling methods used in data preprocessing significantly affected the performance of the combination approaches. It was found that when the random undersampling technique was used, Approach 1 performed better than the other two approaches. However, when the feature selection technique was used in conjunction with an oversampling method (random oversampling or synthetic minority oversampling), we strongly recommended Approach 3.

  • articleNo Access

    Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

    Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

  • articleNo Access

    Learning from Highly Imbalanced Big Data with Label Noise

    This study explores the effects of class label noise on detecting fraud within three highly imbalanced healthcare fraud data sets containing millions of claims and minority class sizes as small as 0.1%. For each data set, 29 noise distributions are simulated by varying the level of class noise and the distribution of noise between the fraudulent and non-fraudulent classes. Four popular machine learning algorithms are evaluated on each noise distribution using six rounds of five-fold cross-validation. Performance is measured using the area under the precision-recall curve (AUPRC), true positive rate (TPR), and true negative rate (TNR) in order to understand the effect of the noise level, noise distribution, and their interactions. AUPRC results show that negative class noise, i.e. fraudulent samples incorrectly labeled as non-fraudulent, is the most detrimental to model performance. TPR and TNR results show that there are significant trade-offs in class-wise performance as noise transitions between the positive and the negative class. Finally, results reveal how overfitting negatively impacts the classification performance of some learners, and how simple regularization can be used to combat this overfitting and improve classification performance across all noise distributions.

  • articleNo Access

    THE USE OF UNDER- AND OVERSAMPLING WITHIN ENSEMBLE FEATURE SELECTION AND CLASSIFICATION FOR SOFTWARE QUALITY PREDICTION

    Software quality prediction models are useful tools for creating high quality software products. The general process is that practitioners use software metrics and defect data along with various data mining techniques to build classification models for identifying potentially faulty program modules, thereby enabling effective project resource allocation. The predictive accuracy of these classification models is often affected by the quality of input data. Two main problems which can affect the quality of input data are high dimensionality (too many independent attributes in a dataset) and class imbalance (many more members of one class than the other class in a binary classification problem). To resolve both of these problems, we present an iterative feature selection approach which repeatedly applies data sampling (to overcome class imbalance) followed by feature selection (to overcome high dimensionality), and finally combines the ranked feature lists from the separate iterations of sampling. After feature selection, models are built either using a plain learner or by using a boosting algorithm which incorporates sampling. In order to assess the impact of various balancing, filter, and learning techniques in the feature selection and model-building process on software quality prediction, we employ two sampling techniques, random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and two ensemble boosting approaches, RUSBoost and SMOTEBoost (in which RUS and SMOTE, respectively, are integrated into a boosting technique), as well as six feature ranking techniques. We apply the proposed techniques to several groups of datasets from two real-world software systems and use two learners to build classification models. The experimental results demonstrate that RUS results in better prediction than SMOTE, and also that boosting is more effective in improving classification performance than not using boosting. In addition, some feature ranking techniques, like chi-squared and information gain, exhibit better and more stable classification behavior than other rankers.

  • articleNo Access

    Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data

    In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.

  • articleNo Access

    An Empirical Investigation of Combining Filter-Based Feature Subset Selection and Data Sampling for Software Defect Prediction

    The main goal of software quality engineering is to produce a high-quality software product through the use of various techniques and processes. Classification models are effective tools for software quality prediction, helping practitioners to detect potentially problematic modules and eventually improve software product. However, two potential problems, high dimensionality and class imbalance, may affect the classifiers performance. In this study, we propose a data pre-processing approach, in which feature selection is combined with data sampling, to overcome these problems. We investigate two filter-based feature subsets selection techniques, i.e., correlation-based and consistency-based subset evaluation methods, and three data sampling methods, i.e., random undersampling, random oversampling, and synthetic minority oversampling. We are interested in exploring the effect of the various feature selection techniques, sampling methods, and their interactions on the performance of classification models. The empirical studies were carried out on 13 datasets from two real-world software systems. The results demonstrate that the correlation-based subset evaluation technique outperformed the consistency-based method when they were used along with a random sampling method and when the training data had a high degree of class imbalance; however, when synthetic minority oversampling was employed or when the training dataset was less imbalanced, the consistency-based technique had better performance than the correlation-based approach.

  • articleNo Access

    Analysis of Hybridized Techniques with Class Imbalance Learning for Predicting Software Maintainability

    Software maintainability is a vital concern of organizations that develop and maintain large software products. The models that assess the maintainability of software systems at initial development stages play a significant role. In the Software Maintainability Prediction (SMP), a prevalent issue that needs to be taken care of is imbalanced data problem. For SMP, imbalanced data problem arises when the software classes that require high maintenance effort are less in number than classes that require low maintenance effort. In this paper, we dealt with the imbalanced data problem by the data resampling. With the imbalanced data, efficient machine learning algorithms are unable to predict the data points of both classes competently. Therefore, we examine the effectiveness of hybridized (HYB) techniques. The HYB techniques aid in finding an optimal solution for a problem by judging the goodness of multiple solutions. As per the results of the study, Adaptive synthetic minority oversampling technique (Adasyn) and Safe level synthetic minority oversampling technique (SafeSMOTE) are the best techniques of imbalanced data. Also, among the investigated HYB techniques, Fuzzy LogitBoost (GFS-LB) and Particle Swarm Optimization with Linear Discriminant Analysis (PSOLDA) emerged as the best techniques to predict maintainability.

  • articleFree Access

    Using Area Under the Precision Recall Curve to Assess the Effect of Random Undersampling in the Classification of Imbalanced Medicare Big Data

    In this paper, we investigate the impact of Random Undersampling (RUS) on a supervised Machine Learning task involving highly imbalanced Big Data. We present the results of experiments in Medicare Fraud detection. To the best of our knowledge, these experiments are conducted with the largest insurance claims datasets ever used for Medicare Fraud detection. We obtain two datasets from two Big Data repositories provided by the United States government’s Centers for Medicare and Medicaid Services. The larger of the two datasets contains nearly 174 million instances, with a minority to majority class ratio of approximately 0.0039. Our contribution is to show that RUS has a detrimental effect on a Medicare Fraud detection task when performed on large scale, imbalanced data. The effect of RUS is apparent in the Area Under the Precision Recall Curve (AUPRC) scores recorded from experimental outcomes. We use four popular, open-source classifiers in our experiments to confirm the negative impact of RUS on their AUPRC scores.

  • articleNo Access

    Improving Credit Card Fraud Detection with Data Reduction Approaches

    Detecting fraudulent activities in credit card transactions can be challenging due to issues like high dimensionality and class imbalance that are often present in the datasets. To address these challenges, data reduction techniques such as data sampling and feature selection have become essential. In this study, we compare four approaches for data reduction: using data sampling alone, employing feature selection alone, applying data sampling followed by feature selection, and using feature selection followed by data sampling. Additionally, we include results using all features. We build classification models using five Decision Tree-based classifiers and Logistic Regression, and evaluate their performance using two performance metrics: the Area Under the Receiver Operating Characteristic Curve (AUC) and the Area under the Precision–Recall Curve (AUPRC). In this work, we adopt ensemble supervised feature selection (SFS) techniques and Random Undersampling (RUS) for data reduction. The experimental results demonstrate that all four data reduction techniques have the potential to improve the performance of classifiers. These results are valuable since the classifiers available are dependent upon application domains, computing environments, and licensing agreements. However, these techniques can be applied independently of all these dependencies. We recommend utilizing the ensemble SFS followed by RUS (SFS–RUS) approach as the preferred data reduction method due to its ability to run feature selection and data sampling in parallel. Additionally, we find that XGBoost and CatBoost outperform other classifiers.

  • articleOpen Access

    SEMI-SUPERVISED SPARSE REPRESENTATION CLASSIFICATION FOR SLEEP EEG RECOGNITION WITH IMBALANCED SAMPLE SETS

    Sleep staging with supervised learning requires a large amount of labeled data that are time-consuming and expensive to collect. Semi-supervised learning is widely used to improve classification performance by combining a small amount of labeled data with a large amount of unlabeled data. The accuracy of pseudo-labels in semi-supervised learning may influence the performance of classifier. Based on semi-supervised sparse representation classification, this study proposed an improved sparse concentration index to estimate the confidence of pseudo-labels data for sleep EEG recognition considering both interclass differences and intraclass concentration. In view of class imbalance in sleep EEG data, the synthetic minority oversampling technique was also improved to remove mixed samples at the boundary between minority and majority classes. The results showed that the proposed method achieved better classification performance, in which the classification accuracy after class balancing was obviously higher than that before class balancing. The findings of this study will be beneficial for application in sleep monitoring devices and sleep-related diseases.

  • articleNo Access

    Analyzing the Role of Class Rebalancing Techniques in Software Defect Prediction

    Predicting software defects is an important task during software testing phase, especially for allocating appropriate resources and prioritizing testing tasks. Typically, classification algorithms are used to accomplish this task by using previously collected datasets. However, these datasets suffer from imbalanced label distribution where clean modules outnumber defective modules. Traditional classification algorithms cannot handle this nature in defect datasets because they assume the datasets are balanced. Failing to address this problem, the classification algorithm will produce a prediction biased towards the majority label. In the literature, there are several techniques designed to address this problem and most of them focus on data re-balancing. Recently, ensemble class imbalance techniques have emerged as an opposing approach to data rebalancing approaches. Regarding the software defect prediction, there are no studies examining the performance of ensemble class imbalance learning against data re-balancing approaches. This paper investigates the efficiency of ensemble class imbalance learning for software defect prediction. We conducted a comprehensive experiment that involved 12 datasets, six classifiers, nine class imbalance techniques, and 10 evaluation metrics. The experiments showed that ensemble approaches, particularly the Under Bagging technique, outperform traditional data re-balancing approaches, particularly when dealing with datasets that have high defect ratios.

  • articleNo Access

    Evaluating the Impact of Data Quality on Sampling

    Learning from imbalanced training data can be a difficult endeavour, and the task is made even more challenging if the data is of low quality or the size of the training dataset is small. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of the impact of changes in four training dataset characteristics — dataset size, class distribution, noise level and noise distribution — on data sampling techniques. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that: (1) even for relatively clean datasets, class imbalance can still hurt learner performance, (2) data sampling, however, may not improve performance for relatively clean but imbalanced datasets, (3) data sampling can be very effective at dealing with the combined problems of noise and imbalance, (4) both the level and distribution of class noise among the classes are important, as either factor alone does not cause a significant impact, (5) when sampling does improve the learners (i.e. for noisy and imbalanced datasets), RUS and SMOTE are the most effective at improving the AUC, while SMOTE performed well relative to the F-measure, (6) there are significant differences in the empirical results depending on the performance measure used, and hence it is important to consider multiple metrics in this type of analysis, and (7) data sampling rarely hurt the AUC, but only significantly improved performance when data was at least moderately skewed or noisy, while for the F-measure, data sampling often resulted in significantly worse performance when applied to slightly skewed or noisy datasets, but did improve performance when data was either severely noisy or skewed, or contained moderate levels of both noise and imbalance.