World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

THE USE OF UNDER- AND OVERSAMPLING WITHIN ENSEMBLE FEATURE SELECTION AND CLASSIFICATION FOR SOFTWARE QUALITY PREDICTION

    https://doi.org/10.1142/S0218539314500041Cited by:9 (Source: Crossref)

    Software quality prediction models are useful tools for creating high quality software products. The general process is that practitioners use software metrics and defect data along with various data mining techniques to build classification models for identifying potentially faulty program modules, thereby enabling effective project resource allocation. The predictive accuracy of these classification models is often affected by the quality of input data. Two main problems which can affect the quality of input data are high dimensionality (too many independent attributes in a dataset) and class imbalance (many more members of one class than the other class in a binary classification problem). To resolve both of these problems, we present an iterative feature selection approach which repeatedly applies data sampling (to overcome class imbalance) followed by feature selection (to overcome high dimensionality), and finally combines the ranked feature lists from the separate iterations of sampling. After feature selection, models are built either using a plain learner or by using a boosting algorithm which incorporates sampling. In order to assess the impact of various balancing, filter, and learning techniques in the feature selection and model-building process on software quality prediction, we employ two sampling techniques, random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and two ensemble boosting approaches, RUSBoost and SMOTEBoost (in which RUS and SMOTE, respectively, are integrated into a boosting technique), as well as six feature ranking techniques. We apply the proposed techniques to several groups of datasets from two real-world software systems and use two learners to build classification models. The experimental results demonstrate that RUS results in better prediction than SMOTE, and also that boosting is more effective in improving classification performance than not using boosting. In addition, some feature ranking techniques, like chi-squared and information gain, exhibit better and more stable classification behavior than other rankers.