No Access

THE USE OF UNDER- AND OVERSAMPLING WITHIN ENSEMBLE FEATURE SELECTION AND CLASSIFICATION FOR SOFTWARE QUALITY PREDICTION

Department of Mathematics and Computer Science, Eastern Connecticut State University, 83 Windham Street, Willimantic, Connecticut 06226, USA

Search for more papers by this author

TAGHI M. KHOSHGOFTAAR

Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, Florida 33431, USA

Search for more papers by this author

, and

RANDALL WALD

Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, Florida 33431, USA

Search for more papers by this author

https://doi.org/10.1142/S0218539314500041Cited by:9 (Source: Crossref)

Abstract

Software quality prediction models are useful tools for creating high quality software products. The general process is that practitioners use software metrics and defect data along with various data mining techniques to build classification models for identifying potentially faulty program modules, thereby enabling effective project resource allocation. The predictive accuracy of these classification models is often affected by the quality of input data. Two main problems which can affect the quality of input data are high dimensionality (too many independent attributes in a dataset) and class imbalance (many more members of one class than the other class in a binary classification problem). To resolve both of these problems, we present an iterative feature selection approach which repeatedly applies data sampling (to overcome class imbalance) followed by feature selection (to overcome high dimensionality), and finally combines the ranked feature lists from the separate iterations of sampling. After feature selection, models are built either using a plain learner or by using a boosting algorithm which incorporates sampling. In order to assess the impact of various balancing, filter, and learning techniques in the feature selection and model-building process on software quality prediction, we employ two sampling techniques, random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and two ensemble boosting approaches, RUSBoost and SMOTEBoost (in which RUS and SMOTE, respectively, are integrated into a boosting technique), as well as six feature ranking techniques. We apply the proposed techniques to several groups of datasets from two real-world software systems and use two learners to build classification models. The experimental results demonstrate that RUS results in better prediction than SMOTE, and also that boosting is more effective in improving classification performance than not using boosting. In addition, some feature ranking techniques, like chi-squared and information gain, exhibit better and more stable classification behavior than other rankers.

Keywords:

References

Q. Songet al., IEEE Trans. Softw. Engrg. 37(3), 356 (2011), DOI: 10.1109/TSE.2010.90. Crossref, Web of Science, Google Scholar
A. A. Shanabet al., Int. J. Bus. Intell. Data Min. 7(2), 116 (2012), DOI: 10.1504/IJBIDM.2012.048730. Crossref, Web of Science, Google Scholar
T. M. Khoshgoftaar, K. Gao and J. Van Hulse, Recent Trends in Information Reuse and Integration (Springer, 2012) pp. 167–189. Crossref, Google Scholar
H. Liuet al., Feature selection: An ever evolving frontier in data mining, Proc. Fourth Int. Workshop on Feature Selection in Data Mining (2010) pp. 4–13. Google Scholar
S. Doraisamyet al., A study on feature selection and classification techniques for automatic genre classification of traditional malay music, Ninth Int. Conf. Music Information Retrieval (Philadelphia, PA, USA, 2008) pp. 331–336. Google Scholar
K. Jonget al., Feature selection in proteomic pattern data with support vector machines, Proc. 2004 IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology (2004) pp. 41–48. Google Scholar
L. Yu, Y. Han and M. E. Berens, IEEE/ACM Trans. Comput. Biol. Bioinformatics 9(1), 262 (2012), DOI: 10.1109/TCBB.2012.108. Web of Science, Google Scholar
G. Forman, J. Mach. Learn. Res. 3, 1289 (2003). Web of Science, Google Scholar
R. Barandelaet al., The imbalanced training sample problem: Under or over sampling?, Joint IAPR Int. Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'04)3138, Lecture Notes in Computer Science (2004) pp. 806–814. Google Scholar
H. Han, W. Y. Wang and B. H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Int. Conf. Intelligent Computing (ICIC'05)3644, Lecture Notes in Computer Science (Springer-Verlag, 2005) pp. 878–887. Google Scholar
C. Seiffertet al., IEEE Trans. Syst., Man, Cybern. A 40(1), 185 (2010), DOI: 10.1109/TSMCA.2009.2029559. Crossref, Web of Science, Google Scholar
Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proc. 13th Int. Conf. Machine Learning (1996) pp. 148–156. Google Scholar
Y. Sunet al., Pattern Recogn. 40(12), 3358 (2007), DOI: 10.1016/j.patcog.2007.04.009. Crossref, Web of Science, Google Scholar
N. V. Chawlaet al., SMOTEBoost: Improving prediction of the minority class in boosting, Proc. Principles of Knowledge Discovery in Databases (2003) pp. 107–119. Google Scholar
H. Liu, H. Motoda and L. Yu, Artif. Intell. 159(2), 49 (2004), DOI: 10.1016/j.artint.2004.05.009. Crossref, Web of Science, Google Scholar
T. M. Khoshgoftaaret al., Inf. Syst. Front. 1 (2013). Google Scholar
I. H. Witten and E. Frank , Data Mining: Practical Machine Learning Tools and Techniques , 3rd edn. ( Morgan Kaufmann , 2011 ) . Google Scholar
K. Kira and L. A. Rendell, A practical approach to feature selection, Proc. 9th Int. Workshop on Machine Learning (1992) pp. 249–256. Google Scholar
N. V. Chawlaet al., J. Artif. Intell. Res. 16, 321 (2002). Crossref, Web of Science, Google Scholar
J. P. Hudepohlet al., IEEE Software 13(5), 56 (1996), DOI: 10.1109/52.536459. Crossref, Web of Science, Google Scholar
G. Boetticher, T. Menzies and T. Ostrand, Promise repository of empirical software engineering data (2007), Available at: http://promisedata.org/ . Google Scholar
T. Zimmermann, R. Premraj and A. Zeller, Predicting defects for eclipse, Proc. 29th Int. Conf. Software Eng. Workshops (IEEE Computer Society, Washington, DC, USA, 2007) p. 76. Google Scholar
C. Wohlin et al. , Experimentation in Software Engineering: An Introduction , Kluwer International Series in Software Engineering ( Kluwer Academic Publishers , Boston, MA , 2000 ) . Crossref, Google Scholar
L. G. Votta and A. A. Porter, Experimental software engineering: A report on the state of the art, Proc. 17th. Int. Conf. Software Engineering (IEEE Computer Society, Seattle, WA, USA, 1995) pp. 277–279. Google Scholar