Featured Topic Issue — Best Papers from SEKE 2015; Guest Editor: Haiping XuNo Access

Aggregating Data Sampling with Feature Subset Selection to Address Skewed Software Defect Data

Department of Mathematics and Computer Science, Eastern Connecticut State University, Willimantic, CT 06226, USA

Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA

E-mail Address: khoshgof@fau.edu

Search for more papers by this author

, and

Amri Napolitano

Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA

E-mail Address: amrifau@gmail.com

Search for more papers by this author

https://doi.org/10.1142/S0218194015400318Cited by:7 (Source: Crossref)

Abstract

Defect prediction is an important process activity frequently used for improving the quality and reliability of software products. Defect prediction results provide a list of fault-prone modules which are necessary in helping project managers better utilize valuable project resources. In the software quality modeling process, high dimensionality and class imbalance are the two potential problems that may exist in data repositories. In this study, we investigate three data preprocessing approaches, in which feature selection is combined with data sampling, to overcome these problems in the context of software quality estimation. These three approaches are: Approach 1 — sampling performed prior to feature selection, but retaining the unsampled data instances; Approach 2 — sampling performed prior to feature selection, retaining the sampled data instances; and Approach 3 — sampling performed after feature selection. A comparative investigation is presented for evaluating the three approaches. In the experiments, we employed three sampling methods (random undersampling, random oversampling, and synthetic minority oversampling), each combined with a filter-based feature subset selection technique called correlation-based feature selection. We built the defect prediction models using five common classification algorithms. The case study was based on software metrics and defect data collected from multiple releases of a real-world software system. The results demonstrated that the type of sampling methods used in data preprocessing significantly affected the performance of the combination approaches. It was found that when the random undersampling technique was used, Approach 1 performed better than the other two approaches. However, when the feature selection technique was used in conjunction with an oversampling method (random oversampling or synthetic minority oversampling), we strongly recommended Approach 3.

Keywords:

References

1. A. K. Pandey and N. K. Goyal, Predicting fault-prone software module using data mining technique and fuzzy logic, Special Issue of International Journal of Computer and Communication Technology 2 (2–4) (2010) 56–63. Google Scholar
2. H. Liu, H. Motoda, R. Setiono and Z. Zhao, Feature selection: An ever evolving frontier in data mining, Proceedings of the Fourth International Workshop on Feature Selection in Data Mining (2010) 4–13. Google Scholar
3. M. A. Hall, Correlation-based feature selection for machine learning, Ph.D. dissertation, The University of Waikato, Hamilton, New Zealand, 1999. Google Scholar
4. S. Wang and X. Yao, Using class imbalance learning for software defect prediction, IEEE Transactions on Reliability 62 (2) (2013) 434–443. Crossref, Web of Science, Google Scholar
5. Z. Sun, Q. Song and X. Zhu, Using coding based ensemble learning to improve software defect prediction, IEEE Transactions on Systems, Man, and Cybernetics 42 (6) (2012) 1806–1817. Crossref, Web of Science, Google Scholar
6. G. M. Weiss, Mining with rarity: A unifying framework, SIGKDD Explorations 6 (1) (2004) 7–19. Crossref, Google Scholar
7. J. Van Hulse, T. M. Khoshgoftaar and A. Napolitano, Experimental perspectives on learning from imbalanced data, Proceedings of the 24th International Conference on Machine Learning (ACM, 2007), pp. 935–942. Google Scholar
8. Z. Zhao et al., Advancing feature selection research, Computer Science & Engineering, Arizona State University, Tech. Rep. tr-10-007, 2010. Google Scholar
9. K. Gao, T. M. Khoshgoftaar and A. Napolitano, Investigating two approaches for adding feature ranking to sampled ensemble learning for software quality estimation, International Journal of Software Engineering and Knowledge Engineering 25 (1) (2015) 115–146. Link, Web of Science, Google Scholar
10. T. M. Khoshgoftaar, K. Gao, A. Napolitano and R. Wald, A comparative study of iterative and non-iterative feature selection techniques for software defect prediction, Information Systems Frontiers 16 (5) (2014) 801–822. Crossref, Web of Science, Google Scholar
11. I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. Google Scholar
12. V. Kumar and S. Minz, Feature selection: A literature review, Smart Computing Review 4 (3) (2014) 211–229. Crossref, Google Scholar
13. M. A. Hall and G. Holmes, Benchmarking attribute selection techniques for discrete class data mining, IEEE Transactions of Knowledge and Data Engineering 15 (6) (2003) 1437–1447. Crossref, Web of Science, Google Scholar
14. C. Akalya Devi, K. E. Kannammal and B. Surendiran, A hybrid feature selection model for software fault prediction, International Journal on Computational Sciences and Applications 2 (2) (2012) 25–35. Crossref, Google Scholar
15. K. Gao, T. M. Khoshgoftaar and N. Seliya, Predicting high-risk program modules by selecting the right software measurements, Software Quality Journal 20 (1) (2012) 3–42. Crossref, Web of Science, Google Scholar
16. T. M. Khoshgoftaar, H. Wang and N. Seliya, Performance of filter-based feature subset selection for software quality data classification, Proceedings of the 20th ISSAT International Conference on Reliability and Quality in Design (2014) 219–212. Google Scholar
17. N. V. Chawla, K. W. Bowyer, L. O. Hall and P. W. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357. Crossref, Web of Science, Google Scholar
18. R. Barandela, R. M. Valdovinos, J. S. Sanchez and F. J. Ferri, The imbalanced training sample problem: Under or over sampling?, Joint IAPR International Workshops on Structural, Syntactic and Statistical Pattern Recognition Lecture Notes in Computer Science, (2004) 806–814. Google Scholar
19. K. Gao, T. M. Khoshgoftaar and A. Napolitano, The use of ensemble-based data preprocessing techniques for software defect prediction, International Journal of Software Engineering and Knowledge Engineering 24 (9) (2014) 1229–1254. Link, Web of Science, Google Scholar
20. R. S. Wahono, N. Suryana and S. Ahmad, Metaheuristic optimization based feature selection for software defect prediction, Journal of Software 9 (5) (2014) 1324–1333. Crossref, Google Scholar
21. K. Gao and T. M. Khoshgoftaar, Software defect prediction for high-dimensional and class-imbalanced data, Proceedings of the 23rd International Conference on Software Engineering & Knowledge Engineering (2011) 89–94. Google Scholar
22. I. H. Witten, E. Frank and M. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. (Morgan Kaufmann, 2011). Google Scholar
23. S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Prentice-Hall, 1998). Google Scholar
24. J. Shawe-Taylor and N. Cristianini, Support Vector Machines, 2nd edn. (Cambridge University Press, 2000). Google Scholar
25. Y. Jiang, J. Lin, B. Cukic and T. Menzies, Variance analysis in software fault prediction models, Proceedings of the 20th IEEE International Symposium on Software Reliability Engineering (2009) 99–108. Google Scholar
26. T. Zimmermann, R. Premraj and A. Zeller, Predicting defects for eclipse, Proceedings of the 29th International Conference on Software Engineering Workshops (2007), 76. Google Scholar
27. C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell and A. Wesslen, Experimentation in Software Engineering: An Introduction, Kluwer International Series in Software Engineering (Kluwer Academic Publishers, Boston, 2000). Crossref, Google Scholar
28. L. G. Votta and A. A. Porter, Experimental software engineering: A report on the state of the art, Proceedings of the 17th International Conference on Software Engineering (1995) 277–279. Google Scholar