Artificial IntelligenceNo Access

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

Shaojian Qiu

http://orcid.org/0000-0001-7152-108X

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510000, P. R. China

Search for more papers by this author

Lu Lu

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510000, P. R. China

E-mail Address: lul@scut.edu.cn

Corresponding author.

Search for more papers by this author

Siyu Jiang

School of Software Engineering, South China University of Technology, Guangzhou 510000, P. R. China

Search for more papers by this author

, and

Yang Guo

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510000, P. R. China

Search for more papers by this author

https://doi.org/10.1142/S0218001419590377Cited by:27 (Source: Crossref)

Abstract

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Keywords:

References

1. P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen and H. Nielsen , Assessing the accuracy of prediction algorithms for classification: An overview, Bioinf. 16(5) (2000) 412–424. Web of Science, Google Scholar
2. R. Barandela, R. M. Valdovinos and J. S. Sanchez , New applications of ensembles of classifiers, Pattern Anal. Appl. 6(3) (2003) 245–256. Web of Science, Google Scholar
3. L. Breiman , Bagging predictors, Mach. Learn. 24(2) (1996) 123–140. Web of Science, Google Scholar
4. B. Caglayan, E. Kocaguneli, J. Krall, F. Peters and B. Turhan, The promise repository of empirical software engineering data, West Virginia University Department of Computer Science (2012). Google Scholar
5. N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer , Smote: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. Web of Science, Google Scholar
6. N. V. Chawla, A. Lazarevic, L. O. Hall and K. W. Bowyer , Smoteboost: Improving prediction of the minority class in boosting, European Conf. Principles of Data Mining and Knowledge Discovery (Berlin, Heidelberg, 2003), pp. 107–119. Google Scholar
7. L. Chen, B. Fang, Z. Shang and Y. Tang , Negative samples reduction in cross-company software defects prediction, Inf. Softw. Technol. 62(1) (2015) 67–77. Web of Science, Google Scholar
8. X. Chen, Q. Gu, W. Liu, S. Liu and C. Ni , Survey of static software defect prediction, J. Softw. 27(10) (2016) 1–25. Google Scholar
9. M. D’Ambros, M. Lanza and R. Robbes , An extensive comparison of bug prediction approaches, 2010 7th IEEE Working Conf. Mining Software Repositories (MSR) (Cape Town, South Africa, 2010), pp. 31–41. Google Scholar
10. P. Domingos , Metacost: A general method for making classifiers cost-sensitive, in Proc. Fifth ACM SIGKDD International Conf. Knowledge Discovery and Data Mining (San Diego, USA, 1999), pp. 155–164. Google Scholar
11. K. O. Elish and M. O. Elish , Predicting defect-prone software modules using support vector machines, J. Syst. Softw. 81(5) (2008) 649–660. Web of Science, Google Scholar
12. T. Fawcett , An introduction to roc analysis, Pattern Recognit. Lett. 27(8) (2006) 861–874. Web of Science, Google Scholar
13. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten , The weka data mining software: An update, ACM SIGKDD Explorations Newsletter 11(1) (2009) 10–18. Google Scholar
14. Y. Freund and R. E. Schapire , A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55(1) (1997) 119–139. Web of Science, Google Scholar
15. J. Han, J. Pei and M. Kamber , Data Mining: Concepts and Techniques (Elsevier, 2011). Google Scholar
16. A. E. Hassan , Predicting faults using the complexity of code changes, IEEE 31st International Conf. Software Engineering, 2009. ICSE 2009, 2009, pp. 78–88. Google Scholar
17. Z. He, F. Shu, Y. Yang, M. Li and Q. Wang , An investigation on the feasibility of cross-project defect prediction, Autom. Softw. Eng. 19(2) (2012) 167–199. Web of Science, Google Scholar
18. S. Herbold, A. Trautsch and J. Grabowski , A comparative study to benchmark cross-project defect prediction approaches, IEEE Trans. Softw. Eng. 2017. Web of Science, Google Scholar
19. X. Y. Jing, F. Wu, X. Dong and B. Xu , An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems, IEEE Trans. Softw. Eng. 43(4) (2017) 321–339. Web of Science, Google Scholar
20. Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha and N. Ubayashi , A large-scale empirical study of just-in-time quality assurance, IEEE Trans. Softw. Eng. 39(6) (2013) 757–773. Web of Science, Google Scholar
21. I. H. Laradji, M. Alshayeb and L. Ghouti , Software defect prediction using ensemble learning on selected features, Inf. Softw. Technol. 58 (2015) 388–402. Web of Science, Google Scholar
22. Y. Ma, G. Luo, X. Zeng and A. Chen , Transfer learning for cross-company software defect prediction, Inf. Softw. Technol. 54(3) (2012) 248. Web of Science, Google Scholar
23. T. Menzies, J. Greenwald and A. Frank , Data mining static code attributes to learn defect predictors, IEEE Trans. Softw. Eng. 33(1) (2006) 2–13. Web of Science, Google Scholar
24. T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang and A. Bener , Defect prediction from static code features: Current results, limitations, new approaches, Autom. Softw. Eng. 17(4) (2010) 375–407. Web of Science, Google Scholar
25. J. Nam, S. J. Pan and S. Kim , Transfer defect learning, International Conf. Software Engineering (San Francisco, USA, 2013), pp. 382–391. Google Scholar
26. S. J. Pan, I. W. Tsang, J. T. Kwok and Q. Yang , Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw. 22(2) (2011) 199–210. Web of Science, Google Scholar
27. L. Peng, B. Yang, Y. Chen and A. Abraham , Data gravitation based classification, Inf. Sci. an Int. J. 179(6) (2009) 809–819. Web of Science, Google Scholar
28. D. Ryu, O. Choi and J. Baik , Value-cognitive boosting with a support vector machine for cross-project defect prediction, Empir. Softw. Eng. 21(1) (2016) 43–71. Web of Science, Google Scholar
29. C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano , Rusboost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cyber. A, Syst. Hum. 40(1) (2010) 185–197. Web of Science, Google Scholar
30. M. Shepperd, Q. Song, Z. Sun and C. Mair , Data quality: Some comments on the NASA software defect datasets, IEEE Trans. Softw. Eng. 39(9) (2013) 1208–1215. Web of Science, Google Scholar
31. Q. Song, Y. Guo and M. Shepperd , A comprehensive investigation of the role of imbalanced learning for software defect prediction, IEEE Trans. Softw. Eng. (IEEE, 2018), p. 1. Google Scholar
32. C. Tantithamthavorn, S. Mcintosh, A. E. Hassan and K. Matsumoto , An empirical comparison of model validation techniques for defect prediction models, IEEE Trans. Softw. Eng. 43(1) (2017) 1–18. Web of Science, Google Scholar
33. X. Tao , Intelligent software engineering: Synergy between AI and software engineering, in Proc. 11th Innovations in Software Engineering Conf., 2018, pp. 1. Google Scholar
34. B. Turhan, T. Menzies, A. B. Bener and J. D. Stefano , On the relative value of cross-company and within-company data for defect prediction, Empir. Softw. Eng. 14(5) (2009) 540–578. Web of Science, Google Scholar
35. P. Walters , Ruelles operator theorem and measures, Trans. Am. Math. Soc. 214 (1975) 375–387. Web of Science, Google Scholar
36. S. Wang and X. Yao , Diversity analysis on imbalanced data sets by using ensemble models, IEEE Symp. Computational Intelligence and Data Mining, 2009, pp. 324–331. Google Scholar
37. K. Weiss, T. M. Khoshgoftaar and D. D. Wang , A survey of transfer learning, J. Big Data 3(1) (2016) 9. Google Scholar
38. R. Wu, H. Zhang, S. Kim and S. C. Cheung , RELINK: Recovering links between bugs and changes, ACM Sigsoft Symp. the European Conf. Foundations of Software Engineering (Szeged, Hungary, 2011), pp. 15–25. Google Scholar
39. X. Xia, D. Lo, S. J. Pan, N. Nagappan and X. Wang , HYDRA: Massively compositional model for cross-project defect prediction, IEEE Trans. Softw. Eng. 42(10) (2016) 977–998. Web of Science, Google Scholar
40. X. Yu, M. Zhou, X. Chen, L. Deng and L. Wang , Using class imbalance learning for cross-company defect prediction, 29th International Conf. Software Engineering and Knowledge Engineering (Pittsburgh, USA, 2017), pp. 117–122. Google Scholar
41. T. Zimmermann, N. Nagappan, H. Gall, E. Giger and B. Murphy , Cross-project defect prediction: A large scale experiment on data versus domain versus process, Joint Meeting of the European Software Engineering Conf. and the ACM Sigsoft International Symp. on Foundations of Software Engineering, 2009, August 2009, Amsterdam, the Netherlands, pp. 91–100. Google Scholar