Open Access

Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms

University School of Information and Communication Technology (U.S.I.C.T), Guru Gobind Singh Indraprastha University, New Delhi, India

E-mail Address: arora.rashmi@gmail.com

Corresponding author.

Search for more papers by this author

and

Arvinder Kaur

University School of Information and Communication Technology (U.S.I.C.T), Guru Gobind Singh Indraprastha University, New Delhi, India

Search for more papers by this author

https://doi.org/10.1142/S2196888822500142Cited by:9 (Source: Crossref)

Abstract

Software Fault Prediction (SFP) is the most persuasive research area of software engineering. Software Fault Prediction which is carried out within the same software project is known as With-In Fault Prediction. However, local data repositories are not enough to build the model of With-in software Fault prediction. The idea of cross-project fault prediction (CPFP) has been suggested in recent years, which aims to construct a prediction model on one project, and use that model to predict the other project. However, CPFP requires that both the training and testing datasets use the same set of metrics. As a consequence, traditional CPFP approaches are challenging to implement through projects with diverse metric sets. The specific case of CPFP is Heterogeneous Fault Prediction (HFP), which allows the program to predict faults among projects with diverse metrics. The proposed framework aims to achieve an HFP model by implementing Feature Selection on both the source and target datasets to build an efficient prediction model using supervised machine learning techniques. Our approach is applied on two open-source projects, Linux and MySQL, and prediction is evaluated based on Area Under Curve (AUC) performance measure. The key results of the proposed approach are as follows: It significantly gives better results of prediction performance for heterogeneous projects as compared with cross projects. Also, it demonstrates that feature selection with feature mapping has a significant effect on HFP models. Non-parametric statistical analyses, such as the Friedman and Nemenyi Post-hoc Tests, are applied, demonstrating that Logistic Regression performed significantly better than other supervised learning algorithms in HFP models.

Keywords:

References

1. X. Jing, F. Wu, X. Dong, F. Qi and B. Xu, Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning, in Proc. 10th Joint Meeting Foundation Software Engineering (Association for Computing Machinery, New York, NY, USA, 2015), pp. 496–507. https://doi.org/10.1145/2786805.2786813. Crossref, Google Scholar
2. V. R. Basili, L. C. Briand and W. L. Melo, A validation of object-oriented design metrics as quality indicators, IEEE Trans. Softw. Eng. 22 (1996) 751–761. Crossref, Google Scholar
3. Z. Li, X. Y. Jing, X. Zhu and H. Zhang, Heterogeneous defect prediction through multiple kernel learning and ensemble learning, in 2017 IEEE Int. Conf. Software Maintenance and Evolution (ICSME) IEEE, pp. 91–102. Google Scholar
4. W. Fu, S. Kim, T. Menzies, J. Nam and L. Tan, Heterogeneous defect prediction, in Proc. 2015 10th Joint Meeting on Foundations of Software Engineering, Ser. ESEC/FSE (ACM, New York, USA, 2015), pp. 508–519. Google Scholar
5. A. Wang, Y. Zhang, H. Wu, K. Jiang and M. Wang, Few-shot learning-based balanced distribution adaptation for heterogeneous defect prediction, IEEE Access 8 (2020) 32989–33001. Crossref, Google Scholar
6. H. Chen, X. Y. Jing, Z. Li, D. Wu, Y. Peng and Z. Huang, An empirical study on heterogeneous defect prediction approaches, IEEE Trans. Softw. Eng. 47(12) (2021) 2803–2822, https://doi.org/10.1109/TSE.2020.2968520. Crossref, Google Scholar
7. X. Yin, L. Liu, H. Liu and Q. Wu, Heterogeneous cross-project defect prediction with multiple source projects based on transfer learning, Math. Biosci. Eng. 17(2) (2019) 1020–1040. Crossref, Google Scholar
8. K. Gao, T. M. Khoshgoftaar, H. Wang and N. Seliya, Choosing software metrics for defect prediction: An investigation on feature selection techniques, Softw. Pract. Exp. 41(5) (2011) 579–606. http://dx.doi.org/10.1002/spe.1043. Crossref, Google Scholar
9. Z. He, F. Shu, Y. Yang, M. Li and Q. Wang, An investigation on the feasibility of cross-project defect prediction, Autom. Softw. Eng. 19(2) (2012) 167–199. Crossref, Google Scholar
10. J. Nam, S. J. Pan and S. Kim, Transfer defect learning, in Proc. 2013 Int. Conf. Software Engineering (IEEE Press, Piscataway, NJ, USA, 2013), pp. 382–391. Crossref, Google Scholar
11. B. Turhan, T. Menzies, A. B. Bener and J. Di Stefano, On the relative value of cross-company and within-company data for defect prediction, Emp. Softw. Eng. 14 (2009) 540–578. Crossref, Google Scholar
12. H. W. Lilliefors, On the Kolmogorov–Smirnov test for normality with mean and variance unknown, J. Am. Stat. Assoc. 62(318) (1967) 399–402. Crossref, Google Scholar
13. H. J. Weerts, A. C. Mueller and J. Vanschoren, Importance of tuning hyperparameters of machine learning algorithms, arXiv:2007.07588. Google Scholar
14. K. Black, Business Statistics: For Contemporary Decision Making (John Wiley & Sons, 2019). Google Scholar
15. https://scitools.com. Google Scholar
16. Linux kernel: http://bugzilla.kernel.org. Google Scholar
17. MySQL DBMS: http://bugs.mysql.com. Google Scholar
18. A. Alsaeedi and M. Z. Khan, Software defect prediction using supervised machine learning and ensemble techniques: A comparative study, J. Softw. Eng. Appl. 12(5) (2019) 85–100. Crossref, Google Scholar
19. S. J. Pan and Q. Yang, A survey on transfer learning, in IEEE Trans. Knowl. Data Eng. 22(10) (2010) 1345–1359, https://doi.org/10.1109/TKDE.2009.191. Crossref, Google Scholar
20. M. Sokolova and G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manage. 45(4) (2009) 427–437. Crossref, Google Scholar
21. P. He, B. Li and Y. Ma, Towards cross-project defect prediction with imbalanced feature sets, arXiv:1411.4228. Google Scholar
22. J. Nam et al., Heterogeneous defect prediction, IEEE Trans. Softw. Eng. 44(9) (2017) 874–896. Crossref, Google Scholar
23. J. Demsar, Statistical comparisons of classifiers over multiple datasets, J. Mach. Learn. Res. 7 (2006) 1–30. Google Scholar
24. T. Mende and R. Koschke, Revisiting the evaluation of defect prediction models, in Proc. 5th Int. Conf. Predictor Models in Software Engineering (PROMISE 2009, Vancouver, BC, Canada, May 18–19, 2009), pp. 1–10. Crossref, Google Scholar
25. N. F. Schneidewind, Methodology for validating software metrics, IEEE Trans. Softw. Eng. 18(5) (1992) 410–422. Crossref, Google Scholar
26. A. Cruz and K. Ochimizu, Towards logistic regression models for predicting fault-prone code across software projects, in Proc. 3rd Int. Symp. Empirical Software Engineering and Measurement (Lake Buena Vista, FL, USA, 2009), pp. 460–463, https://doi.org/10.1109/ESEM.2009.5316002. Crossref, Google Scholar
27. L. C. Briand, W. L. Melo and J. Wust, Assessing the applicability of fault-proneness models across object-oriented software projects, IEEE Trans. Softw. Eng. 28(7) (2002) 706–720. Crossref, Google Scholar
28. M. Li, H. Zhang, R. Wu and Z.-H. Zhou, Sample-based software defect prediction with active and semi-supervised learning, Autom. Softw. Eng. 19(2) (2012) 201–230. Crossref, Google Scholar
29. F. J. Massey, The Kolmogoro–Smirnov test for goodness of fit, J. Amer. Stat. Assoc. 46(253) (1951) 68–78. Crossref, Google Scholar
30. T. Menzies, J. Greenwald and A. Frank, Data mining static code attributes to learn defect predictors, IEEE Trans. Softw. Eng. 33(1) (2007) 2–13. Crossref, Google Scholar
31. M. DAmbros, M. Lanza and R. Robbes, Evaluating defect prediction approaches: A benchmark and an extensive comparison, Emp. Softw. Eng. 17(4–5) (2012) 531–577. Crossref, Google Scholar
32. T. Ostrand, E. Weyuker and R. Bell, Predicting the location and number of faults in large software systems, IEEE Trans. Softw. Eng. 31(4) (2005) 340–355. Crossref, Google Scholar
33. F. Zhang, A. Mockus, I. Keivanloo and Y. Zou, Towards building a universal defect prediction model, in Proc. 11th Working Conf. Mining Software Repositories MSR (ACM, New York, USA, 2014), pp. 182–191. Crossref, Google Scholar
34. A. Kaur and R. Malhotra, Application of random forest in predicting fault-prone classes, in 2008 Int. Conf. Advanced Computer Theory and Engineering (IEEE, 2008), pp. 37–43. Crossref, Google Scholar
35. K. Weiss, T. M. Khoshgoftaar and D. Wang, A survey of transfer learning, J. Big Data 3(1) (2016) 1–40. Crossref, Google Scholar
36. D. Cotroneo, R. Natella and R. Pietrantuono, Predicting aging-related bugs using software complexity metrics, Perf. Eval. 70(3) (2013) 163–178. Crossref, Google Scholar

Vol. 09, No. 03

Metrics

Downloaded 486 times

History

Received 1 July 2021

Revised 9 December 2021

Accepted 9 December 2021

Published: 24 January 2022

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC BY) License which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Keywords

PDF download

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms

Abstract

Recommended