No Access

Comparing Feature Selection Techniques for Software Quality Estimation Using Data-Sampling-Based Boosting Algorithms

Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, Florida 33431, USA

Search for more papers by this author

Kehan Gao

Department of Mathematics and Computer Science, Eastern Connecticut State University, 83 Windham Street, Willimantic, Connecticut 06226, USA

Corresponding author.

Search for more papers by this author

Ye Chen

Department of Mathematics and Computer Science, Eastern Connecticut State University, 83 Windham Street, Willimantic, Connecticut 06226, USA

Search for more papers by this author

, and

Amri Napolitano

Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, Florida 33431, USA

Search for more papers by this author

https://doi.org/10.1142/S0218539315500138Cited by:3 (Source: Crossref)

Abstract

Software defect prediction is a classification technique that utilizes software metrics and fault data collected during the software development process to identify fault-prone modules before the testing phase. It aims to optimize project resource allocation and eventually improve the quality of software products. However, two factors, high dimensionality and class imbalance, may cause low quality training data and subsequently degrade classification models. Feature (software metric) selection and data sampling are frequently used to overcome these problems. Feature selection (FS) is a process of choosing a subset of relevant features so that the quality of prediction models can be maintained or improved. Data sampling alters the dataset to change its balance level, therefore alleviating the problem of traditional classification models that are biased toward the overrepresented (majority) class. A recent study shows that another method, called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models), is also effective for addressing the class imbalance problem. In this paper, we present a technique that uses FS followed by a boosting algorithm in the context of software quality estimation. We investigate four FS approaches: individual FS, repetitive sampled FS, sampled ensemble FS, and repetitive sampled ensemble FS, and study the impact of the four approaches on the quality of the prediction models. Ten base feature ranking techniques are examined in the case study. We also employ the boosting algorithm to construct classification models with no FS and use the results as the baseline for further comparison. The empirical results demonstrate that (1) FS is important and necessary prior to the learning process; (2) the repetitive sampled FS method generally has similar performance to the individual FS technique; and (3) the ensemble filter (including sampled ensemble filter and repetitive sampled ensemble filter) performs better than or similarly to the average of the corresponding individual base rankers.

Keywords:

References

Q. Songet al., IEEE Trans. Softw. Eng. 37(3), 356 (2011). Crossref, Web of Science, Google Scholar
A. K. Pandey and N. K. Goyal, Int. J. Comp. Comm. Technol. 2(4), 56 (2010). Google Scholar
K. Gao, T. M. Khoshgoftaar and N. Seliya, Softw. Qual. J. 20(1), 3 (2012). Crossref, Web of Science, Google Scholar
S. Maldonado, R. Weber and F. Famili, Inf. Sci. 286(1), 228 (2014). Crossref, Web of Science, Google Scholar
H. Liuet al., Feature selection: An ever evolving frontier in data mining, Proc. Fourth Int. Workshop on Feature Selection in Data Mining (2010) pp. 4–13. Google Scholar
S. S. Rathore and A. Gupta , A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction , Proc. 7th India Software Engineering Conf. ( 2014 ) . Google Scholar
T. M. Khoshgoftaar, K. Gao and A. Napolitano, Int. J. Softw. Eng. Knowl. Eng. 22(2), 161 (2012). Link, Web of Science, Google Scholar
C. Seiffertet al., IEEE Trans. Syst., Man, Cybern. A, Syst. Hum. 40(1), 185 (2010). Crossref, Web of Science, Google Scholar
N. V. Chawla et al. , J. Artif. Intell. Res. 16 , 321 ( 2002 ) . Crossref, Web of Science, Google Scholar
Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proc. 13th Int. Conf. Machine Learning (1996) pp. 148–156. Google Scholar
I. Guyon and A. Elisseeff , J. Mach. Learn. Res. 3 , 1157 ( 2003 ) . Google Scholar
V. Kumar and S. Minz, Smart Comput. Rev. 4(3), 211 (2014). Crossref, Google Scholar
G. Ilczuket al., New feature selection methods for qualification of the patients for cardiac pacemaker implantation, Computers in Cardiology, 2007 (2007) pp. 423–426. Google Scholar
R. Waldet al., Predicting susceptibility to social bots on twitter, Proc. 14th IEEE Int. Conf. Information Reuse and Integration (IRI2013) (2013) pp. 6–13. Google Scholar
C. Akalya devi, K. E. Kannammal and B. Surendiran, Int. J. Comput. Sci. Appl. 2(2), 25 (2012). Crossref, Google Scholar
G. M. Weiss, SIGKDD Explorations 6(1), 7 (2004). Crossref, Google Scholar
H. Han, W. Y. Wang and B. H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Int. Conf. Intelligent Computing (ICIC'05)3644, Lecture Notes in Computer Science (Springer-Verlag, 2005) pp. 878–887. Google Scholar
R. Barandelaet al., The imbalanced training sample problem: Under or over sampling?, Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'04)3138, Lecture Notes in Computer Science (2004) pp. 806–814. Google Scholar
R. E. Schapire and Y. Freund , Boosting: Foundations and Algorithms , Adaptive Computation and Machine Learning Series ( The MIT Press , 2012 ) . Crossref, Google Scholar
C. Seiffertet al., Building useful models from imbalanced data with sampling and boosting, Proc. 21st Int. Florida Artificial Intelligence Research Society Conference (FLAIRS-2008) (AAAI Press, Coconut Grove, Florida, 2008) pp. 306–311. Google Scholar
N. V. Chawlaet al., SMOTEBoost: Improving prediction of the minority class in boosting, Proc. Principles of Knowledge Discovery in Databases (2003) pp. 107–119. Google Scholar
R. S. Wahono, N. Suryana and S. Ahmad, J. Softw. 9(5), 1324 (2014). Crossref, Google Scholar
K. Gao, T. M. Khoshgoftaar and R. Wald, Int. J. Reliab. Qual. Saf. Eng. 21(1), (2014), DOI: 10.1142/S0218539314500041. Link, Google Scholar
T. M. Khoshgoftaaret al., Inf. Syst. Front. 16(5), 801 (2014). Crossref, Web of Science, Google Scholar
A. A. Shanab, T. M. Khoshgoftaar and R. Wald, Robustness of threshold-based feature rankers with data sampling on noisy and imbalanced data, FLAIRS Conf. (2012) pp. 92–97. Google Scholar
H. Wanget al., A study on first order statistics-based feature selection techniques on software metric data, Proc. 25th Int. Conf. Software Engineering and Knowledge Engineering (2013) pp. 467–472. Google Scholar
J. S. Olsson and D. W. Oard, Combining feature selectors for text classification, Proc. 15th ACM Int. Conf. Information and Knowledge Management (2006) pp. 798–799. Google Scholar
I. H. Witten and E. Frank , Data Mining: Practical Machine Learning Tools and Techniques , 2nd edn. ( Morgan Kaufmann , 2005 ) . Google Scholar
J. Van Hulseet al., Netw. Model. Anal. Health Inform. Bioinform. 1(2), 47 (2012). Crossref, Google Scholar
L. Goh, Q. Song and N. Kasabov, A novel feature selection method to improve classification of gene expression data, Proc. Second Conf. Asia-Pacific Bioinformatics (2004) pp. 161–166. Google Scholar
S. Haykin , Neural Networks: A Comprehensive Foundation , 2nd edn. ( Prentice-Hall , 1998 ) . Google Scholar
J. Shawe-Taylor and N. Cristianini , Support Vector Machines , 2nd edn. ( Cambridge University Press , 2000 ) . Google Scholar
S. Le Cessie and J. C. Van Houwelingen, Appl. Stat. 41(1), 191 (1992). Crossref, Web of Science, Google Scholar
S. Lessmannet al., IEEE Trans. Softw. Eng. 34(4), 485 (2008). Crossref, Web of Science, Google Scholar
Y. Jianget al., Variance analysis in software fault prediction models, Proc. 20th IEEE Int. Symp. Software Reliability Engineering (2009) pp. 99–108. Google Scholar
J. Souza, N. Japkowicz and S. Matwin, Stochfs: A framework for combining feature selection outcomes through a stochastic process, Knowledge Discovery in Databases: PKDD 20053721 (2005) pp. 667–674. Google Scholar
T.-Y. Liu, Easy ensemble and feature selection for imbalance data sets, Proc. 2009 Int. Joint Conf. Bioinformatics, Systems Biology and Intelligent Computing (IEEE Computer Society, Washington, DC, USA, 2009) pp. 517–520. Google Scholar
T. Zimmermann, R. Premraj and A. Zeller, Predicting defects for eclipse, Proc. 29th Int. Conf. Software Engineering Workshops (IEEE Computer Society, Washington, DC, USA, 2007) p. 76. Google Scholar
K. Gao et al. , Pract. Aspects Search-Based Softw. Eng. 41 , 579 ( 2011 ) . Crossref, Web of Science, Google Scholar
C. Wohlin et al. , Experimentation in Software Engineering: An Introduction , Kluwer International Series in Software Engineering ( Kluwer Academic Publishers , Boston, MA , 2000 ) . Crossref, Google Scholar
L. G. Votta and A. A. Porter, Experimental software engineering: A report on the state of the art, Proc. 17th Int. Conf. Software Engineering (IEEE Computer Society, Seattle, WA, USA, 2005) pp. 277–279. Google Scholar