No Access

Learning from Highly Imbalanced Big Data with Label Noise

Justin M. Johnson

https://orcid.org/0000-0003-3511-0624

College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida 33431, United States

E-mail Address: jjohn273@fau.edu

Search for more papers by this author

Robert K. L. Kennedy

College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida 33431, United States

E-mail Address: rkennedy@fau.edu

Search for more papers by this author

, and

Taghi M. Khoshgoftaar

College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida 33431, United States

E-mail Address: khoshgof@fau.edu

Search for more papers by this author

https://doi.org/10.1142/S0218213023600035Cited by:1 (Source: Crossref)

This article is part of the issue:

Special Issue on Selected Papers from the 33rd Annual IEEE International Conference on Tools with Artificial Intelligence (ICTAI-2021)
Guest Editors: George A. Tsihrintzis, Maria Virvou and Ioannis Hatzilygeroudis

Abstract

This study explores the effects of class label noise on detecting fraud within three highly imbalanced healthcare fraud data sets containing millions of claims and minority class sizes as small as 0.1%. For each data set, 29 noise distributions are simulated by varying the level of class noise and the distribution of noise between the fraudulent and non-fraudulent classes. Four popular machine learning algorithms are evaluated on each noise distribution using six rounds of five-fold cross-validation. Performance is measured using the area under the precision-recall curve (AUPRC), true positive rate (TPR), and true negative rate (TNR) in order to understand the effect of the noise level, noise distribution, and their interactions. AUPRC results show that negative class noise, i.e. fraudulent samples incorrectly labeled as non-fraudulent, is the most detrimental to model performance. TPR and TNR results show that there are significant trade-offs in class-wise performance as noise transitions between the positive and the negative class. Finally, results reveal how overfitting negatively impacts the classification performance of some learners, and how simple regularization can be used to combat this overfitting and improve classification performance across all noise distributions.

Keywords:

Remember to check out the Most Cited Articles!
Check out Notable Titles in Artificial Intelligence.

References

1. F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim and A. Mustapha, Data quality: A survey of data quality dimensions, in 2012 Int. Conf. on Information Retrieval Knowledge Management (2012), pp. 300–304. Crossref, Google Scholar
2. J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder and N. Seliya, A survey on addressing high-class imbalance in big data, Journal of Big Data 5 (1) (2018) 42. Crossref, Google Scholar
3. B. Frénay and M. Verleysen, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25 (5) (2013) 845–869. Crossref, Web of Science, Google Scholar
4. C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Folleco, An empirical study of the classification performance of learners on imbalanced and noisy software quality data, Information Sciences 259 (2014) 571–595. Crossref, Web of Science, Google Scholar
5. R. A. Bauder and T. M. Khoshgoftaar, A study on rare fraud predictions with big medicare claims fraud data, Intelligent Data Analysis 24 (1) (2020) 141–161. Crossref, Web of Science, Google Scholar
6. W. Wei, J. Li, L. Cao, Y. Ou and J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data, World Wide Web 16 (4) (2013) 449–475. Crossref, Web of Science, Google Scholar
7. M. Herland, T. M. Khoshgoftaar and R. A. Bauder, Big data fraud detection using multiple medicare data sources, Journal of Big Data 5 (1) (2018) 1–21. Crossref, Google Scholar
8. J. Zhang and Y. Yang, Robustness of regularized linear classification methods in text categorization, in Proc. of the 26th Annual Int. ACM SIGIR Conf. on Research and Development in Informaion Retrieval (2003), pp. 190–197. Crossref, Google Scholar
9. K. M. Ali and M. J. Pazzani, Error reduction through learning multiple descriptions, Machine Learning 24 (3) (1996) 173–202. Crossref, Web of Science, Google Scholar
10. B. Krawczyk, Learning from imbalanced data: Open challenges and future directions, Progress in Artificial Intelligence 5 (4) (2016) 221–232. Crossref, Google Scholar
11. J. M. Johnson and T. M. Khoshgoftaar, A survey on classifying big data with label noise, J. Data and Information Quality (2021). Google Scholar
12. Centers for Medicare & Medicaid Services, Medicare Provider Utilization and Payment Data (2022), https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data. Google Scholar
13. N. Japkowicz, The class imbalance problem: Significance and strategies, in Proc. of the 2000 Int. Conf. on Artificial Intelligence (2000), pp. 111–117. Google Scholar
14. G. M. Weiss, Mining with rarity: A unifying framework, ACM SIGKDD Explorations Newsletter 6 (1) (2004) 7–19. Crossref, Google Scholar
15. G. Rekha, A. K. Tyagi, N. Sreenath and S. Mishra, Class imbalanced data: Open issues and future research directions, in 2021 Int. Conf. on Computer Communication and Informatics (ICCCI) (2021), pp. 1–6. Crossref, Google Scholar
16. J. M. Johnson and T. M. Khoshgoftaar, Robust thresholding strategies for highly imbalanced and noisy data, in 2021 20th IEEE Int. Conf. on Machine Learning and Applications (ICMLA) (2021), pp. 1182–1188. Crossref, Google Scholar
17. T. Hasanin, T. M. Khoshgoftaar, J. L. Leevy and R. A. Bauder, Investigating class rarity in big data, Journal of Big Data 7 (1) (2020) 1–17. Crossref, Google Scholar
18. I. Triguero, J. Maillo, J. Luengo, S. García and F. Herrera, From big data to smart data with the k-nearest neighbours algorithm, in 2016 IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (2016), pp. 859–864. Crossref, Google Scholar
19. E. Dumbill, What is Big Data?: An Introduction to the Big Data Landscape (2012), http://radar.oreilly.com/2012/01/what-is-big-data.html. Google Scholar
20. J. Qiu, Q. Wu, G. Ding, Y. Xu and S. Feng, A survey of machine learning for big data processing, EURASIP Journal on Advances in Signal Processing 2016 (2016). Google Scholar
21. M. G. Lozano, J. Brynielsson, U. Franke, M. Rosell, E. Tjörnhammar, S. Varga and V. Vlassov, Veracity assessment of online data, Decision Support Systems 129 (2020) 113132. Crossref, Web of Science, Google Scholar
22. R. K. L. Kennedy, J. M. Johnson and T. M. Khoshgoftaar, The effects of class label noise on highly-imbalanced big data, in 2021 IEEE 33rd Int. Conf. on Tools with Artificial Intelligence (ICTAI) (2021), pp. 1427–1433. Crossref, Google Scholar
23. Office of Inspector General, LEIE Downloadable Databases (2019), https://oig.hhs.gov/exclusions/exclusions_list.asp. Google Scholar
24. R. C. Prati, J. Luengo and F. Herrera, Emerging topics and challenges of learning from noisy data in nonstandard classification: A survey beyond binary class noise, Knowledge and Information Systems 60 (1) (2019) 63–97. Crossref, Web of Science, Google Scholar
25. S. Gupta and A. Gupta, Dealing with noise problem in machine learning data-sets: A systematic review, Procedia Computer Science 161 (2019) 466–474. Crossref, Google Scholar
26. B. Frénay and M. Verleysen, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25 (5) (2014) 845–869. Crossref, Web of Science, Google Scholar
27. D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics 2 (3) (1972) 408–421. Crossref, Web of Science, Google Scholar
28. I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics SMC-6 (6) (1976) 448–452. Crossref, Web of Science, Google Scholar
29. T. M. Khoshgoftaar and P. Rebours, Improving software quality prediction by noise filtering techniques, Journal of Computer Science and Technology 22 (3) (2007) 387–396. Crossref, Web of Science, Google Scholar
30. S. Verbaeten and A. Van Assche, Ensemble methods for noise elimination in classification problems, in International Workshop on Multiple Classifier Systems (2003), pp. 317–325. Google Scholar
31. J. A. Sáez, M. Galar, J. Luengo and F. Herrera, INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control, Information Fusion 27 (2016) 19–32. Crossref, Web of Science, Google Scholar
32. J. S. Sánchez, R. Barandela, A. I. Marqués, R. Alejo and J. Badenas, Analysis of new techniques to obtain quality training sets, Pattern Recognition Letters 24 (7) (2003) 1015–1022. Crossref, Web of Science, Google Scholar
33. J. Koplowitz and T. A. Brown, On the relation of performance to editing in nearest neighbor rules, Pattern Recognition 13 (3) (1981) 251–255. Crossref, Web of Science, Google Scholar
34. N. Lawrence and B. Schölkopf, Estimating a Kernel Fisher discriminant in the presence of label noise, in 18th Int. Conf. on Machine Learning (ICML 2001) (2001), pp. 306–306. Google Scholar
35. J. Bootkrajang and A. Kabán, Multi-class classification in the presence of labelling errors, in 19th European Symp. on Artificial Neural Networks (Bruges, Belgium, 2011). Google Scholar
36. Y. Li, L. F. Wessels, D. de Ridder and M. J. Reinders, Classification in the presence of class noise using a probabilistic Kernel Fisher method, Pattern Recognition 40 (12) (2007) 3349–3357. Crossref, Web of Science, Google Scholar
37. D. F. Nettleton, A. Orriols-Puig and A. Fornells, A study of the effect of different types of noise on the precision of supervised learning techniques, Artificial Intelligence Review 33 (4) (2010) 275–306. Crossref, Web of Science, Google Scholar
38. A. Folleco, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, Identifying learners robust to low quality data, Informatica (Slovenia) 33 (2009) 245–259. Google Scholar
39. T. G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization, Machine Learning 40 (2) (2000) 139–157. Crossref, Web of Science, Google Scholar
40. D. Rolnick, A. Veit, S. Belongie and N. Shavit, Deep learning is robust to massive label noise, ArXiv abs/1705.10694 (2017). Google Scholar
41. B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang and M. Sugiyama, Coteaching: Robust training of deep neural networks with extremely noisy labels (2018). Google Scholar
42. J. M. Johnson and T. M. Khoshgoftaar, The effects of data sampling with deep learning and highly imbalanced big data, Information Systems Frontiers 22 (5) (2020) 1113–1131. Crossref, Web of Science, Google Scholar
43. T. Kaneko, Y. Ushiku and T. Harada, Label-noise robust generative adversarial networks, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2019), pp. 2467–2476. Crossref, Google Scholar
44. J. M. Johnson and T. M. Khoshgoftaar, Output thresholding for ensemble learners and imbalanced big data, in 2021 IEEE 33rd Int. Conf. on Tools with Artificial Intelligence (ICTAI) (2021), pp. 1449–1454. Crossref, Google Scholar
45. J. M. Johnson and T. M. Khoshgoftaar, Medicare fraud detection using neural networks, Journal of Big Data 6 (1) (2019) 63. Crossref, Google Scholar
46. J. M. Johnson and T. M. Khoshgoftaar, Thresholding strategies for deep learning with highly imbalanced big data, in Deep Learning Applications, Vol. 2, eds. M. A. Wani, T. M. Khoshgoftaar and V. Palade, (Springer Singapore, 2021), pp. 199–227. Google Scholar
47. L. K. Branting, F. Reeder, J. Gold and T. Champney, Graph analytics for healthcare fraud risk estimation, in 2016 IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining (ASONAM) (2016), pp. 845–851. Crossref, Google Scholar
48. V. Chandola, S. R. Sukumar and J. C. Schryver, Knowledge discovery from massive healthcare claims data, in KDD (2013). Google Scholar
49. J. T. Hancock and T. M. Khoshgoftaar, Gradient boosted decision tree algorithms for medicare fraud detection, SN Computer Science 2 (4) (2021) 268. Crossref, Google Scholar
50. J. M. Johnson and T. M. Khoshgoftaar, Hcpcs2Vec: Healthcare procedure embeddings for medicare fraud prediction, in 2020 IEEE 6th Int. Conf. on Collaboration and Internet Computing (CIC) (2020). Crossref, Google Scholar
51. J. M. Johnson and T. M. Khoshgoftaar, Encoding high-dimensional procedure codes for healthcare fraud detection, SN Computer Science 3 (5) (2022) 362. Crossref, Google Scholar
52. Centers for Medicare & Medicaid Services, Medicare Physician & Other Practitioners Methodology (2022), https://data.cms.gov/resources/medicare-physician-other-practitioners-methodology. Google Scholar
53. Office of Inspector General, Exclusion authorities, https://oig.hhs.gov/exclusions/authorities.asp. Google Scholar
54. J. Van Hulse and T. M. Khoshgoftaar, Knowledge discovery from imbalanced and noisy data, Data & Knowledge Engineering 68 (12) (2009) 1513–1542. Crossref, Web of Science, Google Scholar
55. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. Web of Science, Google Scholar
56. F. Chollet et al.., Keras (2015), https://keras.io. Google Scholar
57. M. Abadi et al.., TensorFlow: Large-scale machine learning on heterogeneous systems (2015), http://tensorflow.org/. Google Scholar
58. T. Saito and M. Rehmsmeier, The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, PloS one 10 (3) (2015) e0118432. Crossref, Web of Science, Google Scholar
59. J. H. Hancock, J. M. Johnson and T. M. Khoshgoftaar, Informative evaluation metrics for highly imbalanced big data classification, in 2022 21st IEEE Int. Conf. on Machine Learning and Applications (ICMLA) (2022). Crossref, Google Scholar
60. M. Berenson, D. Levine and M. Goldstein, Intermediate statistical methods and applications: A computer package approach (1983). Google Scholar
61. C. Pelletier, S. Valero, J. Inglada, N. Champion, C. Sicre and G. Dedieu, Effect of training class label noise on classification performances for land cover mapping with satellite image time series, Remote Sensing 9 (2017) 173. Crossref, Web of Science, Google Scholar
62. H. Li, J. Li, X. Guan, B. Liang, Y. Lai and X. Luo, Research on overfitting of deep learning, in 2019 15th Int. Conf. on Computational Intelligence and Security (CIS) (2019), pp. 78–81. Crossref, Google Scholar