World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Learning from Highly Imbalanced Big Data with Label Noise

    https://doi.org/10.1142/S0218213023600035Cited by:1 (Source: Crossref)
    This article is part of the issue:

    This study explores the effects of class label noise on detecting fraud within three highly imbalanced healthcare fraud data sets containing millions of claims and minority class sizes as small as 0.1%. For each data set, 29 noise distributions are simulated by varying the level of class noise and the distribution of noise between the fraudulent and non-fraudulent classes. Four popular machine learning algorithms are evaluated on each noise distribution using six rounds of five-fold cross-validation. Performance is measured using the area under the precision-recall curve (AUPRC), true positive rate (TPR), and true negative rate (TNR) in order to understand the effect of the noise level, noise distribution, and their interactions. AUPRC results show that negative class noise, i.e. fraudulent samples incorrectly labeled as non-fraudulent, is the most detrimental to model performance. TPR and TNR results show that there are significant trade-offs in class-wise performance as noise transitions between the positive and the negative class. Finally, results reveal how overfitting negatively impacts the classification performance of some learners, and how simple regularization can be used to combat this overfitting and improve classification performance across all noise distributions.