Advanced Search

Narrow Results

Results: 1 - 14of14

Follow results:

refine search

Filters

per page:

Sort: Relevance

Context for search term 1Search term 1*

All Dates

LastSelect static range

Custom Range

Select starting monthSelect starting year

Select ending monthSelect ending year

Advanced

Search name	Searched On	Run search
Keyword: Outlier Removal (2)	31 Mar 2025	Run
Keyword: Imbalanced Data (14)	31 Mar 2025	Run
Keyword: Regional Growth (1)	31 Mar 2025	Run
Keyword: Ligament Injury (3)	31 Mar 2025	Run
Keyword: Geodesic Circle (1)	31 Mar 2025	Run

articleNo Access
SHAP as a Data Reduction Technique for Highly Imbalanced Big Data
International Journal on Artificial Intelligence Tools19 Feb 2025
Preview Abstract
Fraud detection through the classification of highly imbalanced Big Data is an exciting area of Machine Learning research. On the one hand, in certain fraud detection application domains, the use of One-Class classifiers is an overlooked opportunity. On the other hand, for researchers faced with the task of building Machine Learning models for identifying fraud, when only legitimate transaction data is available, One-Class Classifiers are indispensable. We investigate the efficacy of SHapley Additive exPlanations (SHAP) as a feature selection technique for One-Class classification tasks. In this study we utilize authentic data from the Credit Card fraud and Medicare insurance fraud application domains. Our contribution is to show that researchers can use SHAP in conjunction with One-Class Classifiers to do feature selection on highly imbalanced datasets, and then build models, with the selected features, that yield performance similar to, or better than, models built using all features. Our results in Big Medicare data fraud detection show that an over 90% data reduction through feature selection can nevertheless coincide with the best performance in terms of Area under the Precision Recall Curve.
articleNo Access
Improved Overlap-based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease
- Pattaramon Vuttipittayamongkol and
- Eyad Elyan
International Journal of Neural Systems17 Jul 2020
Preview Abstract
Classification of imbalanced datasets has attracted substantial research interest over the past decades. Imbalanced datasets are common in several domains such as health, finance, security and others. A wide range of solutions to handle imbalanced datasets focus mainly on the class distribution problem and aim at providing more balanced datasets by means of resampling. However, existing literature shows that class overlap has a higher negative impact on the learning process than class distribution. In this paper, we propose overlap-based undersampling methods for maximizing the visibility of the minority class instances in the overlapping region. This is achieved by the use of soft clustering and the elimination threshold that is adaptable to the overlap degree to identify and eliminate negative instances in the overlapping region. For more accurate clustering and detection of overlapped negative instances, the presence of the minority class at the borderline areas is emphasized by means of oversampling. Extensive experiments using simulated and real-world datasets covering a wide range of imbalance and overlap scenarios including extreme cases were carried out. Results show significant improvement in sensitivity and competitive performance with well-established and state-of-the-art methods.
articleNo Access
Response to Discussion on “Improved Overlap-Based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease,”
- Pattaramon Vuttipittayamongkol and
- Eyad Elyan
International Journal of Neural Systems12 Aug 2020
Preview Abstract
In the paper Improved Overlap-Based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease, the authors introduced two new methods that address the class overlap problem in imbalanced datasets. The methods involve identification and removal of potentially overlapped majority class instances. Extensive evaluations were carried out using 136 datasets and compared against several state-of-the-art methods. Results showed competitive performance with those methods, and statistical tests proved significant improvement in classification results. The discussion on the paper related to the behavioral analysis of class overlap and method validation was raised by Fernández. In this article, the response to the discussion is delivered. Detailed clarification and supporting evidence to answer all the points raised are provided.
articleNo Access
FEATURE SELECTION FOR DATASETS WITH IMBALANCED CLASS DISTRIBUTIONS
- ABU H. M. KAMAL,
- XINGQUAN ZHU,
- ABHIJIT PANDYA,
- SAM HSU, and
- RAMASWAMY NARAYANAN
International Journal of Software Engineering and Knowledge Engineering01 Mar 2010
Preview Abstract
Feature selection for supervised learning concerns the problem of selecting a number of important features (w.r.t. the class labels) for the purposes of training accurate prediction models. Traditional feature selection methods, however, fail to take the sample distributions into consideration which may lead to poor prediction for minority class examples. Due to the sophistication and the cost involved in the data collection process, many applications, such as biomedical research, commonly face biased data collections with one class of examples (e.g., diseased samples) significantly less than other classes (e.g., normal samples). For these applications, the minority class examples, such as disease samples, credit card frauds, and network intrusions, are only a small portion of the data but deserve full attention for accurate prediction. In this paper, we propose three filtering techniques, Higher Weight (HW), Differential Minority Repeat (DMR) and Balanced Minority Repeat (BMR), to identify important features from datasets with biased sample distribution. Experimental comparisons with the ReliefF method on five datasets demonstrate the effectiveness of the proposed methods in selecting informative features for accurate prediction of minority class examples.
articleNo Access
Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data
- Kewen Li,
- Mingxiao Yu,
- Lu Liu,
- Timing Li, and
- Jiannan Zhai
International Journal of Software Engineering and Knowledge Engineering01 Aug 2018
Preview Abstract
The class imbalance problem has negative effects on the performance of feature selection in imbalanced data. Traditional feature selection algorithms always study on the balanced class distribution of the data and improve the overall classification accuracy for the optimization goal, which tends to be overwhelmed by the large classes, ignoring the small ones. This paper proposes a novel feature selection method based on the weighted mutual information (WMI) for the imbalanced data, defined as WMI algorithm. The WMI algorithm assigns different weights to the samples based on the fuzzy c-means (FCM) clustering algorithm and then calculates the mutual information based on the weight of each sample. This paper used the AUC as the evaluation criterion of the selected feature. At last, four unbalanced datasets from NASA software defect datasets are used to validate the proposed approach. Experimental results show that the proposed method achieves higher prediction accuracy of both minority class and majority class.
articleNo Access
An Empirical Analysis of Evolved Radial Basis Function Networks and Support Vector Machines with Mixture of Kernels
International Journal on Artificial Intelligence Tools01 Aug 2015
Preview Abstract
Classification is one of the most fundamental and formidable tasks in many domains including biomedical. In biomedical domain, the distributions of data in most of the datasets into predefined number of classes is significantly different (i.e., the classes are distributed unevenly). Many mathematical, statistical, and machine learning approaches have been developed for classification of biomedical datasets with a varying degree of success. This paper attempts to analyze the empirical performance of two forefront machine learning algorithms particularly designed for classification problem by adding some novelty to address the problem of imbalanced dataset. The evolved radial basis function network with novel kernel and support vector machine with mixture of kernels are suitably designed for the purpose of classification of imbalanced dataset. The experimental outcome shows that both algorithms are promising compared to simple radial basis function neural networks and support vector machine, respectively. However, on an average, support vector machine with mixture kernels is better than evolved radial basis function neural networks.
articleFree Access
Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection
- K. Kiruthika Devi and
- G. A. Sathish Kumar
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems01 Jan 2024
Preview Abstract
Currently, social media networks such as Facebook and Twitter have evolved into valuable platforms for global communication. However, due to their extensive user bases, Twitter is often misused by illegitimate users engaging in illicit activities. While there are numerous research papers available that delve into combating illegitimate users on Twitter, a common shortcoming in most of these works is the failure to address the issue of class imbalance, which significantly impacts the effectiveness of spam detection. Few other research works that have addressed class imbalance have not yet applied bio-inspired algorithms to balance the dataset. Therefore, we introduce PSOB-U, a particle swarm optimization-based undersampling technique designed to balance the Twitter dataset. In PSOB-U, various classifiers and metrics are employed to select majority samples and rank them. Furthermore, an ensemble learning approach is implemented to combine the base classifiers in three stages. During the training phase of the base classifiers, undersampling techniques and a cost-sensitive random forest (CS-RF) are utilized to address the imbalanced data at both the data and algorithmic levels. In the first stage, imbalanced datasets are balanced using random undersampling, particle swarm optimization-based undersampling, and random oversampling. In the second stage, a classifier is constructed for each of the balanced datasets obtained through these sampling techniques. In the third stage, a majority voting method is introduced to aggregate the predicted outputs from the three classifiers. The evaluation results demonstrate that our proposed method significantly enhances the detection of illegitimate users in the imbalanced Twitter dataset. Additionally, we compare our proposed work with existing models, and the predicted results highlight the superiority of our spam detection model over state-of-the-art spam detection models that address the class imbalance problem. The combination of particle swarm optimization-based undersampling and the ensemble learning approach using majority voting results in more accurate spam detection.
articleNo Access
ACTIVITY MINING: FROM ACTIVITIES TO ACTIONS
- LONGBING CAO,
- YANCHANG ZHAO,
- CHENGQI ZHANG, and
- HUAIFENG ZHANG
International Journal of Information Technology & Decision Making01 Jun 2008
Preview Abstract
Activity data accumulated in real life, such as terrorist activities and governmental customer contacts, present special structural and semantic complexities. Activity data may lead to or be associated with significant business impacts, and result in important actions and decision making leading to business advantage. For instance, a series of terrorist activities may trigger a disaster to society, and large amounts of fraudulent activities in social security programs may result in huge government customer debt. Uncovering these activities or activity sequences can greatly evidence and/or enhance corresponding actions in business decisions. However, mining such data challenges the existing KDD research in aspects such as unbalanced data distribution and impact-targeted pattern mining. This paper investigates the characteristics and challenges of activity data, and the methodologies and tasks of activity mining based on case-study experience in the area of social security. Activity mining aims to discover high impact activity patterns in huge volumes of unbalanced activity transactions. Activity patterns identified can be used to prevent disastrous events or improve business decision making and processes. We illustrate the above issues and prospects in mining governmental customer contacts data to recover customer debt.
articleNo Access
A Hybrid Approach for Binary Classification of Imbalanced Data
- Hsinhan Tsai,
- Ta-Wei Yang,
- Wai-Man Wong,
- Han-Yi Kao, and
- Cheng-Fu Chou
International Journal of Computational Intelligence and Applications25 Apr 2024
Preview Abstract
Binary classification with an imbalanced dataset is challenging. Models tend to consider all samples as belonging to the majority class. Although existing solutions such as sampling methods, cost-sensitive methods, and ensemble learning methods improve the poor accuracy of the minority class, these methods are limited by overfitting or cost parameters that are difficult to decide. This paper proposes a hybrid approach with dimension reduction that consists of data block construction, dimensionality reduction, and ensemble learning with deep neural network classifiers. The performance is evaluated on eight imbalanced public datasets in terms of recall, G-mean, AUC, F-measure, and balanced accuracy. The results show that the proposed model outperforms state-of-the-art methods.
articleNo Access
Automatic Video Event Detection for Imbalance Data Using Enhanced Ensemble Deep Learning
- Samira Pouyanfar and
- Shu-Ching Chen
International Journal of Semantic Computing01 Mar 2017
Preview Abstract
With the explosion of multimedia data, semantic event detection from videos has become a demanding and challenging topic. In addition, when the data has a skewed data distribution, interesting event detection also needs to address the data imbalance problem. The recent proliferation of deep learning has made it an essential part of many Artificial Intelligence (AI) systems. Till now, various deep learning architectures have been proposed for numerous applications such as Natural Language Processing (NLP) and image processing. Nonetheless, it is still impracticable for a single model to work well for different applications. Hence, in this paper, a new ensemble deep learning framework is proposed which can be utilized in various scenarios and datasets. The proposed framework is able to handle the over-fitting issue as well as the information losses caused by single models. Moreover, it alleviates the imbalanced data problem in real-world multimedia data. The whole framework includes a suite of deep learning feature extractors integrated with an enhanced ensemble algorithm based on the performance metrics for the imbalanced data. The Support Vector Machine (SVM) classifier is utilized as the last layer of each deep learning component and also as the weak learners in the ensemble module. The framework is evaluated on two large-scale and imbalanced video datasets (namely, disaster and TRECVID). The extensive experimental results illustrate the advantage and effectiveness of the proposed framework. It also demonstrates that the proposed framework outperforms several well-known deep learning methods, as well as the conventional features integrated with different classifiers.
articleNo Access
Correlation-Assisted Imbalance Multimedia Concept Mining and Retrieval
- Yilin Yan and
- Mei-Ling Shyu
International Journal of Semantic Computing01 Jun 2017
Preview Abstract
In the past decades, we have witnessed an explosion of multimedia data, especially with the development of social media websites and blooming popularity of smart devices. As a result, multimedia semantic concept mining and retrieval whose objective is to mine useful information from the large amount of multimedia data including texts, images, and videos has become more and more important. The huge amount of multimedia data and the semantic gap between low-level features and high-level semantic concepts have made it even more challenging. To address these challenges, the correlations among the classes can provide important context cues to help bridge the semantic gap. Meanwhile, many real-world datasets do not have uniform class distributions while the minority instances actually represent the concept of interests, like frauds in transactions, intrusions in network security, and unusual events in surveillance. Despite extensive research efforts, imbalanced concept retrieval remains one of the most challenging research problems in multimedia data mining. Different from existing frameworks regarding concept correlations among labels, this paper presents a novel concept correlation analysis model using the correlation between the retrieval scores and labels. Experimental results on the TRECVID benchmark datasets demonstrate that the proposed framework can enhance imbalanced concept mining and retrieval even with trivial scores from the minority class.
articleNo Access
A Parkinson’s disease diagnosis approach for nonequilibrium gait data
- Sha Wang
International Journal of Modeling, Simulation, and Scientific Computing09 Jul 2022
Preview Abstract
An important feature of Parkinson’s disease (PD) patients is dyskinesia, and gait signal analysis can provide a strong basis for the diagnosis and rehabilitation of Parkinson’s disease. Traditional machine learning methods are not suitable for directly classifying imbalanced data. In order to accurately distinguish healthy people from Parkinson’s disease patients, a cost-sensitive support vector machine (CS-SVM) method is designed in this paper, which is used to construct the model for classification of gait signals between Kinson’s disease patients and healthy individuals. Gait data for the entire subject was extracted from a real U-shaped electronic walkway. The extracted features are converted to a dimensionless form, the classification performance can be improved. The experimental results show that the prediction accuracy and F-measure obtained by the CS-SVM method are 94.16% and 87.08%, respectively.
articleFree Access
Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data
- Nicole Hayes,
- Ekaterina Merkurjev, and
- Guo-Wei Wei
Journal of Computational Biophysics and Chemistry19 Sep 2024
Preview Abstract
Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman–Bence–Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.
chapterNo Access
A FILTERING APPROACH TO SPLICE SITE PREDICTIONS IN HUMAN GENES
- KIHOON YOON and
- STEPHEN KWEK
Advances in Bioinformatics and Its Applications01 May 2005
Preview Abstract
Developing a reliable gene finding system is a very important task in gaining valuable information from non-annotated genome DNA sequence. The success of such a system depends heavily on the accuracy of splice site predictions. Despite extensive research on gene finding, current splice site prediction systems do not yield reliable accuracy. Due to the extremely high imbalance ratio of splice sites to non-splice sites, they tend to have either high false positive rate and/or low overall accuracy. In this paper, we propose several new features that are more indicative of the presence of splice sites. More importantly, we also present a simple but effective filtering approach for dealing with the imbalance problem. We show that our approach, coupled with the new features, significantly outperforms three existing popular techniques on the human genome sequences in terms of overall accuracy as well as false positive rate.

back

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Narrow Results

Filters Applied

Publication Type

Article Type

Publication Date

Author

Publication/Book Series

Subjects

Access

SHAP as a Data Reduction Technique for Highly Imbalanced Big Data

Improved Overlap-based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease

Response to Discussion on “Improved Overlap-Based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease,”

FEATURE SELECTION FOR DATASETS WITH IMBALANCED CLASS DISTRIBUTIONS

Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data

An Empirical Analysis of Evolved Radial Basis Function Networks and Support Vector Machines with Mixture of Kernels

Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection

ACTIVITY MINING: FROM ACTIVITIES TO ACTIONS

A Hybrid Approach for Binary Classification of Imbalanced Data

Automatic Video Event Detection for Imbalance Data Using Enhanced Ensemble Deep Learning

Correlation-Assisted Imbalance Multimedia Concept Mining and Retrieval

A Parkinson’s disease diagnosis approach for nonequilibrium gait data

Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data

A FILTERING APPROACH TO SPLICE SITE PREDICTIONS IN HUMAN GENES