Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Detecting anomalous patterns in data is a relevant task in many practical applications, such as defective items detection in industrial inspection systems, cancer identification in medical images, or attacker detection in network intrusion detection systems. This paper focuses on detection of anomalous images, this is images that visually deviate from a reference set of regular data. While anomaly detection has been widely studied in the context of classical machine learning, the application of modern deep learning techniques in this field is still limited. We here propose a capsule-based network for anomaly detection in an extremely imbalanced fully supervised context: we assume that anomaly samples are available, but their amount is limited if compared to regular data. By using a variant of the standard CapsNet architecture, we achieved state-of-the-art results on the MNIST, F-MNIST and K-MNIST datasets.
The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.
When working with real-world applications we often find imbalanced datasets, those for which there exists a majority class with normal data and a minority class with abnormal or important data. In this work, we make an overview of the class imbalance problem; we review consequences, possible causes and existing strategies to cope with the inconveniences associated to this problem. As an effort to contribute to the solution of this problem, we propose a new rule induction algorithm named Rule Extraction for MEdical Diagnosis (REMED), as a symbolic one-class learning approach. For the evaluation of the proposed method, we use different medical diagnosis datasets taking into account quantitative metrics, comprehensibility, and reliability. We performed a comparison of REMED versus C4.5 and RIPPER combined with over-sampling and cost-sensitive strategies. This empirical analysis of the REMED algorithm showed it to be quantitatively competitive with C4.5 and RIPPER in terms of the area under the Receiver Operating Characteristic curve (AUC) and the geometric mean, but overcame them in terms of comprehensibility and reliability. Results of our experiments show that REMED generated rules systems with a larger degree of abstraction and patterns closer to well-known abnormal values associated to each considered medical dataset.
Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.
Coronary Artery Disease (CAD) is very common among the major types of cardiovascular diseases, and there are several studies created with different features including data that is collected from patients for timely diagnosis of CAD.
In this study, a dataset with 21 features have been used, and a risk score prediction system has been proposed. The patients were divided into four groups. To determine the effective features of CAD dataset; t-test and Relief-f methods on Logistic Regression Analysis (LRA); Relief-f on Neural Network (NN) feature selection methods were utilized.
Sampling methods were used to improve imbalanced form of 4-classed dataset, and the effects of sampling methods were evaluated. Using NN with oversampling and Relief-f feature selection method; the results before the preprocess operations were detected as follows; 72.3% accuracy; after the operations, 84.1% accuracy were achieved with 0.84 sensitivity 0.94 specificity. These statistics obtained from the experiment, by detailed analysis, are the best ones for the CAD data set in this study.
Using the feature selection and the sampling methods with the NN substantially improve the prediction accuracy as well as the other metrics. This suggests that these preprocessing methods and the NN may be used together to construct for prediction of the 4-classed imbalanced medical datasets.
This paper is intended to verify that cost-sensitive learning is a competitive approach for learning fuzzy rules in certain imbalanced classification problems. It will be shown that there exist cost matrices whose use in combination with a suitable classifier allows for improving the results of some popular data-level techniques. The well known FURIA algorithm is extended to take advantage of this definition. A numerical study is carried out to compare the proposed cost-sensitive FURIA to other state-of-the-art classification algorithms, based on fuzzy rules and on other classical machine learning methods, on 64 different imbalanced datasets.
In imbalanced learning, most supervised learning algorithms often fail to account for data distribution and present learning that is biassed towards the majority leading to unfavorable classification performance, particularly for the minority class samples. To tackle this problem, the ADASYN algorithm adaptively allocates weights to the minority class examples. A significant weight improves the possibility for the minority class sample serving as a seed sample in the synthetic sample generation process. However, it does not consider the noisy examples. Thus, this paper presents a modified method of ADASYN (M-ADASYN) for learning imbalanced datasets with noisy samples. M-ADASYN considers the distribution of minority class and creates noise-free minority examples by eliminating the noisy samples based on proximity to the original minority and majority class samples. The experimental outcomes confirm that the predictive performance of M-ADASYN is better than KernelADASYN, ADASYN, and SMOTE algorithm.
Enterprise data present several difficulties when are used in data mining projects. Apart from being heterogeneous, noisy and disparate, they may also be characterized by major imbalances between the different classes. Predictive classification using imbalanced data necessitates methodologies that are adequate for such data, and particularly for the training of algorithms and evaluation of the resulting classifiers. This chapter suggests to experiment with several class distributions in the training sets and a variety of performance measures, especially those that are known to better expose the strengths and weaknesses of classification models. By combining classifiers into schemes that are suitable for the specific business domain, may improve predictions. However, the final evaluation of the classifiers must always be based on the impact of the results to the enterprise, which can take the form of a cost model that reflects requirements of existing knowledge. Taking a telecommunications company as an example, we provide a framework for handling enterprise data during the initial phases of the project, as well as for generating and evaluating predictive classifiers. We also provide the design of a decision support system, which embodies the above process with the daily routine of such company.