Please login to be able to save your searches and receive alerts for new content matching your search criteria.
This paper presents a novel distributed one-class classification approach based on an extension of the ν-SVM method, thus permitting its application to Big Data data sets. In our method we will consider several one-class classifiers, each one determined using a given local data partition on a processor, and the goal is to find a global model. The cornerstone of this method is the novel mathematical formulation that makes the optimization problem separable whilst avoiding some data points considered as outliers in the final solution. This is particularly interesting and important because the decision region generated by the method will be unaffected by the position of the outliers and the form of the data will fit more precisely. Another interesting property is that, although built in parallel, the classifiers exchange data during learning in order to improve their individual specialization. Experimental results using different datasets demonstrate the good performance in accuracy of the decision regions of the proposed method in comparison with other well-known classifiers while saving training time due to its distributed nature.
The common spatial patterns (CSP) algorithm is one of the most frequently used and effective spatial filtering methods for extracting relevant features for use in motor imagery brain–computer interfaces (MI-BCIs). However, the inherent defect of the traditional CSP algorithm is that it is highly sensitive to potential outliers, which adversely affects its performance in practical applications. In this work, we propose a novel feature optimization and outlier detection method for the CSP algorithm. Specifically, we use the minimum covariance determinant (MCD) to detect and remove outliers in the dataset, then we use the Fisher score to evaluate and select features. In addition, in order to prevent the emergence of new outliers, we propose an iterative minimum covariance determinant (IMCD) algorithm. We evaluate our proposed algorithm in terms of iteration times, classification accuracy and feature distribution using two BCI competition datasets. The experimental results show that the average classification performance of our proposed method is 12% and 22.9% higher than that of the traditional CSP method in two datasets (p<0.05), and our proposed method obtains better performance in comparison with other competing methods. The results show that our method improves the performance of MI-BCI systems.
A self-organizing neural network model that computes the smallest circle (also called minimum spanning circle) enclosing a finite set of given points was proposed by Datta.3 In the article,3 the algorithm is stated and it is demonstrated by simulation that the center of the smallest circle can be achieved with a given level of accuracy. No rigorous proof was given in support of the simulation results. In this paper, we make a rigorous analysis of the model and mathematically prove that the model converges to the desired center of the minimum spanning circle. A suitable neural network architecture is also designed for parallel implementation of the proposed model. Time complexity of the algorithm is worked out under the proposed architecture. Extension of the proposed model to higher dimensions is discussed and demonstrated with some applications.
This paper describes experiences and results applying Support Vector Machine (SVM) to a Computer Intrusion Detection (CID) dataset. First, issues in supervised classification are discussed, then the incorporation of anomaly detection enhancing the modeling and prediction of cyber-attacks. SVM methods are seen as competitive with benchmark methods and other studies, and are used as a standard for the anomaly detection investigation. The anomaly detection approaches compare one class SVMs with a thresholded Mahalanobis distance to define support regions. Results compare the performance of the methods and investigate joint performance of classification and anomaly detection. The dataset used is the DARPA/KDD-99 publicly available dataset of features from network packets, classified into nonattack and four-attack categories.
Most traditional classifiers implicitly assume that data samples belong to at least one class among predefined several classes. However, all data patterns may not be known at the time of data collection or a new pattern can be emerging over time. In this paper, a new method is presented for monitoring the change in class distribution and detecting an emerging class. First a statistical significance test is designed which can signal for a change in class distribution. When an alarm for new class generation is set on, retrieval of new class members is performed using density estimation and entropy-based thresholding. Our experimental results demonstrate competent performances of the proposed method.
We propose an effective outlier detection algorithm for high-dimensional data. We consider manifold models of data as is typically assumed in dimensionality reduction/manifold learning. Namely, we consider a noisy data set sampled from a low-dimensional manifold in a high-dimensional data space. Our algorithm uses local geometric structure to determine inliers, from which the outliers are identified. The algorithm is applicable to both linear and nonlinear models of data. We also discuss various implementation issues and we present several examples to demonstrate the effectiveness of the new approach.
A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. Previous work in spatial outlier detection focuses on detecting spatial outliers with a single attribute. In the paper, we propose two approaches to discover spatial outliers with multiple attributes. We formulate the multi-attribute spatial outlier detection problem in a general way, provide two effective detection algorithms, and analyze their computation complexity. In addition, using a real-world census data, we demonstrate that our approaches can effectively identify local abnormality in large spatial data sets.
This paper introduces a stochastic graph-based algorithm, called OutRank, for detecting outliers in data. We consider two approaches for constructing a graph representation of the data, based on the object similarity and number of shared neighbors between objects. The heart of this approach is the Markov chain model that is built upon this graph, which assigns an outlier score to each object. Using this framework, we show that our algorithm is more robust than the existing outlier detection schemes and can effectively address the inherent problems of such schemes. Empirical studies conducted on both real and synthetic data sets show that significant improvements in detection rate and false alarm rate are achieved using the proposed framework.
In this paper, we propose an approach that combines different outlier detection algorithms in order to gain an improved effectiveness. To this end, we first estimate an outlier score vector for each data object. Each element of the estimated vectors corresponds to an outlier score produced by a specific outlier detection algorithm. We then use the multivariate beta mixture model to cluster the outlier score vectors into several components so that the component that corresponds to the outliers can be identified. A notable feature of the proposed approach is the automatic identification of outliers, while most existing methods return only a ranked list of points, expecting the outliers to come first; or require empirical threshold estimation to identify outliers. Experimental results, on both synthetic and real data sets, show that our approach substantially enhances the accuracy of outlier base detectors considered in the combination and overcome their drawbacks.
The task of semi-supervised outlier detection is to find the instances that are exceptional from other data, using some labeled examples. In many applications such as fraud detection and intrusion detection, this issue becomes more important. Most existing techniques are unsupervised. On the other hand, semi-supervised approaches use both negative and positive instances to detect outliers. However, in many real world applications, very few positive labeled examples are available. This paper proposes an innovative approach to address this problem. The proposed method works as follows. First, some reliable negative instances are extracted by a kNN-based algorithm. Afterwards, fuzzy clustering using both negative and positive examples is utilized to detect outliers. Experimental results on real data sets demonstrate that the proposed approach outperforms the previous unsupervised state-of-the-art methods in detecting outliers.
Outlier detection is a difficult problem due to its time complexity being quadratic or cube in most cases, which makes it necessary to develop corresponding acceleration algorithms. Since the index structure (c.f. R tree) is used in the main acceleration algorithms, those approaches deteriorate when the dimensionality increases. In this paper, an approach named VBOD (vibration-based outlier detection) is proposed, in which the main variants assess the vibration. Since the basic model and approximation algorithm FASTVBOD do not need to compute the index structure, their performances are less sensitive to increasing dimensions than traditional approaches. The basic model of this approach has only quadratic time complexity. Furthermore, accelerated algorithms decrease time complexity to O(nlogn). The fact that this approach does not rely on any parameter selection is another advantage. FASTVBOD was compared with other state-of-the-art algorithms, and it performed much better than other methods especially on high dimensional data.
We introduce a novel hypothesis in the field of outlier detection, suggesting that normal data tend to be distributed in regions where the density changes smoothly or is less pronounced, whereas abnormal data often exhibit distribution in areas characterized by abrupt changes in data density. Relying on this hypothesis, we develop a novel density-based unsupervised outlier detection method, referred to as Quantum Clustering (QC). This approach addresses the processing of unlabeled data and employs a potential function to identify the centroids of clusters and outliers effectively. Experimental results demonstrate that the potential function can accurately detect hidden outliers within data points. Furthermore, by adjusting the parameter σ, QC enables the identification of more subtle outliers. Additionally, our method is evaluated on several benchmarks from diverse research areas, affirming its broad applicability.
Outlier detection is one of the most important data analytics tasks and is used in numerous applications and domains. The goal of outlier detection is to find abnormal entities that are significantly different from the remaining data. Often, the underlying data is distributed across different organizations. If outlier detection is done locally, the results obtained are not as accurate as when outlier detection is done collaboratively over the combined data. However, the data cannot be easily integrated into a single database due to privacy and legal concerns. In this paper, we address precisely this problem. We first define privacy in the context of collaborative outlier detection. We then develop a novel method to find outliers from both horizontally partitioned and vertically partitioned categorical data in a privacy-preserving manner. Our method is based on a scalable outlier detection technique that uses attribute value frequencies. We provide an end-to-end privacy guarantee by using the differential privacy model and secure multiparty computation techniques. Experiments on real data show that our proposed technique is both effective and efficient.
We present a new method for the automatic detection of circular objects in images: we detect an osculating circle to an elliptic arc using a Hough transform, iteratively deforming it into an ellipse, removing outlier pixels, and searching for a separate edge. The voting space for the Hough transform is restricted to one and two dimensions for efficiency, and special weighting schemes are introduced to enhance the accuracy. We demonstrate the effectiveness of our method using real images. Finally, we apply our method to the calibration of a turntable for 3D object shape reconstruction.
Outlier detection targets those exceptional data whose pattern is rare and lie in low density regions. In this paper, under the assumption of complete spatial randomness inside clusters, we propose an MDV (Multi-scale Deviation of the Volume) approach to identifying outliers. In addition to assigning an outlier score for each object, it directly outputs a crisp outlier set. It also offers a plot showing the data structure in every object's vicinity, which is useful in explaining why it may be outlying. Finally, the effectiveness of MDV is demonstrated with both artificial and real datasets.
Peculiarity oriented mining (POM), aimed at discovering peculiarity rules hidden in a dataset, is a data mining method. Peculiarity factor (PF) is one of the most important concepts in POM. In this paper, it is proved that PF can accurately characterize the peculiarity of data sampled from a normal distribution. However, for a general one-dimensional distribution, it does not have the property. A local version of PF, called LPF, is proposed to solve the difficulty. LPF can effectively describe the peculiarity of data sampled from a continuous one-dimensional distribution. Based on LPF, a framework of local peculiarity oriented mining is presented, which consists of two steps, namely, peculiar data identification and peculiar data analysis. Two algorithms for peculiar data identification and a case study of peculiar data analysis are given to make the framework practical. Experiments on several benchmark datasets show their good performance.
High-dimensional data poses unique challenges in outlier detection process. Most of the existing algorithms fail to properly address the issues stemming from a large number of features. In particular, outlier detection algorithms perform poorly on dataset of small size with a large number of features. In this paper, we propose a novel outlier detection algorithm based on principal component analysis and kernel density estimation. The proposed method is designed to address the challenges of dealing with high-dimensional data by projecting the original data onto a smaller space and using the innate structure of the data to calculate anomaly scores for each data point. Numerical experiments on synthetic and real-life data show that our method performs well on high-dimensional data. In particular, the proposed method outperforms the benchmark methods as measured by F1-score. Our method also produces better-than-average execution times compared with the benchmark methods.
Detecting outliers in high dimensional datasets is quite a difficult data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Due to the existence of many irrelevant dimensions in high dimensional datasets, it is of great importance to eliminate the irrelevant or unimportant dimensions and identify outliers in interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques and subspace-based clustering methods. The dimension-growth subspace clustering techniques find interesting subspaces in relatively lower possible dimension space, while dimension-growth approaches intend to find the maximum cliques in high dimensional datasets. This paper presents a novel approach by identifying outliers in correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ the frequent pattern algorithms to find the correlated subspaces. Based on the correlated subspaces obtained, outliers are distinguished from the projected subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional datasets.
In linear regression, outliers have a serious effect on the estimation of regression model parameters and the prediction of final results, so outlier detection is one of the key steps in data analysis. In this paper, we use a mean shift model and then we apply the penalty function to penalize the mean shift parameters, which is conducive to get a sparse parameter vector. We choose Sorted L1 regularization (SLOPE), which provides a convex loss function, and shows good statistical properties in parameter selection. We apply an iterative process which using gradient descent method and parameter selection at each step. Our algorithm has higher computational efficiency since the calculation of inverse matrix is avoided. Finally, we use Cross-Validation rules (CV) and Bayesian Information Criterion (BIC) criteria to fine tune the parameters, which helps our program identify outliers and obtain more robust regression coefficients. Compared with other methods, the experimental results show that our program has a fantastic performance in all aspects of outlier detection.
Affymetrix GeneChip® oligonucleotide arrays are dedicated to analyzing gene expression differences across distinct experimental conditions. Data production for such arrays is an elaborate process with many potential sources of variability unrelated to biologically relevant gene expression variations. Therefore, rigorous data quality assessment is fundamental throughout the process for downstream biologically meaningful analyses. We have developed a program named AffyGCQC, which is the acronym for a bioinformatics tool designed to perform Affymetrix GeneChip Quality Control. This program implements a graphical representation of QC metrics recommended by Affymetrix for GeneChip oligonucleotide array technology. Most importantly, it performs extreme studentized deviate statistical tests for the set of arrays being compared in a given experiment, thus providing an objective measure for outlier detection. AffyGCQC has been designed as an easy-to-use Web-based interface (online supplementary information: ; contact: affygcqc@biologie.ens.fr).