Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    Approximation Algorithms for Spherical k-Means Problem with Penalties Using Local Search Techniques

    In this paper, we consider the spherical k-means problem with penalties, a robust model of spherical clusterings that requires identifying outliers during clustering to improve the quality of the solution. Each outlier will incur a specified penalty cost. In this problem, one should detect the outliers and propose a k-clustering for the given data set so as to minimize the sum of the clustering and penalty costs. As our main contribution, we present a (16+83)-approximation via single-swap local search and an (8+27+𝜀)-approximation via multi-swap local search.

  • articleNo Access

    A quick method based on SIMPLISMA-KPLS for simultaneously selecting outlier samples and informative samples for model standardization in near infrared spectroscopy

    A novel method based on SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) and Kernel Partial Least Square (KPLS), named as SIMPLISMA-KPLS, is proposed in this paper for selection of outlier samples and informative samples simultaneously. It is a quick algorithm used to model standardization (or named as model transfer) in near infrared (NIR) spectroscopy. The NIR experiment data of the corn for analysis of the protein content is introduced to evaluate the proposed method. Piecewise direct standardization (PDS) is employed in model transfer. And the comparison of SIMPLISMA-PDS-KPLS and KS-PDS-KPLS is given in this research by discussion of the prediction accuracy of protein content and calculation speed of each algorithm. The conclusions include that SIMPLISMA-KPLS can be utilized as an alternative sample selection method for model transfer. Although it has similar accuracy to Kennard–Stone (KS), it is different from KS as it employs concentration information in selection program. This means that it ensures analyte information is involved in analysis, and the spectra (X) of the selected samples is interrelated with concentration (y). And it can be used for outlier sample elimination simultaneously by validation of calibration. According to the statistical data results of running time, it is clear that the sample selection process is more rapid when using KPLS. The quick algorithm of SIMPLISMA-KPLS is beneficial to improve the speed of online measurement using NIR spectroscopy.

  • articleNo Access

    Efficiency Improvement of Classification Model Based on Altered K-Means Using PCA and Outlier

    In the generation and analysis of Big Data following the development of various information devices, the old data processing and management techniques reveal their hardware and software limitations. Their hardware limitations can be overcome by the CPU and GPU advancements, but their software limitations depend on the advancement of hardware. This study thus sets out to address the increasing analysis costs of dense Big Data from a software perspective instead of depending on hardware. An altered K-means algorithm was proposed with ideal points to address the analysis costs issue of dense Big Data. The proposed algorithm would find an optimal cluster by applying Principal Component Analysis (PCA) in the multi-dimensional structure of dense Big Data and categorize data with the predicted ideal points as the central points of initial clusters. Its clustering validity index and F-measure results were compared with those of existing algorithms to check its excellence, and it had similar results to them. It was also compared and assessed with some data classification techniques investigated in previous studies and we found that it made a performance improvement of about 3–6% in the analysis costs.

  • articleNo Access

    SVR+RVR: A ROBUST SPARSE KERNEL METHOD FOR REGRESSION

    Support vector machine (SVM) and relevance vector machine (RVM) are two state of the art kernel learning methods. But both methods have some disadvantages: although SVM is very robust against outliers, it makes unnecessarily liberal use of basis functions since the number of support vectors required typically grows linearly with the size of the training set; on the other hand the solution of RVM is astonishingly sparse, but its performance deteriorates significantly when the observations are contaminated by outliers. In this paper, we present a combination of SVM and RVM for regression problems, in which the two methods are concatenated: firstly, we train a support vector regression (SVR) machine on the full training set; then a relevance vector regression (RVR) machine is trained only on a subset consisting of support vectors, but whose target values are replaced by the predictions of SVR. Using this combination, we overcome the drawbacks of SVR and RVR. Experiments demonstrate SVR+RVR is both very sparse and robust.

  • articleNo Access

    FUZZY ROBUST REGRESSION ANALYSIS BASED ON THE RANKING OF FUZZY SETS

    Since fuzzy linear regression was introduced by Tanaka et al., fuzzy regression analysis has been widely studied and applied in various areas. Diamond proposed the fuzzy least squares method to eliminate disadvantages in the Tanaka et al method. In this paper, we propose a modified fuzzy least squares regression analysis. When independent variables are crisp, the dependent variable is a fuzzy number and outliers are present in the data set. In the proposed method, the residuals are ranked as the comparison of fuzzy sets, and the weight matrix is defined by the membership function of the residuals. To illustrate how the proposed method is applied, two examples are discussed and compared in methods from the literature. Results from the numerical examples using the proposed method give good solutions.

  • articleNo Access

    Possibilistic Clustering Methods for Interval-Valued Data

    Outliers may have many anomalous causes, for example, credit card fraud, cyberintrusion or breakdown of a system. Several research areas and application domains have investigated this problem. The popular fuzzy c-means algorithm is sensitive to noise and outlying data. In contrast, the possibilistic partitioning methods are used to solve these problems and other ones. The goal of this paper is to introduce cluster algorithms for partitioning a set of symbolic interval-type data using the possibilistic approach. In addition, a new way of measuring the membership value, according to each feature, is proposed. Experiments with artificial and real symbolic interval-type data sets are used to evaluate the methods. The results of the proposed methods are better than the traditional soft clustering ones.

  • articleNo Access

    Robust Enhanced Ridge-Type Estimation for the Poisson Regression Models: Application to English League Football Data

    This study delves into the intricacies of addressing multicollinearity and outliers in Poisson regression modeling. Our examination encompasses a comprehensive survey of well-established methodologies, such as the ridge estimator and MT-estimator, and recent innovations like the enhanced ridge-type estimator. These approaches bolstered the robust estimation of the Poisson regression modeling with the development of a robust method. Through a series of numerical studies, including simulations and a meticulous analysis of the English League Football data, we empirically validate the efficacy of these methods. This research contributes to the evolving toolkit for effectively handling count data in the dynamic field of statistics.

  • articleNo Access

    FILTERING AND DENOISING IN LINEAR REGRESSION ANALYSIS

    In this paper we examine the effect of outlier/leverage point on the accuracy measures in the linear regression models. We use the coefficient of determination, which is a measure of model adequacy, to compare the effect of filtering approach on the least squares estimates. We also compare the performance of the filter-based approach with several resistant methods in a situation where there are several outliers in the data sets. Specifically, we examine the sensitivity of the resistant methods and the proposed approach in the circumstances where there are several leverage points in the data sets. To gain a better understanding of the effect of filtering and evaluating the performance of the proposed approach, we consider real data and simulation studies with several sample sizes, different percentage of outliers, and various noise levels.

  • articleNo Access

    A Preliminary Investigation into the Effect of Outlier(s) on Singular Spectrum Analysis

    The aim of this paper is to study the effect of outliers on different parts of singular spectrum analysis (SSA) from both theoretical and practical points of view. The rank of the trajectory matrix, the magnitude of eigenvalues, reconstruction, and forecasting results are evaluated using simulated and real data sets. The performance of both recurrent and vector forecasting procedures are assessed in the presence of outliers. We find that the existence of outliers affect the rank of the matrix and increases the linear recurrent dimensions whilst also having a significant impact on SSA reconstruction and forecasting processes. There is also evidence to suggest that in the presence of outliers, the vector SSA forecasts are more robust in comparison to the recurrent SSA forecasts. These results indicate that the identification and removal of the outliers are mandatory to achieve optimal SSA decomposition and forecasting results.

  • articleNo Access

    An Approach for Data Labelling and Concept Drift Detection Based on Entropy Model in Rough Sets for Clustering Categorical Data

    Clustering is an important technique in data mining. Clustering a large data set is difficult and time consuming. An approach called data labelling has been suggested for clustering large databases using sampling technique to improve efficiency of clustering. A sampled data is selected randomly for initial clustering and data points which are not sampled and unclustered are given cluster label or an outlier based on various data labelling techniques. Data labelling is an easy task in numerical domain because it is performed based on distance between a cluster and an unlabelled data point. However, in categorical domain since the distance is not defined properly between data points and data points with cluster, then data labelling is a difficult task for categorical data. This paper proposes a method for data labelling using entropy model in rough sets for categorical data. The concept of entropy, introduced by Shannon with particular reference to information theory is a powerful mechanism for the measurement of uncertainty information. In this method, data labelling is performed by integrating entropy with rough sets. This method is also applied to drift detection to establish if concept drift occurred or not when clustering categorical data. The cluster purity is also discussed using Rough Entropy for data labelling and for outlier detection. The experimental results show that the efficiency and clustering quality of this algorithm are better than the previous algorithms.

  • articleNo Access

    Forecasting energy data using Singular Spectrum Analysis in the presence of outlier(s)

    The aim of this paper is to present a comparative study on the performance of the two different forecasting approaches of SSA in the presence of outliers. We examine this issue from different points of view. As our real data set, we have considered the well known WTI Spot Price series. The effect on forecasting process when confronted with outlier(s) in different parts of a time series is evaluated. Based on this study, we find evidence which suggests that the existence of outliers affect SSA reconstruction and forecasting results, and that VSSA forecasting performs better than RSSA in terms of the accuracy and robustness of forecasts.