Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    OCEAN: A Non-Conventional Parameter Free Clustering Algorithm Using Relative Densities of Categories

    In this paper, we propose a fully autonomous density-based clustering algorithm named ‘Ocean’, which is inspired by the oceanic landscape and phenomena that occur in it. Ocean is an improvement over conventional algorithms regarding both distance metric and the clustering mechanism. Ocean defines the distance between two categories as the difference in the relative densities of categories. Unlike existing approaches, Ocean neither assigns the same distance to all pairs of categories, nor assigns arbitrary weights to matches and mismatches between categories that can lead to clustering errors. Ocean uses density ratios of adjacent regions in multidimensional space to detect the edges of the clusters. Ocean is robust against clusters of identical patterns. Unlike conventional approaches, Ocean neither makes any assumption regarding the data distribution within clusters, nor requires tuning of free parameters. Empirical evaluations demonstrate improved performance of Ocean over existing approaches.

  • articleNo Access

    GEOMETRIC ALGORITHMS FOR DENSITY-BASED DATA CLUSTERING

    Data clustering is a fundamental problem arising in many practical applications. In this paper, we present new geometric approximation and exact algorithms for the density-based data clustering problem in d-dimensional space ℝd (for any constant integer d ≥ 2). Previously known algorithms for this problem are efficient only when the specified range around each input point, called the δ-neighborhood, contains on average a constant number of input points. Different distributions of the input data points have significant impact on the efficiency of these algorithms. In the worst case when the data points are highly clustered, these algorithms run in quadratic time, although such situations might not occur very frequently on real data. By using computational geometry techniques, we develop faster approximation and exact algorithms for the density-based data clustering problem in ℝd. In particular, our approximation algorithm based on the ∊-fuzzy distance function takes O(n log n) time for any given fixed value ∊>0, and our exact algorithms take sub-quadratic time. The running times and output quality of our algorithms do not depend on any particular data distribution. We believe that our fast approximation algorithm is of considerable practical importance, while our sub-quadratic exact algorithms are more of theoretical interest. We implemented our approximation algorithm and the experimental results show that our approximation algorithm is efficient on arbitrary input point sets.

  • articleNo Access

    Faster DBSCAN and HDBSCAN in Low-Dimensional Euclidean Spaces

    We present a new algorithm for the widely used density-based clustering method dbscan. For a set of n points in 2 our algorithm computes the dbscan-clustering in O(nlogn) time, irrespective of the scale parameter 𝜀 (and assuming the second parameter MinPts is set to a fixed constant, as is the case in practice). Experiments show that the new algorithm is not only fast in theory, but that a slightly simplified version is competitive in practice and much less sensitive to the choice of 𝜀 than the original dbscan algorithm. We also present an O(nlogn) randomized algorithm for hdbscan in the plane — hdbscan is a hierarchical version of dbscan introduced recently — and we show how to compute an approximate version of hdbscan in near-linear time in any fixed dimension.

  • articleNo Access

    Effectiveness of Heuristic Based Approach on the Performance of Indexing and Clustering of High Dimensional Data

    Data in practical applications (e.g., images, molecular biology, etc) is mostly characterised by high dimensionality and huge size or number of data instances. Though, feature reduction techniques have been successful in reducing the dimensionality for certain applications, dealing with high dimensional data is still an area which has received considerable attention in the research community. Indexing and clustering of high dimensional data are two of the most challenging techniques that have a wide range of applications. However, these techniques suffer from performance issues as the dimensionality and size of the processed data increases. In our effort to tackle this problem, this paper demonstrates a general optimisation technique applicable to indexing and clustering algorithms which need to calculate distances and check them against some minimum distance condition. The optimisation technique is a simple calculation that finds the minimum possible distance between two points, and checks this distance against the minimum distance condition; thus reusing already computed values and reducing the need to compute a more complicated distance function periodically. Effectiveness and usefulness of the proposed optimisation technique has been demonstrated by applying it with successful results to clustering and indexing techniques. We utilised a number of clustering techniques, including the agglomerative hierarchical clustering, k-means clustering, and DBSCAN algorithms. Runtime for all three algorithms with this optimisation scenario was reduced, and the clusters they returned were verified to remain the same as the original algorithms. The optimisation technique also shows potential for reducing runtime by a substantial amount for indexing large databases using NAQ-tree; in addition, the optimisation technique shows potential for reducing runtime as databases grow larger both in dimensionality and size.

  • articleNo Access

    A CLUSTERING-BASED NICHING FRAMEWORK FOR THE APPROXIMATION OF EQUIVALENT PARETO-SUBSETS

    In many optimization problems in practice, multiple objectives have to be optimized at the same time. Some multi-objective problems are characterized by multiple connected Pareto-sets at different parts in decision space — also called equivalent Pareto-subsets. We assume that the practitioner wants to approximate all Pareto-subsets to be able to choose among various solutions with different characteristics. In this work, we propose a clustering-based niching framework for multi-objective population-based approaches that allows to approximate equivalent Pareto-subsets. Iteratively, the clustering process assigns the population to niches, and the multi-objective optimization process concentrates on each niche independently. Two exemplary hybridizations, rake selection and DBSCAN, as well as SMS-EMOA and kernel density clustering demonstrate that the niching framework allows enough diversity to detect and approximate equivalent Pareto-subsets.

  • articleNo Access

    The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data

    Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.