Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    k-Means Clustering with Optimal Centroid: An Optimization Insisted Model for Removing Outliers

    In data cleaning, the process of detecting and correcting corrupt, inaccurate or irrelevant records from the record set is a tedious task. Particularly, the process of “outlier detection” occupies a significant role in data cleaning that removes or eliminates the outlier’s that exist in data. Traditionally, more efforts have been taken to remove the outliers, and one of the promising ways is customizing clustering models. In this manner, this paper intends to propose a new outlier detection model via enhanced k-means with outlier removal (E-KMOR), which assigns all outliers into a group naturally during the clustering process. For assigning the point to be outliers, a new intra-cluster based distance evaluation is employed. The main contribution of this paper is to select cluster centroid optimally through a newly proposed hybrid optimization algorithm termed particle updated lion algorithm (PU-LA), which hybrids the concepts of LA and particle swarm optimization (PSO), respectively. Thereby, the proposed work is named as E-KMOR-PU-LA. Finally, the efficacy of the proposed E-KMOR-PU-LA model is proved through a comparative analysis over conventional models by concerning runtime and accuracy.

  • articleNo Access

    Detection and Correction of Abnormal Data with Optimized Dirty Data: A New Data Cleaning Model

    Each and every business enterprises require noise-free and clean data. There is a chance of an increase in dirty data as the data warehouse loads and refreshes a large quantity of data continuously from the various sources. Hence, in order to avoid the wrong conclusions, the data cleaning process becomes a vital one in various data-connected projects. This paper made an effort to introduce a novel data cleaning technique for the effective removal of dirty data. This process involves the following two steps: (i) dirty data detection and (ii) dirty data cleaning. The dirty data detection process has been assigned with the following process namely, data normalization, hashing, clustering, and finding the suspected data. In the clustering process, the optimal selection of centroid is the promising one and is carried out by employing the optimization concept. After the finishing of dirty data prediction, the subsequent process: dirty data cleaning begins to activate. The cleaning process also assigns with some processes namely, the leveling process, Huffman coding, and cleaning the suspected data. The cleaning of suspected data is performed based on the optimization concept. Hence, for solving all optimization problems, a new hybridized algorithm is proposed, the so-called Firefly Update Enabled Rider Optimization Algorithm (FU-ROA), which is the hybridization of the Rider Optimization Algorithm (ROA) and Firefly (FF) algorithm is introduced. To the end, the analysis of the performance of the implanted data cleaning method is scrutinized over the other traditional methods like Particle Swarm Optimization (PSO), FF, Grey Wolf Optimizer (GWO), and ROA in terms of their positive and negative measures. From the result, it can be observed that for iteration 12, the performance of the proposed FU-ROA model for test case 1 on was 0.013%, 0.7%, 0.64%, and 0.29% better than the extant PSO, FF, GWO, and ROA models, respectively.

  • chapterFree Access

    A FRAMEWORK FOR DETERMINING OUTLYING MICROARRAY EXPERIMENTS

    Microarrays are high-throughput technologies whose data are known to be noisy. In this work, we propose a graph-based method which first identifies the extent to which a single microarray experiment is noisy and then applies an error function to clean individual expression levels. These two steps are unified within a framework based on a graph representation of a separate data set from some repository. We demonstrate the utility of our method by comparing our results against statistical methods by applying both techniques to simulated microarray data. Our results are encouraging and indicate one potential use of microarray data from past experiments.

  • chapterNo Access

    STUDY ON DATA-ORIENTED IT AUDIT USED IN CHINA

    Application of information technologies (IT) in the field of audit is worth studying. Chinese data-oriented IT audit is researched in this paper. Firstly, the conditions of IT audit used in domestic and overseas are analyzed. Then, the main steps of data-oriented IT audit used in China are presented, including: data acquisition, data cleaning and data processing. And data processing method based on outlier detection and visualization data processing methods are proposed. Then, the popular data-oriented IT audit software developed in China are given. Finally, data-oriented IT audit used in Chinese online audit are introduced, and the differences of data acquisition and data processing methods between online audit and single computer audit are analyzed.

  • chapterNo Access

    A NOVEL ONTOLOGY TOOL FOR DATA CLEANING

    Researchers develop data extractor which could effectively extract data from web sources, tabulate them, and used it for further processing. However, not all data are correctly extracted, they may either missed certain valuable information or contain additional unnecessary information. In the case of unnecessary information, researchers use a cleaning method to remove them such that the data extracted are free of errors. Removing these data is important as unnecessary information may affect the accuracy of subsequent extractor tools, hence may eventually prevent the tool from performing its task efficiently. In this research proposal, we embark on a data cleaning tool to clean data using ontology tools, which could effectively clean data based on their semantics. Experimental results show that our tool is highly efficient in data cleaning and is able to outperform existing state of the art tools.

  • chapterNo Access

    DeepDetect: An Extensible System for Detecting Attribute Outliers & Duplicates in XML

    XML, the eXtensible Markup Language, is fast evolving into the new standard for data representation and exchange on the WWW. This has resulted in a growing number of data cleaning techniques to locate "dirty" data (artifacts). In this paper, we present DEEPDETECT – an extensible system that detects attribute outliers and duplicates in XML documents. Attribute outlier detection finds objects that contain deviating values with respect to a relevant group of objects. This entails utilizing the correlation among element values in a given XML document. Duplicate detection in XML requires the identification of subtrees that correspond to real world objects. Our system architecture enables sharing of common operations that prepare XML data for the various artifact detection techniques. DEEPDETECT also provides an intuitive visual interface for the user to specify various parameters for preprocessing and detection, as well as to view results.