[Keyword: Data Quality] AND [All Categories: Innovation / Technology / Knowledge / Information Management] : Search

In this topic

Advanced Search

SEARCH GUIDE

Results: 1 - 4of4

Follow results:

refine search

Filters

per page:

Sort: Relevance

Context for search term 1Search term 1*

All Dates

LastSelect static range

Custom Range

Select starting monthSelect starting year

Select ending monthSelect ending year

Advanced

Search name	Searched On	Run search
[Keyword: Data Quality] AND [All Categories: Innovation / Technology / Knowledge / ... (4)	29 Mar 2025	Run
[Keyword: Axial Ligand] AND [All Categories: Biochemistry] (2)	29 Mar 2025	Run
[Keyword: Technologies] AND [All Categories: International Economics] (1)	29 Mar 2025	Run
[Keyword: Supergravity] AND [All Categories: Optics and Laser Physics] (2)	29 Mar 2025	Run
[Keyword: Biomechanics] AND [All Categories: Nanofabrication & Nanomanipulation] (2)	29 Mar 2025	Run

articleNo Access
Evaluating the Impact of Data Quality on Sampling
Journal of Information & Knowledge Management01 Sep 2011
Preview Abstract
Learning from imbalanced training data can be a difficult endeavour, and the task is made even more challenging if the data is of low quality or the size of the training dataset is small. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of the impact of changes in four training dataset characteristics — dataset size, class distribution, noise level and noise distribution — on data sampling techniques. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that: (1) even for relatively clean datasets, class imbalance can still hurt learner performance, (2) data sampling, however, may not improve performance for relatively clean but imbalanced datasets, (3) data sampling can be very effective at dealing with the combined problems of noise and imbalance, (4) both the level and distribution of class noise among the classes are important, as either factor alone does not cause a significant impact, (5) when sampling does improve the learners (i.e. for noisy and imbalanced datasets), RUS and SMOTE are the most effective at improving the AUC, while SMOTE performed well relative to the F-measure, (6) there are significant differences in the empirical results depending on the performance measure used, and hence it is important to consider multiple metrics in this type of analysis, and (7) data sampling rarely hurt the AUC, but only significantly improved performance when data was at least moderately skewed or noisy, while for the F-measure, data sampling often resulted in significantly worse performance when applied to slightly skewed or noisy datasets, but did improve performance when data was either severely noisy or skewed, or contained moderate levels of both noise and imbalance.
chapterNo Access
DeepDetect: An Extensible System for Detecting Attribute Outliers & Duplicates in XML
- Qiangfeng Peter Lau,
- Wynne Hsu,
- Judice L. Y. Koh, and
- Mong Li Lee
Data Quality and High-Dimensional Data Analysis01 Feb 2009
Preview Abstract
XML, the eXtensible Markup Language, is fast evolving into the new standard for data representation and exchange on the WWW. This has resulted in a growing number of data cleaning techniques to locate "dirty" data (artifacts). In this paper, we present DEEPDETECT – an extensible system that detects attribute outliers and duplicates in XML documents. Attribute outlier detection finds objects that contain deviating values with respect to a relevant group of objects. This entails utilizing the correlation among element values in a given XML document. Duplicate detection in XML requires the identification of subtrees that correspond to real world objects. Our system architecture enables sharing of common operations that prepare XML data for the various artifact detection techniques. DEEPDETECT also provides an intuitive visual interface for the user to specify various parameters for preprocessing and detection, as well as to view results.
chapterNo Access
CHARIOT: A Comprehensive Data Integration and Quality Assurance Model for Agro-Meteorological Data
- Mark Anthony F. Mateo and
- Carson Kai-Sang Leung
Data Quality and High-Dimensional Data Analysis01 Feb 2009
Preview Abstract
In this paper, we propose a comprehensive data integration and quality assurance model, called CHARIOT, for agro-meteorological data. This model comprised of two modules: an intermediary module and a data quality control module. The intermediary provides users with reliable and continuous access to heterogeneous weather databases from various sources; it also solves various compatibility issues in meteorological time series data. The data quality control tool consists of a multi-layer system spanning internal, temporal and spatial data checks. These two modules combined together provide users with clean and error-free inputs for weather-driven agricultural management decisions. When applying CHARIOT to weather data for a real-life agricultural application, CHARIOT is shown to be effective in controlling and improving data quality, which in turn leads to better and more accurate agricultural management decisions.
chapterNo Access
Data Quality for Decision Support – The Indian Banking Scenario
- Hemalatha Diwakar and
- Alka Vaidya
Data Quality and High-Dimensional Data Analysis01 Feb 2009
Preview Abstract
To face the challenges posed by new techno-savvy market players, the Public Sector Banks (PSB) and the old private banks in India, have introduced Core Banking Solutions (CBS) to replace disparate branch automation systems. CBS provides centralized online banking operational database which can be exploited for building Decision Support System (DSS) in key areas. While promptness of data is ensured, other data quality needs are to be appraised before implementing any such DSS. Hence an assessment of data quality in two key areas – Customer Relationship Management and Borrower Behaviour was carried out for a sample bank for data profiling, inter-field consistency, attribute value dependent constraints, domain constraints. The study has identified critical areas for data quality improvement both for legacy data that has been migrated and new data being captured by the CBS. Measures for data cleaning and implementation of additional constraints at the database or application level are proposed for improvement of data quality for implementing these DSS.