C 2018

Data Quality Problems in TPC-DI Based Data Integration Processes

YANG, Qishan, Mouzhi GE and Markus HELFERT

Basic information

Original name

Data Quality Problems in TPC-DI Based Data Integration Processes

Authors

YANG, Qishan, Mouzhi GE (156 China, guarantor, belonging to the institution) and Markus HELFERT (276 Germany)

Edition

Germany, Enterprise Information Systems, p. 57-73, 17 pp. 321, 2018

Publisher

Springer Lecture Notes in Business Information Processing

Other information

Language

English

Type of outcome

Kapitola resp. kapitoly v odborné knize

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Germany

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

RIV identification code

RIV/00216224:14330/18:00103077

Organization unit

Faculty of Informatics

ISBN

978-3-319-93374-0

Keywords in English

Data quality;Data integration;TPC-DI Benchmark;ETL

Tags

International impact, Reviewed
Změněno: 31/5/2022 14:20, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created inhouse or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark.