YANG, Qishan, Mouzhi GE and Markus HELFERT. Data Quality Problems in TPC-DI Based Data Integration Processes. Online. In Enterprise Information Systems. Germany: Springer Lecture Notes in Business Information Processing, 2018, p. 57-73. 321. ISBN 978-3-319-93374-0. Available from: https://dx.doi.org/10.1007/978-3-319-93375-7_4.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Data Quality Problems in TPC-DI Based Data Integration Processes
Authors YANG, Qishan, Mouzhi GE (156 China, guarantor, belonging to the institution) and Markus HELFERT (276 Germany).
Edition Germany, Enterprise Information Systems, p. 57-73, 17 pp. 321, 2018.
Publisher Springer Lecture Notes in Business Information Processing
Other information
Original language English
Type of outcome Chapter(s) of a specialized book
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Germany
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
RIV identification code RIV/00216224:14330/18:00103077
Organization unit Faculty of Informatics
ISBN 978-3-319-93374-0
Doi http://dx.doi.org/10.1007/978-3-319-93375-7_4
Keywords in English Data quality;Data integration;TPC-DI Benchmark;ETL
Tags topvydavatel
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 31/5/2022 14:20.
Abstract
Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created inhouse or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark.
PrintDisplayed: 22/7/2024 00:17