Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Robert Wrembel

doi:10.1007/978-3-031-21047-1_1

Scientific Information System of the Poznań University of Technology

PL EN

Main page / Publications / Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Submit a comment

Chapter

Download BibTeX

Title

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Authors

Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} employee

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2022

Chapter type

chapter in monograph / paper

Publication language

english

Keywords

EN

Data integration
Data warehouse
Data lake
Big data
Extract transform load
Data processing workflow
Data processing pipeline
Data quality
Data deduplication
ETL performance optimization

Abstract

EN In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes. To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance. In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

Date of online publication

20.11.2022

Pages (from - to)

3 - 17

DOI

10.1007/978-3-031-21047-1_1

URL

https://link.springer.com/chapter/10.1007/978-3-031-21047-1_1

Book

Information Integration and Web Intelligence : 24th International Conference, iiWAS 2022, Virtual Event, November 28–30, 2022 : Proceedings

Presented on

24th International Conference on Information Integration and Web iiWAS 2022, 28-30.11.2022, Poznań, Polska