Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Robert Wrembel

doi:10.1007/978-3-031-21047-1_1

System Informacji Naukowej Politechniki Poznańskiej

PL EN

Strona główna / Publikacje / Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Zgłoś uwagę

Rozdział

Pobierz BibTeX

Tytuł

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Autorzy

Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2022

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Słowa kluczowe

EN

Data integration
Data warehouse
Data lake
Big data
Extract transform load
Data processing workflow
Data processing pipeline
Data quality
Data deduplication
ETL performance optimization

Streszczenie

EN In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes. To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance. In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

Data udostępnienia online

20.11.2022

Strony (od-do)

3 - 17

DOI

10.1007/978-3-031-21047-1_1

URL

https://link.springer.com/chapter/10.1007/978-3-031-21047-1_1

Książka

Information Integration and Web Intelligence : 24th International Conference, iiWAS 2022, Virtual Event, November 28–30, 2022 : Proceedings

Zaprezentowany na

24th International Conference on Information Integration and Web iiWAS 2022, 28-30.11.2022, Poznań, Polska