Depending on the amount of data to process, file generation may take longer.

If it takes too long to generate, you can limit the data by, for example, reducing the range of years.

Chapter

Download BibTeX

Title

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Authors

[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] employee

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2022

Chapter type

chapter in monograph / paper

Publication language

english

Keywords
EN
  • Data integration
  • Data warehouse
  • Data lake
  • Big data
  • Extract transform load
  • Data processing workflow
  • Data processing pipeline
  • Data quality
  • Data deduplication
  • ETL performance optimization
Abstract

EN In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes. To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance. In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

Date of online publication

20.11.2022

Pages (from - to)

3 - 17

DOI

10.1007/978-3-031-21047-1_1

URL

https://link.springer.com/chapter/10.1007/978-3-031-21047-1_1

Book

Information Integration and Web Intelligence : 24th International Conference, iiWAS 2022, Virtual Event, November 28–30, 2022 : Proceedings

Presented on

24th International Conference on Information Integration and Web iiWAS 2022, 28-30.11.2022, Poznań, Polska

Ministry points / chapter

20

Ministry points / conference (CORE)

20

This website uses cookies to remember the authenticated session of the user. For more information, read about Cookies and Privacy Policy.