W zależności od ilości danych do przetworzenia generowanie pliku może się wydłużyć.

Jeśli generowanie trwa zbyt długo można ograniczyć dane np. zmniejszając zakres lat.

Rozdział

Pobierz BibTeX

Tytuł

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Autorzy

[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2022

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Słowa kluczowe
EN
  • Data integration
  • Data warehouse
  • Data lake
  • Big data
  • Extract transform load
  • Data processing workflow
  • Data processing pipeline
  • Data quality
  • Data deduplication
  • ETL performance optimization
Streszczenie

EN In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes. To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance. In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

Data udostępnienia online

20.11.2022

Strony (od-do)

3 - 17

DOI

10.1007/978-3-031-21047-1_1

URL

https://link.springer.com/chapter/10.1007/978-3-031-21047-1_1

Książka

Information Integration and Web Intelligence : 24th International Conference, iiWAS 2022, Virtual Event, November 28–30, 2022 : Proceedings

Zaprezentowany na

24th International Conference on Information Integration and Web iiWAS 2022, 28-30.11.2022, Poznań, Polska

Punktacja Ministerstwa / rozdział

20

Punktacja Ministerstwa / konferencja (CORE)

20

Ta strona używa plików Cookies, w celu zapamiętania uwierzytelnionej sesji użytkownika. Aby dowiedzieć się więcej przeczytaj o plikach Cookies i Polityce Prywatności.