Depending on the amount of data to process, file generation may take longer.

If it takes too long to generate, you can limit the data by, for example, reducing the range of years.

Chapter

Download BibTeX

Title

Framework to Optimize Data Processing Pipelines Using Performance Metrics

Authors

[ 1 ] Politechnika Poznańska | [ 2 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ D ] phd student | [ P ] employee

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2020

Chapter type

chapter in monograph / paper

Publication language

english

Keywords
EN
  • ETL workflow
  • ML workflow
  • workflow optimization
  • cost model
  • parallelization
Abstract

EN Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.

Date of online publication

11.09.2020

Pages (from - to)

131 - 140

DOI

10.1007/978-3-030-59065-9_11

URL

https://link.springer.com/chapter/10.1007/978-3-030-59065-9_11

Book

Big Data Analytics and Knowledge Discovery : 22nd International Conference, DaWaK 2020, Bratislava, Slovakia, September 14–17, 2020 : Proceedings

Presented on

22nd International Conference on Big Data Analytics and Knowledge Discovery DaWaK 2020, 14-17.09.2020, Bratislava, Slovac Republic

Ministry points / chapter

20

Ministry points / conference (CORE)

70

This website uses cookies to remember the authenticated session of the user. For more information, read about Cookies and Privacy Policy.