Framework to Optimize Data Processing Pipelines Using Performance Metrics

Syed Muhammad Fawad Ali; Robert Wrembel

doi:10.1007/978-3-030-59065-9_11

Scientific Information System of the Poznań University of Technology

PL EN

Main page / Publications / Framework to Optimize Data Processing Pipelines Using Performance Metrics

Submit a comment

Chapter

Download BibTeX

Title

Framework to Optimize Data Processing Pipelines Using Performance Metrics

Authors

Syed Muhammad Fawad Ali (PP) ^{[ 1 ][ D ]}
Robert Wrembel (WIiT) ^{[ 2 ][ 2.3 ][ P ]}

^{[ 1 ]} Politechnika Poznańska | ^{[ 2 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ D ]} phd student | ^{[ P ]} employee

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2020

Chapter type

chapter in monograph / paper

Publication language

english

Keywords

EN

ETL workflow
ML workflow
workflow optimization
cost model
parallelization

Abstract

EN Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.

Date of online publication

11.09.2020

Pages (from - to)

131 - 140

DOI

10.1007/978-3-030-59065-9_11

URL

https://link.springer.com/chapter/10.1007/978-3-030-59065-9_11

Book

Big Data Analytics and Knowledge Discovery : 22nd International Conference, DaWaK 2020, Bratislava, Slovakia, September 14–17, 2020 : Proceedings

Presented on

22nd International Conference on Big Data Analytics and Knowledge Discovery DaWaK 2020, 14-17.09.2020, Bratislava, Slovac Republic