Text Similarity Measures in a Data Deduplication Pipeline for Customers Records

Witold Andrzejewski; Bartosz Bębel; Paweł Boiński; Mariusz Sienkiewicz; Robert Wrembel

Scientific Information System of the Poznań University of Technology

PL EN

Main page / Publications / Text Similarity Measures in a Data Deduplication Pipeline for Customers Records

Submit a comment

Chapter

Download BibTeX

Title

Text Similarity Measures in a Data Deduplication Pipeline for Customers Records

Authors

Witold Andrzejewski (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Bartosz Bębel (WIiT) ^{[ 1 ][ P ]}
Paweł Boiński (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Mariusz Sienkiewicz (WIiT) ^{[ 2 ][ 2.3 ][ SzD ]}
Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ 2 ]} Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} employee | ^{[ SzD ]} doctoral school student

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2023

Chapter type

chapter in monograph / paper

Publication language

english

Keywords

EN

data quality
entity resolution
data deduplication
data deduplication pipeline
customers records deduplication
text similarity measures
customer data
Python packages
experimental evaluation

Abstract

EN Data stored in information systems are often erroneous. The most typical errors include: inconsistent, missing, and outdated values, typos as well as duplicates. To handle data of poor quality, data cleaning (a.k.a. curation) and deduplication (a.k.a. entity resolution) methods are used in projects realized by research and industry. Data deduplication is of particular challenge due to its computational complexity and the complexity of finding the most adequate method for comparing records and computing similarities of these records. The similarity value of two records is a compound value, whose computation is based on similarities of individual attribute values. To compute these similarities, mul- tiple similarity measures were proposed in research literature and were implemented in various libraries (widely available in Python). For a given deduplication problem, a challenging task is to decide which similarity measures are the most adequate to given attributes being compared, since some similarity measures perform better than others for given characteristics of data being compared. In this paper, we report the experimental evaluation of 45 similarity measures for text values. The need to assess the measures came from a project conducted for a large financial institution in Poland. The measures were compared on five dif- ferent real data sets, each of which had a different characteristic (e.g., text length, the number of words). The similarity measures were assessed (1) based on similarity values they produced for given values being compared and (2) based on their execution time. To the best of our knowledge, it is the first report that in- cludes such a broad evaluation of a large selection of different similarity measures, on different real data sets.

Pages (from - to)

32 - 42

URL

https://ceur-ws.org/Vol-3369/paper3.pdf

Book

Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) : co-located with the 26th International Conference on Extending Database Technology and the 26th International Conference on Database Theory (EDBT/ICDT 2023), Ioannina, Greece, March 28, 2023

Presented on

25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2023), 28.03.2023, Ioannina, Greece

License type

CC BY (attribution alone)

Open Access Mode

publisher's website