Depending on the amount of data to process, file generation may take longer.

If it takes too long to generate, you can limit the data by, for example, reducing the range of years.

Chapter

Download BibTeX

Title

Text Similarity Measures in a Data Deduplication Pipeline for Customers Records

Authors

[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ 2 ] Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] employee | [ SzD ] doctoral school student

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2023

Chapter type

chapter in monograph / paper

Publication language

english

Keywords
EN
  • data quality
  • entity resolution
  • data deduplication
  • data deduplication pipeline
  • customers records deduplication
  • text similarity measures
  • customer data
  • Python packages
  • experimental evaluation
Abstract

EN Data stored in information systems are often erroneous. The most typical errors include: inconsistent, missing, and outdated values, typos as well as duplicates. To handle data of poor quality, data cleaning (a.k.a. curation) and deduplication (a.k.a. entity resolution) methods are used in projects realized by research and industry. Data deduplication is of particular challenge due to its computational complexity and the complexity of finding the most adequate method for comparing records and computing similarities of these records. The similarity value of two records is a compound value, whose computation is based on similarities of individual attribute values. To compute these similarities, mul- tiple similarity measures were proposed in research literature and were implemented in various libraries (widely available in Python). For a given deduplication problem, a challenging task is to decide which similarity measures are the most adequate to given attributes being compared, since some similarity measures perform better than others for given characteristics of data being compared. In this paper, we report the experimental evaluation of 45 similarity measures for text values. The need to assess the measures came from a project conducted for a large financial institution in Poland. The measures were compared on five dif- ferent real data sets, each of which had a different characteristic (e.g., text length, the number of words). The similarity measures were assessed (1) based on similarity values they produced for given values being compared and (2) based on their execution time. To the best of our knowledge, it is the first report that in- cludes such a broad evaluation of a large selection of different similarity measures, on different real data sets.

Pages (from - to)

32 - 42

URL

https://ceur-ws.org/Vol-3369/paper3.pdf

Book

Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) : co-located with the 26th International Conference on Extending Database Technology and the 26th International Conference on Database Theory (EDBT/ICDT 2023), Ioannina, Greece, March 28, 2023

Presented on

25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2023), 28.03.2023, Ioannina, Greece

License type

CC BY (attribution alone)

Open Access Mode

publisher's website

Open Access Text Version

final published version

Date of Open Access to the publication

at the time of publication

Ministry points / chapter

5

Ministry points / conference (CORE)

70

This website uses cookies to remember the authenticated session of the user. For more information, read about Cookies and Privacy Policy.