Text Similarity Measures in a Data Deduplication Pipeline for Customers Records
[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ 2 ] Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] pracownik | [ SzD ] doktorant ze Szkoły Doktorskiej
2023
rozdział w monografii naukowej / referat
angielski
- data quality
- entity resolution
- data deduplication
- data deduplication pipeline
- customers records deduplication
- text similarity measures
- customer data
- Python packages
- experimental evaluation
EN Data stored in information systems are often erroneous. The most typical errors include: inconsistent, missing, and outdated values, typos as well as duplicates. To handle data of poor quality, data cleaning (a.k.a. curation) and deduplication (a.k.a. entity resolution) methods are used in projects realized by research and industry. Data deduplication is of particular challenge due to its computational complexity and the complexity of finding the most adequate method for comparing records and computing similarities of these records. The similarity value of two records is a compound value, whose computation is based on similarities of individual attribute values. To compute these similarities, mul- tiple similarity measures were proposed in research literature and were implemented in various libraries (widely available in Python). For a given deduplication problem, a challenging task is to decide which similarity measures are the most adequate to given attributes being compared, since some similarity measures perform better than others for given characteristics of data being compared. In this paper, we report the experimental evaluation of 45 similarity measures for text values. The need to assess the measures came from a project conducted for a large financial institution in Poland. The measures were compared on five dif- ferent real data sets, each of which had a different characteristic (e.g., text length, the number of words). The similarity measures were assessed (1) based on similarity values they produced for given values being compared and (2) based on their execution time. To the best of our knowledge, it is the first report that in- cludes such a broad evaluation of a large selection of different similarity measures, on different real data sets.
32 - 42
CC BY (uznanie autorstwa)
witryna wydawcy
ostateczna wersja opublikowana
w momencie opublikowania
5
70