W zależności od ilości danych do przetworzenia generowanie pliku może się wydłużyć.

Jeśli generowanie trwa zbyt długo można ograniczyć dane np. zmniejszając zakres lat.

Rozdział

Pobierz BibTeX

Tytuł

Text Similarity Measures in a Data Deduplication Pipeline for Customers Records

Autorzy

[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ 2 ] Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] pracownik | [ SzD ] doktorant ze Szkoły Doktorskiej

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2023

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Słowa kluczowe
EN
  • data quality
  • entity resolution
  • data deduplication
  • data deduplication pipeline
  • customers records deduplication
  • text similarity measures
  • customer data
  • Python packages
  • experimental evaluation
Streszczenie

EN Data stored in information systems are often erroneous. The most typical errors include: inconsistent, missing, and outdated values, typos as well as duplicates. To handle data of poor quality, data cleaning (a.k.a. curation) and deduplication (a.k.a. entity resolution) methods are used in projects realized by research and industry. Data deduplication is of particular challenge due to its computational complexity and the complexity of finding the most adequate method for comparing records and computing similarities of these records. The similarity value of two records is a compound value, whose computation is based on similarities of individual attribute values. To compute these similarities, mul- tiple similarity measures were proposed in research literature and were implemented in various libraries (widely available in Python). For a given deduplication problem, a challenging task is to decide which similarity measures are the most adequate to given attributes being compared, since some similarity measures perform better than others for given characteristics of data being compared. In this paper, we report the experimental evaluation of 45 similarity measures for text values. The need to assess the measures came from a project conducted for a large financial institution in Poland. The measures were compared on five dif- ferent real data sets, each of which had a different characteristic (e.g., text length, the number of words). The similarity measures were assessed (1) based on similarity values they produced for given values being compared and (2) based on their execution time. To the best of our knowledge, it is the first report that in- cludes such a broad evaluation of a large selection of different similarity measures, on different real data sets.

Strony (od-do)

32 - 42

URL

https://ceur-ws.org/Vol-3369/paper3.pdf

Książka

Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) : co-located with the 26th International Conference on Extending Database Technology and the 26th International Conference on Database Theory (EDBT/ICDT 2023), Ioannina, Greece, March 28, 2023

Zaprezentowany na

25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2023), 28.03.2023, Ioannina, Greece

Typ licencji

CC BY (uznanie autorstwa)

Tryb otwartego dostępu

witryna wydawcy

Wersja tekstu w otwartym dostępie

ostateczna wersja opublikowana

Czas udostępnienia publikacji w sposób otwarty

w momencie opublikowania

Punktacja Ministerstwa / rozdział

5

Punktacja Ministerstwa / konferencja (CORE)

70

Ta strona używa plików Cookies, w celu zapamiętania uwierzytelnionej sesji użytkownika. Aby dowiedzieć się więcej przeczytaj o plikach Cookies i Polityce Prywatności.