On evaluating text similarity measures for customer data deduplication

Paweł Boiński; Mariusz Sienkiewicz; Robert Wrembel; Bartosz Bębel; Witold Andrzejewski

doi:10.1145/3555776.3578724

System Informacji Naukowej Politechniki Poznańskiej

PL EN

Strona główna / Publikacje / On evaluating text similarity measures for customer data deduplication

Zgłoś uwagę

Rozdział

Pobierz BibTeX

Tytuł

On evaluating text similarity measures for customer data deduplication

Autorzy

Paweł Boiński (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Mariusz Sienkiewicz (WIiT) ^{[ 2 ][ 2.3 ][ SzD ]}
Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Bartosz Bębel (WIiT) ^{[ 1 ][ P ]}
Witold Andrzejewski (WIiT) ^{[ 1 ][ 2.3 ][ P ]}

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ 2 ]} Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} pracownik | ^{[ SzD ]} doktorant ze Szkoły Doktorskiej

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2023

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Słowa kluczowe

EN

data quality
entity resolution
data deduplication
text similarity measures

Streszczenie

EN In this paper, we summarize the results obtained while evaluating 44 similarity measures for text values, which represent real institutional customers data. These data come from a project conducted for a large financial institution in Poland. The similarity measures were assessed based on similarity values they returned and based on their execution times. To the best of our knowledge, it is the first report that evaluates such a large selection of different similarity measures.

Strony (od-do)

297 - 300

DOI

10.1145/3555776.3578724

Książka

Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing SAC '23, March 27 - March 31, 2023, Tallinn, Estonia

Zaprezentowany na

38th ACM/SIGAPP Symposium on Applied Computing (SAC '23), 27-31.03.2023, Tallinn, Estonia

Punktacja Ministerstwa / rozdział

20

Punktacja Ministerstwa / konferencja (CORE)

20

System tworzony przez Politechnikę Poznańską oraz Poznańskie Centrum Superkomputerowo-Sieciowe

Zaloguj się przez eKonto, aby dodać do SIN