On evaluating text similarity measures for customer data deduplication

Paweł Boiński; Mariusz Sienkiewicz; Robert Wrembel; Bartosz Bębel; Witold Andrzejewski

doi:10.1145/3555776.3578724

Scientific Information System of the Poznań University of Technology

PL EN

Main page / Publications / On evaluating text similarity measures for customer data deduplication

Submit a comment

Chapter

Download BibTeX

Title

On evaluating text similarity measures for customer data deduplication

Authors

Paweł Boiński (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Mariusz Sienkiewicz (WIiT) ^{[ 2 ][ 2.3 ][ SzD ]}
Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Bartosz Bębel (WIiT) ^{[ 1 ][ P ]}
Witold Andrzejewski (WIiT) ^{[ 1 ][ 2.3 ][ P ]}

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ 2 ]} Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} employee | ^{[ SzD ]} doctoral school student

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication

2023

Chapter type

chapter in monograph / paper

Publication language

english

Keywords

EN

data quality
entity resolution
data deduplication
text similarity measures

Abstract

EN In this paper, we summarize the results obtained while evaluating 44 similarity measures for text values, which represent real institutional customers data. These data come from a project conducted for a large financial institution in Poland. The similarity measures were assessed based on similarity values they returned and based on their execution times. To the best of our knowledge, it is the first report that evaluates such a large selection of different similarity measures.

Pages (from - to)

297 - 300

DOI

10.1145/3555776.3578724

Book

Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing SAC '23, March 27 - March 31, 2023, Tallinn, Estonia

Presented on

38th ACM/SIGAPP Symposium on Applied Computing (SAC '23), 27-31.03.2023, Tallinn, Estonia

Ministry points / chapter

20

Ministry points / conference (CORE)

20

System created by Poznań University of Technology and Poznan Supercomputing and Networking Center

Log in through eKonto to add to SIS