ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] employee
2024
chapter in monograph / paper
english
- human evaluation
- reproduction
- reproducibility
- dialogue summarization
- summarization
EN In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer science). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the results.
229 - 237
CC BY (attribution alone)
publisher's website
final published version
at the time of publication
5
140