Biomedical Semantic Textual Similarity: Evaluation of Sentence Representations Enhanced with Principal Component Reduction and Word Frequency Weighting

Klaudia Kantor; Mikołaj Morzy

doi:10.1007/978-3-031-09342-5_39

System Informacji Naukowej Politechniki Poznańskiej

PL EN

Strona główna / Publikacje / Biomedical Semantic Textual Similarity: Evaluation of Sentence Representations Enhanced with Principal Component Reduction and Word Frequency Weighting

Zgłoś uwagę

Rozdział

Pobierz BibTeX

Tytuł

Biomedical Semantic Textual Similarity: Evaluation of Sentence Representations Enhanced with Principal Component Reduction and Word Frequency Weighting

Autorzy

Klaudia Kantor (WIiT) ^{[ 1 ][ 2.3 ][ SzD ]}
Mikołaj Morzy (WIiT) ^{[ 2 ][ 2.3 ][ P ]}

^{[ 1 ]} Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ 2 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ SzD ]} doktorant ze Szkoły Doktorskiej | ^{[ P ]} pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2022

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Streszczenie

EN Biomedical texts encode semantics in domain vocabulary, extensive use of acronyms, proper nouns, named entities, and numerical values with implied meaning. This information is absent from the surface text form, making semantic textual similarity challenging for models trained on the general English language. This paper evaluates different techniques of sentence embedding in semantic textual similarity search in the biomedical domain. We compare static embeddings, transformer-based representations (focusing on models fine-tuned to the biomedical domain), and sentence transformers. We also introduce two auxiliary techniques: principal component reduction and word frequency embedding weighting. To gain better insights into the latent properties of sentence embeddings, we perform directional expectation tests. We conduct our experiments on two benchmark datasets: the BIOSSES and the Clinical Outcomes. We find that sentence transformers are surprisingly effective, outperforming fine-tuned transformer-based models. Initial experiments also suggest the efficacy of principal component reduction and embedding weighting by word frequency.

Data udostępnienia online

09.07.2022

Strony (od-do)

393 - 403

DOI

10.1007/978-3-031-09342-5_39

URL

https://link.springer.com/chapter/10.1007/978-3-031-09342-5_39

Książka

Artificial Intelligence in Medicine : 20th International Conference on Artificial Intelligence in Medicine, AIME 2022, Halifax, NS, Canada, June 14–17, 2022 : Proceedings

Zaprezentowany na

20th International Conference on Artificial Intelligence in Medicine AIME 2022, 14-17.06.2022, Halifax, Canada