Biomedical Semantic Textual Similarity: Evaluation of Sentence Representations Enhanced with Principal Component Reduction and Word Frequency Weighting
[ 1 ] Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ 2 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ SzD ] doctoral school student | [ P ] employee
chapter in monograph / paper
EN Biomedical texts encode semantics in domain vocabulary, extensive use of acronyms, proper nouns, named entities, and numerical values with implied meaning. This information is absent from the surface text form, making semantic textual similarity challenging for models trained on the general English language. This paper evaluates different techniques of sentence embedding in semantic textual similarity search in the biomedical domain. We compare static embeddings, transformer-based representations (focusing on models fine-tuned to the biomedical domain), and sentence transformers. We also introduce two auxiliary techniques: principal component reduction and word frequency embedding weighting. To gain better insights into the latent properties of sentence embeddings, we perform directional expectation tests. We conduct our experiments on two benchmark datasets: the BIOSSES and the Clinical Outcomes. We find that sentence transformers are surprisingly effective, outperforming fine-tuned transformer-based models. Initial experiments also suggest the efficacy of principal component reduction and embedding weighting by word frequency.
393 - 403