Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data

Dawid Adam Plaskowski; Szymon Skwarek; Dominika Grajewska; Maciej Niemir; Agnieszka Ławrynowicz

System Informacji Naukowej Politechniki Poznańskiej

PL EN

Strona główna / Publikacje / Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data

Zgłoś uwagę

Rozdział

Pobierz plik Pobierz BibTeX

Tytuł

Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data

Autorzy

Dawid Adam Plaskowski ^{[ S ]}
Szymon Skwarek ^{[ S ]}
Dominika Grajewska
Maciej Niemir (WIZ) ^{[ 1 ][ 6.6 ][ SzD ]}
Agnieszka Ławrynowicz (WIiT) ^{[ 2 ][ 2.3 ][ P ]}

^{[ 1 ]} Wydział Inżynierii Zarządzania, Politechnika Poznańska | ^{[ 2 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ S ]} student | ^{[ SzD ]} doktorant ze Szkoły Doktorskiej | ^{[ P ]} pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja
[6.6] Nauki o zarządzaniu i jakości

Rok publikacji

2024

Typ rozdziału

rozdział w monografii naukowej / referat

Język publikacji

angielski

Słowa kluczowe

EN

Language Models
Information Extraction
Opinion Mining

Streszczenie

EN To address the challenge of extracting opinions from semi-structured webpages such as blog posts and product rankings, encoder-decoder transformer models are employed. We enhance the models’ performance by generating synthetic data using large language models like GPT3.5 and GPT-4, diversified through prompts featuring various text styles, personas and product characteristics. Different fine-tuning strategies are experimented, training both with and without domain-adapted instructions, as well as, training on synthetic customer reviews, targeting tasks such as extracting product names, pros, cons, and opinion sentences. Our evaluation shows a significant improvement in the models’ performance in both product characteristic and opinion extraction tasks, validating the effectiveness of using synthetic data for fine-tuning and signals the potential of pretrained language models to automate web scraping techniques from diverse web sources.

Strony (od-do)

681 - 688

URL

https://www.scitepress.org/Link.aspx?doi=10.5220/0012384900003636

Książka

Proceedings of the 16th International Conference on Agents and Artificial Intelligence - (Volume 3)

Zaprezentowany na

16th International Conference on Agents and Artificial Intelligence, 24-26.02.2024, Rome, Italy

Typ licencji

CC BY-NC-ND (uznanie autorstwa - użycie niekomercyjne - bez utworów zależnych)

Tryb otwartego dostępu

witryna wydawcy