Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data
[ 1 ] Wydział Inżynierii Zarządzania, Politechnika Poznańska | [ 2 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ S ] student | [ SzD ] doktorant ze Szkoły Doktorskiej | [ P ] pracownik
[2.3] Informatyka techniczna i telekomunikacja[6.6] Nauki o zarządzaniu i jakości
2024
rozdział w monografii naukowej / referat
angielski
- Language Models
- Information Extraction
- Opinion Mining
EN To address the challenge of extracting opinions from semi-structured webpages such as blog posts and product rankings, encoder-decoder transformer models are employed. We enhance the models’ performance by generating synthetic data using large language models like GPT3.5 and GPT-4, diversified through prompts featuring various text styles, personas and product characteristics. Different fine-tuning strategies are experimented, training both with and without domain-adapted instructions, as well as, training on synthetic customer reviews, targeting tasks such as extracting product names, pros, cons, and opinion sentences. Our evaluation shows a significant improvement in the models’ performance in both product characteristic and opinion extraction tasks, validating the effectiveness of using synthetic data for fine-tuning and signals the potential of pretrained language models to automate web scraping techniques from diverse web sources.
681 - 688
CC BY-NC-ND (uznanie autorstwa - użycie niekomercyjne - bez utworów zależnych)
witryna wydawcy
ostateczna wersja opublikowana
w momencie opublikowania
publiczny
5