A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository

Alexandre Quemy; Robert Wrembel; Natalia Łopuszyńska; George Papadakis; Agustín D. Delgado

doi:10.1016/j.is.2023.102258

System Informacji Naukowej Politechniki Poznańskiej

PL EN

Strona główna / Publikacje / A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository

Zgłoś uwagę

Artykuł

Pobierz BibTeX

Tytuł

A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository

Autorzy

Alexandre Quemy
Robert Wrembel (WIiT) ^{[ 1 ][ 2.3 ][ P ]}
Natalia Łopuszyńska
George Papadakis
Agustín D. Delgado

^{[ 1 ]} Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | ^{[ P ]} pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2023

Opublikowano w

Information Systems

Rocznik: 2023 | Tom: vol. 119

Typ artykułu

artykuł naukowy

Język publikacji

angielski

Słowa kluczowe

EN

European court of human rights
Open repository of legal documents
Legal analytics
Classification algorithms on legal data
Experiments reproducibility

Streszczenie

EN his work is a companion reproducible paper of our experiments and results reported in a previous work Quemy and Wrembel (2022) introducing an open repository of legal documents, called ECHR-OD, together with a large benchmark of Machine Learning (ML) methods for text classification. Machine Learning (ML) algorithms are used in various domains, including banking, healthcare, manufacturing, energy management, security, trade or insurance. However, building reliable ML models is challenging. First, because in order to build prediction models by ML algorithms, massive amounts of pre-processed data are needed, but in practice, such datasets are scarce or require a tremendous amount of time to be prepared. Second, because once a model is built, its performance needs to be assessed. To this end, benchmarks are needed, but their availability is limited as well. Despite the fact that ML algorithms are used in multiple domains, their application to the legal domain so far has received little attention from research communities. This fact motivated us to run a project to build and make available an open repository called the European Court of Human Rights Open Data (ECHR-OD) of judgment documents. In this paper, we describe a step-by-step Extract, Transform, and Load (ETL) process, supported with code snippets, for building ECHR-OD, so that it can be easily reproduced. The process produces (almost) exhaustive datasets that have been transformed, homogenized, re-organized, cleaned beforehand, and made available in a suitable format for ML algorithms. The ECHR-OD repository makes available tabular descriptive features as well as features extracted from natural language documents, accessible via a web user interface. Moreover, we provide a self-contained and easily reproducible set of experiments assessing ML classification algorithms on the content of the ECHR-OD repository. To the best of our knowledge, the ETL process and the set of experiments form the first fully end-to-end, from ingesting and pre-processing legal documents to obtaining high quality ML models, open, and reproducible benchmark on the prediction of the European Court of Human Rights judgments. Both components, the ETL and the experiments, leverage Docker for reproducibility. The content of this paper weakly reproduces the original results and provides a new weakly reproducible set of experiments.

Data udostępnienia online

12.08.2023

Strony (od-do)

102258-1 - 102258-14

DOI

10.1016/j.is.2023.102258

URL

https://www.sciencedirect.com/science/article/abs/pii/S0306437923000947?dgcid=coauthor

Uwagi

Article Number: 102258

Punktacja Ministerstwa / czasopismo

100