W zależności od ilości danych do przetworzenia generowanie pliku może się wydłużyć.

Jeśli generowanie trwa zbyt długo można ograniczyć dane np. zmniejszając zakres lat.

Artykuł

Pobierz BibTeX

Tytuł

A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository

Autorzy

[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] pracownik

Dyscyplina naukowa (Ustawa 2.0)

[2.3] Informatyka techniczna i telekomunikacja

Rok publikacji

2023

Opublikowano w

Information Systems

Rocznik: 2023 | Tom: vol. 119

Typ artykułu

artykuł naukowy

Język publikacji

angielski

Słowa kluczowe
EN
  • European court of human rights
  • Open repository of legal documents
  • Legal analytics
  • Classification algorithms on legal data
  • Experiments reproducibility
Streszczenie

EN his work is a companion reproducible paper of our experiments and results reported in a previous work Quemy and Wrembel (2022) introducing an open repository of legal documents, called ECHR-OD, together with a large benchmark of Machine Learning (ML) methods for text classification. Machine Learning (ML) algorithms are used in various domains, including banking, healthcare, manufacturing, energy management, security, trade or insurance. However, building reliable ML models is challenging. First, because in order to build prediction models by ML algorithms, massive amounts of pre-processed data are needed, but in practice, such datasets are scarce or require a tremendous amount of time to be prepared. Second, because once a model is built, its performance needs to be assessed. To this end, benchmarks are needed, but their availability is limited as well. Despite the fact that ML algorithms are used in multiple domains, their application to the legal domain so far has received little attention from research communities. This fact motivated us to run a project to build and make available an open repository called the European Court of Human Rights Open Data (ECHR-OD) of judgment documents. In this paper, we describe a step-by-step Extract, Transform, and Load (ETL) process, supported with code snippets, for building ECHR-OD, so that it can be easily reproduced. The process produces (almost) exhaustive datasets that have been transformed, homogenized, re-organized, cleaned beforehand, and made available in a suitable format for ML algorithms. The ECHR-OD repository makes available tabular descriptive features as well as features extracted from natural language documents, accessible via a web user interface. Moreover, we provide a self-contained and easily reproducible set of experiments assessing ML classification algorithms on the content of the ECHR-OD repository. To the best of our knowledge, the ETL process and the set of experiments form the first fully end-to-end, from ingesting and pre-processing legal documents to obtaining high quality ML models, open, and reproducible benchmark on the prediction of the European Court of Human Rights judgments. Both components, the ETL and the experiments, leverage Docker for reproducibility. The content of this paper weakly reproduces the original results and provides a new weakly reproducible set of experiments.

Data udostępnienia online

12.08.2023

Strony (od-do)

102258-1 - 102258-14

DOI

10.1016/j.is.2023.102258

URL

https://www.sciencedirect.com/science/article/abs/pii/S0306437923000947?dgcid=coauthor

Uwagi

Article Number: 102258

Punktacja Ministerstwa / czasopismo

100

Impact Factor

3

Ta strona używa plików Cookies, w celu zapamiętania uwierzytelnionej sesji użytkownika. Aby dowiedzieć się więcej przeczytaj o plikach Cookies i Polityce Prywatności.