Depending on the amount of data to process, file generation may take longer.

If it takes too long to generate, you can limit the data by, for example, reducing the range of years.


Download BibTeX


A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository


[ 1 ] Instytut Informatyki, Wydział Informatyki i Telekomunikacji, Politechnika Poznańska | [ P ] employee

Scientific discipline (Law 2.0)

[2.3] Information and communication technology

Year of publication


Published in

Information Systems

Journal year: 2023 | Journal volume: vol. 119

Article type

scientific article

Publication language


  • European court of human rights
  • Open repository of legal documents
  • Legal analytics
  • Classification algorithms on legal data
  • Experiments reproducibility

EN his work is a companion reproducible paper of our experiments and results reported in a previous work Quemy and Wrembel (2022) introducing an open repository of legal documents, called ECHR-OD, together with a large benchmark of Machine Learning (ML) methods for text classification. Machine Learning (ML) algorithms are used in various domains, including banking, healthcare, manufacturing, energy management, security, trade or insurance. However, building reliable ML models is challenging. First, because in order to build prediction models by ML algorithms, massive amounts of pre-processed data are needed, but in practice, such datasets are scarce or require a tremendous amount of time to be prepared. Second, because once a model is built, its performance needs to be assessed. To this end, benchmarks are needed, but their availability is limited as well. Despite the fact that ML algorithms are used in multiple domains, their application to the legal domain so far has received little attention from research communities. This fact motivated us to run a project to build and make available an open repository called the European Court of Human Rights Open Data (ECHR-OD) of judgment documents. In this paper, we describe a step-by-step Extract, Transform, and Load (ETL) process, supported with code snippets, for building ECHR-OD, so that it can be easily reproduced. The process produces (almost) exhaustive datasets that have been transformed, homogenized, re-organized, cleaned beforehand, and made available in a suitable format for ML algorithms. The ECHR-OD repository makes available tabular descriptive features as well as features extracted from natural language documents, accessible via a web user interface. Moreover, we provide a self-contained and easily reproducible set of experiments assessing ML classification algorithms on the content of the ECHR-OD repository. To the best of our knowledge, the ETL process and the set of experiments form the first fully end-to-end, from ingesting and pre-processing legal documents to obtaining high quality ML models, open, and reproducible benchmark on the prediction of the European Court of Human Rights judgments. Both components, the ETL and the experiments, leverage Docker for reproducibility. The content of this paper weakly reproduces the original results and provides a new weakly reproducible set of experiments.

Date of online publication


Pages (from - to)

102258-1 - 102258-14





Article Number: 102258

Ministry points / journal


Impact Factor

3.7 [List 2022]

This website uses cookies to remember the authenticated session of the user. For more information, read about Cookies and Privacy Policy.