A Python library for extracting CELLAR case law data from EUR-Lex.
This library contains functions to get CELLAR case law data from the EUR-Lex SPARQL endpoint and enrich additional information from InfoCuria and CELLAR item sources.
Python 3.9+
- CI: the badge above tracks the default supported test workflow
- Coverage: the badge above tracks the default local test suite coverage snapshot
Pranav Bapat |
Piotr Lewandowski |
shashankmc |
gijsvd |
venvis |
davidwickerhf |
pip install cellar-extractorcellar-extractor builds enriched EUR-Lex / CELLAR case-law datasets.
It starts from CELLAR metadata and then enriches:
- citation edges
- summaries and keywords
- full text
- sector-specific metadata
- graph-ready node/edge projections
The extractor is currently centered on:
- sector 6 case law: CJEU-style material via InfoCuria
- sector 8 case law: mixed / national-case-law material via CELLAR RDF + item downloads
The main workflow has two stages.
get_cellar(...)- fetches the base CELLAR corpus
- returns CSV-like dataframe output or JSON-like dictionary output
get_cellar_extra(...)- enriches that corpus with citations, full text, summaries, keywords, provenance, and missing-data flags
The citation graph is now extracted through the public CELLAR SPARQL endpoint. Legacy EUR-Lex SOAP webservice support is kept only for validation tests and is not part of the production path anymore.
| Need | Source |
|---|---|
| Base corpus metadata | CELLAR SPARQL |
Citation edges (citing, cited_by) |
CELLAR SPARQL |
| Sector 6 full text and structured metadata | InfoCuria |
| Sector 8 full text and summaries | CELLAR RDF + downloadable item manifestations |
| Legacy citation comparison only | EUR-Lex SOAP webservice |
import cellar_extractor as cell
df = cell.get_cellar(
save=False,
file_format="csv",
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=100,
)Returns a dataframe with base metadata such as CELEX, ECLI, type, dates, and subject-matter-related fields.
You can also save explicitly to a custom path instead of the default data/ location:
cell.get_cellar(
save=True,
file_format="csv",
sd="2025-01-01",
ed="2025-01-31T23:59:59",
output_path="exports/cellar_january.csv",
)import cellar_extractor as cell
extra_df, fulltext = cell.get_cellar_extra(
save=False,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=100,
threads=4,
)Returns:
extra_df: enriched dataframefulltext: list of JSON rows containing extracted text and provenance
You can independently control where the enriched CSV and fulltext JSON are written:
cell.get_cellar_extra(
save=True,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
metadata_output_path="exports/cellar_extra.csv",
fulltext_output_path="exports/cellar_fulltext.json",
threads=4,
)import cellar_extractor as cell
nodes, edges = cell.get_nodes_and_edges_lists(extra_df, only_local=True)only_local=True keeps only edges whose target CELEX is also present in extra_df.
filtered = cell.filter_subject_matter(extra_df, "competition")If you want the largest reproducible scrape, do not run one enormous date range blindly. Use bounded windows and persist each window.
Recommended approach:
- choose a date window by
sd/ed - run
get_cellar(...)orget_cellar_extra(...) - save outputs to disk
- repeat for the next window
- concatenate downstream
Practical guidance:
- use month-sized or week-sized windows for stability
- keep
threadsmoderate, typically4to10 - use
save=Truefor long runs - keep the fulltext JSON files; they are the canonical extracted text output
Example file-based run:
import cellar_extractor as cell
cell.get_cellar_extra(
save=True,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=5000,
threads=6,
)By default this writes into data/:
- a CSV with the enriched tabular dataset
- a
_fulltext.jsonfile with the text rows
get_cellar_extra(...) produces:
- an enriched dataframe / CSV
- a fulltext JSON list / file
citingcited_bycelex_summarycelex_keywordscelex_directory_codescelex_eurovocadvocate_generaljudge_rapporteuraffecting_idsaffecting_stringscitations_extra_infofulltext_sourcesummary_sourcemissing_reasons
celexeclitexttext_sourcetext_languagetext_formatmissing_reasons
The extractor does not treat empty values as silent success.
Important cases:
- if citation data exists, it should populate
citing/cited_by - if a document has no citation edges, the columns still exist and are empty
- if full text or summary is not available upstream,
missing_reasonsshould reflect that
Typical missing_reasons values:
FULLTEXT_UNAVAILABLE_UPSTREAMSUMMARY_UNAVAILABLE_UPSTREAMUNAVAILABLE_UPSTREAM
Sector 8 is still best effort because upstream availability is uneven, but the extractor now flags absence explicitly instead of implying completeness.
Imported from cellar_extractor/__init__.py:
| Function / class | Purpose |
|---|---|
get_cellar(...) |
Fetch base CELLAR metadata |
get_cellar_extra(...) |
Fetch enriched metadata + full text |
get_nodes_and_edges_lists(df, only_local=False) |
Build citation graph lists |
filter_subject_matter(df, phrase) |
Filter dataframe by subject phrase |
FetchOperativePart |
Extract operative part from a single case document |
Writing |
Write operative-part outputs to CSV / JSON / TXT |
get_cellar(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", file_format="csv", output_dir="data", output_path=None, return_data=None, save=None)get_cellar_extra(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", threads=10, username="", password="", output_dir="data", metadata_output_path=None, fulltext_output_path=None, save_metadata=None, save_fulltext=None, return_data=None, save=None)get_nodes_and_edges_lists(df=None, only_local=False)filter_subject_matter(df=None, phrase=None)
Notes:
username/passwordare legacy compatibility parameters and no longer change the extraction pathsaveis the preferred save toggle;save_fileis kept as a deprecated compatibility aliasoutput_path,metadata_output_path, andfulltext_output_pathlet callers choose exact output locations instead of relying on fixed folders- when save flags are disabled, the package returns in-memory objects without writing files
add_citations_separate(data, threads): production citation enrichmentadd_citations_separate_webservice(data, username, password): deprecated legacy comparison pathadd_citations(data, threads): older citation replacement helper
add_sections(data, threads, output_path=None, json_filepath=None, fulltext_output_path=None): enriches summaries, keywords, text metadata, provenance, and missing-data flags
Main higher-level adapter functions:
get_case_data_by_celex_id(celex, language="EN")get_html_text_by_celex_id(id)get_summary_html(celex)get_full_text_from_html(html_text)
This module contains the sector-aware source logic for InfoCuria and CELLAR item retrieval.
get_citations(source_celex, cites_depth=1, cited_depth=1, max_retries=3)get_citations_csv(celex, max_retries=3)get_citing(celex, cites_depth, max_retries=3)get_cited(celex, cited_depth, max_retries=3)run_eurlex_webservice_query(query_input, username, password)for legacy SOAP validation only
Advanced query helper class:
CellarSparqlQueryget_endorsements()get_subjects()get_parties()get_keywords()get_citations()get_grounds()
Classes:
FetchOperativePartWriting
Use this path when you want operative-part extraction for individual documents rather than the full dataset pipeline.
These are the upstream systems the extractor relies on.
| Endpoint family | Used for |
|---|---|
CELLAR SPARQL https://publications.europa.eu/webapi/rdf/sparql |
corpus discovery, metadata, citation edges |
InfoCuria https://infocuriaws.curia.europa.eu/... |
sector 6 text and metadata |
InfoCuria https://infocuria.curia.europa.eu/document/... |
sector 6 document HTML |
CELLAR resource/item URLs under https://publications.europa.eu/resource/cellar/... |
sector 8 downloadable text / summary manifestations |
EUR-Lex SOAP https://eur-lex.europa.eu/EURLexWebService?wsdl |
legacy redundancy tests only |
pytest -qRUN_INFOCURIA_INTEGRATION=1RUN_SECTOR8_INTEGRATION=1RUN_CITATION_INTEGRATION=1
Examples:
RUN_INFOCURIA_INTEGRATION=1 pytest -q tests/test_infocuria_integration.py
RUN_SECTOR8_INTEGRATION=1 pytest -q tests/test_sector8_integration.py
RUN_CITATION_INTEGRATION=1 pytest -q tests/test_citation_graph_integration.pyOnly needed if you want to re-check SOAP redundancy:
RUN_WEBSERVICE_INTEGRATION=1 pytest -q tests/test_webservice_credentials_integration.py tests/test_webservice_redundancy_integration.pyIf used, credentials are read from .env:
EURLEX_WEBSERVICE_USERNAME=
EURLEX_WEBSERVICE_PASSWORD=These credentials are not required for normal extraction.
That means the extractor could not find the requested upstream content. This is expected when upstream does not expose a summary or full text for the document.
Check:
- that the document actually has graph relations upstream
- the live SPARQL endpoint availability
- whether you are looking at a very small or isolated sample
That is usually an upstream availability issue, not a silent extractor failure. Sector 8 is intentionally handled as best effort with explicit flags.
This project uses setuptools_scm for automatic versioning based on git tags. Follow these steps to release a new version:
git tag v<major>.<minor>.<patch>For example:
git tag v1.2.3git push origin v<major>.<minor>.<patch>