Cellar Extractor

A Python library for extracting CELLAR case law data from EUR-Lex.

This library contains functions to get CELLAR case law data from the EUR-Lex SPARQL endpoint and enrich additional information from InfoCuria and CELLAR item sources.

Version

Python 3.9+

Tests

CI: the badge above tracks the default supported test workflow
Coverage: the badge above tracks the default local test suite coverage snapshot

Contributors

_{Pranav Bapat}

_{Piotr Lewandowski}

_shashankmc

_gijsvd

_venvis

_{davidwickerhf}

How to install?

pip install cellar-extractor

What The Project Does

cellar-extractor builds enriched EUR-Lex / CELLAR case-law datasets.

It starts from CELLAR metadata and then enriches:

citation edges
summaries and keywords
full text
sector-specific metadata
graph-ready node/edge projections

The extractor is currently centered on:

sector 6 case law: CJEU-style material via InfoCuria
sector 8 case law: mixed / national-case-law material via CELLAR RDF + item downloads

The main workflow has two stages.

get_cellar(...)
- fetches the base CELLAR corpus
- returns CSV-like dataframe output or JSON-like dictionary output
get_cellar_extra(...)
- enriches that corpus with citations, full text, summaries, keywords, provenance, and missing-data flags

The citation graph is now extracted through the public CELLAR SPARQL endpoint. Legacy EUR-Lex SOAP webservice support is kept only for validation tests and is not part of the production path anymore.

Data Sources By Type

Need	Source
Base corpus metadata	CELLAR SPARQL
Citation edges (`citing`, `cited_by`)	CELLAR SPARQL
Sector 6 full text and structured metadata	InfoCuria
Sector 8 full text and summaries	CELLAR RDF + downloadable `item` manifestations
Legacy citation comparison only	EUR-Lex SOAP webservice

Quick Start

1. Fetch Base CELLAR Metadata

import cellar_extractor as cell

df = cell.get_cellar(
    save=False,
    file_format="csv",
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=100,
)

Returns a dataframe with base metadata such as CELEX, ECLI, type, dates, and subject-matter-related fields.

You can also save explicitly to a custom path instead of the default data/ location:

cell.get_cellar(
    save=True,
    file_format="csv",
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    output_path="exports/cellar_january.csv",
)

2. Fetch The Enriched Dataset

import cellar_extractor as cell

extra_df, fulltext = cell.get_cellar_extra(
    save=False,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=100,
    threads=4,
)

Returns:

extra_df: enriched dataframe
fulltext: list of JSON rows containing extracted text and provenance

You can independently control where the enriched CSV and fulltext JSON are written:

cell.get_cellar_extra(
    save=True,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    metadata_output_path="exports/cellar_extra.csv",
    fulltext_output_path="exports/cellar_fulltext.json",
    threads=4,
)

3. Build A Citation Graph

import cellar_extractor as cell

nodes, edges = cell.get_nodes_and_edges_lists(extra_df, only_local=True)

only_local=True keeps only edges whose target CELEX is also present in extra_df.

4. Filter By Subject Matter

filtered = cell.filter_subject_matter(extra_df, "competition")

Full-Scrape Strategy

If you want the largest reproducible scrape, do not run one enormous date range blindly. Use bounded windows and persist each window.

Recommended approach:

choose a date window by sd / ed
run get_cellar(...) or get_cellar_extra(...)
save outputs to disk
repeat for the next window
concatenate downstream

Practical guidance:

use month-sized or week-sized windows for stability
keep threads moderate, typically 4 to 10
use save=True for long runs
keep the fulltext JSON files; they are the canonical extracted text output

Example file-based run:

import cellar_extractor as cell

cell.get_cellar_extra(
    save=True,
    sd="2025-01-01",
    ed="2025-01-31T23:59:59",
    max_ecli=5000,
    threads=6,
)

By default this writes into data/:

a CSV with the enriched tabular dataset
a _fulltext.json file with the text rows

Main Outputs

get_cellar_extra(...) produces:

an enriched dataframe / CSV
a fulltext JSON list / file

Important Enriched DataFrame Columns

citing
cited_by
celex_summary
celex_keywords
celex_directory_codes
celex_eurovoc
advocate_general
judge_rapporteur
affecting_ids
affecting_strings
citations_extra_info
fulltext_source
summary_source
missing_reasons

Important Fulltext JSON Fields

celex
ecli
text
text_source
text_language
text_format
missing_reasons

Completeness Rules

The extractor does not treat empty values as silent success.

Important cases:

if citation data exists, it should populate citing / cited_by
if a document has no citation edges, the columns still exist and are empty
if full text or summary is not available upstream, missing_reasons should reflect that

Typical missing_reasons values:

FULLTEXT_UNAVAILABLE_UPSTREAM
SUMMARY_UNAVAILABLE_UPSTREAM
UNAVAILABLE_UPSTREAM

Sector 8 is still best effort because upstream availability is uneven, but the extractor now flags absence explicitly instead of implying completeness.

Public API Reference

Root-Level Package API

Imported from cellar_extractor/__init__.py:

Function / class	Purpose
`get_cellar(...)`	Fetch base CELLAR metadata
`get_cellar_extra(...)`	Fetch enriched metadata + full text
`get_nodes_and_edges_lists(df, only_local=False)`	Build citation graph lists
`filter_subject_matter(df, phrase)`	Filter dataframe by subject phrase
`FetchOperativePart`	Extract operative part from a single case document
`Writing`	Write operative-part outputs to CSV / JSON / TXT

Core Modules

`cellar_extractor/cellar.py`

get_cellar(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", file_format="csv", output_dir="data", output_path=None, return_data=None, save=None)
get_cellar_extra(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", threads=10, username="", password="", output_dir="data", metadata_output_path=None, fulltext_output_path=None, save_metadata=None, save_fulltext=None, return_data=None, save=None)
get_nodes_and_edges_lists(df=None, only_local=False)
filter_subject_matter(df=None, phrase=None)

Notes:

username / password are legacy compatibility parameters and no longer change the extraction path
save is the preferred save toggle; save_file is kept as a deprecated compatibility alias
output_path, metadata_output_path, and fulltext_output_path let callers choose exact output locations instead of relying on fixed folders
when save flags are disabled, the package returns in-memory objects without writing files

`cellar_extractor/citations_adder.py`

add_citations_separate(data, threads): production citation enrichment
add_citations_separate_webservice(data, username, password): deprecated legacy comparison path
add_citations(data, threads): older citation replacement helper

`cellar_extractor/fulltext_saving.py`

add_sections(data, threads, output_path=None, json_filepath=None, fulltext_output_path=None): enriches summaries, keywords, text metadata, provenance, and missing-data flags

`cellar_extractor/eurlex_scraping.py`

Main higher-level adapter functions:

get_case_data_by_celex_id(celex, language="EN")
get_html_text_by_celex_id(id)
get_summary_html(celex)
get_full_text_from_html(html_text)

This module contains the sector-aware source logic for InfoCuria and CELLAR item retrieval.

`cellar_extractor/sparql.py`

get_citations(source_celex, cites_depth=1, cited_depth=1, max_retries=3)
get_citations_csv(celex, max_retries=3)
get_citing(celex, cites_depth, max_retries=3)
get_cited(celex, cited_depth, max_retries=3)
run_eurlex_webservice_query(query_input, username, password) for legacy SOAP validation only

`cellar_extractor/cellar_sparql_queries.py`

Advanced query helper class:

CellarSparqlQuery
- get_endorsements()
- get_subjects()
- get_parties()
- get_keywords()
- get_citations()
- get_grounds()

`cellar_extractor/operative_extractions.py`

Classes:

FetchOperativePart
Writing

Use this path when you want operative-part extraction for individual documents rather than the full dataset pipeline.

Upstream Endpoints Used

These are the upstream systems the extractor relies on.

Endpoint family	Used for
CELLAR SPARQL `https://publications.europa.eu/webapi/rdf/sparql`	corpus discovery, metadata, citation edges
InfoCuria `https://infocuriaws.curia.europa.eu/...`	sector 6 text and metadata
InfoCuria `https://infocuria.curia.europa.eu/document/...`	sector 6 document HTML
CELLAR resource/item URLs under `https://publications.europa.eu/resource/cellar/...`	sector 8 downloadable text / summary manifestations
EUR-Lex SOAP `https://eur-lex.europa.eu/EURLexWebService?wsdl`	legacy redundancy tests only

Testing

Fast Local Suite

pytest -q

Live Integration Flags

RUN_INFOCURIA_INTEGRATION=1
RUN_SECTOR8_INTEGRATION=1
RUN_CITATION_INTEGRATION=1

Examples:

RUN_INFOCURIA_INTEGRATION=1 pytest -q tests/test_infocuria_integration.py
RUN_SECTOR8_INTEGRATION=1 pytest -q tests/test_sector8_integration.py
RUN_CITATION_INTEGRATION=1 pytest -q tests/test_citation_graph_integration.py

Legacy Webservice Tests

Only needed if you want to re-check SOAP redundancy:

RUN_WEBSERVICE_INTEGRATION=1 pytest -q tests/test_webservice_credentials_integration.py tests/test_webservice_redundancy_integration.py

If used, credentials are read from .env:

EURLEX_WEBSERVICE_USERNAME=
EURLEX_WEBSERVICE_PASSWORD=

These credentials are not required for normal extraction.

Troubleshooting

`missing_reasons` is populated

That means the extractor could not find the requested upstream content. This is expected when upstream does not expose a summary or full text for the document.

Citation columns are empty

Check:

that the document actually has graph relations upstream
the live SPARQL endpoint availability
whether you are looking at a very small or isolated sample

Sector 8 feels sparse

That is usually an upstream availability issue, not a silent extractor failure. Sector 8 is intentionally handled as best effort with explicit flags.

Releasing

This project uses setuptools_scm for automatic versioning based on git tags. Follow these steps to release a new version:

1. Create a git tag

git tag v<major>.<minor>.<patch>

For example:

git tag v1.2.3

2. Push the tag to remote

git push origin v<major>.<minor>.<patch>

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
cellar_extractor		cellar_extractor
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_package.py		build_package.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Cellar Extractor

Version

Tests

Contributors

How to install?

What The Project Does

Data Sources By Type

Quick Start

1. Fetch Base CELLAR Metadata

2. Fetch The Enriched Dataset

3. Build A Citation Graph

4. Filter By Subject Matter

Full-Scrape Strategy

Main Outputs

Important Enriched DataFrame Columns

Important Fulltext JSON Fields

Completeness Rules

Public API Reference

Root-Level Package API

Core Modules

Upstream Endpoints Used

Testing

Fast Local Suite

Live Integration Flags

Legacy Webservice Tests

Troubleshooting

missing_reasons is populated

Citation columns are empty

Sector 8 feels sparse

Releasing

1. Create a git tag

2. Push the tag to remote

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages

`missing_reasons` is populated