EU Collection Process

This page describes how EU legislative procedure data flows through the backstage pipeline.

Step 1: Collect

The collect step queries the Cellar SPARQL endpoint to discover the current universe of procedures with interinstitutional reference codes.

What the query returns

Each result row contains:

procedure reference: The interinstitutional code (e.g., 2016_399)
cellar URI: The Cellar resource identifier (UUID)

SPARQL snapshots

Each run saves the raw SPARQL result as a dated snapshot at {case}/procedures/state/sparql_snapshots/{date}.json, providing an audit trail of what the endpoint returned on a given date. Downstream steps (download, parse) read procedure references from this snapshot.

Step 2: Download

For each procedure in the latest SPARQL snapshot, the pipeline downloads the RDF/XML tree notice from Cellar.

Request format

POST https://publications.europa.eu/resource/procedure/{proc_ref}
Accept: application/rdf+xml;notice=tree

The tree notice format returns the complete procedure record including:

Procedure metadata (type, title, dates)
All events in the procedure timeline (committee opinions, plenary votes, Council adoptions, etc.)
Event metadata (dates, responsible institutions)
Document references linked to each event

Download modes

Mode	Behavior
Full (default)	Download all procedures regardless of existing data
Incremental	Only download procedures without an existing RDF file in storage

Step 3: Parse

Downloaded RDF notices are parsed by openbasement, which uses YAML-based extraction templates to convert RDF triples into structured dictionaries. The extracted data is then converted to typed openstage EUProcedure models via the from_openbasement() adapter.

Parsed output on S3 is a flat model_dump() of the openstage model: identifiers, title, events, procedure_type, etc.

Step 4: Package

Parsed procedure JSON files are validated into typed openstage models and packaged into a ZIP dataset using Dataset.dump(). The ZIP contains one JSON file per procedure plus a metadata.json with dataset-level provenance (name, version, creation date, pipeline dependency versions).

Step 5: Publish

The packaged dataset is uploaded to Harvard Dataverse with structured metadata including:

Dataset title and description
Author and contact information
Subject classification and keywords
Geographic coverage
Date of deposit

Publishing supports creating new datasets or adding new versions to existing ones. A dry-run mode allows previewing the metadata without actually publishing.