Running the Pipeline

Prerequisites

backstage requires Python 3.11+ and uses uv for environment management.

# Install dependencies
uv sync

# For parsing (requires openbasement)
uv sync --extra parsing

# Environment (add your Hetzner S3 credentials)
cp .env.example .env
source .env

Full pipeline

Run all steps for a case:

python -m flows.run --case eu --sample          # sample mode (~10 procedures)
python -m flows.run --case eu                   # full production run
python -m flows.run --case eu --dry-run         # skip Dataverse publishing

Incremental update

Collect new procedures and download only missing RDFs, then re-parse and package:

python -m flows.run --case eu --download-mode incremental

Specific steps

Run only selected steps:

python -m flows.run --case eu --steps collect download
python -m flows.run --case eu --steps parse
python -m flows.run --case eu --steps publish --dry-run   # Dataverse publishing only

Standalone step execution

Each step can also run independently:

python -m flows.eu.collect --sample
python -m flows.eu.download --sample
python -m flows.eu.parse --sample
python -m flows.eu.publish --dry-run

All cases

Run the pipeline for all configured cases:

python -m flows.run --case all --sample --dry-run

CLI flags

Flag Description
--case Case to run (eu, all). Default: eu
--steps Specific steps to run. Default: all steps for the case
--sample Limit to ~10 procedures for testing
--sample-limit N Override sample size (default: 10)
--dry-run Skip destructive operations (Dataverse publishing)
--download-mode full (re-download all) or incremental (only new). Default: full