Getting Started

Installation

openbasement is not yet published on PyPI. Install directly from GitHub:

pip install openbasement @ git+https://github.com/openstage-eu/openbasement.git

Getting RDF data

openbasement operates on in-memory rdflib.Graph objects. It does no network I/O. The caller is responsible for loading the graph.

The recommended Cellar format is the RDF tree notice, which inlines related entities (events, documents) into a single RDF/XML response:

curl -L \
  -H 'Accept: application/rdf+xml;notice=tree' \
  'https://publications.europa.eu/resource/procedure/2019_2026'

This returns standard RDF/XML that rdflib parses natively.

First extraction

from rdflib import Graph
from openbasement import extract

# Load an RDF tree notice
g = Graph()
g.parse("2019_2026.rdf", format="xml")

# Extract using the built-in EU procedure template
results = extract(g, template="eu_procedure")

procedure = results[0]
procedure["_uri"]           # "http://publications.europa.eu/resource/..."
procedure["title"]          # {"en": "Regulation on ...", "fr": "..."}
procedure["date"]           # "2019-12-11"
procedure["events"]         # List of nested event dicts
procedure["_raw_triples"]   # Triples not consumed by the template

Fields and relations are top-level keys. Metadata keys are prefixed with _ (_uri, _rdf_types, _raw_triples, _same_as).

Output shape

{
    "_uri": "http://publications.europa.eu/resource/procedure/2019_2026",
    "_rdf_types": ["http://.../cdm#procedure_codecision"],
    "_same_as": [                                    # present when sameAs merge found aliases
        "http://.../resource/cellar/abc123...",
        "http://.../resource/pegase/1042898",
    ],

    # Scalar fields
    "date": "2019-12-11",
    "reference": "2019_2026",

    # Multilingual fields: language-keyed dicts
    "title": {"en": "Regulation on ...", "fr": "Reglement sur ...", "de": "Verordnung ..."},

    # Wildcard fields: predicate-keyed dicts
    "dates": {"date": "2019-12-11", "date_adopted": "2021-06-09"},

    # Relations: lists of nested entity dicts
    "events": [
        {
            "_uri": "http://.../procedure-event/...",
            "date": "2020-01-29",
            "type_code": "...",
            "works": [...],
            "_raw_triples": [...]
        },
    ],

    # Unconsumed triples (from all alias subjects, excludes owl:sameAs triples)
    "_raw_triples": [("subj", "pred", "obj"), ...]
}

All RDF information is preserved. Fields the template recognizes become structured data. Everything else lands in _raw_triples.

Extracting specific entities

# Extract only events (not the parent procedure)
events = extract(g, template="eu_procedure", entity="event")

# Extract only documents
docs = extract(g, template="eu_procedure", entity="document")

owl:sameAs merging

By default, openbasement merges entities that share owl:sameAs links. CDM data uses multiple URIs for the same entity (pegase IDs, procedure URIs, cellar UUIDs), and merging produces one rich output entity instead of multiple sparse duplicates.

The _same_as metadata field lists all alias URIs. The _uri field contains the canonical URI (preferring resource/procedure/ over internal identifiers).

To disable merging (e.g., for debugging or non-CDM data):

# Per call
results = extract(g, template="eu_procedure", merge_same_as=False)

# Or in a custom template (same_as_merge defaults to true)
template = {
    "version": "1",
    "same_as_merge": False,
    "prefixes": {...},
    "entities": {...},
}

Using custom templates

# From a YAML file path
results = extract(g, template="/path/to/custom.yaml")

# From a dict
results = extract(g, template={"version": "1", "prefixes": {...}, "entities": {...}})

See Templates for the full template format.

Listing built-in templates

from openbasement import list_builtin_templates

list_builtin_templates()  # ["eu_document", "eu_procedure"]

Transforms

Templates can apply named transforms to extracted values. Two built-in transforms are available:

  • year_from_date: Extracts the year from a date string ("2019-12-11" -> "2019")
  • uri_local_name: Extracts the local name from a URI ("http://.../concept/COD" -> "COD")

Custom transforms are callables passed at extraction time:

results = extract(
    g,
    template="eu_procedure",
    transforms={"my_transform": lambda v: v.upper()},
)

Templates reference transforms by name in the transform field option.

Auditing templates

The audit function checks how well a template covers the predicates actually present in a graph:

from openbasement import audit, load_template

template = load_template("eu_procedure")
report = audit(g, template)

report["summary"]["coverage"]             # 0.85 (85% of triples covered)
report["entities"]["event"]["uncovered"]  # Predicates not in template
report["entities"]["event"]["missing"]    # Template predicates not in graph