CDM Patterns
The Common Data Model (CDM) ontology has several patterns that make raw RDF extraction painful. This page documents the quirks that openbasement handles and explains why the template system exists.
Triple aliasing
The most pervasive CDM pattern. Every fact appears under multiple predicates simultaneously:
<event> cdm:date "2023-07-05" .
<event> cdm:event_date "2023-07-05" .
<event> cdm:event_legal_date "2023-07-05" .
The three forms follow a naming convention:
- Short form:
cdm:date(generic, shared across entity types) - Entity-prefixed form:
cdm:event_date(scoped to the entity category) - Fully qualified form:
cdm:event_legal_date(specific to the entity subtype)
Which predicates are present varies across data vintages. Older procedures (pre-2010) may use different predicate forms than recent ones. Templates handle this with predicate alias lists, trying each in order:
date:
predicate:
- "cdm:event_legal_date"
- "cdm:event_date"
- "cdm:date"
For cardinality: "one", the first alias that produces data wins. For cardinality: "many", results from all aliases are merged and deduplicated.
owl:sameAs identity
CDM uses multiple URIs for the same real-world entity. A single legislative procedure might have:
- A pegase ID:
resource/pegase/1044494(internal identifier) - A procedure URI:
resource/procedure/2017_33(human-readable reference) - A cellar UUID:
resource/cellar/abc123...(storage identifier)
These are linked via owl:sameAs, sometimes through intermediate nodes:
<resource/pegase/1044494> owl:sameAs <resource/cellar/abc123> .
<resource/procedure/2017_33> owl:sameAs <resource/cellar/abc123> .
This matters in two ways:
-
Duplicate entities:
find_instances()may discover the same real-world entity multiple times (once per URI that carries the targetrdf:type). Without merging, the output contains a rich entity (from the pegase URI with all the data) and a near-empty stub (from the procedure URI with only a few triples). -
Broken inverse lookups: Events may point their
part_of_dossierpredicate to the procedure URI, while therdf:typelives on the pegase URI. Without sameAs expansion, inverse predicate lookups miss these connections.
openbasement handles both problems:
-
sameAs merging (default, controlled by the template's
same_as_mergekey ormerge_same_as=function parameter): After discovering instances, the engine groups them intoowl:sameAsequivalence classes using 2-hop expansion. Each group produces one output entity with triples merged from all alias URIs. A canonical URI is selected (preferringresource/procedure/over pegase/cellar URIs). All aliases are listed in the_same_asmetadata field. -
Inverse predicate expansion: The
inverse_predicaterelation option queries all alias URIs when looking for nodes that point back to the entity. When sameAs merging is active, the alias set is already computed; otherwise, it falls back to expanding sameAs for the single instance.
Multilingual literals
Most text fields in CDM exist in up to 24 EU languages as language-tagged RDF literals:
<proc> cdm:title "Regulation on ..."@en .
<proc> cdm:title "Reglement sur ..."@fr .
<proc> cdm:title "Verordnung ..."@de .
openbasement collects all language variants into a language-keyed dict:
{"en": "Regulation on ...", "fr": "Reglement sur ...", "de": "Verordnung ..."}
The template's languages.preferred list controls fallback ordering. Untagged literals (no language tag) appear under the "_" key when fallback: "any".
Subclass hierarchies
CDM uses rdfs:subClassOf for entity type specialization. A codecision procedure is not directly typed as cdm:procedure_interinstitutional. Instead:
cdm:procedure_codecision rdfs:subClassOf cdm:procedure_interinstitutional .
<proc> rdf:type cdm:procedure_codecision .
Similarly, old procedures without an interinstitutional code use cdm:procedure_without_code_interinstitutional as a subclass.
Without subclass handling, a query for procedure_interinstitutional instances would miss these. The include_subclasses: true template option tells the engine to check one level of rdfs:subClassOf when finding instances.
Cellar pre-computes type hierarchies, so in practice the one-level subclass check is sufficient for tree notices.
Predicate variation across vintages
Predicate usage is not uniform across the full historical dataset. A predicate that works on modern procedures (post-2015) may not exist on older ones (pre-2010). Templates should list all known predicate aliases to maximize coverage across the full date range.
The audit() function helps identify coverage gaps: it compares template predicates against actual graph content and reports which predicates are missing or uncovered.