Skip to contents

Evidence-based governance for data definitions

Lifecycle: experimental

The Problem

Most data problems aren’t about bad data—they’re about bad definitions.

When a dashboard says “X patients are ready for discharge”, it’s showing the result of a rule, not reality. That rule might be sensible, but once it becomes a target, people adapt to it. The definition doesn’t change, but what it measures does.

This is ontology drift: the gradual divergence between what a definition claims to measure and what it actually measures.

The Solution

ontologyR treats definitions as hypotheses, not decrees:

  • Version definitions — never overwrite, always add new versions
  • Audit definitions — sample cases and check if definitions match reality
  • Detect drift — track disagreement rates over time
  • Govern with evidence — use audit data to inform adoption and deprecation

Installation

# Install from GitHub
devtools::install_github("CathalByrneGit/ontologyR")

Two API Styles

ontologyR offers two equivalent ways to work with ontologies:

Inspired by ontologyIndex, this feels R-native with $ accessors:

library(ontologyR)

# Create ontology object
ont <- ontology(":memory:")

# Register and define
ont$register_object("Encounter", "encounters", "encounter_id")
ont$define_concept("ready_for_discharge", "Encounter")
ont$add_version("ready_for_discharge", "flow", 1, 
                "NOT planned_intervention", status = "active")

# Explore with $ syntax
ont$concepts
ont$concepts$ready_for_discharge
ont$concepts$ready_for_discharge$flow[[1]]

# Evaluate from version object
v <- ont$concepts$ready_for_discharge$flow[[1]]
result <- v$evaluate()

# Audit workflow
sample <- ont$sample("ready_for_discharge", "flow", n = 20)
ont$record_audit("ready_for_discharge", "flow", 1, "E1", TRUE, FALSE, "reviewer")
ont$drift_status()

ont$disconnect()

Explicit functions with database connection management:

library(ontologyR)

# Connect to ontology database
ont_connect(":memory:")

# Register object type (maps logical name to physical table)
ont_register_object("Encounter", "encounters", "encounter_id")

# Define a concept
ont_define_concept(
  concept_id = "ready_for_discharge",
  object_type = "Encounter",
  description = "Patient is ready to leave hospital"
)

# Add a versioned definition
ont_add_version(
  concept_id = "ready_for_discharge",
  scope = "flow",
  version = 1,
  sql_expr = "NOT planned_intervention_24h",
  status = "active",
  rationale = "Proxy: no planned interventions"
)

# Evaluate the concept
result <- ont_evaluate("ready_for_discharge", "flow", 1)

# Sample for audit
sample <- ont_sample_for_audit("ready_for_discharge", "flow", n = 20)

# Record human judgment
ont_record_audit(
  concept_id = "ready_for_discharge",
  scope = "flow", version = 1,
  object_key = "E123",
  system_value = TRUE,
  reviewer_value = FALSE,  # Disagreement!
  reviewer_id = "dr_smith"
)

# Check for drift
ont_detect_drift("ready_for_discharge", "flow", threshold = 0.15)

# View governance status
ont_drift_status()

Core Concepts

Term Meaning
Object Type A logical entity (e.g., “Encounter”) mapped to a physical table
Concept A named definition (e.g., “ready_for_discharge”)
Version A specific SQL expression for a concept, scoped and numbered
Scope Context for a definition (e.g., “flow”, “clinical”, “regulatory”)
Audit A human judgment comparing system evaluation to ground truth
Drift When disagreement rate exceeds threshold

Concept Templates

Templates let you define a base concept that can be inherited by scope-specific variants. This is perfect for standards like ILO unemployment definitions where different countries have local adaptations.

# Define an ILO unemployment template
ont_define_template(
  template_id = "ilo_unemployed",
  template_name = "ILO Unemployment Definition",
  object_type = "Person",
  base_sql_expr = "age >= {{min_age}} AND age <= {{max_age}} AND NOT employed AND seeking_work",
  parameters = list(
    min_age = list(default = 15, type = "integer"),
    max_age = list(default = 74, type = "integer")
  ),
  source_standard = "ILO"
)

# Create country-specific variants
ont_inherit_concept("unemployed_us", "ilo_unemployed", "united_states",
  parameter_values = list(min_age = 16, max_age = 65),
  deviation_notes = "US uses 16-65 age range per BLS")

ont_inherit_concept("unemployed_ireland", "ilo_unemployed", "ireland",
  parameter_values = list(min_age = 15, max_age = 66))

# Compare all variants
ont_compare_template_variants("ilo_unemployed")

Interactive Shiny Apps

ontologyR includes three Shiny applications for interactive exploration and management:

Ontology Explorer

Browse concepts, templates, audits, and governance information:

ont_run_explorer()
# Or point to a specific database:
ont_run_explorer(db_path = "my_ontology.duckdb")

Definition Builder

Visual SQL builder for non-technical users to create concept definitions:

Lineage Viewer

Interactive DAG visualization of datasets and transforms:

List all available apps:

API Reference

Top-level Entry Points

Function Description
ontology() Recommended: Create an ontology object with $ accessors
ont_connect() Connect using function-based API

Object-based API (ont$...)

ont <- ontology("path/to/db.duckdb")

# Exploration (returns S3 objects)
ont$concepts                           # All concepts
ont$concepts$my_concept                # Single concept
ont$concepts$my_concept$scope[[1]]     # Specific version
ont$objects                            # All object types

# Actions (return invisibly)
ont$register_object(...)
ont$define_concept(...)
ont$add_version(...)
ont$activate(...)
ont$deprecate(...)

# Queries (return data)
ont$evaluate(concept_id, scope, version)
ont$compare(concept_id, scope, v1, v2)
ont$sample(concept_id, scope, n)
ont$audit_summary(concept_id, scope, version)
ont$drift_status()

# Governance
ont$check_policy(action, concept_id, scope, version)
ont$governance_report()

# Utilities
ont$sql("SELECT ...")
ont$refresh()
ont$disconnect()

Function-based API (ont_*())

Connection

Concepts

Auditing

Drift Detection

Governance

Templates

Interactive Apps

Design Principles

  1. Definitions are hypotheses — test them, don’t assume they’re right
  2. Disagreement is evidence — audit mismatches are signal, not noise
  3. Drift is measurable — track disagreement rates systematically
  4. Governance follows evidence — block actions on drifting definitions
  5. Coexistence over replacement — multiple versions can be active

Learn More

Design Notes: Tables vs. Classes

ontologyR uses tables as the source of truth but provides S3 classes for ergonomics.

Why tables (not named lists like ontologyIndex)?

Our problem Their problem
Definitions change over time Ontologies are static
Need versioning + audit trails Just hierarchy traversal
“Is this definition still valid?” “Is term A ancestor of B?”
Scale unknown (could be enterprise) Fits in memory

Tables give us: - Natural versioning (just add rows) - SQL evaluation (definitions are SQL) - Integration with data lakes (csolake) - Audit queries for free

Why S3 accessors on top?

Interactive R users expect ont$concepts$foo, not ont_get_concept(con, "foo").

The hybrid approach: - Database = source of truth - S3 objects = cached views with nice printing - Both APIs use the same tables

When to use which

Situation Recommendation
Interactive exploration ont <- ontology(); ont$concepts
Scripts and packages ont_connect(); ont_evaluate(...)
Building pipelines Function-based (explicit control)
Teaching / demos Object-based (less boilerplate)

Comparison with ontologyIndex

ontologyIndex is excellent for bio-ontologies (HPO, GO). Here’s how we differ:

# ontologyIndex: static hierarchy from OBO file
data(hpo)
hpo$name[["HP:0001250"]]          # "Seizure"
get_ancestors(hpo, "HP:0001250")  # Returns parent terms

# ontologyR: versioned definitions with governance
ont <- ontology("metrics.duckdb")
ont$concepts$ready_for_discharge$flow[[1]]$sql_expr  # "NOT planned_intervention"
ont$concepts$ready_for_discharge$flow[[1]]$audit_summary  # Drift stats
ontologyIndex ontologyR
Domain Bio-ontologies Operational definitions
Data structure Named lists Tables + S3 views
Mutability Read-only Versioned, auditable
Key operation Hierarchy traversal SQL evaluation + drift detection
Persistence In-memory Database-backed

Use ontologyIndex for: Gene Ontology, HPO, disease taxonomies

Use ontologyR for: KPIs, metrics, business definitions that need governance

Other influences

  • Goodhart’s Law — “When a measure becomes a target, it ceases to be a good measure”
  • Statistical process control — Treating drift as a measurable signal
  • Data mesh — Decentralized ownership with standards
  • Critical analysis of analytics platforms and definition lock-in

License

MIT