The Assembly Line Analogy
Imagine a car factory:
- Station 1: Raw steel arrives, gets cut into body panels
- Station 2: Body panels get welded together into a frame
- Station 3: Frame gets painted
- Station 4: Engine and interior are installed
- Station 5: Final inspection, car rolls off the line
Each station is a transform — it takes inputs and produces outputs. The path from raw steel to finished car is the lineage.
ontologyR transforms work the same way — they define how data flows from sources to outputs, with full traceability.
Why Transforms Matter
Without transforms, data pipelines look like this:
-- dashboard_query.sql (written 3 years ago, nobody knows what it does)
SELECT a.*, b.total, c.status
FROM some_table a
JOIN (SELECT ... FROM other_table ...) b ON a.id = b.id
JOIN (SELECT ... FROM yet_another ...) c ON a.id = c.id
WHERE ...Problems: - What tables does this actually use? - If
other_table changes schema, does this break? - How long
does this take to run? - When did the logic last change?
Transforms solve this by making data transformations explicit, tracked, and reproducible.
Defining a Transform
A transform is a named, tracked data transformation:
library(ontologyR)
ont_connect(":memory:")
# Assume we have some source tables
DBI::dbWriteTable(ont_get_connection(), "orders", tibble::tibble(
order_id = paste0("O", 1:100),
customer_id = paste0("C", sample(1:20, 100, replace = TRUE)),
amount = runif(100, 10, 500),
order_date = Sys.Date() - sample(1:90, 100, replace = TRUE)
))
DBI::dbWriteTable(ont_get_connection(), "customers", tibble::tibble(
customer_id = paste0("C", 1:20),
name = paste("Customer", 1:20),
segment = sample(c("Enterprise", "SMB", "Consumer"), 20, replace = TRUE)
))
# Register source datasets
ont_register_dataset("ds_orders", "Orders", "orders", "source")
ont_register_dataset("ds_customers", "Customers", "customers", "source")
# Define a transform that joins them
ont_define_transform(
transform_id = "txf_order_summary",
transform_name = "Order Summary by Customer",
output_dataset_id = "ds_order_summary",
transform_type = "sql",
code = "
SELECT
c.customer_id,
c.name,
c.segment,
COUNT(o.order_id) as order_count,
SUM(o.amount) as total_spent,
MAX(o.order_date) as last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.segment
",
input_datasets = c("ds_orders", "ds_customers"),
description = "Aggregates order data by customer for reporting"
)The Recipe Card Analogy
A transform is like a recipe card: - Name: “Chocolate Chip Cookies” - Ingredients (inputs): flour, butter, chocolate chips - Instructions (code): mix, bake at 350F for 12 min - Result (output): cookies
The recipe card doesn’t make cookies — it describes HOW to make them. Similarly, a transform definition describes the transformation but doesn’t execute it.
Executing Transforms
To actually run the transform:
# Execute the transform
result <- ont_execute_transform("txf_order_summary")
#> v Executed transform txf_order_summary
#> i 20 rows in 0.12 seconds
# Check the result
result
#> $run_id
#> [1] "RUN-20250116-abc123"
#>
#> $transform_id
#> [1] "txf_order_summary"
#>
#> $row_count
#> [1] 20
#>
#> $status
#> [1] "completed"Each execution: - Creates a run record (who, when, how long) - Captures input snapshots (what the inputs looked like) - Creates lineage edges (what flowed where) - Updates the output dataset metadata
Input Snapshots: Reproducibility
Here’s something powerful: every run captures a snapshot of its inputs.
# Get run details
run <- ont_get_run("RUN-20250116-abc123")
# See what the inputs looked like at execution time
jsonlite::fromJSON(run$input_snapshot)
#> $ds_orders
#> $ds_orders$row_count
#> [1] 100
#>
#> $ds_customers
#> $ds_customers$row_count
#> [1] 20This answers: “When this report was generated, what data was it based on?”
Building Transform Chains
Real data pipelines have multiple steps. ontologyR tracks the full chain:
# Step 1: Clean orders (remove test orders)
ont_define_transform(
"txf_clean_orders", "Clean Orders", "ds_clean_orders", "sql",
code = "SELECT * FROM orders WHERE amount > 0 AND customer_id NOT LIKE 'TEST%'",
input_datasets = "ds_orders"
)
ont_execute_transform("txf_clean_orders")
# Step 2: Aggregate by customer
ont_define_transform(
"txf_customer_totals", "Customer Totals", "ds_customer_totals", "sql",
code = "SELECT customer_id, SUM(amount) as total FROM ds_clean_orders GROUP BY customer_id",
input_datasets = "ds_clean_orders"
)
ont_execute_transform("txf_customer_totals")
# Step 3: Segment customers
ont_define_transform(
"txf_customer_segments", "Customer Segments", "ds_customer_segments", "sql",
code = "
SELECT customer_id, total,
CASE WHEN total > 1000 THEN 'Gold'
WHEN total > 500 THEN 'Silver'
ELSE 'Bronze' END as tier
FROM ds_customer_totals
",
input_datasets = "ds_customer_totals"
)
ont_execute_transform("txf_customer_segments")Now you have a three-step pipeline, fully tracked.
Visualizing the DAG
Your transforms form a Directed Acyclic Graph (DAG) — data flows in one direction, no cycles.
# Get the transform graph
dag <- ont_get_transform_dag()
dag$nodes
#> # A tibble: 7 x 3
#> node_id node_type label
#> <chr> <chr> <chr>
#> 1 txf_clean_orders transform Clean Orders
#> 2 txf_customer_totals transform Customer Totals
#> 3 txf_customer_segments transform Customer Segments
#> 4 ds_orders source Orders
#> 5 ds_clean_orders derived Clean Orders
#> 6 ds_customer_totals derived Customer Totals
#> 7 ds_customer_segments derived Customer Segments
dag$edges
#> # A tibble: 6 x 3
#> from to edge_type
#> <chr> <chr> <chr>
#> 1 ds_orders txf_clean_orders primary
#> 2 txf_clean_orders ds_clean_orders output
#> 3 ds_clean_orders txf_customer_totals primary
#> ...This is like a subway map for your data — you can see all the routes and stations.
Impact Analysis: What Will Break?
The killer feature is impact analysis. Before changing anything, see what depends on it:
# What happens if I change the orders table?
impact <- ont_get_impact("ds_orders")
impact$total_affected_datasets
#> [1] 3
impact$total_affected_transforms
#> [1] 3
impact$direct_transforms
#> # A tibble: 1 x 2
#> transform_id transform_name
#> <chr> <chr>
#> 1 txf_clean_orders Clean Orders
impact$all_affected_transforms
#> # A tibble: 3 x 2
#> transform_id transform_name
#> <chr> <chr>
#> 1 txf_clean_orders Clean Orders
#> 2 txf_customer_totals Customer Totals
#> 3 txf_customer_segments Customer SegmentsComparing Runs: What Changed?
When numbers look different, you can compare runs:
# Run the transform twice (maybe data changed)
result1 <- ont_execute_transform("txf_customer_totals")
# ... time passes, data changes ...
result2 <- ont_execute_transform("txf_customer_totals")
# Compare them
comparison <- ont_compare_runs(result1$run_id, result2$run_id)
comparison$output_row_diff
#> [1] 5 # 5 more customers this time
comparison$input_changes
#> Shows row count changes in inputsThis answers: “Why does today’s report show different numbers than yesterday’s?”
DAG Validation: Catch Errors Early
ontologyR can validate your transform graph:
# Check for cycles (impossible data flows)
validation <- ont_validate_dag()
validation$is_valid
#> [1] TRUE
validation$cycles
#> list() # No cycles - good!If you accidentally create a circular dependency (A depends on B, B depends on C, C depends on A), validation catches it before you waste time debugging.
Practical Example: Sales Analytics Pipeline
library(ontologyR)
ont_connect(":memory:")
# === Source Data ===
DBI::dbWriteTable(ont_get_connection(), "raw_transactions", tibble::tibble(
txn_id = 1:1000,
product_id = sample(1:50, 1000, replace = TRUE),
quantity = sample(1:10, 1000, replace = TRUE),
unit_price = runif(1000, 5, 100),
txn_date = Sys.Date() - sample(1:365, 1000, replace = TRUE)
))
DBI::dbWriteTable(ont_get_connection(), "products", tibble::tibble(
product_id = 1:50,
product_name = paste("Product", 1:50),
category = sample(c("Electronics", "Clothing", "Food"), 50, replace = TRUE)
))
# === Register Sources ===
ont_register_dataset("ds_transactions", "Transactions", "raw_transactions", "source")
ont_register_dataset("ds_products", "Products", "products", "source")
# === Build Pipeline ===
# Step 1: Calculate line totals
ont_define_transform(
"txf_line_totals", "Line Totals", "ds_line_totals", "sql",
code = "SELECT *, quantity * unit_price as line_total FROM raw_transactions",
input_datasets = "ds_transactions"
)
# Step 2: Enrich with product info
ont_define_transform(
"txf_enriched_sales", "Enriched Sales", "ds_enriched_sales", "sql",
code = "
SELECT t.*, p.product_name, p.category
FROM ds_line_totals t
JOIN products p ON t.product_id = p.product_id
",
input_datasets = c("ds_line_totals", "ds_products")
)
# Step 3: Aggregate by category
ont_define_transform(
"txf_category_summary", "Category Summary", "ds_category_summary", "sql",
code = "
SELECT category,
COUNT(*) as transactions,
SUM(line_total) as revenue
FROM ds_enriched_sales
GROUP BY category
",
input_datasets = "ds_enriched_sales"
)
# === Execute Pipeline ===
ont_execute_transform("txf_line_totals")
ont_execute_transform("txf_enriched_sales")
ont_execute_transform("txf_category_summary")
# === Analyze ===
# Full lineage from source to summary
ont_get_upstream("ds_category_summary")
#> Shows: transactions -> line_totals -> enriched_sales -> category_summary
#> products -> enriched_sales -> category_summary
# Impact if products table changes
ont_get_impact("ds_products")
#> Shows: enriched_sales and category_summary would be affectedKey Concepts Summary
| Concept | What It Is | Analogy |
|---|---|---|
| Transform | Named data transformation | Recipe card |
| Input Dataset | Data source for transform | Ingredients |
| Output Dataset | Result of transform | Finished dish |
| Run | Single execution of transform | One batch of cooking |
| Input Snapshot | State of inputs at run time | Photo of ingredients |
| Lineage | Graph of data dependencies | Family tree |
| DAG | The full transform graph | Subway map |
| Impact Analysis | What depends on X | Domino prediction |
Next Steps
- See
vignette("datasets-and-materialization")for concept-based materialization - See
vignette("governance-gates")for quality controls on transforms
Remember: The goal isn’t just to move data — it’s to understand how data moves.