Quick Start

Gilles Colling

2026-02-03

The Problem: Silent Data Corruption

You receive monthly customer exports from a CRM system. The data should have unique customer_id values and complete email addresses. One month, someone upstream changes the export logic. Now customer_id has duplicates and some emails are missing.

Without explicit checks, you won’t notice until something breaks downstream—wrong row counts after a join, duplicated invoices, failed email campaigns.

# January export: clean data
january <- data.frame(
  customer_id = c(101, 102, 103, 104, 105),
  email = c("alice@example.com", "bob@example.com", "carol@example.com",
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "premium", "basic", "premium")
)

# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
  customer_id = c(101, 102, 102, 104, 105),  # Note: 102 is duplicated

  email = c("alice@example.com", "bob@example.com", NA,
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "basic", "basic", "premium")
)

The February data looks fine at a glance:

head(february)
#>   customer_id             email segment
#> 1         101 alice@example.com premium
#> 2         102   bob@example.com   basic
#> 3         102              <NA>   basic
#> 4         104  dave@example.com   basic
#> 5         105   eve@example.com premium
nrow(february)  # Same row count
#> [1] 5

But it will silently corrupt your analysis.


The Solution: Make Assumptions Explicit

keyed catches these issues by making your assumptions explicit:

# Define what you expect: customer_id is unique
january_keyed <- january |>
  key(customer_id) |>
  lock_no_na(email)

# This works - January data is clean
january_keyed
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         103 carol@example.com premium
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

Now try the same with February’s corrupted data:

# This fails immediately - duplicates detected
february |>
  key(customer_id)
#> Warning: Key is not unique.
#> ℹ 1 duplicate key value(s) found.
#> ℹ Key columns: customer_id
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         102 <NA>              basic  
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

The error catches the problem at import time, not downstream when you’re debugging a mysterious row count mismatch.


Workflow 1: Monthly Data Validation

Goal: Validate each month’s export against expected constraints before processing.

Challenge: Data quality varies month-to-month. Silent corruption causes cascading errors.

Strategy: Define keys and assumptions once, apply consistently to each import.

Define validation function

validate_customer_export <- function(df) {
  df |>
    key(customer_id) |>
    lock_no_na(email) |>
    lock_nrow(min = 1)
}

# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
#> 
#> ── Keyed Data Frame Summary
#> Dimensions: 5 rows x 3 columns
#> 
#> Key columns: customer_id
#> ✔ Key is unique
#> 
#> Row IDs: none

Keys survive transformations

Once defined, keys persist through dplyr operations:

# Filter preserves key
premium_customers <- january_clean |>
  filter(segment == "premium")

has_key(premium_customers)
#> [1] TRUE
get_key_cols(premium_customers)
#> [1] "customer_id"

# Mutate preserves key
enriched <- january_clean |>
  mutate(domain = sub(".*@", "", email))

has_key(enriched)
#> [1] TRUE

Strict enforcement

If an operation breaks uniqueness, keyed errors and tells you to use unkey() first:

# This creates duplicates - keyed stops you
january_clean |>
  mutate(customer_id = 1)
#> Error in `mutate()`:
#> ! Key is no longer unique after transformation.
#> ℹ Use `unkey()` first if you intend to break uniqueness.

To proceed, you must explicitly acknowledge breaking the key:

january_clean |>
  unkey() |>
  mutate(customer_id = 1)
#> # A tibble: 5 × 3
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1           1 alice@example.com premium
#> 2           1 bob@example.com   basic  
#> 3           1 carol@example.com premium
#> 4           1 dave@example.com  basic  
#> 5           1 eve@example.com   premium

Workflow 2: Safe Joins

Goal: Join customer data with orders without accidentally duplicating rows.

Challenge: Join cardinality mistakes are common and hard to debug. A “one-to-one” join that’s actually one-to-many silently inflates your data.

Strategy: Use diagnose_join() to understand cardinality before joining.

Create sample data

customers <- data.frame(
  customer_id = 1:5,
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
  key(customer_id)

orders <- data.frame(
  order_id = 1:8,
  customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
  amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
  key(order_id)

Diagnose before joining

diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
#> 
#> ── Join Diagnosis
#> Cardinality: one-to-many
#> x: 5 rows, unique
#> y: 8 rows, 3 duplicates

The diagnosis shows:

Now you know what to expect. A left_join() will create 8 rows (one per order), not 5 (one per customer).

Compare key structures

compare_keys(customers, orders)
#> 
#> ── Key Comparison
#> Comparing on: customer_id
#> 
#> x: 5 unique keys
#> y: 5 unique keys
#> 
#> Common: 5 (100.0% of x)
#> Only in x: 0
#> Only in y: 0

This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.


Workflow 3: Row Identity Tracking

Goal: Track which original rows survive through a complex pipeline.

Challenge: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.

Strategy: Use add_id() to attach stable identifiers that survive transformations.

Add row IDs

# Add UUIDs to rows
customers_tracked <- customers |>
  add_id()

customers_tracked
#> # A keyed tibble: 5 x 4
#> # Key:            customer_id | .id
#>   .id                                  customer_id name  tier  
#>   <chr>                                      <int> <chr> <chr> 
#> 1 e87304fc-09ed-4634-8caa-a9d9cf2352cc           1 Alice gold  
#> 2 d4c8b392-666d-43e1-8178-ce8e01efd218           2 Bob   silver
#> 3 149ca4bd-d304-46a8-822b-fe344600d006           3 Carol gold  
#> 4 d6031d5d-90eb-44f7-96a8-db4a0ef6da72           4 Dave  bronze
#> 5 2aad2788-779c-48fb-86de-8dfd71222a2c           5 Eve   silver

IDs survive transformations

# Filter: IDs persist
gold_customers <- customers_tracked |>
  filter(tier == "gold")

get_id(gold_customers)
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"

# Compare with original
compare_ids(customers_tracked, gold_customers)
#> $lost
#> [1] "d4c8b392-666d-43e1-8178-ce8e01efd218"
#> [2] "d6031d5d-90eb-44f7-96a8-db4a0ef6da72"
#> [3] "2aad2788-779c-48fb-86de-8dfd71222a2c"
#> 
#> $gained
#> character(0)
#> 
#> $preserved
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"

The comparison shows exactly which rows were lost (filtered out) and which were preserved.

Combining data with ID handling

When appending new data, bind_id() handles ID conflicts:

batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6)  # No IDs yet

# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
#>                                    .id x
#> 1 beb82b30-6d2b-4a1a-a952-9b710fcf7f62 1
#> 2 766c5fb1-2c63-4b61-9ce3-c46a80e92cfa 2
#> 3 7aac4ff0-a2f9-4965-abf1-7d1501bbf0b6 3
#> 4 c42f55c3-c5d8-4f76-a95e-9303876450fd 4
#> 5 a4ab668b-4e54-4dc1-bdc2-21686e4944d6 5
#> 6 30995197-d6a4-4938-8904-92358c8c7088 6

Workflow 4: Drift Detection

Goal: Detect when data changes unexpectedly between pipeline runs.

Challenge: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.

Strategy: Commit snapshots with commit_keyed() and check for drift with check_drift().

Commit a reference snapshot

# Commit current state as reference
reference_data <- data.frame(
  region_id = c("US", "EU", "APAC"),
  tax_rate = c(0.08, 0.20, 0.10)
) |>
  key(region_id) |>
  commit_keyed()
#> ✔ Snapshot committed: 76a76466...

Check for drift

# No changes yet
check_drift(reference_data)
#> 
#> ── Drift Report
#> ✔ No drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)

Detect changes

# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21

# Drift detected!
check_drift(modified_data)
#> 
#> ── Drift Report
#> ! Drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)
#> ℹ Key values changed
#> ℹ Cell values modified

The drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.

Cleanup

# Remove snapshots when done
clear_all_snapshots()
#> ! This will remove 1 snapshot(s) from cache.
#> ✔ Cleared 1 snapshot(s).

Quick Reference

Core Functions

Function Purpose
key() Define key columns (validates uniqueness)
unkey() Remove key
has_key(), get_key_cols() Query key status

Assumption Checks

Function Validates
lock_unique() No duplicate values
lock_no_na() No missing values
lock_complete() All expected values present
lock_coverage() Reference values covered
lock_nrow() Row count within bounds

Diagnostics

Function Purpose
diagnose_join() Analyze join cardinality
compare_keys() Compare key structures
compare_ids() Compare row identities
find_duplicates() Find duplicate key values
key_status() Quick status summary

Row Identity

Function Purpose
add_id() Add UUID to rows
get_id() Retrieve row IDs
bind_id() Combine data with ID handling
make_id() Create deterministic IDs from columns
check_id() Validate ID integrity

Drift Detection

Function Purpose
commit_keyed() Save reference snapshot
check_drift() Compare against snapshot
list_snapshots() View saved snapshots
clear_snapshot() Remove specific snapshot

When to Use Something Else

keyed is designed for flat-file workflows without database infrastructure. If you need:

Need Better Alternative
Enforced schema Database (SQLite, DuckDB)
Version history Git, git2r
Full data validation pointblank, validate
Production pipelines targets

keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.


See Also