Primary keys for data frames.
In databases, you declare customer_id as a primary key
and the database enforces uniqueness. With CSV and Excel files, you get
no such guarantees - duplicates slip in silently.
keyed brings database-style protections to R data frames through four features:
| Feature | What it does |
|---|---|
| Keys | Declare unique columns, enforced through transformations |
| Locks | Assert conditions (no NAs, row counts, coverage) |
| UUIDs | Track row identity through your pipeline |
| Commits | Snapshot data to detect drift |
# install.packages("pak")
pak::pak("gcol33/keyed")Declare which columns must be unique - like a primary key in a database.
library(keyed)
# Declare the key (errors if not unique)
customers <- read.csv("customers.csv") |>
key(customer_id)
# Composite keys work too
sales <- key(sales, region, year)Keys follow your data through transformations:
# Base R
active <- customers[customers$status == "active", ]
has_key(active)
#> [1] TRUE
# dplyr
active <- customers |> filter(status == "active")
has_key(active)
#> [1] TRUEKeys block operations that would break uniqueness:
customers |> mutate(customer_id = 1)
#> Error: Key is no longer unique after transformation.
#> i Use `unkey()` first if you intend to break uniqueness.
# To proceed, explicitly remove the key first
customers |> unkey() |> mutate(customer_id = 1)Preview joins before running them:
diagnose_join(customers, orders, by = "customer_id")
#> Cardinality: one-to-many
#> customers: 1000 rows (unique)
#> orders: 5432 rows (4432 duplicates)
#> Left join will produce ~5432 rowsAssert conditions at checkpoints in your pipeline.
customers |>
lock_unique(customer_id) |> # Must be unique
lock_no_na(email) |> # No missing emails
lock_nrow(min = 100) # At least 100 rowsLocks error immediately if the condition fails - no silent continuation.
Available locks:
| Function | Checks |
|---|---|
lock_unique(df, col) |
No duplicate values |
lock_no_na(df, col) |
No missing values |
lock_complete(df) |
No NAs in any column |
lock_coverage(df, threshold, col) |
% non-NA above threshold |
lock_nrow(df, min, max) |
Row count in range |
When your data has no natural key, generate stable row identifiers.
# Add a UUID to each row
customers <- add_id(customers)
#> .id name
#> 1 a3f2c8e1b9d04567 Alice
#> 2 7b1e4a9c2f8d3601 Bob
#> 3 e9c7b2a1d4f80235 CarolUUIDs survive all transformations:
filtered <- customers |> filter(name != "Bob")
get_id(filtered)
#> [1] "a3f2c8e1b9d04567" "e9c7b2a1d4f80235"Track which rows were added or removed:
compare_ids(customers, filtered)
#> Lost: 1 row (7b1e4a9c2f8d3601)
#> Kept: 2 rowsUUIDs let you trace rows through joins, filters, and reshaping - essential for debugging data pipelines.
Snapshot your data to detect unexpected changes later.
# Save a snapshot (stored in memory for this session)
customers <- customers |> commit_keyed()
# Work with your data...
customers <- customers |>
filter(status == "active") |>
mutate(score = score + 10)
# Check what changed since the commit
check_drift(customers)
#> Drift detected!
#> - Row count: 1000 -> 847 (-153)
#> - Column 'score' modifiedHow it works: - Each data frame can have one
snapshot attached - Snapshots persist for your R session (lost on
restart) - check_drift() compares current state to the
snapshot - clear_snapshot() removes it,
list_snapshots() shows all
Useful for catching unexpected changes during interactive analysis.
| Need | Better Tool |
|---|---|
| Enforced schema | SQLite, DuckDB |
| Full data validation | pointblank, validate |
| Production pipelines | targets |
keyed gives you database-style protections without database infrastructure. For exploratory workflows where SQLite is overkill but silent corruption is unacceptable.
MIT